Python’s concurrency story is often summarized as “there’s a GIL, so use multiprocessing.” That’s dangerously incomplete. The truth involves understanding what the GIL actually does, when it doesn’t matter, and the real costs of multiprocessing that nobody talks about.
The GIL: What It Actually Is ¶
The Global Interpreter Lock is a mutex that protects access to Python objects. Only one thread can execute Python bytecode at a time.
Why it exists: CPython’s memory management (reference counting) isn’t thread-safe. Without the GIL, two threads incrementing the same object’s reference count could corrupt it:
# Without GIL, this would be a race condition inside CPython:
# Thread 1: reads refcount (1)
# Thread 2: reads refcount (1)
# Thread 1: writes refcount (2)
# Thread 2: writes refcount (2) # Should be 3!
The GIL is a design choice — simpler implementation, faster single-threaded performance, easier C extension development. Other Python implementations (Jython, IronPython) don’t have it.
When the GIL Releases ¶
The GIL isn’t held 100% of the time. It releases:
- Every N bytecode instructions (default: 100, configurable via
sys.setswitchinterval) - During I/O operations (file reads, network calls,
time.sleep) - During certain C extension calls (NumPy operations, some database drivers)
This is why threading does work for I/O-bound Python:
import threading
import requests
def fetch(url):
return requests.get(url) # GIL released during network I/O
# These run concurrently, not sequentially
threads = [threading.Thread(target=fetch, args=(url,)) for url in urls]
for t in threads: t.start()
for t in threads: t.join()
When the GIL Hurts ¶
CPU-bound pure Python code cannot parallelize with threads:
import threading
def cpu_work():
total = 0
for i in range(10_000_000):
total += i * i
return total
# With threads - actually SLOWER than sequential due to GIL contention
threads = [threading.Thread(target=cpu_work) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
Four threads doing CPU work on four cores will be slower than one thread, because they’re constantly fighting over the GIL.
Threading vs Multiprocessing vs asyncio ¶
Python offers three concurrency models. Each has its place.
Threading ¶
What it is: OS threads, shared memory, GIL-limited
Good for:
- I/O-bound work (network calls, file I/O, database queries)
- Blocking on external resources
- Simple producer/consumer patterns
Bad for:
- CPU-bound work (GIL prevents parallelism)
- Memory-heavy workloads (threads share memory, but coordination is tricky)
from concurrent.futures import ThreadPoolExecutor
def fetch_all(urls):
with ThreadPoolExecutor(max_workers=10) as executor:
return list(executor.map(requests.get, urls))
asyncio ¶
What it is: Single-threaded event loop, cooperative multitasking
Good for:
- High-concurrency I/O (thousands of connections)
- When you control the code (can make everything async)
- Network services, web scraping
Bad for:
- CPU-bound work (still single-threaded)
- Mixing with blocking code (blocks the whole event loop)
- Libraries that aren’t async-aware
import asyncio
import aiohttp
async def fetch_all(urls):
async with aiohttp.ClientSession() as session:
tasks = [session.get(url) for url in urls]
return await asyncio.gather(*tasks)
Multiprocessing ¶
What it is: Separate Python processes, no shared memory (by default), no GIL limitation
Good for:
- CPU-bound work that needs true parallelism
- Isolation (one process crashing doesn’t kill others)
- Memory-intensive independent tasks
Bad for:
- Tasks requiring shared state (coordination is expensive)
- Small tasks (process overhead dominates)
- I/O-bound work (threading or asyncio is simpler and often faster)
from multiprocessing import Pool
def cpu_work(n):
return sum(i * i for i in range(n))
with Pool(4) as p:
results = p.map(cpu_work, [10_000_000] * 4) # True parallelism
The Real Cost of Multiprocessing ¶
Multiprocessing isn’t free. Understanding the costs helps you decide when it’s worth it.
Process Creation Overhead ¶
Spawning a process is expensive — typically 10-100ms:
import time
from multiprocessing import Process
start = time.time()
processes = [Process(target=lambda: None) for _ in range(100)]
for p in processes: p.start()
for p in processes: p.join()
print(f"100 processes: {time.time() - start:.2f}s") # ~1-5 seconds
For small tasks, this overhead dominates. A task that takes 1ms but costs 50ms to spawn into a new process is a net loss.
Solution: Reuse processes with Pool:
from multiprocessing import Pool
# Process creation happens once
with Pool(4) as p:
# Thousands of tasks, only 4 processes
results = p.map(small_task, items)
Serialization (Pickling) Costs ¶
Data passed between processes must be serialized. Python uses pickle by default.
What gets pickled:
- Function arguments
- Return values
- Any data shared via
Queue,Pipe, etc.
Costs:
- CPU time to serialize/deserialize
- Memory to hold serialized data
- I/O to transfer between processes
import pickle
import numpy as np
# Large NumPy array
arr = np.random.rand(1000, 1000)
# How much does pickling cost?
import time
start = time.time()
for _ in range(100):
data = pickle.dumps(arr)
pickle.loads(data)
print(f"Pickle roundtrip: {(time.time() - start) / 100 * 1000:.2f}ms")
# Typically 5-20ms per roundtrip for this size
If you’re passing large objects and the work per object is small, pickling dominates:
# Bad: Pickle overhead > work
def tiny_work(large_array):
return large_array.sum() # Microseconds of work
with Pool(4) as p:
# Each call pickles the large array - terrible performance
results = p.map(tiny_work, large_arrays)
# Better: Pass indices, let workers load their own data
def worker_with_shared_data(indices, data_path):
data = load_data(data_path) # Each process loads once
return [data[i].sum() for i in indices]
Memory Overhead ¶
Each process has its own Python interpreter and memory space:
import os
from multiprocessing import Pool
def memory_hog():
# Each process allocates this independently
big_list = list(range(10_000_000))
return sum(big_list)
# 4 processes × 400MB each = 1.6GB
with Pool(4) as p:
results = p.map(memory_hog, range(4))
On a machine with 8GB RAM, spawning too many memory-hungry processes leads to swapping and terrible performance.
Estimate before running:
import sys
# Estimate per-process memory
data = load_typical_workload()
print(f"Estimated memory per process: {sys.getsizeof(data) / 1e6:.1f}MB")
# Don't spawn more processes than memory allows
max_processes = available_memory_mb // memory_per_process_mb
Patterns That Work ¶
The Worker Pool ¶
The most common pattern — reuse processes, distribute work:
from multiprocessing import Pool
from functools import partial
def process_item(item, config):
# Do CPU-intensive work
result = heavy_computation(item, config)
return result
def main():
items = load_items()
config = load_config()
# partial lets us pass extra arguments
worker = partial(process_item, config=config)
with Pool() as p: # Default: cpu_count() processes
results = p.map(worker, items)
return results
Choosing pool size:
- CPU-bound:
Pool(os.cpu_count())or slightly less - Mixed I/O and CPU: Experiment, often
2 * os.cpu_count()works - Memory-constrained: Calculate based on per-process memory
Chunking for Small Tasks ¶
If individual tasks are small, the overhead of dispatching each one hurts:
# Bad: High dispatch overhead
with Pool(4) as p:
results = p.map(tiny_function, million_items)
# Better: Chunk the work
def process_chunk(chunk):
return [tiny_function(item) for item in chunk]
chunks = [items[i:i+1000] for i in range(0, len(items), 1000)]
with Pool(4) as p:
chunk_results = p.map(process_chunk, chunks)
results = [r for chunk in chunk_results for r in chunk]
# Or use chunksize parameter
with Pool(4) as p:
results = p.map(tiny_function, million_items, chunksize=1000)
Shared Memory for Large Data ¶
When multiple processes need the same read-only data, don’t pickle it repeatedly:
from multiprocessing import shared_memory, Pool
import numpy as np
def create_shared_array(data):
"""Create a shared memory array from numpy array."""
shm = shared_memory.SharedMemory(create=True, size=data.nbytes)
shared_array = np.ndarray(data.shape, dtype=data.dtype, buffer=shm.buf)
shared_array[:] = data[:]
return shm
def worker(args):
shm_name, shape, dtype, indices = args
# Attach to existing shared memory
shm = shared_memory.SharedMemory(name=shm_name)
data = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
# Work with data (read-only!)
result = data[indices].sum()
shm.close() # Detach, don't unlink
return result
def main():
# Large dataset - only stored once in memory
data = np.random.rand(10000, 10000)
shm = create_shared_array(data)
# Workers receive only small arguments
work_items = [
(shm.name, data.shape, data.dtype, slice(i*1000, (i+1)*1000))
for i in range(10)
]
with Pool(4) as p:
results = p.map(worker, work_items)
shm.close()
shm.unlink() # Clean up
return results
Progress Tracking ¶
Long-running multiprocessing jobs need visibility:
from multiprocessing import Pool
from tqdm import tqdm
def process_item(item):
# ... work ...
return result
def main():
items = load_items()
with Pool() as p:
# imap returns results as they complete
results = list(tqdm(
p.imap(process_item, items),
total=len(items),
desc="Processing"
))
return results
For unordered results (faster when tasks vary in duration):
results = list(tqdm(
p.imap_unordered(process_item, items),
total=len(items)
))
The Gotchas ¶
Zombie Processes ¶
Processes that aren’t joined become zombies, consuming resources:
# Bad: No cleanup
def bad_parallel():
processes = [Process(target=work) for _ in range(10)]
for p in processes: p.start()
# Function returns without joining - zombies!
# Good: Always join or use context manager
def good_parallel():
processes = [Process(target=work) for _ in range(10)]
for p in processes: p.start()
for p in processes: p.join() # Wait for completion
# Better: Use Pool with context manager
def better_parallel():
with Pool(10) as p: # Automatic cleanup
results = p.map(work, items)
Pickling Failures ¶
Not everything pickles. Common failures:
# Lambda functions - don't pickle
with Pool() as p:
p.map(lambda x: x*2, items) # PicklingError!
# Closures over unpicklable objects
connection = db.connect() # Can't pickle connections
def worker(item):
return connection.execute(...) # Fails
# Fix: Create resources inside the worker
def worker(item):
connection = db.connect() # Each process creates its own
try:
return connection.execute(...)
finally:
connection.close()
Global State Confusion ¶
Processes don’t share memory. Global modifications don’t propagate:
counter = 0
def increment():
global counter
counter += 1
return counter
with Pool(4) as p:
results = p.map(lambda _: increment(), range(100))
print(counter) # Still 0! Each process has its own counter
print(results) # [1, 1, 1, 1, ...] - each process counted independently
The Fork Bomb ¶
Accidentally spawning processes in a loop:
# Bad: Each worker spawns more workers
def recursive_worker(depth):
if depth > 0:
with Pool(2) as p: # Spawns in each process!
p.map(recursive_worker, [depth-1] * 4)
# This creates 2^n processes - crashes fast
Rule: Only spawn processes from the main process:
if __name__ == "__main__":
with Pool() as p:
results = p.map(worker, items)
macOS/Windows Fork vs Spawn ¶
On Linux, multiprocessing uses fork() by default — child processes get a copy of parent memory.
On macOS (Python 3.8+) and Windows, it uses spawn — child processes start fresh and import your module.
This breaks code that assumes fork:
# Works on Linux, fails on macOS/Windows
big_data = load_data() # Loaded once in parent
def worker(idx):
return big_data[idx] # Assumes big_data exists
if __name__ == "__main__":
with Pool() as p:
results = p.map(worker, range(100)) # NameError on macOS!
Fix: Pass data explicitly or use initializers:
def init_worker(data):
global big_data
big_data = data
def worker(idx):
return big_data[idx]
if __name__ == "__main__":
data = load_data()
with Pool(initializer=init_worker, initargs=(data,)) as p:
results = p.map(worker, range(100))
Decision Framework ¶
When to use what:
| Scenario | Best Choice | Why |
|---|---|---|
| Web requests, file I/O | threading or asyncio |
GIL releases during I/O |
| 1000s of network connections | asyncio |
Lower overhead than threads |
| CPU-bound, independent tasks | multiprocessing |
Bypasses GIL |
| CPU-bound, shared large data | multiprocessing + shared memory |
Avoids pickle overhead |
| Small CPU tasks, many items | multiprocessing.Pool with chunking |
Amortizes overhead |
| NumPy/Pandas heavy computation | Often neither! | These release GIL internally |
The NumPy exception: NumPy operations release the GIL. If your “CPU work” is mostly NumPy:
import numpy as np
def numpy_work(arr):
# These operations release the GIL
return np.fft.fft(arr).sum()
# Threading actually works here!
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(4) as e:
results = list(e.map(numpy_work, arrays))
Measuring Before Committing ¶
Don’t assume parallelism helps. Measure:
import time
from multiprocessing import Pool
def benchmark(func, args, n_runs=5):
times = []
for _ in range(n_runs):
start = time.time()
func(args)
times.append(time.time() - start)
return sum(times) / len(times)
# Sequential
def sequential(items):
return [process(item) for item in items]
# Parallel
def parallel(items):
with Pool() as p:
return p.map(process, items)
items = load_items()
print(f"Sequential: {benchmark(sequential, items):.2f}s")
print(f"Parallel: {benchmark(parallel, items):.2f}s")
If parallel isn’t at least 2-3x faster with 4+ cores, the overhead is eating your gains.