Python Parallelism: The GIL, Multiprocessing, and When Each Matters


Python’s concurrency story is often summarized as “there’s a GIL, so use multiprocessing.” That’s dangerously incomplete. The truth involves understanding what the GIL actually does, when it doesn’t matter, and the real costs of multiprocessing that nobody talks about.

The Global Interpreter Lock is a mutex that protects access to Python objects. Only one thread can execute Python bytecode at a time.

Why it exists: CPython’s memory management (reference counting) isn’t thread-safe. Without the GIL, two threads incrementing the same object’s reference count could corrupt it:

# Without GIL, this would be a race condition inside CPython:
# Thread 1: reads refcount (1)
# Thread 2: reads refcount (1)
# Thread 1: writes refcount (2)
# Thread 2: writes refcount (2)  # Should be 3!

The GIL is a design choice — simpler implementation, faster single-threaded performance, easier C extension development. Other Python implementations (Jython, IronPython) don’t have it.

The GIL isn’t held 100% of the time. It releases:

  • Every N bytecode instructions (default: 100, configurable via sys.setswitchinterval)
  • During I/O operations (file reads, network calls, time.sleep)
  • During certain C extension calls (NumPy operations, some database drivers)

This is why threading does work for I/O-bound Python:

import threading
import requests

def fetch(url):
    return requests.get(url)  # GIL released during network I/O

# These run concurrently, not sequentially
threads = [threading.Thread(target=fetch, args=(url,)) for url in urls]
for t in threads: t.start()
for t in threads: t.join()

CPU-bound pure Python code cannot parallelize with threads:

import threading

def cpu_work():
    total = 0
    for i in range(10_000_000):
        total += i * i
    return total

# With threads - actually SLOWER than sequential due to GIL contention
threads = [threading.Thread(target=cpu_work) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()

Four threads doing CPU work on four cores will be slower than one thread, because they’re constantly fighting over the GIL.

Python offers three concurrency models. Each has its place.

What it is: OS threads, shared memory, GIL-limited

Good for:

  • I/O-bound work (network calls, file I/O, database queries)
  • Blocking on external resources
  • Simple producer/consumer patterns

Bad for:

  • CPU-bound work (GIL prevents parallelism)
  • Memory-heavy workloads (threads share memory, but coordination is tricky)
from concurrent.futures import ThreadPoolExecutor

def fetch_all(urls):
    with ThreadPoolExecutor(max_workers=10) as executor:
        return list(executor.map(requests.get, urls))

What it is: Single-threaded event loop, cooperative multitasking

Good for:

  • High-concurrency I/O (thousands of connections)
  • When you control the code (can make everything async)
  • Network services, web scraping

Bad for:

  • CPU-bound work (still single-threaded)
  • Mixing with blocking code (blocks the whole event loop)
  • Libraries that aren’t async-aware
import asyncio
import aiohttp

async def fetch_all(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [session.get(url) for url in urls]
        return await asyncio.gather(*tasks)

What it is: Separate Python processes, no shared memory (by default), no GIL limitation

Good for:

  • CPU-bound work that needs true parallelism
  • Isolation (one process crashing doesn’t kill others)
  • Memory-intensive independent tasks

Bad for:

  • Tasks requiring shared state (coordination is expensive)
  • Small tasks (process overhead dominates)
  • I/O-bound work (threading or asyncio is simpler and often faster)
from multiprocessing import Pool

def cpu_work(n):
    return sum(i * i for i in range(n))

with Pool(4) as p:
    results = p.map(cpu_work, [10_000_000] * 4)  # True parallelism

Multiprocessing isn’t free. Understanding the costs helps you decide when it’s worth it.

Spawning a process is expensive — typically 10-100ms:

import time
from multiprocessing import Process

start = time.time()
processes = [Process(target=lambda: None) for _ in range(100)]
for p in processes: p.start()
for p in processes: p.join()
print(f"100 processes: {time.time() - start:.2f}s")  # ~1-5 seconds

For small tasks, this overhead dominates. A task that takes 1ms but costs 50ms to spawn into a new process is a net loss.

Solution: Reuse processes with Pool:

from multiprocessing import Pool

# Process creation happens once
with Pool(4) as p:
    # Thousands of tasks, only 4 processes
    results = p.map(small_task, items)

Data passed between processes must be serialized. Python uses pickle by default.

What gets pickled:

  • Function arguments
  • Return values
  • Any data shared via Queue, Pipe, etc.

Costs:

  • CPU time to serialize/deserialize
  • Memory to hold serialized data
  • I/O to transfer between processes
import pickle
import numpy as np

# Large NumPy array
arr = np.random.rand(1000, 1000)

# How much does pickling cost?
import time
start = time.time()
for _ in range(100):
    data = pickle.dumps(arr)
    pickle.loads(data)
print(f"Pickle roundtrip: {(time.time() - start) / 100 * 1000:.2f}ms")
# Typically 5-20ms per roundtrip for this size

If you’re passing large objects and the work per object is small, pickling dominates:

# Bad: Pickle overhead > work
def tiny_work(large_array):
    return large_array.sum()  # Microseconds of work

with Pool(4) as p:
    # Each call pickles the large array - terrible performance
    results = p.map(tiny_work, large_arrays)

# Better: Pass indices, let workers load their own data
def worker_with_shared_data(indices, data_path):
    data = load_data(data_path)  # Each process loads once
    return [data[i].sum() for i in indices]

Each process has its own Python interpreter and memory space:

import os
from multiprocessing import Pool

def memory_hog():
    # Each process allocates this independently
    big_list = list(range(10_000_000))
    return sum(big_list)

# 4 processes × 400MB each = 1.6GB
with Pool(4) as p:
    results = p.map(memory_hog, range(4))

On a machine with 8GB RAM, spawning too many memory-hungry processes leads to swapping and terrible performance.

Estimate before running:

import sys

# Estimate per-process memory
data = load_typical_workload()
print(f"Estimated memory per process: {sys.getsizeof(data) / 1e6:.1f}MB")

# Don't spawn more processes than memory allows
max_processes = available_memory_mb // memory_per_process_mb

The most common pattern — reuse processes, distribute work:

from multiprocessing import Pool
from functools import partial

def process_item(item, config):
    # Do CPU-intensive work
    result = heavy_computation(item, config)
    return result

def main():
    items = load_items()
    config = load_config()
    
    # partial lets us pass extra arguments
    worker = partial(process_item, config=config)
    
    with Pool() as p:  # Default: cpu_count() processes
        results = p.map(worker, items)
    
    return results

Choosing pool size:

  • CPU-bound: Pool(os.cpu_count()) or slightly less
  • Mixed I/O and CPU: Experiment, often 2 * os.cpu_count() works
  • Memory-constrained: Calculate based on per-process memory

If individual tasks are small, the overhead of dispatching each one hurts:

# Bad: High dispatch overhead
with Pool(4) as p:
    results = p.map(tiny_function, million_items)

# Better: Chunk the work
def process_chunk(chunk):
    return [tiny_function(item) for item in chunk]

chunks = [items[i:i+1000] for i in range(0, len(items), 1000)]
with Pool(4) as p:
    chunk_results = p.map(process_chunk, chunks)
results = [r for chunk in chunk_results for r in chunk]

# Or use chunksize parameter
with Pool(4) as p:
    results = p.map(tiny_function, million_items, chunksize=1000)

When multiple processes need the same read-only data, don’t pickle it repeatedly:

from multiprocessing import shared_memory, Pool
import numpy as np

def create_shared_array(data):
    """Create a shared memory array from numpy array."""
    shm = shared_memory.SharedMemory(create=True, size=data.nbytes)
    shared_array = np.ndarray(data.shape, dtype=data.dtype, buffer=shm.buf)
    shared_array[:] = data[:]
    return shm

def worker(args):
    shm_name, shape, dtype, indices = args
    # Attach to existing shared memory
    shm = shared_memory.SharedMemory(name=shm_name)
    data = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
    
    # Work with data (read-only!)
    result = data[indices].sum()
    
    shm.close()  # Detach, don't unlink
    return result

def main():
    # Large dataset - only stored once in memory
    data = np.random.rand(10000, 10000)
    
    shm = create_shared_array(data)
    
    # Workers receive only small arguments
    work_items = [
        (shm.name, data.shape, data.dtype, slice(i*1000, (i+1)*1000))
        for i in range(10)
    ]
    
    with Pool(4) as p:
        results = p.map(worker, work_items)
    
    shm.close()
    shm.unlink()  # Clean up
    
    return results

Long-running multiprocessing jobs need visibility:

from multiprocessing import Pool
from tqdm import tqdm

def process_item(item):
    # ... work ...
    return result

def main():
    items = load_items()
    
    with Pool() as p:
        # imap returns results as they complete
        results = list(tqdm(
            p.imap(process_item, items),
            total=len(items),
            desc="Processing"
        ))
    
    return results

For unordered results (faster when tasks vary in duration):

results = list(tqdm(
    p.imap_unordered(process_item, items),
    total=len(items)
))

Processes that aren’t joined become zombies, consuming resources:

# Bad: No cleanup
def bad_parallel():
    processes = [Process(target=work) for _ in range(10)]
    for p in processes: p.start()
    # Function returns without joining - zombies!

# Good: Always join or use context manager
def good_parallel():
    processes = [Process(target=work) for _ in range(10)]
    for p in processes: p.start()
    for p in processes: p.join()  # Wait for completion

# Better: Use Pool with context manager
def better_parallel():
    with Pool(10) as p:  # Automatic cleanup
        results = p.map(work, items)

Not everything pickles. Common failures:

# Lambda functions - don't pickle
with Pool() as p:
    p.map(lambda x: x*2, items)  # PicklingError!

# Closures over unpicklable objects
connection = db.connect()  # Can't pickle connections
def worker(item):
    return connection.execute(...)  # Fails

# Fix: Create resources inside the worker
def worker(item):
    connection = db.connect()  # Each process creates its own
    try:
        return connection.execute(...)
    finally:
        connection.close()

Processes don’t share memory. Global modifications don’t propagate:

counter = 0

def increment():
    global counter
    counter += 1
    return counter

with Pool(4) as p:
    results = p.map(lambda _: increment(), range(100))

print(counter)  # Still 0! Each process has its own counter
print(results)  # [1, 1, 1, 1, ...] - each process counted independently

Accidentally spawning processes in a loop:

# Bad: Each worker spawns more workers
def recursive_worker(depth):
    if depth > 0:
        with Pool(2) as p:  # Spawns in each process!
            p.map(recursive_worker, [depth-1] * 4)

# This creates 2^n processes - crashes fast

Rule: Only spawn processes from the main process:

if __name__ == "__main__":
    with Pool() as p:
        results = p.map(worker, items)

On Linux, multiprocessing uses fork() by default — child processes get a copy of parent memory.

On macOS (Python 3.8+) and Windows, it uses spawn — child processes start fresh and import your module.

This breaks code that assumes fork:

# Works on Linux, fails on macOS/Windows
big_data = load_data()  # Loaded once in parent

def worker(idx):
    return big_data[idx]  # Assumes big_data exists

if __name__ == "__main__":
    with Pool() as p:
        results = p.map(worker, range(100))  # NameError on macOS!

Fix: Pass data explicitly or use initializers:

def init_worker(data):
    global big_data
    big_data = data

def worker(idx):
    return big_data[idx]

if __name__ == "__main__":
    data = load_data()
    with Pool(initializer=init_worker, initargs=(data,)) as p:
        results = p.map(worker, range(100))

When to use what:

Scenario Best Choice Why
Web requests, file I/O threading or asyncio GIL releases during I/O
1000s of network connections asyncio Lower overhead than threads
CPU-bound, independent tasks multiprocessing Bypasses GIL
CPU-bound, shared large data multiprocessing + shared memory Avoids pickle overhead
Small CPU tasks, many items multiprocessing.Pool with chunking Amortizes overhead
NumPy/Pandas heavy computation Often neither! These release GIL internally

The NumPy exception: NumPy operations release the GIL. If your “CPU work” is mostly NumPy:

import numpy as np

def numpy_work(arr):
    # These operations release the GIL
    return np.fft.fft(arr).sum()

# Threading actually works here!
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(4) as e:
    results = list(e.map(numpy_work, arrays))

Don’t assume parallelism helps. Measure:

import time
from multiprocessing import Pool

def benchmark(func, args, n_runs=5):
    times = []
    for _ in range(n_runs):
        start = time.time()
        func(args)
        times.append(time.time() - start)
    return sum(times) / len(times)

# Sequential
def sequential(items):
    return [process(item) for item in items]

# Parallel
def parallel(items):
    with Pool() as p:
        return p.map(process, items)

items = load_items()
print(f"Sequential: {benchmark(sequential, items):.2f}s")
print(f"Parallel: {benchmark(parallel, items):.2f}s")

If parallel isn’t at least 2-3x faster with 4+ cores, the overhead is eating your gains.