Go Concurrency: Beyond Goroutines and Channels

Go’s concurrency model is often reduced to “goroutines are cheap, channels are cool.” That’s true, but it misses the deeper story: why Go’s approach works, when it doesn’t, and the subtle bugs waiting to bite you in production.

The Problem Go Solved ¶

Before Go, you had two mainstream options for concurrent programming:

Threads: OS-level constructs. Heavy (1-8MB stack each), expensive to create, limited by kernel scheduling overhead. A server with 10,000 concurrent connections needs 10,000 threads — good luck.

Async/Await (Event loops): Single-threaded with non-blocking I/O. Efficient, but your code becomes callback spaghetti or colored functions (async infects everything it touches). Node.js, Python’s asyncio, Rust’s tokio.

Go introduced a third path: goroutines with a userspace scheduler.

Goroutines: Not Threads, Not Coroutines ¶

A goroutine is a function executing concurrently with other goroutines in the same address space. But it’s not an OS thread.

What makes them cheap:

Small initial stack: 2KB (vs 1-8MB for threads), grows dynamically
Userspace scheduling: Go’s runtime multiplexes goroutines onto OS threads, no kernel context switches
Fast creation: ~200ns to spawn a goroutine vs ~1µs for a thread

You can realistically run millions of goroutines. This changes how you think about concurrency — spawning a goroutine per request isn’t just acceptable, it’s the intended design.

The G-M-P Model ¶

Go’s scheduler uses three entities:

G (Goroutine): The unit of work
M (Machine): An OS thread
P (Processor): A logical processor, holds the run queue

    P0              P1              P2
    ┌─────┐         ┌─────┐         ┌─────┐
    │ Run │         │ Run │         │ Run │
    │Queue│         │Queue│         │Queue│
    │G G G│         │G G  │         │G G G│
    └──┬──┘         └──┬──┘         └──┬──┘
       │               │               │
       ▼               ▼               ▼
    ┌─────┐         ┌─────┐         ┌─────┐
    │  M  │         │  M  │         │  M  │
    │(OS) │         │(OS) │         │(OS) │
    └─────┘         └─────┘         └─────┘

The number of P’s defaults to GOMAXPROCS (usually number of CPU cores). This is the true parallelism limit — you can have millions of G’s, but only GOMAXPROCS run simultaneously.

Work stealing: When a P’s run queue is empty, it steals goroutines from other P’s. This keeps all cores busy without explicit load balancing.

Cooperative Scheduling ¶

Goroutines yield control at specific points:

Channel operations (send/receive)
System calls (I/O, sleep)
Function calls (allows stack check, potential preemption)
Explicit runtime.Gosched()

The gotcha: A tight CPU-bound loop without function calls can block other goroutines on that P:

// Bad: This can starve other goroutines
func cpuHog() {
    for {
        // Pure computation, no function calls
        x := 0
        for i := 0; i < 1e9; i++ {
            x += i
        }
    }
}

Go 1.14 introduced asynchronous preemption (via signals) to mitigate this, but it’s not perfect. Design your code to yield.

Channels: Synchronization Primitive, Not Just a Pipe ¶

Channels are typed conduits for communication. But thinking of them as just “concurrent queues” misses the point.

Channels are synchronization primitives that happen to transfer data.

Unbuffered Channels: Rendezvous Points ¶

An unbuffered channel blocks both sender and receiver until both are ready:

ch := make(chan int) // unbuffered

// Goroutine A
ch <- 42  // blocks until someone receives

// Goroutine B
x := <-ch // blocks until someone sends

This is a rendezvous — both goroutines must arrive at the channel operation for either to proceed. It’s a synchronization point, not just data transfer.

Use case: When you need to ensure one goroutine has completed a step before another proceeds.

Buffered Channels: Decoupling ¶

Buffered channels allow sends to proceed without a receiver (up to the buffer size):

ch := make(chan int, 10) // buffer of 10

ch <- 1  // doesn't block (buffer not full)
ch <- 2  // doesn't block
// ... up to 10 sends without blocking

Use case: Decoupling producer and consumer speeds, work queues, rate limiting.

The trap: People buffer channels to “fix” deadlocks. This usually masks the bug temporarily — the deadlock reappears under load when the buffer fills.

Channel Patterns ¶

Fan-out: Multiple goroutines reading from the same channel.

func worker(id int, jobs <-chan Job, results chan<- Result) {
    for job := range jobs {
        results <- process(job)
    }
}

func main() {
    jobs := make(chan Job, 100)
    results := make(chan Result, 100)
    
    // Start workers
    for i := 0; i < 10; i++ {
        go worker(i, jobs, results)
    }
    
    // Send jobs
    for _, job := range allJobs {
        jobs <- job
    }
    close(jobs)
    
    // Collect results
    for range allJobs {
        <-results
    }
}

Fan-in: Multiple goroutines sending to the same channel.

func merge(channels ...<-chan int) <-chan int {
    out := make(chan int)
    var wg sync.WaitGroup
    
    for _, ch := range channels {
        wg.Add(1)
        go func(c <-chan int) {
            defer wg.Done()
            for v := range c {
                out <- v
            }
        }(ch)
    }
    
    go func() {
        wg.Wait()
        close(out)
    }()
    
    return out
}

Pipeline: Chain of stages connected by channels.

func gen(nums ...int) <-chan int {
    out := make(chan int)
    go func() {
        for _, n := range nums {
            out <- n
        }
        close(out)
    }()
    return out
}

func square(in <-chan int) <-chan int {
    out := make(chan int)
    go func() {
        for n := range in {
            out <- n * n
        }
        close(out)
    }()
    return out
}

func main() {
    // Pipeline: gen -> square -> print
    for n := range square(gen(1, 2, 3, 4)) {
        fmt.Println(n)
    }
}

The Bugs You’ll Write ¶

Goroutine Leaks ¶

A goroutine that never terminates is a memory leak. Common causes:

Blocked on channel forever:

func leak() {
    ch := make(chan int)
    
    go func() {
        val := <-ch  // blocks forever - nothing sends to ch
        fmt.Println(val)
    }()
    
    // Function returns, but goroutine lives on, waiting forever
}

Unbounded goroutine spawning:

func handler(requests <-chan Request) {
    for req := range requests {
        // New goroutine per request - if processing is slow,
        // these accumulate
        go process(req)
    }
}

Detection: Monitor runtime.NumGoroutine() over time. In tests, check goroutine count before and after.

func TestNoLeaks(t *testing.T) {
    before := runtime.NumGoroutine()
    
    // ... run test
    
    // Give goroutines time to exit
    time.Sleep(100 * time.Millisecond)
    
    after := runtime.NumGoroutine()
    if after > before {
        t.Errorf("Goroutine leak: %d before, %d after", before, after)
    }
}

Channel Deadlocks ¶

Circular dependency:

func deadlock() {
    ch1 := make(chan int)
    ch2 := make(chan int)
    
    go func() {
        <-ch1      // waits for ch1
        ch2 <- 1   // then sends to ch2
    }()
    
    go func() {
        <-ch2      // waits for ch2
        ch1 <- 1   // then sends to ch1
    }()
    
    // Both goroutines wait forever
}

Self-deadlock:

func selfDeadlock() {
    ch := make(chan int)
    
    ch <- 1   // blocks - no receiver
    x := <-ch // never reached
    fmt.Println(x)
}

Go’s runtime detects some deadlocks (“all goroutines are asleep”) but not all — if there’s any goroutine that could theoretically make progress (even if it won’t), no panic.

Data Races ¶

Goroutines sharing memory without synchronization:

func race() {
    counter := 0
    
    for i := 0; i < 1000; i++ {
        go func() {
            counter++  // DATA RACE: read-modify-write without sync
        }()
    }
    
    time.Sleep(time.Second)
    fmt.Println(counter)  // Not 1000. Different every run.
}

Detection: Run with -race flag:

go run -race main.go
go test -race ./...

The race detector has ~10x CPU overhead and ~5-10x memory overhead. Use it in tests, not production.

Fixes:

// Option 1: Mutex
var mu sync.Mutex
mu.Lock()
counter++
mu.Unlock()

// Option 2: Atomic
var counter int64
atomic.AddInt64(&counter, 1)

// Option 3: Channel (move data, not share it)
results := make(chan int, 1000)
for i := 0; i < 1000; i++ {
    go func() { results <- 1 }()
}
total := 0
for i := 0; i < 1000; i++ {
    total += <-results
}

Context Cancellation Ignored ¶

When a context is cancelled, goroutines should exit promptly:

func badWorker(ctx context.Context) {
    for {
        // Does work but never checks ctx
        doExpensiveWork()
    }
}

func goodWorker(ctx context.Context) {
    for {
        select {
        case <-ctx.Done():
            return  // Exit when cancelled
        default:
            doExpensiveWork()
        }
    }
}

For long-running operations, check ctx.Done() periodically:

func goodWorker(ctx context.Context) error {
    for i := 0; i < 1000000; i++ {
        if i%1000 == 0 {  // Check every 1000 iterations
            select {
            case <-ctx.Done():
                return ctx.Err()
            default:
            }
        }
        doWork(i)
    }
    return nil
}

When Concurrency Hurts ¶

Concurrency isn’t free. Each goroutine has overhead, channels have synchronization costs, and parallel code is harder to reason about.

Too Many Goroutines for CPU-Bound Work ¶

For CPU-bound tasks, more goroutines than cores just adds scheduling overhead:

// Bad: 10,000 goroutines for CPU work on 8 cores
for i := 0; i < 10000; i++ {
    go cpuIntensiveTask(data[i])
}

// Better: Worker pool sized to cores
numWorkers := runtime.GOMAXPROCS(0)
jobs := make(chan Data, len(data))
results := make(chan Result, len(data))

for i := 0; i < numWorkers; i++ {
    go worker(jobs, results)
}

for _, d := range data {
    jobs <- d
}
close(jobs)

Channel Overhead for Fine-Grained Communication ¶

Channels have overhead (~50-100ns per operation). For very fine-grained work, this dominates:

// Bad: Channel send per number to sum
func sumViaChan(nums []int) int {
    ch := make(chan int)
    go func() {
        for _, n := range nums {
            ch <- n
        }
        close(ch)
    }()
    
    sum := 0
    for n := range ch {
        sum += n
    }
    return sum
}

// Good: Just sum directly (or batch if parallelizing)
func sumDirect(nums []int) int {
    sum := 0
    for _, n := range nums {
        sum += n
    }
    return sum
}

Rule of thumb: If the work per channel operation is less than ~1µs, the channel overhead matters.

When goroutines access adjacent memory locations, CPU cache lines bounce between cores:

type Counters struct {
    a int64  // These are likely on the same cache line
    b int64
}

var c Counters

// Two goroutines incrementing different fields
// but causing cache line contention
go func() { 
    for i := 0; i < 1e8; i++ { 
        atomic.AddInt64(&c.a, 1) 
    } 
}()
go func() { 
    for i := 0; i < 1e8; i++ { 
        atomic.AddInt64(&c.b, 1) 
    } 
}()

Fix: Pad to separate cache lines:

type Counters struct {
    a int64
    _ [56]byte  // Padding to push b to next cache line
    b int64
}

Benchmarking Reality ¶

Let’s measure actual overhead on a real task: fetching 100 URLs.

func BenchmarkSequential(b *testing.B) {
    for i := 0; i < b.N; i++ {
        for _, url := range urls {
            fetch(url)
        }
    }
}

func BenchmarkConcurrent(b *testing.B) {
    for i := 0; i < b.N; i++ {
        var wg sync.WaitGroup
        for _, url := range urls {
            wg.Add(1)
            go func(u string) {
                defer wg.Done()
                fetch(u)
            }(url)
        }
        wg.Wait()
    }
}

func BenchmarkWorkerPool(b *testing.B) {
    for i := 0; i < b.N; i++ {
        jobs := make(chan string, len(urls))
        var wg sync.WaitGroup
        
        // 10 workers
        for w := 0; w < 10; w++ {
            wg.Add(1)
            go func() {
                defer wg.Done()
                for url := range jobs {
                    fetch(url)
                }
            }()
        }
        
        for _, url := range urls {
            jobs <- url
        }
        close(jobs)
        wg.Wait()
    }
}

Typical results (100 URLs, ~100ms average latency each):

BenchmarkSequential-8      1    10234567890 ns/op  (~10s)
BenchmarkConcurrent-8     10     123456789 ns/op  (~120ms)
BenchmarkWorkerPool-8     10     112345678 ns/op  (~110ms)

Concurrent is ~80x faster than sequential for I/O-bound work. Worker pool is marginally faster due to less goroutine creation overhead, but the difference is small.

For CPU-bound work, the story is different — concurrent won’t beat sequential unless you have multiple cores and can actually parallelize.

Summary ¶

Go’s concurrency model is powerful because it’s simple enough to use casually but sophisticated enough to scale. The key insights:

Goroutines are cheap — spawn them freely for I/O-bound work
Channels synchronize, not just communicate — think about synchronization needs first
The runtime does a lot — but you can still block it with bad code
Use the race detector — data races are subtle and deadly
Not everything needs concurrency — measure before optimizing

The best concurrent code is the simplest code that achieves the required parallelism. Start sequential, add concurrency where profiling shows it helps.