Making Go Services Fast: What Actually Matters

Quick take

Profile first, fix allocations, bound your concurrency, tune your HTTP and DB layers. Everything else is noise.

At the fintech startup we run a bunch of Go services that handle financial data ingestion, NLP pipelines, and real-time news delivery. When I joined as CTO, some of these services were already showing strain under growing traffic. The Go runtime gives you great defaults, but defaults only take you so far. I spent a good chunk of 2018 hunting down latency spikes and memory bloat across our backend. Here is what I learned.

Measure Before You Touch Anything

I can’t stress this enough. I wasted a full day “optimizing” a JSON serialization path that turned out to account for 2% of our request latency. The actual bottleneck was connection pool exhaustion against Postgres. Embarrassing.

Before changing code, decide what “fast” means for your service:

Latency percentiles — p50, p95, p99. Averages lie.
Requests per second at a fixed error rate
Allocations per operation and total heap size
GC pause time and frequency
Goroutine count — if this is climbing unbounded, you have a leak

Profile With pprof. Always.

Guessing is the enemy. pprof is built in and costs almost nothing to leave running on a debug port.

import _ "net/http/pprof"

func main() {
    go func() {
        _ = http.ListenAndServe("localhost:6060", nil)
    }()
    // service startup
}

go tool pprof http://localhost:6060/debug/pprof/profile
go tool pprof http://localhost:6060/debug/pprof/heap

For contention and scheduling issues, traces are invaluable:

go tool trace http://localhost:6060/debug/pprof/trace

I keep pprof enabled on every staging deployment. The one time I disabled it to “reduce overhead” was the one time I needed it most. Leave it on.

Allocations Are the Performance Killer

This was the single biggest lesson from our services. Allocation rate drives GC pressure, GC pressure drives latency spikes, latency spikes make your p99 look terrible. In one of our data ingestion services, cutting allocations by 40% dropped our p99 from 180ms to 45ms. Same hardware. Same traffic.

Preallocate Slices

If you know the size, tell Go. This is free performance.

items := make([]Item, 0, len(input))
for _, v := range input {
    items = append(items, transform(v))
}

Without the capacity hint, Go doubles the backing array every time it runs out of space. For a loop processing 10,000 items, that’s a lot of unnecessary copying.

Use sync.Pool for Hot Paths

Our news processing pipeline allocates byte buffers on every request. A sync.Pool cut per-request allocations roughly in half.

var bufPool = sync.Pool{
    New: func() interface{} {
        return make([]byte, 4096)
    },
}

buf := bufPool.Get().([]byte)
defer bufPool.Put(buf)

One gotcha: pools get drained on every GC cycle. They smooth out allocation bursts but they’re not a cache. Don’t store anything you can’t afford to recreate.

Watch for Heap Escapes

Go’s escape analysis decides what lives on the stack versus the heap. Stack allocations are basically free. Heap allocations aren’t.

go build -gcflags="-m" ./...

Run that and read the output. Common things that force heap allocation:

Returning a pointer to a local variable
Capturing locals in a closure
Storing a concrete value into an interface{}

That last one bit us. We had a logging middleware that accepted interface{} arguments for structured fields. Every log call was causing heap escapes. Switching to typed fields fixed it.

Strings

Repeated concatenation with + allocates a new string every time. Use strings.Builder.

var b strings.Builder
for _, part := range parts {
    b.WriteString(part)
}
result := b.String()

Seems obvious, but I’ve found string concatenation in hot loops in production code more times than I want to admit.

Concurrency: Goroutines Are Cheap, Chaos Isn’t

Goroutines cost about 2KB of stack. You can spin up millions. But should you? No. Unbounded goroutine creation is how you get cascading failures.

Worker Pools

We process incoming news articles through a pipeline of NLP stages. Early on, we spawned a goroutine per article. At 5,000 articles per minute, that was 5,000 goroutines competing for CPU and slamming downstream services. A fixed worker pool solved it immediately.

func runWorkers(jobs <-chan Job, results chan<- Result, n int) {
    var wg sync.WaitGroup
    for i := 0; i < n; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for job := range jobs {
                results <- process(job)
            }
        }()
    }
    wg.Wait()
    close(results)
}

We typically size n to 2x the number of CPU cores for CPU-bound work, and higher for I/O-bound work. But measure.

Buffered Channels as Backpressure

An unbounded queue hides overload. It grows silently until your process gets OOM-killed. Buffered channels give you explicit backpressure.

jobs := make(chan Job, 1000)

When the buffer fills, senders block. That’s the signal that your consumers can’t keep up. It’s much better to slow down the producer than to let memory grow until the kernel kills you.

Always Use Context for Cancellation

A goroutine without a cancellation mechanism is a goroutine that might run forever. Every outbound call, every slow operation needs a timeout.

func handle(ctx context.Context) error {
    ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
    defer cancel()

    select {
    case res := <-slowWork(ctx):
        return use(res)
    case <-ctx.Done():
        return ctx.Err()
    }
}

We had a goroutine leak in one of our services that went unnoticed for weeks. A downstream API started timing out, but our goroutines just… waited. Forever. Adding context deadlines everywhere was tedious but it eliminated that entire class of problem.

HTTP Server and Client Tuning

The default http.Server has no timeouts. Read that again. No timeouts. A slow client can hold a connection open indefinitely.

server := &http.Server{
    Addr:           ":8080",
    Handler:        handler,
    ReadTimeout:    5 * time.Second,
    WriteTimeout:   10 * time.Second,
    IdleTimeout:    120 * time.Second,
    MaxHeaderBytes: 1 << 20,
}

Set all four. I’ve seen production outages caused by nothing more than missing ReadTimeout.

For outbound HTTP, create one http.Client and reuse it. The default client has no connection pooling limits, which sounds fine until you’re opening 10,000 connections to the same host.

var client = &http.Client{
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 100,
        IdleConnTimeout:     90 * time.Second,
    },
    Timeout: 10 * time.Second,
}

We were creating a new http.Client per request in one of our services. Swapping to a shared client with a tuned transport cut external API call latency by 30% just from connection reuse.

Database Access

DB latency dominates most request paths. Our Postgres pools were the source of more production incidents than anything else in 2018.

db.SetMaxOpenConns(25)
db.SetMaxIdleConns(25)
db.SetConnMaxLifetime(5 * time.Minute)

Key things I learned the hard way:

MaxOpenConns too high and you overwhelm the database. Too low and requests queue up waiting for a connection.
MaxIdleConns should usually match MaxOpenConns. If idle connections are constantly being closed and reopened, you pay the TCP handshake cost on every query.
ConnMaxLifetime prevents stale connections after a database failover. Without it, your app can hold connections to a node that’s no longer primary.

For write-heavy paths, batch inserts inside a transaction. The round-trip cost of individual inserts adds up fast. We had a service doing 500 individual inserts per batch. Wrapping them in a transaction with a multi-value INSERT cut that path from 2 seconds to 80 milliseconds.

Runtime Knobs

A few runtime settings worth knowing:

GOMAXPROCS — should match available CPU cores. In containers, this defaults to the host CPU count, not your cgroup limit. Use uber-go/automaxprocs to fix it automatically.
GOGC — controls GC aggressiveness. Default is 100 (GC triggers when heap doubles). For latency-sensitive services with enough memory, bumping this to 200 or higher reduces GC frequency at the cost of higher memory use.
runtime/pprof and expvar — lightweight runtime visibility. Expose goroutine count, heap size, and request latency. Trends matter more than snapshots.

Benchmarks: Trust Numbers, Not Feelings

Go has built-in benchmarking. Use it.

func BenchmarkProcess(b *testing.B) {
    input := generateInput()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        process(input)
    }
}

go test -bench=. -benchmem -count=5 ./...

The -benchmem flag is crucial. It shows allocations per operation. And -count=5 gives you enough samples for benchstat to tell you whether a change is real or noise.

go get golang.org/x/perf/cmd/benchstat
benchstat old.txt new.txt

I’ve been fooled by benchmark improvements that turned out to be within noise. benchstat keeps you honest.

What It Comes Down To

Performance work in Go isn’t glamorous. It’s running pprof, staring at flame graphs, moving allocations to the stack, tuning pool sizes, and setting timeouts that should have been set from the start. But the payoff is real. We run services at the fintech startup that handle significant traffic on modest infrastructure, and most of the wins came from the patterns above. Not clever algorithms. Not exotic data structures. Just the basics, applied consistently.