Kubernetes Requests and Limits: Lessons From Getting It Wrong

Last year, a team I was working with shipped a service to production with no resource requests and no limits. Default namespace, no LimitRange, no guardrails. The service had a slow memory leak – about 50 MB per hour. Harmless-looking. For the first two days it ran fine.

On day three, during a traffic spike, the container hit 4 GB. Kubernetes evicted it. The replacement pod landed on the same node, inherited the same traffic, and the cycle repeated. The node started thrashing. Other pods on that node – including a critical payment service – got evicted too. What started as one leaky container turned into a 45-minute partial outage.

The fix was three lines of YAML. The investigation took most of a day.

The One Thing You Need to Understand

CPU is compressible. When a container exceeds its CPU limit, Kubernetes throttles it. The process slows down but keeps running.

Memory isn’t compressible. When a container exceeds its memory limit, Kubernetes kills it. OOMKilled. The process restarts from scratch.

That asymmetry drives almost every resource management decision. CPU limits trade throughput for predictability. Memory limits are a hard safety net. Treat them differently.

Requests vs Limits

Requests tell the scheduler how much capacity to reserve. They’re a guarantee. The scheduler won’t place your pod on a node unless that capacity is available.

Limits tell the runtime when to intervene. CPU limits trigger throttling. Memory limits trigger OOMKill.

A cluster can overcommit limits – the sum of all limits can exceed the node capacity. It can’t overcommit requests. This is why requests matter more for scheduling stability.

Setting Them

Start with measurements, not guesses. If your monitoring shows a Go service using 150m CPU on average and 200 MB of memory, set requests near those values:

resources:
  requests:
    cpu: "200m"
    memory: "256Mi"
  limits:
    memory: "512Mi"

Notice: no CPU limit. For latency-sensitive services, CPU limits can cause throttling spikes even when the node has spare capacity. The CFS throttling behavior in Linux is well-documented and surprising – a service using 180m of CPU can still get throttled with a 200m limit because of how the quota is applied per scheduling period.

I default to: set CPU requests, skip CPU limits, always set memory limits. If you need CPU limits for noisy-neighbor isolation, set them generously – 2x to 3x the request.

Measuring and Iterating

Don’t set values once and forget them. Traffic patterns change. Code changes. Dependencies change.

Look at these metrics over at least two weeks:

Container CPU usage (the actual usage, not the request)
Memory working set (not RSS – working set is what Kubernetes uses for eviction decisions)
Throttling events (container_cpu_cfs_throttled_seconds_total)

Set requests around the P50 usage. Set memory limits around the P99 plus a buffer. If you’re seeing throttling events on a latency-sensitive service, either raise the CPU limit or remove it.

Tools like VPA (Vertical Pod Autoscaler) can generate recommendations. Use them as input, not as autopilot. I’ve seen VPA recommendations that would have cut memory limits below the application’s startup footprint. Trust but verify.

QoS Classes

Kubernetes assigns a quality-of-service class based on how you configure requests and limits:

Guaranteed: requests equal limits for both CPU and memory. Last to be evicted.
Burstable: requests set but not equal to limits. Most common. Middle priority for eviction.
BestEffort: no requests, no limits. First to be evicted. Never appropriate for production.

The team from my opening story had BestEffort pods running alongside Guaranteed pods. When the node ran out of memory, guess which ones died first. Not theirs – at first. But once enough BestEffort pods got evicted and rescheduled, the cascading resource pressure hit everything.

Guardrails

Two cluster-level mechanisms prevent the “no resources specified” scenario:

LimitRange sets default requests and limits for containers that don’t specify their own. Set this on every namespace. The defaults don’t need to be perfect – they need to prevent BestEffort pods from accidentally running in production.

ResourceQuota caps the total resource usage in a namespace. This prevents one team from consuming the cluster. Set quotas per team namespace and review them quarterly.

At a large consumer platform, we enforced both. LimitRange provided sane defaults, and ResourceQuota kept growth predictable. Teams that needed more resources had to justify it, which had the side effect of surfacing inefficient services early.

The Three Failure Modes

OOMKilled. Memory limit is too low, or the application has a leak. Check the limit against actual usage. If usage grows linearly over time, it’s a leak – fix the code, not the limit.

CPU throttling. Shows up as latency spikes with available CPU on the node. The container is being throttled by CFS even though the node isn’t overloaded. Raise or remove the CPU limit.

Pending pods. Requests are too large for available capacity, or scheduling constraints (node selectors, taints, affinity rules) are too restrictive. Either reduce requests, add capacity, or relax the constraints.

All three are observable. All three are fixable. The problem is usually not that teams don’t know how – it’s that nobody is looking at the metrics until an incident forces it.

The Minimum

Set memory limits on every container. No exceptions.
Set CPU requests based on measured usage.
Apply LimitRange defaults on every namespace.
Apply ResourceQuota on team namespaces.
Monitor throttling, OOMKill events, and pending pods.
Review resource settings quarterly or after significant traffic changes.

Resource management isn’t glamorous work. It’s the kind of thing that only gets attention after an outage. But three lines of YAML and fifteen minutes of looking at Grafana dashboards can prevent the kind of cascading failure that ruins your weekend. Discipline over heroics.