Quick take
Most Kubernetes clusters run at 20-40% actual utilization. The rest is wasted money. Right-size your resource requests using real data, stop guessing at CPU limits, use spot instances for stateless workloads, and review your spend monthly. I’ve helped teams cut 30-50% off their K8s bills without a single reliability regression.
Last month I looked at a company spending $47k/month on their Kubernetes clusters. Actual utilization across the board? 23%. They were paying for ghost capacity that nobody asked for and nobody noticed.
This isn’t unusual. I see it at almost every enterprise engagement. Kubernetes makes scaling trivially easy, which makes over-provisioning trivially easy too. Some developer copy-pasted resource requests from a Stack Overflow answer two years ago. Nobody questioned it. The cluster autoscaler dutifully spun up nodes to satisfy those fictional requests. The bill grew.
Why Your Bill Keeps Growing
Kubernetes schedules based on resource requests, not actual usage. Your cloud bill pays for the nodes backing those requests. If every pod requests 2 CPU cores but uses 0.3, you’re paying for 2 cores of node capacity per pod. Multiply that across hundreds of pods. Yeah.
The usual culprits:
- Copy-pasted resource blocks that become tribal defaults
- Memory requests inflated to “never get OOM-killed” levels, without anyone measuring actual usage
- CPU limits set defensively and never revisited
- Zero cost visibility per team or per service
Measure First. Seriously.
I can’t stress this enough. Before you touch a single resource request, get visibility into what you’re actually using.
Build a dashboard (Grafana, Kubecost, whatever) that shows:
- CPU requested vs CPU used at p95
- Memory requested vs peak working set
- Utilization broken down by namespace
- Nodes that can’t scale down because of inflated requests
The first time an engineering lead sees their team requesting 64 cores and using 11, the conversation changes fast.
The Cost Comparison
Here is what I typically see before and after right-sizing:
| Category | Before | After | Savings |
|---|---|---|---|
| CPU requests vs actual | 4x over-provisioned | 1.3x buffer | ~60% node reduction |
| Memory requests | 2-3x peak usage | Peak + 20% buffer | ~40% node reduction |
| Node types | Single large instance type | Mixed instance pools | 15-25% better bin packing |
| Spot usage | 0% | 40-60% of stateless workloads | 60-70% on those nodes |
| Dev/staging environments | Same specs as prod | Right-sized, spot-heavy | 50-70% reduction |
| Typical monthly bill | $45-50k | $20-25k | 45-55% |
These are real numbers from a mid-sized company running about 200 microservices. Your mileage will vary, but the pattern holds.
Right-Size Resource Requests
Stop guessing. Use p95 CPU as your request baseline. Use peak memory working set plus a 15-20% buffer for memory requests. Set memory limits to protect against runaway processes. Drop CPU limits entirely for most services – they cause throttling that hurts latency more than it helps anything.
resources:
requests:
cpu: 250m # based on p95 actual usage
memory: 1Gi # based on peak working set + buffer
limits:
memory: 2Gi # hard cap for runaway protection
# no CPU limit -- let it burst
Run VPA in recommendation mode first. It watches actual usage and suggests request values. Don’t let it auto-apply yet. Review the recommendations manually, make sure they make sense, then update your deployments. Build trust before you automate.
Fix the Node Layer
Right-sized pods on wrong-sized nodes still waste money. If you have a bunch of pods requesting 500m CPU and 512Mi memory, shoving them onto m5.4xlarge instances is terrible bin packing.
Mix your instance types. Use smaller instances for smaller workloads. Create separate node pools for workloads with different profiles (CPU-heavy vs memory-heavy vs general). Enable Cluster Autoscaler and let it remove nodes that are actually idle.
Autoscaling only works when requests are honest. Inflated requests mean the autoscaler sees “full” nodes everywhere and keeps adding capacity you don’t need.
Spot Instances: Free Money (Almost)
Spot instances are 60-70% cheaper than on-demand. The catch: they can be reclaimed with two minutes notice.
Good candidates for spot:
- Stateless services with 3+ replicas
- Batch jobs and queue processors
- Dev and staging environments (all of it, honestly)
- Anything that handles graceful shutdown
Bad candidates: single-replica stateful services, databases, anything where losing a node means losing data.
Use taints and tolerations to pin critical workloads to on-demand nodes. Spread spot across multiple instance types and AZs so reclamation doesn’t take out your entire fleet at once.
Guardrails That Stick
Optimization without guardrails is a one-time win. Things drift back within months.
Set up:
- ResourceQuota per namespace so no team can accidentally claim half the cluster
- LimitRange to enforce minimum and maximum resource requests
- Labels for team, environment, and service – you need these for cost allocation
- Monthly review cadence – look at the top 10 over-provisioned workloads, adjust, repeat
That last one matters most. Cost optimization isn’t a project. It’s a habit. A 30-minute monthly review catches drift before it becomes a $10k/month problem.
The Uncomfortable Truth
Most Kubernetes cost problems aren’t technical problems. They’re ownership problems. Nobody owns the bill. Nobody sees the bill broken down by team. Nobody gets asked “why are you requesting 8 cores for a service that peaks at 0.5?”
Fix the visibility and the accountability first. The technical optimization follows naturally.
I’ve watched teams cut their bills in half within two months just by making resource usage visible to the teams that own the workloads. No fancy tooling. No replatforming. Just data, ownership, and a monthly conversation.