Your Cloud Bill Is Lying to You: A Cost Optimization Comparison

Quick take

Reserved instances aren’t a strategy. Visibility is. Most teams burn money on idle resources they can’t even find, then negotiate a discount on the waste. Fix the waste first.

I’ve managed cloud spend at three startups now. A fintech startup where every penny in margin mattered. A mobility platform at Dropbyke in Seoul where we ran real-time services on a shoestring. And currently at Decloud, where – ironically – the whole product is about making cloud infrastructure sane.

I keep seeing the same pattern. Someone gets a scary AWS bill. Leadership asks engineering to “optimize cloud costs.” Engineering buys reserved instances, saves 20%, and calls it done. Meanwhile, half the staging environments are running 24/7, there are EBS volumes attached to instances that were terminated months ago, and nobody knows which team owns the NAT gateway that’s costing $400/month.

Reserved instances are a pricing mechanism, not a strategy. Here’s what actually works, ranked by effort vs. impact.

The Real Comparison

Strategy	Effort	Typical Savings	When It Helps	When It’s a Trap
Kill idle resources	Low	15-30%	Always. Literally always.	Never a trap. Just do it.
Right-size instances	Medium	10-25%	After you have utilization data	Before you have data – you’ll guess wrong
Storage lifecycle policies	Low	5-15%	When you have >1TB of objects	When access patterns are unpredictable
Reserved instances / savings plans	Low (to buy)	20-40% off on-demand	Steady, predictable workloads	Fast-changing services, early-stage products
Spot / preemptible instances	High	60-90% off on-demand	CI, batch jobs, stateless workers	Anything stateful or latency-sensitive
Architecture changes	Very high	10-50%	When compute patterns are fundamentally wasteful	When the org isn’t ready for the migration cost

That table tells a story most FinOps consultants won’t: the highest-ROI work is the boring stuff at the top.

Start With Visibility, Not Discounts

At the fintech startup, I inherited an AWS account with zero tagging discipline. Costs were allocated to “engineering” as a single line item. Completely useless. My first week, I enforced four tags on every resource:

Environment: production | staging | dev
Team: platform | data | product
Service: api | worker | web | ml-pipeline
Owner: <person who provisioned it>

Within two weeks, we found three full staging environments nobody was using. One had been running for five months. That single discovery saved more than the next quarter’s reserved instance commitment would have.

Tags are free. Use them.

Right-Sizing: The Unsexy Middle Ground

Most instances are oversized. This isn’t controversial. Every cloud provider’s own tooling will tell you this. The question is what to do about it.

My approach, applied at every place I’ve worked:

Collect two weeks of utilization data. Not one day. Not peak hour. Two weeks across normal traffic patterns.
Start with non-production. Dev and staging environments are almost always 2-4x oversized because someone copied the production config.
Target p95 CPU below 60% after resizing. You want headroom. This isn’t about running hot.

Instance Type	Monthly Cost	Avg CPU	p95 CPU	Action
m5.2xlarge	$280	8%	15%	Downsize to m5.large ($70)
r5.xlarge	$183	45%	72%	Keep. Good fit.
c5.4xlarge	$496	12%	20%	Downsize to c5.xlarge ($124)
t3.medium	$30	3%	5%	Ask if this service is still needed

That last column is the one that matters most. Sometimes the right size is zero.

Spot Instances: High Reward, High Effort

I’m a fan of spot capacity for the right workloads. At Decloud, we run all CI builds on spot instances. Saves us roughly 70% on build infrastructure. But I wouldn’t touch spot for anything user-facing or stateful.

The honest comparison:

Workload	Spot-safe?	Why / Why Not
CI/CD pipelines	Yes	Builds are idempotent. Retry is cheap.
Batch ETL jobs	Yes	Checkpointable. Interruption adds minutes, not risk.
Dev environments	Yes	Nobody cares if dev goes down for 2 minutes.
Stateless API workers	Maybe	Only if you have enough replicas and proper drain handling.
Databases	No	Just… no.
Single-instance anything	No	Spot + single instance = eventual outage

The Thing Nobody Talks About: Data Transfer

Cross-AZ traffic on AWS is $0.01/GB each way. Doesn’t sound like much until you’re pushing 10TB/month between services in different availability zones. That’s $200/month just for services talking to each other inside your own VPC.

Three fixes that cost nothing but attention:

Keep chatty services in the same AZ. Schedule pods to co-locate when latency and bandwidth matter.
Use VPC endpoints for AWS services. S3 traffic through a NAT gateway is money on fire.
Cache aggressively at the edge. A CDN in front of your API responses (where appropriate) cuts both latency and egress.

At Dropbyke, data transfer was our third-highest line item after compute and RDS. Moving our real-time location services into a single AZ and adding a VPC endpoint for S3 cut that bill by 40% in one sprint.

What I Actually Do Every Month

I don’t have a FinOps team. At startup scale, cost discipline is an engineering habit, not a department. Here’s my actual routine:

Weekly: Glance at the cost dashboard. Look for anything that jumped more than 20% week-over-week. If something spiked, find it before the month closes.

Monthly: Review top 10 services by cost. Check for idle resources – unattached volumes, orphaned load balancers, forgotten test clusters. Five minutes of cleanup saves hundreds.

Before any commitment: Right-size first. Always. Buying a reserved instance for an oversized instance just locks in the waste at a discount.

Cloud cost optimization isn’t a project with a finish line. It’s hygiene. Like writing tests or reviewing pull requests – you either build the habit or you pay the tax forever.

The teams that manage cloud spend well aren’t the ones with the best FinOps tooling. They’re the ones where every engineer can see what their service costs and cares enough to keep it reasonable.