I opened our AWS bill last month and genuinely thought there was an error. We had budgeted for compute. What showed up was compute plus a zoo of line items I never agreed to. Data transfer fees. Storage request charges. Snapshot costs for things nobody remembered creating. The bill was nearly double what the pricing calculator said it would be.
This wasn’t the first time. At the fintech startup we process a lot of financial data. At Dropbyke we were shuffling GPS pings across regions. Both times the same gut punch: the pricing page and the invoice are two completely different documents.
The pricing page is a fantasy
Every team I’ve worked with starts the same way. You go to the AWS calculator, punch in your instance types, maybe add an RDS box, and land on a number that feels reasonable. Then the invoice arrives and it’s stuffed with charges you never consciously chose.
That isn’t a bug. AWS isn’t one service. It’s hundreds of services, each with its own meter running.
Where the money actually goes
Data transfer. This one got us at the fintech startup. We were pulling market data, processing it, and pushing results to users. Egress charges per gigabyte. Cross-AZ traffic. NAT gateway fees. None of it obvious until the bill. A workload that looks cheap on compute can bleed money on transfer alone.
Storage that never stops billing. You pay for capacity, sure. But also for IOPS if you exceed baseline, for every snapshot that piles up because nobody set a retention policy, for every API call against S3. Cold storage looks cheap until you actually need to retrieve something. Then the retrieval fees hit.
Zombie resources. Stopped instances still bill for their EBS volumes. Elastic IPs sitting unused. Load balancers running with nothing behind them. Serverless invocations that seem free individually but cost real money at scale. I’ve found forgotten dev environments running for months. Months.
Managed databases. RDS is convenient but expensive. Multi-AZ doubles your compute and storage cost. Backups cost money. Provisioned IOPS costs more money. ElastiCache bills per node hour whether your hit rate is 99% or 5%. DynamoDB charges for capacity units and storage separately.
Support and monitoring. AWS support is a percentage of your total bill. Your bill goes up, support goes up. CloudWatch charges for custom metrics, log ingestion, dashboards. Config rules cost per evaluation. Observability isn’t free. Not even close.
Tag everything or drown
Before you optimize anything, you need to know what costs what. The answer is tagging. Boring, tedious, absolutely essential.
Resources:
MyInstance:
Type: AWS::EC2::Instance
Properties:
Tags:
- Key: Environment
Value: production
- Key: Team
Value: platform
- Key: Service
Value: api
Tag every resource. Environment, team, service. No exceptions. If something isn’t tagged, you can’t attribute the cost, and if you can’t attribute it, nobody owns it. Unowned costs grow forever.
Set up weekly cost reviews. Not monthly. By the time a monthly review catches a spike, you have already burned four weeks of money. Set budget alerts. Use Cost Explorer daily. Make cost visible to the people creating the resources.
What actually saves money
Reserved instances for anything with a stable baseline. Don’t over-commit. One-year terms first. Only go multi-year when you’re dead certain the workload isn’t moving.
Spot instances for batch jobs, CI builds, anything that can handle interruption. We used these heavily for data processing at the fintech startup. Massive savings, but you have to design for the instance disappearing mid-job.
Right-sizing. This is the boring, never-ending work. Pull actual CPU and memory utilization, compare to what you’re paying for, downsize. Most instances are over-provisioned because someone picked a size “just in case” and never looked again.
Cleanup. Delete unused snapshots. Remove unattached volumes. Kill forgotten dev environments. Set lifecycle policies on S3. This isn’t glamorous work. It saves real money.
Architecture decisions. Keep compute and data in the same region. Use CloudFront for static content. Prefer VPC endpoints over NAT gateways when possible. These choices compound.
Make cost someone’s problem
The single biggest thing you can do is make cost visible and owned. A shared dashboard where every team sees their spend. A monthly review where you discuss it like you discuss uptime. Cost as a non-functional requirement, right next to latency and availability.
Nobody optimizes what nobody sees. Make it visible. Make it owned. That’s the whole trick.