Quick take
You don’t need a chaos team, a million users, or Netflix’s infrastructure. You need a hypothesis, a kill command, and the guts to run it during work hours. That’s it. That’s chaos engineering.
Every talk I’ve seen on chaos engineering opens with the same slide: Netflix’s Chaos Monkey. And then everyone in the audience nods, thinks “cool, but we’re not Netflix,” and goes back to hoping their services don’t fall over on a Friday afternoon.
I get it. I was there too. At the fintech startup we had a small engineering team, a handful of microservices, and real users depending on real-time financial data. Not exactly the scale where you hire a dedicated Site Reliability team. But that’s precisely why we needed chaos experiments. We couldn’t afford a mystery outage at 2 AM with no runbook and no idea which service just took everything else down with it.
What it actually is
Strip away the branding and chaos engineering is dead simple. You pick something that could break. You guess what will happen when it does. You break it. Then you check if you were right.
That’s the whole thing. Hypothesis, experiment, observation. Science fair stuff.
The part most people skip is the hypothesis. They just kill a pod and watch the fireworks. That’s not chaos engineering. That’s sabotage. The hypothesis is what makes it useful. “If service A goes down, service B should degrade gracefully and return cached results within 500ms.” Now you have something testable. Now you’ll actually learn something when the answer is no.
Start with a game day, not a tool
Forget Chaos Monkey. Forget Gremlin. Forget LitmusChaos. At least for now.
What worked for us was embarrassingly low-tech. We’d pick a Thursday afternoon, gather around a whiteboard, and ask: “What happens if Postgres becomes unreachable for 30 seconds?” Then we’d talk through it. Just talk. Who gets paged? What errors do users see? Does the queue back up? Does it recover automatically or does someone need to SSH in and restart something?
Half the time, the answer was “nobody knows.” That’s the finding. That’s the value. You don’t even need to touch a terminal yet.
When you’re ready to actually inject failures, staging is your friend. We ran our first real experiments there. Killed processes, added latency with tc, dropped DNS. Found out our health checks were lying to us. Found out one service had a retry loop that would hammer a dead dependency 200 times a second. In staging. Before any customer saw it.
The failures that teach you the most
Kill a process. See if it restarts. See if anyone gets alerted. You’d be surprised how many teams discover their alerting doesn’t work during their first chaos experiment. We did.
Add network latency. 100ms, then 500ms, then 2 seconds. Watch what happens to upstream callers. Timeouts configured wrong will cascade faster than you can type kubectl get pods.
# Kill a process, see if your system notices
kill -9 $PID
# Add 100ms latency, see who complains
tc qdisc add dev eth0 root netem delay 100ms
Starve a service of memory or CPU. We had a data ingestion job at the fintech startup that would occasionally eat all available memory on the box. We only discovered this pattern because we simulated it first. Fixed it with proper resource limits before it hit production hard.
Cut off a dependency. What does your app do when Redis is gone? If the answer is “500 errors for everyone,” you’ve got work to do. If the answer is “falls back to the database, slightly slower,” you’re in good shape.
The only rule: you must be able to stop it
This is non-negotiable. If you can’t kill the experiment in seconds, don’t run it. Blast radius has to be small. Rollback has to be instant. Everyone on the team has to know the experiment is happening.
We had a simple rule: no experiments without a Slack message first, and no experiments that touch more than one service at a time. Boring? Sure. But nobody got surprised, and we never turned a controlled test into a real incident.
Progress looks like: staging first, then one production instance, then a few, then wider. If staging experiments are still surprising you, you’re not ready for production. Stay there. There’s no shame in it.
Make it boring
The goal is for chaos experiments to become routine. Wednesday afternoon, we break something. Thursday morning, we file tickets for what we found. Next week, we verify the fixes. Repeat.
The best outcome is when experiments stop being interesting. When you kill a service and everything just… works. Traffic reroutes. Alerts fire. Dashboards light up. Recovery happens automatically. Boring. Beautiful.
That’s resilience. Not hoping things won’t break, but knowing exactly what happens when they do. Any team can get there. You just have to be willing to break things on purpose first.