Most Chaos Engineering Is Theater

| 3 min read |
chaos-engineering reliability sre hot-takes

Teams love saying they do chaos engineering. Few actually have hypotheses. Even fewer fix what they find.

Most teams doing “chaos engineering” are just killing pods and watching what happens. No hypothesis. No metrics. No follow-up. That’s not engineering. That’s a demo.

I’ve been running failure experiments at Decloud for about a year now. The honest version: the first three months were mostly us proving we had no idea how our own systems behaved. Which, I guess, is the point. But it only worked because we actually did something with what we learned.

The hypothesis problem

Here’s what separates real chaos engineering from the conference-talk version: you need a hypothesis before you break things.

“Let’s kill the cache and see what happens” isn’t a hypothesis. “If Redis goes down, API latency stays under 300ms because we fall back to direct DB reads” – that’s a hypothesis. It’s testable. It’s specific. And when it turns out to be wrong (it was wrong for us), you have a clear thing to fix.

I’d estimate 80% of teams doing chaos engineering skip this step. They run Chaos Monkey, things break, they shrug, they move on. The experiment taught them nothing because they weren’t looking for anything specific.

What you actually need first

Before you even think about chaos experiments, you need:

  • Observability that works. If you can’t trace a request end to end and see error rates in near-real-time, you’re injecting failure into a black box. Pointless.
  • An abort button. Every experiment needs a kill switch with a defined threshold. Ours is simple: error rate over 5% for two minutes, we stop.
  • Someone on-call who knows it’s happening. Sounds obvious. We forgot once. Fun Slack thread.

If you don’t have these three things, you’re not ready. Go build observability first. That’s more valuable anyway.

Start boring

Our first real experiment was killing a single instance during off-peak hours. Incredibly boring. Also incredibly revealing – we discovered our health checks were too slow, so the load balancer kept routing traffic to the dead instance for almost 30 seconds.

Thirty seconds. For a health check issue. We would never have found that in staging because our staging environment has three instances and production has forty. Different behavior entirely.

Second experiment: adding 500ms latency to our payment provider’s API. We thought we had a 2-second timeout. We didn’t. We had a 30-second default timeout from the HTTP client library that nobody had overridden. That one could have caused a real outage.

The part everyone skips

Running the experiment is the fun part. Fixing what you find is the boring part. Guess which one actually matters.

We track one metric religiously: percentage of chaos experiment findings that result in a shipped fix within 30 days. If that number drops below 80%, we stop running new experiments until we’ve caught up. No point discovering problems faster than you fix them.

Most teams I’ve talked to don’t track this at all. They have a spreadsheet of “findings” from six months ago that nobody’s looked at since the game day. That’s not chaos engineering. That’s a compliance checkbox.

Keep it honest

Chaos engineering is a genuinely useful practice when done with discipline. Hypothesis, experiment, measure, fix. Four steps. Not complicated. But most of what I see in the wild skips steps one and four, which makes it just… breaking things for content.

If your chaos experiments aren’t changing how you build software, you’re doing it wrong.