Your Staging Environment Is Lying to You

| 5 min read |
testing production feature-flags deployment

Staging never catches the real bugs. Here's how I learned to test in production without burning everything down.

Two weeks into my EF batch in 2019, I pushed a payments integration for Decloud that had passed every test we had. Unit tests. Integration tests against a sandbox API. Manual QA on staging. All green.

It broke within forty minutes of hitting real users.

The sandbox API returned amounts as integers. The production API returned them as strings. Our staging environment had been confirming our assumptions, not testing our code. That fifteen-minute fire drill taught me more about testing strategy than any conference talk ever will.

Staging is a comfortable lie

I’ve been CTO at a fintech startup, a mobility startup, and now I’m building Decloud. Across all of them, the pattern repeats: staging looks just enough like production to give you confidence, but differs in all the ways that actually matter.

Different data shapes. Different load patterns. Different third-party API behaviors. Different timing. Different everything that’s hard to fake and easy to ignore.

The bugs that wake you up at 3am are never the ones your test suite catches. They are the ones that only exist when a real user in a real timezone hits a real edge case with real data. Staging can’t reproduce that. Full stop.

This isn’t an argument against pre-release testing. It’s an argument that pre-release testing is necessary but not sufficient. You need to verify behavior where it actually runs.

The rules I follow

After burning myself enough times – at the fintech startup, at Dropbyke, and now at Decloud – I’ve landed on a few non-negotiable principles.

Small blast radius, always. Start with one percent of traffic. Not ten. Not “just the beta users.” One percent. If something is broken, one percent is a learning opportunity. Ten percent is an incident. I learned this the hard way at Dropbyke, where a “small” rollout to a single city still meant thousands of angry riders.

Define success before you ship, not after. If you can’t write down what “working” looks like in two sentences, you don’t understand the change well enough to ship it. Error rate below X. Latency under Y. Conversion not worse than Z. Write it down. Pin it in Slack. Make it boring.

Rollback must be one click. If reverting a change requires a redeploy, a database migration, or waking someone up, your deployment pipeline isn’t ready for production testing. Feature flags make this trivial. We use them for everything at Decloud, even infrastructure changes.

Never create real side effects. Production tests must not send real emails, charge real cards, or corrupt real analytics. Synthetic users, idempotent operations, and explicit safety checks aren’t optional. They are the whole point.

How I actually do it

Feature flags and progressive rollout

This is the bread and butter. Ship the code dark, then turn it on for a sliver of traffic.

func handleCheckout(user User) Response {
    if featureFlags.Enabled("new_checkout", user.ID) {
        return newCheckout(user)
    }
    return legacyCheckout(user)
}

I write this in Go because that’s what I reach for, but the idea is language-agnostic. The flag gives you a kill switch. The progressive rollout gives you data. Together, they turn a risky deployment into a controlled experiment.

We stage rollouts by percentage: 1%, 5%, 25%, 100%. Each step gets at least a few hours of observation. If the error rate ticks up at 5%, we kill it at 5%. No drama.

Canary releases

When a change is bigger than a single feature – say, a new version of a service – canary deployments do the same thing at the infrastructure level. Route a small slice of traffic to the new version, compare its behavior against the baseline, and promote only when the numbers look right.

Shadow traffic

For really scary changes, mirror production requests to the new code path without returning its results to users. Compare outputs offline. This is how I validated Decloud’s pricing engine rewrite: the old engine served every request while the new one ran in shadow mode for two weeks. We caught three discrepancies that would have been billing bugs.

Synthetic monitoring

Always-on health checks that hit your critical paths every minute. Signup. Login. Checkout. API token refresh. If synthetic checks fail, you know before your users do. This is table stakes but I’m amazed how many teams skip it.

When to not do this

Some things should never be tested in production, no matter how good your flags and rollbacks are.

Irreversible data migrations. Authorization changes. Billing logic where a bug means overcharging real people. Compliance workflows where a mistake has legal consequences. If you can’t undo it in thirty seconds, don’t experiment with it live.

These areas need exhaustive pre-release testing, careful code review, and a deployment plan that reads more like a checklist than a YOLO push.

What this actually requires

Testing in production isn’t a shortcut. It isn’t “move fast and break things.” It’s the opposite: move deliberately, instrument everything, and learn from real conditions while keeping the blast radius small enough that learning doesn’t become damage.

Every production system I’ve run – from the fintech startup’s financial news pipeline to Dropbyke’s real-time fleet tracking to whatever Decloud becomes – has taught me the same lesson. The gap between staging and production is where your real bugs live. You can ignore that gap and be surprised, or you can instrument it and be prepared.

I prefer being prepared. It’s less exciting, but I sleep better.