Feature Flags at Scale: What Nobody Warns You About

Last year at Decloud, I ran a script to count our feature flags. We had 847. Active, production feature flags spread across 23 services. I knew we had a lot. I didn’t know we had that many.

The worst part: when I cross-referenced with our flag metadata, only about 200 had an owner listed. The rest were orphans. Nobody knew who created them, what they controlled, or whether they were safe to remove. One flag had been set to “false” for 14 months. The code path behind it was unreachable. But nobody dared delete it because nobody knew what it did.

That’s what feature flags look like at scale when you don’t have discipline. And I’m writing this post because I don’t want you to end up there.

Flags are great. Stale flags are debt.

The pitch for feature flags is solid. Separate deployment from release. Ship code to production behind a flag. Roll out gradually. Roll back instantly. Run experiments. Gate features by customer plan. Use kill switches during incidents.

All of that works. I’m a believer. At Decloud, flags saved us multiple times during incidents. Flipping a kill switch is faster than deploying a rollback. Having a gradual rollout catch a regression at 5% of traffic instead of 100% is genuinely valuable.

The problem isn’t the concept. The problem is lifecycle management. Flags are easy to create. Nobody wants to remove them.

Categorize your flags or drown

After the 847-flag incident, we introduced four categories at Decloud:

Release flags – short-lived. Control rollout of a new feature. Expected lifetime: days to weeks. Must be removed once the feature is fully rolled out.

Experiment flags – medium-lived. Power A/B tests. Expected lifetime: duration of the experiment. Must be retired with a clear outcome documented.

Operational flags – potentially long-lived. Kill switches, circuit breakers, graceful degradation. These can stay, but they need periodic review.

Permission flags – permanent. Gate access by customer plan or contract. These are really entitlements, not feature flags, and should be treated differently in your code.

The categories matter because the cleanup rules are different. A release flag that’s still around after a month is a problem. An operational kill switch that’s been around for a year is fine.

The rules that saved us

After the cleanup, we implemented a simple policy. Every new flag must have:

An owner (the person who will remove it)
A type (release, experiment, operational, permission)
An expected removal date (except operational and permission flags)
A documented default value and failure mode

If any of those are missing, the flag can’t be created. We enforced this in our flag management wrapper.

type Flag struct {
    Name       string
    Owner      string
    Type       FlagType
    RemoveBy   time.Time // zero for operational/permission
    DefaultVal bool
    FailMode   FailMode  // FailOpen or FailClosed
}

func (s *FlagService) Create(f Flag) error {
    if f.Owner == "" {
        return errors.New("flag must have an owner")
    }
    if f.Type == Release && f.RemoveBy.IsZero() {
        return errors.New("release flags must have a removal date")
    }
    // ...
}

We also added a weekly report: flags past their removal date, flags with no evaluations in the last 30 days, and flags where the “off” path has never been exercised in production. That last one is subtle but important – if you’ve never seen the fallback path in production, you don’t actually know it works.

Fail-open or fail-closed

This is the decision most teams skip. What happens when the flag service is down?

For a payment flow, you probably want fail-closed. Better to show an error than process a payment through an untested code path.

For a UI experiment, fail-open makes sense. Show the default experience. Nobody notices.

At one enterprise, the flag service went down for 20 minutes and their checkout flow broke because every flag defaulted to “false” – including the flag that enabled their only working checkout flow. They’d never tested what “all flags off” looked like. Don’t be that team.

The cleanup problem is a people problem

Technical solutions help. Expiration dates, automated reports, lint rules that flag dead code paths. But the real problem is incentives.

Nobody gets credit for removing a feature flag. Shipping a new feature is visible. Cleaning up after it isn’t. So flags accumulate.

At Decloud, we made flag cleanup part of the definition of done. A feature isn’t finished when it’s rolled out to 100%. It’s finished when the flag is removed and the old code path is deleted. We tracked “flag debt” on our engineering dashboard alongside other health metrics.

At larger organizations, I recommend a quarterly flag review. Pull up the full inventory. For every flag without a recent evaluation, ask: is this still needed? For every flag past its removal date, assign cleanup as a sprint task. It’s unglamorous work. It prevents the kind of mess that makes your codebase feel hostile to new engineers.

Keep it simple

The best feature flag setup I’ve seen is also the simplest. Boolean flags with deterministic evaluation. Clear categories. Mandatory ownership. Aggressive cleanup.

The worst setups are the ones with complex targeting rules, flags that depend on other flags, and evaluation logic spread across three different layers. If a single flag requires a whiteboard to explain, break it into smaller flags or put the complexity in application code where it can be tested properly.

Feature flags are a powerful tool. They’re also a maintenance commitment. Treat them like production configuration – because that’s exactly what they are.