I keep seeing the same pattern. A VP reads the DORA report, gets excited, mandates dashboards for all teams, and within two months the metrics are either gamed, ignored, or weaponized in performance reviews. A genuinely useful framework gets turned into a surveillance tool. It’s frustrating to watch.
DORA metrics – deployment frequency, lead time for changes, change failure rate, time to restore – are good because they measure system outcomes. Not individual output. Not lines of code. Not story points. System outcomes. The moment you point them at individual engineers, you’ve already lost the plot.
The four metrics, briefly
Deployment frequency: how often code reaches production. Higher is generally better because it implies smaller batches and faster feedback. But a team deploying empty feature flags ten times a day isn’t actually faster.
Lead time for changes: the time from commit to production. This one is revealing. When lead time is high, the bottleneck is almost never “engineers are slow.” It’s reviews, environments, approval gates, flaky tests. Infrastructure problems disguised as people problems.
Change failure rate: the percentage of deployments that cause incidents or need rollback. The quality counterbalance to speed. If you’re deploying constantly but breaking things, you aren’t winning.
Time to restore: how quickly you recover after a failure. I care about this one the most. Your system will break. The question is whether you can fix it in minutes or hours.
The measurement trap
Here is where teams go wrong. They start measuring before they agree on definitions.
What counts as a deployment? Does a config change count? A feature flag flip? What about a deployment to staging that auto-promotes? Is a deployment “to production” when it hits the first canary or when it reaches 100% of traffic?
What counts as a failure? Only P1 incidents? Any rollback? A hotfix that ships within an hour?
If you don’t write these definitions down and get agreement, every team will interpret the metrics differently. Then someone will compare teams, the numbers will be meaningless, and trust in the whole system collapses.
I watched exactly this happen at one company. Two teams reported wildly different deployment frequencies. Turned out one team counted every Kubernetes pod restart as a deployment. The other only counted manual releases. Same dashboard, completely different realities.
What actually works
Use rolling 30-day medians, not averages. Averages lie when you have one outlier incident that takes a week to resolve. Medians are boring and that’s the point.
Keep the metrics at the team or service level. Never at the individual level. The second you rank individuals by lead time, someone will start gaming commit timestamps. I’ve seen it happen.
Track trends against a team’s own baseline. Comparing Team A to Team B is almost always misleading. Teams have different codebases, different customer profiles, different risk tolerances. A payments team deploying twice a week with zero failures is doing better than a marketing team deploying daily with a 15% failure rate, even though the dashboard would suggest otherwise.
Automate collection. If someone has to manually tag deployments or log incidents for the metrics to work, the data will be inconsistent within a month. Pull from your CI/CD system, your version control, and your incident tracker. If those systems don’t have the events you need, fix that first.
The traps to avoid
Don’t use DORA for performance reviews. I can’t say this enough. The moment engineers know their deployment frequency affects their bonus, they will split PRs into meaningless fragments. You’ll get higher numbers and worse software.
Don’t optimize one metric in isolation. Deployment frequency without change failure rate is just recklessness. Lead time without quality is just speed. They work as a set or not at all.
Don’t hide incidents. If the change failure rate shows up in some leadership scorecard, teams will reclassify incidents as “expected behavior” or “not caused by the deployment.” I’ve seen teams debate for twenty minutes whether something was “really an incident” just to protect their number.
Don’t expect precision. These are directional signals. Trends matter. A lead time that drops from days to hours over a quarter is meaningful. The difference between 2.3 hours and 2.7 hours is noise. Treat it as noise.
The honest approach
Pair DORA metrics with qualitative feedback. Ask teams: “What’s slowing you down this sprint?” The metrics should confirm what teams already feel. If the numbers say everything is fine but engineers are miserable, the numbers are wrong – or more likely, they’re measuring the wrong thing.
Use the metrics to fund improvements. Lead time is high because the test suite takes 40 minutes? That’s a concrete investment case. Change failure rate is climbing? Maybe the team needs better staging environments, not a talking-to.
DORA is a compass, not a GPS. It tells you roughly which direction you’re heading. It doesn’t tell you where every pothole is. Use it that way and it’s genuinely valuable. Overfit to the numbers and you’ll optimize for the dashboard while the actual engineering culture rots underneath.