At Decloud, our first deployment pipeline was a shell script called deploy.sh. It was 140 lines of bash, it had three sleep statements in it, and it worked fine until it didn’t.
The day it didn’t was a Tuesday. Someone pushed a config change that looked innocent in the PR – an environment variable rename. The script deployed it to all pods simultaneously. The new variable name didn’t match what the application expected at startup. Every pod crashed. Recovery took 45 minutes because the script had no rollback logic. It just pushed forward.
That was the week I decided to actually invest in deployment infrastructure.
Quick take
GitOps (Git as the source of truth for cluster state) combined with progressive delivery (canary rollouts with automated analysis) gives you deployments that are auditable, reversible, and safe by default. We set this up at Decloud with Argo CD and Argo Rollouts. It took about three weeks to get right. It has prevented more incidents than I can count.
GitOps: the boring part that matters most
GitOps is an operating model. The desired state of your system lives in a Git repository. A controller running in the cluster watches that repo and continuously reconciles the actual state toward the declared state.
That’s it. No magic. Just four properties:
- Declarative. You describe what you want, not how to get there.
- Versioned. Every change has a commit, a diff, and a reviewer.
- Pull-based. The cluster pulls state from Git. No CI system needs credentials to push into production.
- Self-healing. Someone
kubectl edits a deployment? The controller reverts it. Drift is detected and corrected automatically.
We use Argo CD for this. Flux is also solid. The tool matters less than the discipline of treating Git as the single source of truth for everything running in the cluster.
The immediate win was audit trail. Before GitOps, answering “what changed and when” during an incident meant digging through CI logs, Slack messages, and someone’s memory. After GitOps, it was git log. Every deployment was a commit. Every rollback was a revert.
Progressive delivery: the part that keeps you safe
GitOps tells you what should be running. Progressive delivery controls how new versions are introduced. Instead of flipping all traffic at once – which is what our bash script did – you advance in stages and check real signals at each step.
At Decloud we use canary deployments. A new version gets 20% of traffic. We wait three minutes. If error rates and latency look normal, we go to 60%. Another five minutes. Then full rollout.
Here is roughly what the Argo Rollout spec looks like:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api-gateway
spec:
replicas: 6
strategy:
canary:
steps:
- setWeight: 20
- pause: { duration: 3m }
- analysis:
templates:
- templateName: error-rate-check
- setWeight: 60
- pause: { duration: 5m }
- analysis:
templates:
- templateName: latency-check
The analysis steps query Prometheus. If error rate exceeds the threshold, the rollout automatically aborts and rolls back. No human intervention. No 2am pages.
The key insight: rollout strategy, analysis thresholds, and promotion rules are all in Git. They are versioned, reviewed, and auditable. Nobody is hand-tuning production.
How the pieces fit together
The full flow:
- CI builds the image, pushes to the registry, and opens a PR against the config repo with the new image tag.
- Someone reviews and merges the PR.
- Argo CD detects the new commit and syncs the cluster.
- Argo Rollouts executes the canary steps.
- Automated analysis gates decide promote or rollback.
- A Slack notification tells the team what happened.
Deployments are deterministic. If you run the same commit twice, you get the same result. If something goes wrong, you revert the commit and the cluster converges to the previous state.
Repo structure – opinions I’ve earned
We tried monorepo (app code and manifests together) first. It worked for three services. By service ten, merge conflicts on shared infrastructure files were constant.
We switched to a split model: application repos hold code and build configs, a separate config repo holds Kubernetes manifests organized by environment. CI bridges the two by updating the config repo when a new image is built.
Some teams use “app of apps” patterns with Argo CD. That works if you have a platform team managing the hierarchy. For smaller teams, keep it simple. One repo, one folder per service, one overlay per environment.
Whatever you choose, optimize for clear ownership. If two teams are constantly editing the same files, your structure is wrong.
The guardrails you actually need
After running this for over a year, here are the guardrails that mattered in practice:
- Meaningful success metrics. Error rate and latency are baseline. Add business-level signals if you can. We check order completion rates during canary for our commerce services.
- Tested rollback thresholds. If you have never seen an automated rollback trigger, your thresholds are either too loose or your analysis is broken. We intentionally deploy known-bad versions to staging to verify the rollback path works.
- A manual override. Automated analysis is great until the metrics are noisy from an unrelated issue. You need a way to pause, skip, or manually promote. Argo Rollouts supports all of these.
- No secrets in Git. We use Sealed Secrets for Kubernetes secrets. External Secrets Operator is another option. The point is that credentials never appear in the config repo, not even encrypted in a way that a casual contributor can decrypt.
- Blast radius limits. Namespace isolation. Resource quotas. A bad deployment in one service shouldn’t cascade.
What still goes wrong
I would be lying if I said this setup is bulletproof.
The most common problem: metrics are too noisy. If your baseline error rate fluctuates by 2% normally, setting a rollback threshold at 1% means every deployment gets aborted. Tuning thresholds is an ongoing process, not a one-time setup.
Second most common: someone makes a manual change in the cluster. Argo CD reverts it. The person who made the change gets confused. This is actually the system working correctly, but it requires educating the team that kubectl edit is no longer a valid workflow.
Third: configuration sprawl. As the number of services grows, the config repo gets messy. Kustomize overlays help, but you need discipline about what belongs in base vs. overlay and when to refactor.
Was it worth it?
Our deployment frequency went from twice a week to multiple times a day. Our incident rate from bad deployments dropped to near zero. The 45-minute outage from that config variable rename? It would be caught in the first canary step now, rolled back automatically in under a minute, and nobody would lose sleep over it.
Three weeks of setup for that kind of confidence. I would make that trade again every time.