GitOps: Stop SSHing Into Production

Quick take

If your deployment process involves someone running kubectl on a laptop, you don’t have a deployment process. You have a person. GitOps replaces that person with a Git repo and a controller that actually remembers what it did.

I have a confession. At the fintech startup, our “deployment pipeline” for the first few months was me running kubectl apply from my MacBook at 11pm. I had a checklist in Notes.app. It worked until I forgot step 4 one night and took down the price feed for 45 minutes.

That incident was the push I needed. Not because the outage was catastrophic – it wasn’t. But because I realized that if I got hit by a bus, nobody else could deploy. The entire process lived in my head and my shell history.

GitOps fixed that. Not overnight, but permanently.

What GitOps actually is (and isn’t)

GitOps isn’t a product you install. It’s a principle: Git is the single source of truth for what your infrastructure should look like, and a controller continuously reconciles reality to match.

That’s it. Everything else is implementation detail.

The word gets thrown around a lot in 2019, mostly by vendors trying to sell dashboards. Strip away the marketing and you get three rules:

Declare your desired state in files checked into Git
Run a controller that watches those files and applies changes automatically
Never touch the cluster directly

If a change isn’t in Git, it didn’t happen. Full stop.

Why pull-based beats push-based

Most CI/CD pipelines are push-based. Jenkins builds your image, then runs kubectl apply against the cluster. The problem: Jenkins now needs cluster credentials, network access, and the ability to mutate production state. Your CI system becomes a god-mode attack surface.

Pull-based GitOps flips this. The controller runs inside the cluster. It watches a Git repo and pulls changes when it sees a new commit. The cluster reaches out to Git, not the other way around.

Push model:  CI --> kubectl apply --> Cluster
Pull model:  Cluster <-- controller watches <-- Git repo

The pull model wins on security, stability, and performance simultaneously. No cluster credentials in CI. No network path from build servers to production. And the controller is smart enough to apply only diffs, so it’s faster than re-applying everything.

Setting it up: a real example

Here is the repo layout I settled on at Decloud after trying three different approaches. We run Go microservices on Kubernetes, so this is Kustomize-flavored.

infra/
  apps/
    api-gateway/
      base/
        deployment.yaml
        service.yaml
        kustomization.yaml
      overlays/
        staging/
          kustomization.yaml
          replicas.yaml
        production/
          kustomization.yaml
          replicas.yaml
          hpa.yaml
    auth-service/
      base/
        ...
      overlays/
        ...
  platform/
    namespaces/
    network-policies/
    rbac/

The apps/ directory is owned by service teams. The platform/ directory is owned by me (or whoever is on infra that week). This separation matters more than any specific tool choice.

The base deployment

Nothing exotic here. A standard Kubernetes deployment with explicit resource requests, health checks, and a pinned image tag. No latest. Ever.

# apps/api-gateway/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
spec:
  replicas: 2
  selector:
    matchLabels:
      app: api-gateway
  template:
    metadata:
      labels:
        app: api-gateway
    spec:
      containers:
        - name: api-gateway
          image: registry.decloud.io/api-gateway:v0.8.3
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "256Mi"
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5

The production overlay

Production gets more replicas, an HPA, and a specific image tag. The overlay touches only what differs.

# apps/api-gateway/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: production
resources:
  - ../../base
  - hpa.yaml
patches:
  - replicas.yaml
images:
  - name: registry.decloud.io/api-gateway
    newTag: v0.8.3

This is the file that changes on every deploy. A version bump is a one-line diff in a pull request. Reviewable. Revertable. Boring.

Choosing a controller: Flux vs Argo CD

In early 2019, the two serious options are Flux and Argo CD. I’ve run both.

Flux is simple. It watches a Git repo, syncs to a cluster, done. Installing it’s one command:

fluxctl install \
  --git-url=[email protected]:decloud/infra.git \
  --git-branch=master \
  --git-path=apps/ \
  --namespace=flux \
| kubectl apply -f -

Flux feels like a Unix tool. It does one thing. If you want a UI or multi-cluster visibility, look elsewhere.

Argo CD is heavier but more featureful. You define Applications as custom resources:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api-gateway
  namespace: argocd
spec:
  source:
    repoURL: https://github.com/decloud/infra
    targetRevision: master
    path: apps/api-gateway/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

The selfHeal: true flag is the killer feature. Someone runs a manual kubectl edit in production? Argo CD reverts it within seconds. That alone is worth the setup cost.

My recommendation: start with Flux if you have one cluster and want simplicity. Move to Argo CD when you need multi-cluster, a UI for the team, or you’re tired of explaining to people what state the cluster is in.

We started with Flux. Moved to Argo CD three months later. No regrets on either decision.

The deployment workflow

Here is how a deploy actually works once this is set up:

# Developer bumps the image tag after CI builds a new version
git checkout -b deploy/api-gateway-v0.8.4
vim apps/api-gateway/overlays/production/kustomization.yaml
# Change newTag: v0.8.3 -> v0.8.4
git add -A && git commit -m "deploy: api-gateway v0.8.4"
git push origin deploy/api-gateway-v0.8.4
# Open PR, get review, merge

After merge, the controller picks up the change within 60 seconds (configurable). The rollout happens. Monitoring confirms. Done.

Compare that to the old world:

# The old way -- do not do this
kubectl set image deployment/api-gateway \
  api-gateway=registry.decloud.io/api-gateway:v0.8.4 \
  -n production
# Hope you remember what you just did
# Hope the next person knows too

The first approach gives you an author, a reviewer, a timestamp, a diff, and a one-command rollback (git revert). The second gives you nothing.

Secrets: the hard part

Secrets are where GitOps gets uncomfortable. You want everything in Git, but you can’t commit plaintext credentials. In 2019, the options are:

Mozilla SOPS – encrypts values in-place using KMS or PGP. The file structure stays readable, only the values are encrypted. This is what I use.

# secrets.enc.yaml (encrypted with SOPS)
apiVersion: v1
kind: Secret
metadata:
  name: api-gateway-secrets
data:
  DATABASE_URL: ENC[AES256_GCM,data:abc123...,type:str]
  API_KEY: ENC[AES256_GCM,data:def456...,type:str]
sops:
  kms:
    - arn: arn:aws:kms:eu-west-1:123:key/abc-def

Bitnami Sealed Secrets – encrypts secrets with a cluster-side key. Safe to commit the encrypted form. Decryption only happens inside the cluster.

External secret managers – Vault, AWS Secrets Manager, etc. The secret reference lives in Git, the actual value lives elsewhere. More moving parts, but better for large orgs.

Pick one. Don’t commit plaintext. I’ve seen production database passwords in public GitHub repos more times than I want to admit.

Handling emergencies

Sometimes you need to bypass the process. A P1 incident at 3am isn’t the time to open a pull request.

The rule: fix it now, commit it immediately after.

# 1. Fix the fire
kubectl rollback deployment/api-gateway -n production

# 2. Immediately capture in Git (before you forget or go back to sleep)
git checkout -b hotfix/api-gateway-rollback
# Update manifests to match the rolled-back state
git commit -m "hotfix: rollback api-gateway to v0.8.3 (incident #47)"
git push && # open PR

# 3. Write a note in the incident channel explaining why you bypassed GitOps

If you skip step 2, the controller will eventually reconcile and re-deploy the broken version. I learned this the hard way during my EF batch – we had a demo the next morning and I rolled back manually without committing. Woke up to the broken version redeployed by Flux at 6am. Fun times.

Drift detection is a feature, not a bug

When you first enable GitOps, the controller will fight you. It will revert manual changes, flag resources you forgot about, and generally make noise.

Good. That noise is telling you things:

A developer ran kubectl edit because the PR process felt too slow. Fix the process.
A resource exists in the cluster that isn’t in Git. Either add it to Git or delete it.
Someone created a secret manually during an incident. Capture it properly.

Drift detection is your audit system running in real time. Treat every drift event as a process improvement opportunity, not an annoyance to suppress.

Repo structure: mono vs split

I tried three approaches before settling:

Mono-repo (app + infra together) – Simple for tiny teams. Falls apart when you have more than ~5 services because every deploy touches the same repo and branch protection gets weird.

Split repos (one for app, one for infra) – Clean ownership boundaries. The infra repo becomes the system of record. Cross-cutting changes require two PRs, which is annoying but forces you to think about coupling. This is what I recommend for most teams.

Per-app manifests in app repos – Each service owns its deployment config. Great for autonomy, terrible for platform-wide changes like “update the sidecar version on every service.” You will end up writing a bot to open 30 PRs.

For a team of 3-10 engineers, split repos with Kustomize overlays is the sweet spot. You can always migrate later – your manifests are just files in Git, after all.

What I got wrong

A few mistakes worth sharing:

I over-automated image updates. I set up Flux’s auto-image-update feature, which watches a registry and commits new tags automatically. Sounds great. In practice, it deployed a broken build at 2am because CI had a flaky test that passed on a bad commit. Explicit tag bumps via PRs are slower but safer. I turned off auto-updates within a week.

I didn’t enforce branch protection early enough. For the first month, anyone could push directly to master on the infra repo. One accidental force-push later, I added required reviews and status checks. Should have been day one.

I skipped the README. New engineers had no idea how to deploy. The process was obvious to me because I built it. It wasn’t obvious to anyone else. A one-page doc explaining “how to deploy a new version” saved hours of Slack questions.

Getting started

If you’re reading this and still deploying with kubectl from a laptop, here is what I would do tomorrow:

Create an infra repo. Move your manifests there.
Install Flux with fluxctl install. Point it at the repo.
Enable branch protection on master. Require one review.
Deploy something small through a PR. Watch it sync.
Delete kubectl access for your CI system. If CI can’t touch the cluster, it won’t.

You can do steps 1-4 in an afternoon. Step 5 takes courage, but it’s the one that makes the whole thing stick.

GitOps isn’t exciting. It’s a Git repo, a controller, and a team agreement to stop running commands against production. The boring part is the point.