A Year Running Kubernetes in Production

We moved our first workload to Kubernetes about a year ago at the fintech startup. The pitch was simple: declarative deployments, self-healing services, no more SSH-into-prod-and-pray. A year later, I can confirm all of that’s true. I can also confirm that the path there involved a DNS outage at 2am, a deployment that silently ate 4GB of RAM, and me reading the kube-proxy source code on a Saturday.

Worth it? Yes. Painless? Not even close.

The stuff that pays off immediately

The declarative model is the real win. You describe what you want, Kubernetes converges toward it. No more runbooks for “what if the service dies on box 3.” It just comes back. Deployments get boring, which is exactly what you want deployments to be.

Resource requests and limits are the other quick win, but only once you actually set them properly. We ran for weeks without limits on one of our data ingestion services. It worked great until it didn’t, and then it took down two neighbors on the same node.

Don’t skip this part:

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

Those numbers are a starting point. Profile under real load. Revisit quarterly. The defaults are lies.

Rolling updates deserve a mention too. Going from “schedule maintenance window, hold breath, deploy, pray” to incremental rollouts with automatic rollback changed how the whole team thinks about shipping. People deploy more often because it feels safe. That feedback loop alone justified the migration.

Where it bites you

Networking. Every single time.

Service discovery looks dead simple in the docs. Then you hit DNS caching issues, or a network policy blocks something you didn’t expect, or your ingress controller does something creative with header forwarding. Kubernetes hides the network behind a nice abstraction until that abstraction leaks, and then you’re reading iptables rules at midnight.

We had a bug at the fintech startup where intermittent 503s on one service turned out to be a kube-dns cache TTL mismatch with our upstream provider. Took three days to find. The fix was two lines of config. That’s Kubernetes networking in a nutshell.

Resource tuning is the slow burn. Set requests too low and the scheduler packs your nodes like a clown car — then starts evicting pods when things get tight. Set limits too low and your service gets CPU-throttled under load, which looks exactly like an application bug from the outside. The only way through is profiling under production traffic patterns and adjusting. There’s no shortcut here.

Stateful workloads are possible but honestly, we kept Postgres and Redis outside the cluster for most of the year. Storage classes, PV lifecycle, backup and restore — it all works, but the operational surface area is large. If you’re just getting started, keep your databases where they are. Move them in later when you actually understand what you’re signing up for.

Debugging gets weird. Something fails and the cause could be the application code, the container image, the scheduler, the network plugin, the node itself, or some combination. I’ve seen a deployment fail because a node had a full disk from container image garbage collection not running. Good luck finding that without decent logging and tracing. kubectl describe is your best friend, but sometimes your best friend doesn’t know the answer either.

The hard lessons

Treating Kubernetes like a black box works until you need to upgrade the cluster. Or recover from an etcd failure. Or figure out why the scheduler won’t place a pod. Someone on your team needs to actually understand the control plane. Not “watched a conference talk” understand. “Can read the API server logs and make sense of them” understand.

Running your own control plane is expensive. Not in money — in attention. If your cloud provider offers managed Kubernetes, take the deal. We spent months on control plane operations that could have been spent on product work. The managed offerings are solid enough now. Pay the tax, move on.

YAML sprawl is the death of a thousand paper cuts. You start with clean manifests for three services. Six months later you have forty files that are 80% identical and nobody remembers which copy is canonical. Helm or Kustomize help, but be careful — I’ve seen teams replace YAML sprawl with template sprawl, which is somehow worse because now you need to understand Go templates to deploy a config change.

Secrets handling is genuinely bad out of the box. Kubernetes Secrets are base64-encoded. That’s encoding, not encryption. Anyone with API access can read them. We ended up pulling secrets from Vault at deploy time and treating the built-in Secrets as a transport layer. If you’re storing database passwords as Kubernetes Secrets and calling it done, stop.

Local development is the thing nobody budgets time for. Your cluster isn’t a laptop. Developers need fast feedback without blowing up shared environments. Minikube and isolated namespaces help, but they need investment. We lost a solid week of productivity before we built a decent local dev story.

Habits that actually stuck

GitOps, before we called it that. Every manifest in version control. Deployments triggered by merges. No kubectl apply from someone’s laptop in production. This gives you an audit trail, makes rollbacks trivial, and kills snowflake configurations dead.

Health checks as production code. A bad liveness probe will kill your service faster than a bad deploy. We had a probe that checked a database connection with a 1-second timeout. Database had a slow moment, probe failed, Kubernetes restarted the pod, pod came up and hit the still-slow database, probe failed again. Restart loop. Cascading failure from a health check. Write your probes carefully.

Kubernetes-aware monitoring. Watching application metrics is necessary but not sufficient. You need to see what the scheduler is doing, how the nodes look, what the control plane health is. Without that, incidents are guesswork. With it, incidents become analysis.

Should you adopt it?

If you have multiple services, deploy frequently, and need real scheduling and resilience — yes. The investment pays off.

If you have a small team, a monolith, or two services behind a load balancer — probably not yet. Kubernetes doesn’t remove operational complexity. It trades one kind for another, and the new kind requires specific expertise. A team that’s struggling with basic deployments won’t be saved by Kubernetes. They’ll just struggle with more abstraction layers.

A year later

Kubernetes feels less like magic now and more like a system I can reason about. That’s the real milestone. Not “it works” but “I understand why it works, and I understand why it breaks.”

The teams I’ve seen succeed treat Kubernetes as a product they operate — with on-call, with runbooks, with regular investment. The teams that struggle treat it like something they installed once and forgot about. Same as any other infrastructure, really. The tool doesn’t care about your intentions. It cares about your configuration.

A Year Running Kubernetes in Production — What Actually Happened