Two Years of Kubernetes in Production — The Boring Parts Are the …

Last Tuesday at 3 AM, I got paged because a node drain took out our entire API tier. All three replicas. Same node. No PodDisruptionBudget. We’d been running Kubernetes in production for two years and I still hadn’t set one up for our most critical service. That’s the thing about year two. You think you know the platform. You don’t. You just know a different set of things than you did in year one.

I wrote about our first year with Kubernetes a year ago. The tone was “we survived and it was worth it.” That’s still true. But year two taught me that survival and maturity are very different things.

The networking tax

I said networking was hard last year. Understatement.

At the fintech startup we started with a basic CNI and no network policies. Fine for a while. Then we added more services, more teams started deploying, and suddenly every pod could talk to every other pod. No segmentation. No audit trail of who talks to what. Fixing this retroactively was brutal. You can’t just flip on network policies without mapping every legitimate connection first. We spent two weeks doing traffic analysis before we could write a single policy.

DNS was the other trap. Kube-dns worked great at our initial scale. Ten services, light traffic, no complaints. Then we added a data pipeline that hammered DNS with lookups on every request. Intermittent failures. Slow responses that looked like application bugs. Took us days to trace it back to DNS resolution under load. The fix was trivial — ndots configuration and a local DNS cache. The debugging wasn’t.

Treat networking as its own project. First-class. With a budget. Not something you’ll “figure out later.”

Resource requests stopped being guesses

Year one, our resource requests were vibes. Somebody would eyeball a service, pick a number, ship it. We got away with it because we had plenty of headroom.

Year two, the cluster got denser. Services started competing for resources. OOMKills on one service, wasted capacity on another. The scheduler was doing exactly what we told it to — the problem was that we told it garbage.

We started profiling every service under realistic load in staging. Measured actual memory and CPU usage over a week. Then set requests based on the P95, not the average, not a guess. Kept memory limits close to requests. Dropped CPU limits entirely unless a service had a known runaway pattern.

resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"

That YAML block looks boring. It’s also the difference between a stable cluster and a 3 AM page. The numbers matter. Profile them.

Stateful workloads: still nope (mostly)

We tried running Elasticsearch inside the cluster for about a month. Storage behavior was unpredictable. Backup and restore required custom tooling. Recovery after a node failure was slow and manual. We moved it back to managed infrastructure and haven’t looked back.

Postgres and Redis stayed outside the cluster all of year two. I know people run stateful workloads on Kubernetes successfully. I also know it requires an operational investment we weren’t ready to make. Managed services aren’t free — they just shift the cost from your on-call to your invoice. For us, that trade was obviously correct.

PodDisruptionBudgets: the lesson I learned the hard way

Back to that 3 AM page. A single node drain shouldn’t be able to take down a production service. That’s the whole point of running multiple replicas. But without a PodDisruptionBudget, Kubernetes will happily drain every pod from a node regardless of what service they belong to. If all your replicas happen to land on the same node — and the scheduler will do this if your anti-affinity rules are weak — one drain takes you to zero.

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api

We now treat PDBs as part of the deployment contract. You don’t get to ship a production service without one. Period.

Upgrades aren’t routine

Upgraded the cluster twice in year two. Both times were more work than expected.

Deprecations broke manifests. API versions we were using got dropped with one minor release warning. Node draining surfaced resource issues we’d been ignoring. Control plane changes rippled into unexpected places — one upgrade changed default admission controller behavior and three deploys failed before we figured out why.

The process that worked: read every release note cover to cover. Rehearse in staging with production-like load. Not “run the tests” — actually push traffic through it. Schedule a real maintenance window. Don’t YOLO a cluster upgrade on a Friday afternoon. (We did this once. Never again.)

Multiple clusters helped. We moved to a setup where we could upgrade one cluster while the other kept serving. Longer to provision, but the blast radius dropped to near zero.

Git as the source of truth

This was already working in year one, but by year two it became non-negotiable culture. Every manifest in version control. Every deployment triggered by a merge. No kubectl apply from someone’s laptop. No exceptions.

The payoff isn’t just auditability. It’s rollbacks. Something breaks, you revert a commit. No guessing what changed. No “who ran what command in production.” The answer is always in the git log. This single practice eliminated more incidents than any other operational improvement we made.

RBAC and secrets: the wake-up call

Year one, everyone had cluster-admin. Year two, we finally locked it down. Created service accounts per application, scoped access narrowly, and spent an uncomfortable afternoon discovering how many things broke when we removed broad permissions. Every one of those breakages was a security hole we’d been ignoring.

Secrets were the other wake-up call. Base64 isn’t encryption. It’s encoding. Anyone with API read access could decode every secret in the cluster. We moved to pulling secrets from Vault at deploy time and encrypting etcd at rest. Should have done it in month one. Didn’t.

Java, DNS, and autoscaling

Some workloads need special treatment. Our Java services ignored container memory limits by default and allocated memory like they owned the whole node. Took a few OOMKills to learn that the JVM needs explicit container-aware flags.

DNS bit us again here. Services making too many lookups created a bottleneck that looked like network latency. Horizontal pod autoscaling only worked correctly once our base resource requests were accurate — garbage in, garbage scaling decisions out.

What I’d tell someone starting today

Use managed Kubernetes. Don’t run your own control plane unless you have a very specific reason. The managed offerings are mature enough now and the time you save goes directly into product work.

Invest in developer experience from day one. The platform can be rock solid and still feel terrible if developers can’t get a fast local feedback loop. Minikube, namespaced dev environments, whatever works — budget time for it early.

Don’t touch service mesh until the basics are boring. I’ve seen teams adopt Istio before they had working health checks. Get networking, observability, and resource management right first. Then decide if you need the complexity.

Year three

The cluster is boring now. Boring in the good way. Deploys happen multiple times a day without drama. Incidents are rarer and smaller. The platform does what we need it to do.

But it took two full years of grinding — profiling resources, debugging DNS, writing network policies, learning the hard way about PDBs, locking down RBAC, building upgrade processes. None of that’s glamorous. None of it makes for a good conference talk. It’s just the work.

Kubernetes is a platform that rewards the teams willing to operate it seriously. Two years in, I’m convinced it’s the right foundation. I’m also convinced that most of the value comes from the boring operational work that nobody wants to do.

Two Years of Kubernetes in Production — The Boring Parts Are the Hard Parts