Quick take
Stop chasing the shiny stuff. Resource limits, probes, network policies, RBAC, and boring upgrades will prevent 90% of your Kubernetes pain. I learned this the hard way so you don’t have to.
Kubernetes won. The debate is over. But winning adoption doesn’t mean most teams are running it well. I’ve operated clusters across three startups now – a fintech startup, a mobility platform at Dropbyke, and currently building Decloud through EF – and the failure pattern is always the same: teams skip the fundamentals, bolt on complexity, then blame Kubernetes when things break.
This isn’t a “getting started” guide. It’s the checklist I actually use. Everything here exists because I’ve been paged at 3am for the opposite.
Upgrades Should Be Boring
If your Kubernetes upgrade process feels heroic, something is wrong. Stay on a supported minor version. Have a rollback plan. Run it on a schedule, not as a fire drill.
I keep a dead-simple rule: if we’re more than one minor version behind, it’s a P1. Not because something is broken, but because something will break and we’ll be debugging with stale docs and missing patches.
Resource Requests and Limits. On Everything.
This is the single most impactful thing you can do. No exceptions.
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
Requests are your scheduling guarantee. Limits are your safety net. Without both, you’re one misbehaving pod away from cascading node pressure. I’ve seen a single Go service with no memory limit take down a shared node at the fintech startup because someone forgot to close a channel and goroutines piled up. That was a fun Friday evening.
Size these from actual metrics, not guesswork. The metrics server and Prometheus give you enough signal. Start conservative, then adjust.
Namespaces With Actual Guardrails
Namespaces without resource quotas are just labels. Enforce boundaries at the namespace level so one team’s experiment can’t starve another team’s production workload.
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
Probes Are Mandatory
Not optional. Not “we’ll add them later.” Mandatory.
Liveness probes restart broken processes. Readiness probes keep bad pods out of the service mesh. Without them, you’re routing traffic to containers that are technically running but functionally dead.
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
One thing I always add: make /healthz and /ready do different things. Your liveness check should be “is the process alive.” Your readiness check should be “can this instance actually serve traffic right now” – database connected, caches warm, dependencies reachable.
Handle Shutdown Properly
Your container will get SIGTERM’d. Handle it. Stop accepting new work, drain in-flight requests, exit clean. Set terminationGracePeriodSeconds to match your actual drain time, not the default 30 seconds.
In Go (which is what I write most services in), this means catching the signal and giving your HTTP server a context-bounded shutdown. It’s ten lines of code. There’s no excuse.
Secrets Deserve Respect
Two rules:
- Mount secrets as files, not environment variables. Env vars leak into crash dumps, child processes, and logging. Files don’t.
- Enable encryption at rest in etcd. If you’re on a managed provider, verify this is actually on. Don’t assume.
For anything beyond toy projects, use an external secret manager. Kubernetes native secrets are base64-encoded, not encrypted. Big difference.
Default-Deny Networking
Without network policies, every pod can talk to every other pod. That’s not a cluster, it’s a flat network with extra steps.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Start with deny-all, then explicitly allow what’s needed. Yes, it’s more work upfront. But “every service can reach every other service plus the internet” isn’t a security posture I’m comfortable with, and you shouldn’t be either.
My order of priorities is always the same: security, stability, performance. In that order. Network policies fall squarely in bucket one.
RBAC and Pod Security
Run as non-root. Drop capabilities. Use read-only root filesystems where you can. Enforce these with PodSecurityPolicy so it’s not just a suggestion.
For RBAC: if any service account has cluster-admin, fix that first. Service accounts get the verbs and resources they need. Nothing more.
Roll Out Safely
Use rolling updates with sensible surge and unavailable settings. Add Pod Disruption Budgets so node drains and maintenance don’t accidentally kill your service. Spread replicas across nodes and zones with anti-affinity rules.
None of this is glamorous. All of it prevents 2am pages.
Observability That Actually Helps
Three things:
Structured logs. JSON, consistent fields, correlation IDs. If your logs are unstructured strings, your incident response is grep and prayer.
Application metrics tied to SLOs. Not vanity dashboards – actual service-level objectives that tell you whether users are happy. Expose them with Prometheus, build dashboards that answer questions, not decorate walls.
Consistent labels. Every resource gets app, team, version, environment. This isn’t bureaucracy. It’s what makes cost allocation, debugging, and incident response possible at any scale beyond one person.
Back Up etcd. Test the Restore.
A backup you’ve never restored is a hope, not a backup. Schedule restores. Verify them. The five minutes this takes will save you from the worst day of your career.
The theme here is boring. Boring upgrades, boring resource limits, boring network policies, boring RBAC. The teams I’ve seen run Kubernetes well aren’t doing anything clever. They’re doing the basics consistently and spending their creativity on the product instead of firefighting infrastructure.
That’s the whole point.