Operators are having a moment. Every Kubernetes talk, every blog post, every Slack channel — someone’s pitching operators as the answer to all your stateful workload problems. And look, they’re a genuinely clever pattern. But I think the community is overselling them, and most teams jumping on the bandwagon are setting themselves up for pain.
At the fintech startup, we run our infrastructure on Kubernetes. We’ve evaluated operators, built internal tooling around them, and dealt with the sharp edges firsthand. So here’s my honest take.
What Operators Actually Are
Dead simple concept. An operator is a controller that watches a custom resource, compares desired state to actual state, and reconciles the difference. Same pattern as built-in K8s controllers, just applied to your specific application.
You define a CRD like this:
apiVersion: example.com/v1
kind: Database
metadata:
name: orders-db
spec:
engine: postgres
version: "10"
replicas: 3
The operator sees that, creates the StatefulSets, Services, ConfigMaps, backup jobs — whatever the system needs. Intent-driven. Declarative. Very Kubernetes-native.
The real value? Day 2 operations. Scaling, rolling upgrades, backups, failover, drift correction. The stuff that happens after you deploy. That’s where operators shine and where manual runbooks fall apart at 3am.
Where the Hype Breaks Down
Here’s what nobody at the conference talks says: writing a good operator is hard. Writing a bad one is easy. And a bad operator running with cluster-level RBAC is a liability.
I’ve seen teams spend weeks building custom operators for problems that a well-written Helm chart and a cron job would have solved. The pattern is seductive — “we’ll encode all our operational knowledge into code!” — but the implementation details are brutal.
Your reconciliation loop has to be idempotent. Truly idempotent. Not “works most of the time” idempotent. You need proper owner references for garbage collection. You need finalizers if you touch anything outside the cluster. You need status conditions so someone debugging at 2am can actually see what’s going on. Most first-attempt operators skip half of this.
Build vs. Adopt (Usually Adopt)
At the fintech startup, we evaluated building custom operators early on. For most of our use cases, adopting existing ones was the right call. The etcd operator, the Prometheus operator — these have hundreds of contributors and production hours behind them. Our custom needs weren’t special enough to justify the maintenance burden.
Build your own only if your system is genuinely proprietary, your operational rules can’t be expressed with existing operators, or you need tighter control than third-party code allows. Otherwise? Use what exists. Seriously.
The Stuff That Actually Matters
If you do write an operator, a few things will save you:
- Idempotent reconciliation. I can’t stress this enough. Compute desired state from the CR every single time. No “run once” actions without guards.
- Owner references. Let K8s garbage collection do its job.
- Status fields. If
kubectl describecan’t tell me what’s broken, your operator is incomplete. - Least privilege RBAC. Operators often end up with way too many permissions. Start minimal.
- Conservative defaults. Easier to loosen limits than to recover from an operator that went aggressive on scaling.
Test against a real cluster. Unit tests alone won’t catch the weird timing issues that only show up when the API server is under load.
My Honest Assessment
Operators are a powerful pattern. They’re the right abstraction for complex stateful workloads on Kubernetes. But they’re not magic, and they’re not always the right tool. Most teams I talk to would be better served by adopting a mature community operator than by building one from scratch.
The pattern will mature. The tooling (Operator SDK, Kubebuilder) is getting better. But right now, in 2018, approach with clear eyes about what you’re signing up for.