You Probably Don't Need a Service Mesh

Quick take

A service mesh is worth it when you have dozens of services, multiple teams, and genuine mTLS/observability requirements. For everyone else, it’s operational overhead disguised as progress. Use simpler tools until the pain is real.

I spent part of last year at a large consumer platform, working on platform infrastructure. They had the scale where a service mesh made sense – hundreds of services, multiple teams shipping independently, regulatory requirements around service-to-service encryption. The mesh was earning its keep.

But here is the thing: the operational burden was substantial. Upgrading the mesh was a coordinated effort. Debugging failures meant understanding whether the issue was in the application, the sidecar, or the control plane. The platform team had dedicated people just for mesh operations. A company at that scale could afford that investment because the alternative – inconsistent security and observability across hundreds of services – was worse.

Most teams I’ve worked with aren’t at that scale. They have five to fifteen services, one or two teams, and a single language stack. When they ask me about service meshes, the honest answer is usually: not yet.

What a mesh actually gives you

A service mesh sits between your services and handles the cross-cutting concerns that are painful to implement consistently in application code:

Mutual TLS between services, with automatic certificate rotation and identity.
Traffic management – retries, timeouts, circuit breakers, canary deployments, traffic splitting.
Observability – consistent metrics, traces, and access logs for every service-to-service call without application changes.
Policy enforcement – authorization rules applied at the infrastructure level.

The mechanism is a sidecar proxy (usually Envoy) injected next to every service instance, managed by a control plane that distributes configuration and certificates.

What a mesh actually costs you

This is the part the sales pitch skips.

Operational complexity. You’re running a distributed system (the mesh) to manage your distributed system (the application). The control plane needs monitoring, upgrades, and capacity planning. The data plane proxies add latency and memory overhead to every pod. When something breaks, you need to determine whether the failure is in your code, the proxy, or the control plane configuration.

Resource overhead. Each sidecar consumes CPU and memory. Multiply that by every pod in your cluster. At that scale, this was a line item in the infrastructure budget. For smaller teams, it can be a surprising cost increase for marginal benefit.

Debugging complexity. I’ve watched engineers spend hours debugging connection failures that turned out to be a misconfigured VirtualService in Istio. The mesh adds a layer of indirection between your service and the network. When that layer misbehaves, the symptoms look like application bugs.

Upgrade coordination. Data plane and control plane versions need to be compatible. Upgrading means rolling out new sidecars across the cluster while keeping traffic flowing. It’s doable, but it isn’t trivial.

When it’s the right call

A service mesh earns its complexity when:

You have dozens of services owned by multiple teams who ship independently. Consistent behavior across all those services is genuinely hard to achieve with libraries alone.
You need mTLS everywhere with centralized certificate management and auditability. Doing this per-service is painful and error-prone.
You need advanced traffic management – canary rollouts, traffic mirroring, fault injection for testing. These are hard to build from scratch.
You have a polyglot stack. If services are written in Go, Java, Python, and Node, a mesh gives you consistent behavior regardless of language.
Regulatory compliance requires uniform controls and audit trails across all service communication.

When it’s the wrong call

You have fewer than 20 services with simple traffic patterns. The mesh overhead isn’t justified.
Your stack is one or two languages. A well-maintained library for retries, timeouts, and circuit breaking (like Go’s standard library or a shared middleware) is simpler and cheaper.
Your team is small and already struggling to ship features. A mesh will slow them down further.
Your architecture is still evolving. If service boundaries aren’t stable, adding infrastructure that assumes stable boundaries will create friction.
You can meet your requirements with simpler tools: ingress controllers for external traffic, OpenTelemetry for observability, cert-manager for TLS.

The honest decision framework

Ask these five questions:

Do I have a specific problem that a mesh solves, or am I attracted to the concept?
Can I solve this problem with a library, a middleware, or an existing tool?
Can my team support another control plane in production?
Is the cost (CPU, memory, operational toil) justified by the benefit?
Are my service boundaries stable enough to build infrastructure around?

If the answer to question 1 is “attracted to the concept,” stop there. If the answer to question 3 is “no,” stop there.

The options if you do proceed

Istio: Feature-rich, widely adopted, complex to operate. The default choice for teams that want everything. Be prepared for a learning curve.

Linkerd: Lighter, simpler, Rust-based data plane. Fewer features than Istio, but significantly easier to operate. I generally recommend this for teams adopting their first mesh.

Consul Connect: Integrates with HashiCorp Consul for service discovery. Works outside Kubernetes, which is useful if you have hybrid infrastructure.

AWS App Mesh / GCP Traffic Director: Managed options that reduce some operational burden at the cost of cloud lock-in.

What to do instead

If you decided against a mesh, you can still solve most of the problems it addresses:

TLS: cert-manager with Let’s Encrypt, or SPIFFE/SPIRE for service identity.
Observability: OpenTelemetry instrumentation with Prometheus and Jaeger. You write more code, but you own the pipeline.
Traffic control: Ingress controllers handle north-south traffic. For east-west, library-level retries and timeouts work fine at modest scale.
Circuit breaking: Language-specific libraries. Go has solid options. So does Java.

The pragmatic path is to start with these simpler tools and adopt a mesh when the pain of inconsistency genuinely exceeds the cost of the mesh. For most teams, that day is further away than the conference talks suggest.