Log Aggregation at Scale: ELK vs Alternatives

Once you have more than a handful of services, SSH-and-grep stops working. A single user request at Dropbyke touches the mobile API, the fleet service, the payment layer, and at least two background workers. When something breaks, I need to search one place for all the related events. That isn’t optional. That’s the baseline.

So we set up ELK. Elasticsearch, Logstash, Kibana. The pitch was compelling: open source, flexible, great full-text search, a plugin for everything. We stood up a three-node cluster, pointed Logstash at it, gave the team Kibana dashboards. For the first few weeks, it felt like a superpower.

Then the cluster started misbehaving. And I spent the next several months learning a painful lesson about the gap between “powerful” and “worth the operational cost.”

If you don’t have someone who wants to babysit Elasticsearch full time, don’t run ELK yourself. Hosted Elasticsearch, Graylog, or even Splunk will save you more engineering hours than they cost.

The Operational Tax Nobody Warns You About

Elasticsearch isn’t a database you deploy and forget. It’s a distributed system that demands constant attention. Shard allocation, index lifecycle, JVM heap tuning, split-brain prevention, disk watermarks. Every one of these will bite you, and they will bite you at 3 AM.

At Dropbyke, our log volume was moderate. Maybe a few gigabytes a day. Nothing that should stress a three-node cluster. But Elasticsearch doesn’t care about your expectations. It cares about index design, merge policies, and whether you remembered to set bootstrap.memory_lock. We spent more time keeping the logging infrastructure healthy than we spent on some of our actual product services.

The worst part is that when your logging system goes down, you lose visibility into everything else at the same time. Your safety net disappears exactly when you need it most.

What ELK Actually Gets Right

I’m not going to pretend it’s all bad. Elasticsearch’s search is genuinely excellent. When the cluster is healthy, the ability to run arbitrary queries across millions of log lines with sub-second response times is hard to match. Logstash can parse almost any log format into structured fields. Kibana dashboards gave our product team visibility they never had before.

The ecosystem is real. Beats shippers are lightweight, community plugins cover most integrations, and the documentation is solid. If you have the operational muscle, ELK is the most flexible open source logging stack available.

What I’d Do Differently

If I were starting over, I wouldn’t self-host Elasticsearch for logging. Full stop.

Hosted Elasticsearch removes the worst of the operational burden. You keep the same query model, the same Kibana dashboards, the same integrations. You lose some control over cluster configuration and you pay more per gigabyte, but you gain back the engineering hours you were burning on cluster babysitting. For most teams, that tradeoff is obvious.

Graylog is worth a look if you want Elasticsearch search without the full DIY build. It wraps Elasticsearch in a more opinionated log management experience with built-in alerting and stream routing. Less flexible than raw ELK, but faster to get running and easier to keep running.

Splunk is the enterprise answer. Powerful, mature, battle-tested. Also expensive enough to make your finance team flinch. If budget isn’t the constraint, Splunk is a safe bet. For a startup, it rarely makes sense.

Cloud provider logging is the lowest-effort option. AWS CloudWatch Logs, for example, integrates deeply with everything else in AWS and requires zero operational overhead. The query capabilities are basic compared to Elasticsearch, but basic is often enough. You can always export to something more powerful later.

Structured Logging Is the Real Win

Regardless of which aggregation tool you pick, the single best investment is structured logging. A JSON log line with consistent fields turns debugging from archaeology into search.

{
  "timestamp": "2016-09-05T10:23:45Z",
  "level": "error",
  "service": "payment-api",
  "trace_id": "abc123",
  "message": "Payment processing failed",
  "error": "Connection timeout",
  "customer_id": "cust_456"
}

Carry a trace ID through every service. Use consistent field names. Emit JSON instead of free-form text. These decisions pay off no matter what sits behind the ingestion pipeline. If you do nothing else, do this.

Put a Buffer in Front of Ingestion

One more thing I learned the hard way: put a queue between your log shippers and your aggregation layer. Kafka, Redis, even a simple file buffer. Traffic spikes will happen. Deploys will happen. If your pipeline has no buffer, you drop logs during the exact moments you need them most.

Pick your logging battles

ELK is powerful software with brutal operational costs. Most teams underestimate how much work Elasticsearch is to run, and they find out at the worst possible time. If you can afford someone who genuinely enjoys tuning JVM garbage collection and shard allocation, go for it. Otherwise, pay for a hosted solution or pick a simpler tool. Your on-call engineers will thank you.