Why We Went Event-Driven (and What Nearly Broke)

At the fintech startup we had this price feed pipeline. Financial data from multiple exchanges, normalised, enriched, and pushed to users who had watchlists. The original design was request-response all the way down. User service calls price service, price service calls enrichment service, enrichment service calls the notification service. Neat diagram on a whiteboard. Terrible in production.

One Tuesday afternoon the enrichment service got slow. Not down. Slow. Response times climbed from 40ms to 1200ms. Because every upstream service was waiting synchronously, the entire chain backed up. Users stopped getting price alerts. The dashboard froze. We had a four-service outage caused by one service being a bit sluggish.

That was the week I started ripping out synchronous calls and replacing them with events.

Events vs. commands – this matters

An event says “this happened.” Past tense. Immutable. PriceUpdated, BikeUnlocked, UserRegistered. A command says “please do this.” It can be rejected. It can fail.

The difference isn’t pedantic. At Dropbyke, when a user unlocked a bike, we emitted a BikeUnlocked event. The billing service, the map service, the analytics pipeline – they all consumed that event independently. None of them could say “no, don’t unlock it.” That decision was already made. They just reacted.

Commands flow differently. ChargeCreditCard can fail. ReserveBike can be rejected if inventory is zero. Mixing these up causes real bugs. I’ve seen teams emit a PaymentProcessed event before the payment actually went through. Don’t do that.

The broker in the middle

The whole thing works because of a message broker sitting between producers and consumers. We used Kafka at the fintech startup – those price feeds generated serious volume and we needed ordering guarantees per instrument. At Dropbyke it was RabbitMQ. Smaller scale, simpler ops, good enough.

The broker decouples everything. The price feed doesn’t know or care that six different services consume its events. It publishes and moves on. Each consumer reads at its own pace. One goes down? The others keep running. You deploy a new analytics consumer next month? Just subscribe. The producer never changes.

This is the core win. Services stop depending on each other directly. They deploy independently. They fail independently. They scale independently.

Patterns I’ve actually used

Pub-sub is the bread and butter. At the fintech startup, a single PriceUpdated event triggered watchlist checks, portfolio recalculations, alert evaluations, and analytics writes. Four systems, zero direct calls between them.

Event sourcing we used for audit-critical flows. Instead of storing “current balance is X,” you store every event that led to X. You can replay the whole history. You can answer “what was the state at 3pm yesterday?” without building a time-travel feature. The downside: schema evolution is painful, and once your event streams get long you need snapshots or replay takes forever.

CQRS we adopted for the watchlist feature. Writes went through a command model – add stock, remove stock, set alert threshold. Reads came from a denormalised projection optimised for fast lookups. Different data shapes, different scaling needs. But you live with eventual consistency. A user adds a stock and for a few hundred milliseconds the read model hasn’t caught up. You handle that in the UI or you accept it.

Sagas we used at Dropbyke for the ride lifecycle. BikeUnlocked kicks off. If the ride ends normally, RideCompleted triggers billing. If something goes wrong – GPS lost, bike reported damaged – compensating events fire to reverse or adjust charges. No distributed transactions. Each step is its own event, each failure has a defined recovery path.

Where the real work lives

Event design is everything. A bad event is one that forces the consumer to call back to the producer for context. At the fintech startup we learned this the hard way. Our first PriceUpdated event had just the instrument ID and the new price. The notification service needed the user’s alert thresholds to decide whether to fire. So it called the user service. Synchronously. Defeating the entire point.

We fixed it by enriching the event. Not with everything, but with enough:

{
  "event": "PriceUpdated",
  "version": 3,
  "instrument_id": "AAPL",
  "price": 143.50,
  "exchange": "NASDAQ",
  "timestamp": "2017-04-09T14:30:00Z",
  "previous_close": 142.80
}

Version the events from day one. You will change the schema. You will add fields, rename things, realise you forgot something critical. If you don’t version, you break every consumer on every change.

Idempotency is non-negotiable. Kafka gives you at-least-once delivery. That means duplicates. Your consumers must handle processing the same event twice without corrupting state. Idempotency keys, deduplication windows, whatever works – but you can’t skip this.

Dead letters. When a consumer fails to process an event, it has to go somewhere visible. Not silently dropped. Not retried infinitely until it blows up the queue. A dead letter topic with alerting. We caught a serialisation bug at Dropbyke within minutes because the dead letter queue spiked. Without it, we would have lost ride events silently.

When to use it and when not to

Go event-driven when multiple services need to react to the same change, when you need audit trails, when you want services to evolve independently, when eventual consistency is acceptable. Financial data pipelines? Perfect fit. IoT-style location updates from a fleet of bikes? Great.

Don’t use it for simple CRUD apps. Don’t use it if your team can’t handle the operational overhead – you’re running a broker now, monitoring consumer lag, managing schemas. And definitely don’t use it if you need strict synchronous guarantees everywhere. A checkout flow where the user needs an immediate “payment confirmed” is a bad candidate for fire-and-forget events.

What I’d tell myself before starting

Pick the right broker for your volume and ordering needs. Kafka if you need partitioned ordering and high throughput. RabbitMQ if you want simpler operations and flexible routing. Don’t overthink it early – you can migrate later, and you probably will.

Start with one bounded context. Don’t try to event-drive the entire system in a quarter. We started with the price feed pipeline at the fintech startup. Got it stable. Understood the operational model. Then expanded.

The complexity doesn’t disappear. It moves. Instead of debugging synchronous call chains, you’re debugging event flows, consumer lag, and ordering anomalies. Different problems, not fewer problems. But the system bends instead of breaking, and at 2am that difference matters a lot.