Quick take
Event sourcing is powerful but unforgiving. Get the aggregate boundaries wrong and you’ll spend months cleaning up. Get them right and you have auditability, replay, and decoupled integrations almost for free.
I’ve been building event-sourced systems at the fintech startup for a while now. We process financial news and market data from hundreds of sources, score relevance in real time, and deliver personalized feeds to users. That pipeline is a natural fit for event sourcing. Every price tick, every news article ingestion, every user interaction – they’re all events that happened at a specific moment, and we need to know exactly what happened and when.
But “natural fit” doesn’t mean “easy.” I’ve made most of the mistakes on this list. This post is what I wish someone had handed me before I started.
The Shift That Changes Everything
In a traditional system you store current state. A row in a table says “Order #789 has status SHIPPED.” You overwrite the old status. History is gone unless you bolted on an audit log.
Event sourcing flips that. You store the facts: OrderPlaced, PaymentReceived, OrderShipped. Current state is derived by replaying those facts. The event log is the source of truth. Everything else – your read models, your dashboards, your search indexes – is a projection you can tear down and rebuild from scratch.
At the fintech startup, this wasn’t a philosophical choice. Regulators wanted to know the exact sequence of data that produced a specific score for a specific user at a specific time. With event sourcing we could answer that question by replaying the stream. With CRUD we would have been guessing.
The Anatomy of an Event
Events are immutable records of things that already happened. Past tense. Specific. Self-contained.
{
"event_type": "ArticleScored",
"event_id": "evt_cf_98231",
"timestamp": "2018-03-19T10:15:30Z",
"aggregate_id": "article_44712",
"data": {
"source": "reuters",
"relevance_score": 0.87,
"matched_topics": ["AAPL", "earnings"],
"scoring_model_version": "v3.2"
}
}
A few things I learned the hard way about event design:
Include the model version. We score articles with ML models that change. Without the model version baked into the event, you can’t tell whether a score difference came from new data or a new model. We didn’t do this initially. Debugging was miserable.
Don’t store derived data as the event. The event is the fact. “ArticleScored” with a score of 0.87 is a fact. “ArticleIsHighlyRelevant” is a conclusion you draw from the fact. Store the fact. Let projections draw conclusions.
Carry enough context. A projection should never need to reach back into another stream to understand what an event means. If your ArticleScored event requires a lookup to know which source it came from, your event is too thin.
Aggregates: The Hardest Part
Aggregates define your consistency boundary. They validate commands, enforce invariants, and emit events. Their internal state is rebuilt by replaying the event stream.
Getting aggregate boundaries right is the single most consequential design decision in an event-sourced system. Too big and you get contention, slow replays, and serialization bottlenecks. Too small and you can’t enforce invariants that span related data.
We started with a giant “UserFeed” aggregate that tracked everything – subscribed topics, read articles, relevance preferences, notification settings. It grew to thousands of events per active user within weeks. Replaying it on every command was brutal.
We broke it apart. Subscriptions became their own aggregate. Notification preferences became their own aggregate. The UserFeed aggregate shrank to just the core feed interaction logic. Replay times dropped from seconds to milliseconds. Contention disappeared.
The rule I follow now: an aggregate should be the smallest unit that can enforce its invariants independently.
CQRS: Separating the Write and Read Paths
Event sourcing doesn’t require CQRS but you’ll almost certainly end up there. The write model and read model have fundamentally different jobs.
Command -> Aggregate -> Event Store -> Projection -> Read Model
The write side validates and enforces rules. The read side is shaped for queries. At the fintech startup our write side emits events like ArticleScored and TopicSubscribed. Our read projections build denormalized views for the API: a user’s personalized feed, trending topics, source reliability dashboards.
The catch is eventual consistency. Your read model will lag behind writes. For us that means a user might subscribe to a topic and not see it reflected in their feed for a few hundred milliseconds. We made that explicit in the product. No pretending the system is synchronous when it’s not.
Projections and Snapshots
Projections
Projections consume events and write to read models. The cardinal rule: they must be idempotent.
def handle(event):
if already_processed(event.event_id):
return
update_read_model(event)
mark_processed(event.event_id)
We’ve been bitten by non-idempotent projections exactly once. A network blip caused a batch of ArticleScored events to be delivered twice. Our read model double-counted relevance scores. Users saw garbage rankings. Took us three hours to figure out what happened and another two to rebuild the projection. Idempotency checks would have made it a non-event.
Snapshots
When an aggregate has a long event history, replaying from event zero on every command gets expensive. Snapshots are a cache – a serialized version of aggregate state at a known position. You replay from the snapshot forward instead of from the beginning.
Key point: snapshots aren’t the source of truth. They’re disposable. If a snapshot is corrupt or stale, delete it and rebuild from events. We snapshot our most active aggregates every 500 events and it keeps command processing fast.
Process Managers
Some workflows span multiple aggregates. A user signs up, we create their profile, set default subscriptions, and kick off an initial scoring run. No single aggregate owns all of that. A process manager listens to events from each aggregate, issues commands to others, and tracks progress. Think of it as a long-running coordinator that reacts to facts rather than orchestrating a transaction.
Schema Evolution: Plan Before You Ship
Events are forever. You can’t go back and change them. So you need a schema evolution strategy before you write your first event.
version: 2
payload:
source: reuters
relevance_score: 0.87
matched_topics: ["AAPL", "earnings"]
scoring_model_version: "v3.2"
What works for us:
- Additive changes with defaults. New field? Add it. Old events that lack the field get a sensible default when read.
- New event types for semantic shifts. If the meaning of an event changes fundamentally, introduce a new event type. Don’t twist the old one.
- Upcasters on read. When loading events, transform old versions into the current shape. The event store stays untouched. The application code only deals with the latest schema.
We’ve gone through four versions of our scoring events. The upcaster chain is a bit ugly but it works and we’ve never had to touch the event store itself.
The Pitfalls That Actually Hurt
Most event sourcing failures come from modeling mistakes, not tooling problems.
Modeling commands as events. “ScoreArticle” is a command. “ArticleScored” is an event. If you store commands in your event log you will confuse every projection that reads them. I’ve seen this mistake more times than I want to admit, including in our own early prototypes.
Giant aggregates. Already covered this. Keep them small. If replay takes more than a few milliseconds, something is wrong.
Synchronous projections on the write path. The moment you make a write wait for a projection to finish, you’ve coupled your write throughput to your read model’s performance. Don’t do this.
Skipping idempotency. Events will be delivered more than once. Network partitions, retries, rebalances. Your projections must handle duplicates gracefully or your read models will drift into nonsense.
Chatty technical events. Events should represent business facts. “DatabaseRowUpdated” isn’t a business fact. “ArticleScored” is. If your events read like a database changelog, you’ve modeled the wrong thing.
When Not to Bother
Event sourcing is a deliberate tradeoff, not a default architecture.
Skip it when CRUD is genuinely sufficient and audit history isn’t a real requirement. Skip it when the domain is simple and the modeling overhead isn’t justified. Skip it when your team isn’t ready to think in terms of eventual consistency.
At the fintech startup it was the right call because of the regulatory requirements, the temporal query needs, and the natural event-driven shape of financial data. For our marketing site? We use a database and a CMS. Not everything needs to be an event.
The Short Version
Model events as business facts in past tense. Keep aggregates small and focused. Build read models asynchronously and accept eventual consistency. Make projections idempotent. Plan schema evolution before you ship. And decide early whether the complexity is worth it for your specific domain – because once you commit to event sourcing, unwinding it is far harder than adopting it.