Event-Driven Architecture: What I Got Wrong and What Survived

It was a Tuesday at the fintech startup. Our financial news pipeline processed market events from about forty data sources – earnings announcements, price movements, regulatory filings – and fanned them out to user watchlists. Event-driven. Clean architecture diagrams. The works.

Then Reuters published a correction to an earnings report. The correction event arrived, got processed, updated the record. Except three other consumers had already read the original event, cached it, and built derivative data on top of it. The correction propagated to the main feed within seconds. The analytics service picked it up four minutes later. The notification service? It had already sent push notifications to eight thousand users with the wrong number.

Nobody’s data was technically “lost.” Everything was eventually consistent. But “eventually” turned out to be forty minutes, and by then a bunch of users had made trading decisions based on stale information. The postmortem wasn’t fun.

That incident taught me more about event-driven architecture than any conference talk. Events are easy to publish. The hard part is everything that happens after.

Quick take

Event-driven architecture decouples your services beautifully and then introduces a whole new category of bugs you’ve never debugged before. It’s worth it – sometimes. The key is idempotent consumers, an outbox pattern for reliable publishing, and accepting that you’re trading request-response simplicity for operational complexity. Go into it with open eyes or don’t go at all.

Events vs. commands

This distinction matters more than people think. Get it wrong and your “loosely coupled” system is actually tightly coupled with extra steps.

Events are facts. Past tense. “OrderPlaced.” “UserRegistered.” “PriceCorrected.” The producer publishes what happened and doesn’t care who’s listening. Could be zero consumers. Could be fifty.

Commands are directives. “SendEmail.” “ChargeCard.” “UpdateCache.” Targeted at a specific service. The sender expects something to happen.

At Decloud, I caught us mixing these constantly in the early days. A service would publish “ProcessPayment” as an “event.” That’s not an event. That’s a command wearing an event’s clothes. And it created an invisible dependency – if the payment service went down, the publisher had to care, which defeated the whole point.

Rule of thumb: if removing all consumers would break the producer’s logic, it’s a command, not an event.

The event contract is your public API

This is where the fintech startup’s correction bug really originated. We treated events like internal implementation details. Field names leaked database columns. Payloads included internal IDs that only made sense inside one service.

Your event schema is a contract. Treat it like a public API. Version it. Document it. Don’t shove your ORM struct into JSON and call it a day.

Here’s the envelope pattern we settled on at Decloud:

{
  "id": "7f2b5a1e-6c0a-4c2c-9a7a-9c2c6d49f9a2",
  "type": "pricing.QuoteUpdated",
  "source": "pricing-service",
  "time": "2020-07-06T10:30:00Z",
  "version": 2,
  "correlationId": "req-8f43",
  "data": {
    "instrumentId": "AAPL",
    "price": 364.11,
    "currency": "USD"
  }
}

The correlationId saved us countless debugging hours. When a user reports something wrong, you grep for that ID across every service’s logs and see the full journey. Without it, debugging event-driven systems is archaeology.

Idempotent consumers or die

Every broker I’ve worked with – Kafka, RabbitMQ, NATS – delivers at-least-once. Duplicates will happen. Network blip, consumer restart, rebalance. Your consumer sees the same event twice. Or five times.

If your handler isn’t idempotent, you charge a credit card twice. Or send the same notification to eight thousand people. Again.

Here’s the pattern we use in Go at Decloud:

type EventHandler struct {
    db    *sql.DB
    topic string
}

func (h *EventHandler) Handle(ctx context.Context, event Event) error {
    tx, err := h.db.BeginTx(ctx, nil)
    if err != nil {
        return fmt.Errorf("begin tx: %w", err)
    }
    defer tx.Rollback()

    // Check if we've already processed this event
    var exists bool
    err = tx.QueryRowContext(ctx,
        "SELECT EXISTS(SELECT 1 FROM processed_events WHERE event_id = $1)",
        event.ID,
    ).Scan(&exists)
    if err != nil {
        return fmt.Errorf("check idempotency: %w", err)
    }
    if exists {
        return nil // Already processed. Done.
    }

    // Do the actual work
    if err := h.processEvent(ctx, tx, event); err != nil {
        return fmt.Errorf("process event: %w", err)
    }

    // Record that we've processed this event -- same transaction
    _, err = tx.ExecContext(ctx,
        "INSERT INTO processed_events (event_id, topic, processed_at) VALUES ($1, $2, NOW())",
        event.ID, h.topic,
    )
    if err != nil {
        return fmt.Errorf("record processed event: %w", err)
    }

    return tx.Commit()
}

The important bit: the idempotency check and the business logic live in the same database transaction. If the process crashes after doing the work but before recording the event ID, the transaction rolls back and the event gets reprocessed safely. If it crashes after commit, the duplicate delivery gets caught by the EXISTS check.

This pattern handles 99% of our idempotency needs. The remaining 1% involves external side effects (sending emails, calling third-party APIs) which need a different approach.

The outbox pattern

This one took me too long to learn. The naive approach: save to database, then publish event.

// DON'T DO THIS
func (s *OrderService) PlaceOrder(ctx context.Context, order Order) error {
    if err := s.db.SaveOrder(ctx, order); err != nil {
        return err
    }
    // What if we crash right here?
    return s.publisher.Publish(ctx, "orders.OrderPlaced", order)
}

If you crash between the database write and the publish, the order exists but the event never fires. Downstream services never learn about it. Data silently drifts.

The fix is the outbox pattern. Write the event to an outbox table in the same transaction as your business data. A separate process polls the outbox and publishes.

func (s *OrderService) PlaceOrder(ctx context.Context, order Order) error {
    tx, err := s.db.BeginTx(ctx, nil)
    if err != nil {
        return fmt.Errorf("begin tx: %w", err)
    }
    defer tx.Rollback()

    if err := s.saveOrder(ctx, tx, order); err != nil {
        return fmt.Errorf("save order: %w", err)
    }

    evt := OutboxEvent{
        ID:        uuid.New().String(),
        Type:      "orders.OrderPlaced",
        Payload:   marshal(order),
        CreatedAt: time.Now(),
    }
    if err := s.saveOutboxEvent(ctx, tx, evt); err != nil {
        return fmt.Errorf("save outbox event: %w", err)
    }

    return tx.Commit()
}

And the publisher, running separately:

func (p *OutboxPublisher) Poll(ctx context.Context) error {
    events, err := p.db.FetchUnpublished(ctx, 100)
    if err != nil {
        return fmt.Errorf("fetch unpublished: %w", err)
    }

    for _, evt := range events {
        if err := p.broker.Publish(ctx, evt.Type, evt.Payload); err != nil {
            // Log and continue. We'll retry on next poll.
            log.Printf("publish failed for event %s: %v", evt.ID, err)
            continue
        }
        if err := p.db.MarkPublished(ctx, evt.ID); err != nil {
            // Event published but not marked. Will be re-published.
            // Consumer idempotency handles the duplicate.
            log.Printf("mark published failed for event %s: %v", evt.ID, err)
        }
    }
    return nil
}

Is it more code? Yes. Does it guarantee that every committed business change produces an event? Also yes. The outbox publisher can crash, restart, and pick up where it left off. Consumers handle the duplicates. The system converges.

At Decloud, we run the outbox poller on a 500ms tick. Good enough for our latency requirements. If you need sub-100ms, look into CDC (change data capture) with something like Debezium reading the WAL directly. Same principle, lower latency, more operational overhead.

Picking a broker

I’m not going to do a feature comparison chart. Those are outdated the moment you publish them. Here’s how I think about it instead.

Kafka if you need ordered replay. We use it at Decloud for anything where “play back the last 7 days of events” is a real requirement. The retention model is its killer feature. The operational cost is its tax. Running Kafka well requires dedicated attention.

RabbitMQ if you need flexible routing and work queues. Fan-out, topic exchanges, priority queues. If your pattern is “distribute work across competing consumers,” Rabbit is easier to reason about than Kafka consumer groups.

NATS if you want something lightweight. We use NATS for internal service communication that doesn’t need persistence. Fire-and-forget telemetry, cache invalidation signals, that kind of thing. JetStream adds persistence if you need it, but at that point evaluate whether Kafka serves you better.

Managed services (SQS, EventBridge, Pub/Sub) if operational simplicity matters more than configurability. At Decloud we run on AWS, and for non-critical event flows, SQS with a dead-letter queue is hard to beat for simplicity.

Ordering is a lie (mostly)

Kafka guarantees ordering per partition. RabbitMQ doesn’t guarantee ordering at all with competing consumers. NATS JetStream gives you ordering per stream but not per subject with multiple consumers.

In practice, design for out-of-order delivery. If you can’t, partition by the entity that needs ordering (e.g., all events for order ID “abc” go to the same partition).

// Partition key ensures all events for the same order
// land on the same Kafka partition
func partitionKey(event Event) string {
    switch e := event.Data.(type) {
    case OrderEvent:
        return e.OrderID
    case UserEvent:
        return e.UserID
    default:
        return event.ID // fallback: event-level ordering only
    }
}

At the fintech startup, we partitioned by instrument ID. All events for AAPL went to the same partition. Ordering within an instrument was guaranteed. Ordering across instruments didn’t matter.

Dead letters and retries

When a consumer fails, retry with exponential backoff. After N retries, send the event to a dead-letter queue. Don’t retry forever. I’ve seen a single poison event take down a consumer group for hours because it kept crashing and restarting in a loop.

func (h *RetryHandler) HandleWithRetry(ctx context.Context, event Event) error {
    var lastErr error
    for attempt := 0; attempt <= h.maxRetries; attempt++ {
        if attempt > 0 {
            backoff := time.Duration(attempt*attempt) * 100 * time.Millisecond
            if backoff > 10*time.Second {
                backoff = 10 * time.Second
            }
            time.Sleep(backoff)
        }

        lastErr = h.inner.Handle(ctx, event)
        if lastErr == nil {
            return nil
        }
        log.Printf("attempt %d failed for event %s: %v", attempt+1, event.ID, lastErr)
    }

    // Max retries exceeded. Dead-letter it.
    if err := h.deadLetter.Send(ctx, event, lastErr); err != nil {
        return fmt.Errorf("dead letter failed for event %s: %w", event.ID, err)
    }
    return nil // Event is safely in the DLQ. Consumer can move on.
}

The dead-letter queue isn’t a trash can. Set up alerts. Review it weekly. At Decloud we have a Slack alert for any event hitting the DLQ and a weekly review where we categorize failures: transient (infra blip), poison (bad data), or bug (our code).

When events are the wrong answer

I’ve seen teams go all-in on event-driven architecture for a CRUD app with three services. Don’t.

Events make sense when:

Multiple consumers need to react to the same change independently
Temporal decoupling matters – the producer shouldn’t wait for or even know about consumers
Data pipelines need a replay-friendly stream of changes
Cross-team boundaries where synchronous coupling would create deployment dependencies

Events are overhead when:

You have one producer and one consumer. That’s a function call with extra steps.
You need synchronous request-response. An HTTP call is simpler and easier to debug.
The system is small enough that a monolith or simple service-to-service calls work fine.

At Decloud, roughly 40% of our inter-service communication is event-driven. The rest is gRPC. That ratio feels right. Event-driven for fan-out and cross-domain integration. Direct calls for everything else.

Observability or it didn’t happen

You can’t debug an event-driven system without distributed tracing. Full stop.

Every event carries a correlation ID. Every consumer logs that ID. When something goes wrong, you search by correlation ID and see the full chain: which service published, who consumed, what happened next.

We also track:

Consumer lag – how far behind each consumer group is. If it’s growing, you’re either underprovisioned or have a bug.
Processing time per event type – catches performance regressions early.
Dead-letter rate – our canary. A spike means something broke.
End-to-end latency – time from event publication to final consumer processing. This is the number users actually feel.

Event-driven architecture is a trade. You give up the simplicity of “service A calls service B and gets a response.” In return you get decoupling, independent scaling, and resilience to individual service failures. Whether that trade is worth it depends on your system, your team, and how much operational maturity you have.

If you can’t answer “how do I debug a failed event end-to-end?” before you start building, you’re not ready. Get observability in place first. The events can wait.