How I Build Data Pipelines That Actually Survive Production

I’ve spent the better part of two years at the fintech startup building pipelines that ingest financial data from dozens of sources. Price feeds, news articles, filings, social sentiment. All of it messy. All of it arriving at unpredictable rates. And all of it feeding dashboards that investors actually make decisions from.

Every single one of those pipelines has broken. The difference between a bad pipeline and a good one isn’t whether it breaks. It’s what happens next.

The Ways They Break

Schema drift is the silent killer. We pull news data from multiple aggregation APIs. One morning, a provider renamed a field from published_date to publishedAt. No changelog. No warning. Our parser kept running, just with null timestamps on every article. We didn’t catch it for six hours because the pipeline was “healthy” – all green, zero errors, completely wrong output.

Then there’s the volume problem. Financial markets don’t generate data at a steady rate. Earnings season hits, or there’s a flash crash, and suddenly we’re getting 20x our normal message volume through Kafka. Pipelines tuned for the average case choke. Batch windows overflow. Consumers fall behind and start dropping events.

Dependency failures are just reality. Data providers go down. Rate limits kick in at the worst possible moment. We had one source that would silently return stale data instead of erroring when their backend was degraded. That one burned us badly.

The worst category: the pipeline runs, the jobs succeed, and the numbers are wrong. We shipped bad price data to a dashboard once because a currency conversion step was using a cached exchange rate from three days prior. Nobody noticed until a user complained. That’s the kind of failure that erodes trust.

Designing For The Inevitable

Store everything raw. Before you parse, before you transform, before you do anything clever – dump the raw payload somewhere durable. At the fintech startup we write every incoming message to S3 before processing. When (not if) a transformation bug corrupts data downstream, you can replay from the raw store. This has saved us more times than I can count.

Make every write idempotent. If a step can’t be safely retried, your pipeline is a house of cards. Upserts are your friend:

INSERT INTO price_feeds (symbol, timestamp, price, source)
VALUES (:symbol, :ts, :price, :source)
ON CONFLICT (symbol, timestamp, source) DO UPDATE
SET price = EXCLUDED.price,
    updated_at = NOW();

We use composite natural keys – symbol plus timestamp plus source – so replaying a batch doesn’t create duplicates. It just overwrites with the same values. Boring. Exactly what you want.

Validate at the boundaries, not in the middle. Input validation happens the moment data enters your system. Output validation happens before anything gets promoted to the “trusted” layer. Everything in between assumes the data is clean because you already checked.

def validate_price_record(record):
    if record['price'] <= 0:
        raise ValidationError(f"Invalid price {record['price']} for {record['symbol']}")
    if record['timestamp'] > datetime.utcnow() + timedelta(hours=1):
        raise ValidationError(f"Future timestamp: {record['timestamp']}")
    if record['symbol'] not in KNOWN_SYMBOLS:
        raise ValidationError(f"Unknown symbol: {record['symbol']}")
    return record

Failed records go to a dead letter queue. The pipeline keeps moving. Bad data gets quarantined. Someone reviews the DLQ daily. This is critical – you need the pressure release valve, but you also need someone actually looking at what lands there.

Checkpoint aggressively. Our Kafka consumers commit offsets after every successfully processed batch, not at the end of some long-running job. A crash at minute 45 of a 60-minute job shouldn’t mean reprocessing from minute zero. Save your position. Make that position represent consistent state.

Monitoring That Actually Tells You Something

“Pipeline is running” isn’t a useful health signal. I care about:

Record throughput – is it within expected range for this time of day?
End-to-end latency – how old is the freshest record in the output table?
Error rate by stage – which step is producing failures?
Output distribution – did the number of distinct symbols in today’s price feed drop by 30%?

That last one is the one most people skip. We alert on output shape, not just output existence. If our news pipeline normally produces articles tagged with 500+ different companies per hour and suddenly that drops to 50, something is wrong upstream even if every job reports success.

Page on things that need action. A sustained error rate above threshold. Data freshness exceeding the SLA. A complete pipeline stop. Don’t page on a single transient retry. Your on-call engineer will start ignoring alerts within a week.

Testing Against Reality

Unit tests cover your transformation logic. Great. Necessary. Not sufficient.

You need integration tests that use production-shaped data. Not three handcrafted JSON objects – actual samples from your real sources, warts and all. Include the record with the unicode character in the company name that broke your CSV parser last month. Include the price feed entry with a value of zero that your system interpreted as “free stock.”

Simulate failures deliberately. Kill a source mid-batch. Inject malformed records. Throttle the database connection. Watch what happens. Does the pipeline degrade gracefully or does it fall over and require manual intervention to restart? If it’s the latter, you have work to do.

Backfills Aren’t Special

If you can’t reprocess historical data without taking down your live pipeline, your architecture has a problem. Partition by date. Keep your raw store accessible. Make sure your compute can handle backfill load alongside normal traffic.

We backfill constantly at the fintech startup. A new data enrichment step means reprocessing months of articles. A bug fix in price normalization means replaying weeks of feeds. This is normal. Design for it from day one.

Pick Boring Tools

I’ve seen teams adopt streaming frameworks they don’t understand because “real-time” sounds impressive on a slide. If your data arrives in hourly batches from an SFTP server, you don’t need Kafka Streams. A cron job and a Python script will outperform a distributed streaming platform that nobody on the team can debug at 3am.

Pick the tool your team can operate under pressure. The best pipeline architecture is the one where your worst engineer can diagnose a failure in the middle of the night, not the one that requires your best engineer to be awake.

The Real Lesson

Every reliability improvement we’ve made at the fintech startup came from a production incident. Raw storage because we lost transformed data. Idempotent writes because a retry created duplicate price entries. Dead letter queues because a single bad record used to halt the entire pipeline.

You can learn from our mistakes or you can make your own. Either way, the approach is the same: assume everything will break, make recovery cheap, and instrument your system so you find out before your users do.