Data Engineering Patterns: Batch vs. CDC vs. Streaming

Quick take

Most teams default to batch ingestion and that’s usually correct. CDC gives you near-real-time without the operational overhead of full streaming. True streaming is powerful but expensive to run and easy to get wrong. Pick based on your actual latency requirement, not what sounds impressive. This post compares all three from experience building financial data pipelines at the fintech startup.

At the fintech startup we processed financial news and market data from hundreds of sources. Some of it needed to be available in seconds – breaking news that moves stock prices. Some of it needed to be correct and complete for daily analytics – aggregated sentiment scores, content performance metrics. And some of it sat in between.

We ended up running all three ingestion patterns: batch, CDC, and streaming. Not because we planned it that way, but because different problems demanded different tradeoffs. After living with all three in production, I’ve got strong opinions about when each one belongs.

The comparison

Here’s the honest breakdown:

	Batch	CDC	Streaming
Latency	Hours	Minutes	Seconds
Complexity	Low	Medium	High
Cost	Low	Medium	High
Schema evolution	Easy (reload)	Moderate	Hard
Backfill	Trivial	Moderate	Painful
Ordering guarantees	N/A (snapshot)	Per-table	Requires design
Failure recovery	Rerun the job	Replay from WAL	Depends on offset management
Good for	Reporting, analytics, finance	Replication, audit trails, near-RT analytics	Event-driven products, real-time features
Operational burden	Minimal	Moderate (connector management)	Significant (cluster, schemas, monitoring)

That table is the result of learning things the hard way.

Batch: the boring default that works

At the fintech startup, our analytics pipeline ran on batch. Every night, we pulled the day’s content, computed sentiment aggregates, calculated engagement metrics, and loaded the results into our warehouse for the analytics team.

Batch ingestion is a scheduled extract. Query the source, pull the data, load it somewhere. The tools are mature – Airflow for orchestration, simple scripts or managed connectors for extraction, a cloud warehouse for storage.

Why batch works so well:

Idempotency is natural. If a job fails, rerun it. You get the same result. There’s no state to recover, no offsets to manage.
Backfills are trivial. Need to reprocess last month? Change the date parameter and run the job again.
Schema changes are manageable. Source adds a column? Your next batch picks it up. Source renames a column? You fix the mapping and reprocess.

The latency tradeoff is real. Batch means your data is hours old at best. For dashboards and reports, that’s fine. For anything user-facing and real-time, it’s not.

My rule: default to batch unless someone can articulate a specific business reason for lower latency. “It would be nice to have real-time data” isn’t a business reason. “Users leave the product if news is more than 30 seconds old” is.

CDC: the middle ground most people skip

Change Data Capture reads the database’s write-ahead log (WAL) and streams row-level changes to a target system. Debezium is the standard open-source tool. Most managed data platforms offer CDC connectors now.

At the fintech startup, we used CDC to replicate our content database to the analytics warehouse. Instead of running nightly full extracts that hammered the production database, we streamed changes continuously. The warehouse stayed within minutes of production, and the load on the source database dropped dramatically.

CDC hits a sweet spot:

Near-real-time without the streaming overhead. You get minute-level freshness without running a Kafka cluster or designing event schemas.
Auditable by default. Every change is captured. You get a complete history of what changed and when.
Lower source load. Reading the WAL is far cheaper than running full table scans.

The downsides are real but manageable. CDC connectors need monitoring – they can fall behind, hit WAL retention limits, or break on DDL changes. Schema evolution requires more care than batch because the connector is continuously running. And initial snapshots for large tables can be slow.

For most teams that want better-than-daily freshness, CDC is the right answer. It’s the pattern I recommend most often because it solves the latency problem without introducing streaming complexity.

Streaming: powerful and expensive

True event streaming – Kafka, Kinesis, Pulsar – is for when your product needs to react to events in real time. At the fintech startup, our breaking news pipeline was streaming. Content arrived from news APIs, got classified and enriched, and appeared in user feeds within seconds.

Streaming is the right tool when:

User experience depends on seconds-level latency.
Events need to trigger immediate downstream actions.
You need to join multiple event streams in real time.

Streaming is the wrong tool when people want it because it sounds impressive on an architecture diagram.

Here’s what streaming actually costs:

Cluster management. Kafka isn’t a fire-and-forget service. Broker configuration, partition management, replication, consumer group coordination. Even managed Kafka (Confluent Cloud, MSK) requires operational attention.
Schema discipline. When producers and consumers are decoupled and running continuously, a schema change can break downstream consumers silently. You need a schema registry. You need compatibility policies. You need someone who owns them.
Ordering and exactly-once semantics. These sound like solved problems until you hit partition rebalancing, consumer crashes, or cross-partition joins. The guarantees are achievable but require careful design.
Backfills are painful. Need to reprocess a week of events? You need to replay from stored offsets, hope retention was sufficient, and deal with the downstream impact of reprocessing alongside live traffic.

At the fintech startup we had two engineers who understood the streaming pipeline deeply. When they were unavailable, incidents on the streaming side took significantly longer to resolve than batch or CDC issues. That knowledge concentration is a real operational risk.

The transformation layer

Regardless of how data gets in, it needs to be transformed for consumption. ELT – load raw, transform in the warehouse – has become the standard pattern, and dbt is the tool that made it practical.

-- models/content_daily_metrics.sql
select
  content_id,
  source_name,
  published_date,
  count(*) as view_count,
  avg(sentiment_score) as avg_sentiment
from {{ ref('stg_content_events') }}
group by 1, 2, 3

Version-controlled SQL. Testable. Documented. This is one of the few areas where the tooling’s genuinely gotten better in the last two years. At the fintech startup we adopted dbt for our analytics models and the improvement in confidence and velocity was immediate.

What I tell teams now

Start with batch. Seriously. Get your data warehouse set up, your orchestration running, your dbt models tested. Build the operational muscle for data quality – freshness checks, row count monitoring, schema validation.

When batch latency becomes a proven problem for a specific use case, add CDC for that source. Don’t migrate everything. Just the tables where minutes matter.

Reserve streaming for the use cases where seconds matter and the business impact justifies the operational cost. Have at least two people who understand the streaming infrastructure deeply. Monitor consumer lag obsessively.

The most common mistake I see: teams adopt streaming first because it’s exciting, then spend months building operational maturity that batch would have given them for free. The boring choice is usually the right choice. Optimize for reliability and then add speed where the business actually needs it.