Your Video Infrastructure Isn't Ready for What's Coming

Quick take

SFUs beat MCUs for 90% of video use cases. Simulcast is non-negotiable. Your TURN bill will surprise you. And if you’re not measuring time-to-first-media, you’re flying blind.

Two weeks ago, someone called me in a panic. Their video platform – built for 10,000 concurrent users – was handling 80,000. SFUs at 94% CPU. TURN servers hemorrhaging bandwidth. The whole thing held together by one engineer who hadn’t slept in three days.

This is March 2020. I’ve had four conversations like this in the last ten days. The pattern is always the same: the architecture was fine for the old world, and the old world ended overnight.

Here’s what I keep telling these teams. Not a textbook overview. The actual decisions that matter when your video infra is on fire.

Video Isn’t HTTP

I keep meeting backend engineers who treat video like a REST API with bigger payloads. It’s not.

A video call is a continuous bidirectional stream that has to feel natural to a human brain. That brain notices latency above ~150ms. It notices audio glitches before video glitches. It definitely notices when your echo cancellation is broken and everyone hears themselves with a half-second delay.

Latency under 150ms or the conversation feels wrong. People start talking over each other.
Audio is sacred. Users tolerate blurry video for minutes. They hang up after five seconds of choppy audio.
NAT traversal is the norm. Corporate firewalls, hotel wifi, mobile networks – assume the worst.
Packet loss kills you silently. Bandwidth can look fine while jitter makes the whole thing unwatchable.

P2P, SFU, MCU: Pick the Right One

I see teams agonize over this choice like it’s a religious decision. It shouldn’t be.

P2P works for 2-4 people. Direct connection, lowest latency, almost no server cost. The catch: every participant uploads their stream to every other participant. Past 4-5 people, uplinks die.

SFU (Selective Forwarding Unit) is the right answer for almost everything else. It receives streams and forwards them selectively. No mixing, no transcoding on the server. This is what Zoom, Google Meet, and basically every serious platform runs.

MCU (Multipoint Control Unit) mixes everything into a single composed stream per participant. Sounds nice. In practice, the server-side CPU and GPU cost is brutal. Only for very low-capability clients or massive broadcast rooms.

Most production systems are hybrid. At Dropbyke we ran P2P for 1:1 rider-driver calls and SFU for everything larger. Simple rule, huge savings.

The rough guidelines I give every team:

2-4 people: P2P. Don’t overthink it.
5-20 people: SFU with simulcast.
20+ people: SFU with aggressive active-speaker optimization.
Webinars / broadcasts: SFU for the audience, dedicated pipeline for presenters.

Scaling SFUs Without Losing Your Mind

SFUs scale horizontally in theory. In practice, calls are stateful. Every participant in a room has to land on the same SFU instance. This isn’t a stateless web tier you can throw behind a load balancer.

Session stickiness first. Your load balancer needs to understand rooms, not just connections. Get this wrong and you’ll spend days debugging “ghost participants” and one-way audio.

Room size caps are a safety valve. One team removed their 50-person cap “because sales wanted it” and promptly discovered a single 200-person room could saturate an SFU. Cap it. Enforce it. Explain to sales later.

Cascading SFUs across regions is how you handle distributed calls without routing Sydney’s traffic through Virginia. Each region gets its own SFU. They talk over backbone links. Users connect locally.

Sydney user -> Sydney SFU <--backbone--> London SFU <- London user

This keeps last-mile latency low and contains failure domains. When the Sydney SFU has issues, London keeps running.

Simulcast Is Non-Negotiable

If you’re running an SFU without simulcast, stop reading and go implement it. I’m serious.

Clients send multiple quality layers simultaneously:

simulcast:
  high: 1280x720 @ 30fps, ~1500kbps
  mid:  640x360 @ 30fps, ~700kbps
  low:  320x180 @ 15fps, ~200kbps

The SFU forwards the appropriate layer per receiver based on bandwidth and screen size. Active speaker gets high quality. Thumbnail tiles get low. Someone on hotel wifi gets mid or low for everything.

Without simulcast, you’re either crushing constrained clients with full-resolution streams or wasting bandwidth by downscaling everything. No middle ground.

Pair simulcast with adaptive bitrate control. When the SFU detects rising packet loss or sustained jitter, it signals the sender to drop resolution. Fastest lever you have to stabilize a degrading call.

The TURN Server Problem

TURN is the thing nobody budgets for and everybody needs.

STUN lets peers discover their public IP and try a direct connection. When that fails – corporate firewall, symmetric NAT, hotel wifi blocking UDP – you need TURN. TURN relays all media through your server. Every stream hits your infra twice (in and out). At scale, it becomes your single largest cost.

I worked with a team whose TURN relay ratio was 40%. Their bandwidth bill was obscene. We got it down to 15% by improving ICE candidate gathering, adding more TURN regions, and implementing better network detection on the client side.

Monitor your TURN relay ratio religiously. Anything above 20% means something is wrong with your connectivity story. More TURN regions with latency-aware routing helps. Anycast or geo-DNS for region selection is table stakes.

What to Actually Monitor

I’ve walked into too many video platforms where the dashboard shows CPU, memory, and “active calls.” That tells you nothing about whether calls are good.

The metrics I insist on:

Time to first media. How long from clicking “join” until the user sees/hears someone? This is the single most important UX metric. If it’s over 3 seconds, people think it’s broken.

Join success rate. What percentage of join attempts actually result in a connected call? I’ve seen rates as low as 85%. That’s 15% of users clicking join and getting nothing.

Packet loss and jitter per stream. Not aggregated across the SFU. Per stream. You need to know which participants are suffering.

TURN relay ratio. Track it. Trend it. Freak out if it’s climbing.

SFU CPU and egress bandwidth. Per instance, not averaged. One overloaded SFU ruins everyone on it while the average looks fine.

The Part Nobody Wants to Hear

Here’s what I keep telling teams: the architecture is the easy part. The hard part is operating it under pressure.

You need admission control. When an SFU is at 80% CPU, stop putting new rooms on it. Almost nobody does this until after their first outage.

You need graceful degradation. When things get bad, drop to audio-only automatically. A stable voice call beats a stuttering video call every time.

You need fast regional failover. When a region goes down, redirect new joins to the nearest healthy region. Don’t wait for someone to notice at 2 AM.

And you need someone who owns this. Not “the platform team” collectively. A specific person who gets paged when quality drops. The team I mentioned at the start? Their biggest problem wasn’t the architecture. It was that nobody owned the system end-to-end until it was already on fire.

Where This Goes

We’re two weeks into what I think is a permanent shift. The companies that figure out SFU scaling, simulcast, and TURN optimization now will have a massive advantage.

The ones still treating video as a feature rather than infrastructure will keep calling consultants like me at 11 PM. Which is fine for my business. But I’d rather see the industry get this right.