Quick take
Learn how DNS, TCP, and load balancers actually work, or keep debugging the wrong layer at 2 AM.
The network isn’t magic
I keep having the same conversation at Dropbyke. A developer opens a ticket: “the API is randomly slow.” I ask them to run ss -tan on the box. Blank stare. They have never heard of socket states.
This isn’t a knowledge gap I can ignore. If you write software that talks over a network – and in 2016, that’s almost all software – then networking is part of your job. Not optional. Not “nice to have.” Part of the job.
Three bugs that keep showing up
DNS caching gone wrong. We had a service that cached DNS lookups forever because the HTTP client’s default resolver never respected TTL. We rotated an IP behind a load balancer, and that one service kept hammering the old address for hours. The fix was two lines of config. The outage was forty-five minutes of confusion because nobody thought to check resolution.
TIME_WAIT exhaustion. A microservice opened a new TCP connection for every request to our mapping provider. Under load, the box ran out of ephemeral ports. Connections backed up, timeouts cascaded, and the bike unlock flow broke. The developer who wrote it had no idea that TCP connections linger in TIME_WAIT after close. Connection pooling fixed it. Understanding why connection pooling matters would have prevented it.
Load balancer health checks. We added a new backend behind an Nginx upstream. The health check was hitting a path that returned 200 even when the database was unreachable. Traffic routed to a box that couldn’t serve real requests. Fifteen minutes of partial outage because nobody understood what the load balancer was actually checking.
None of these required deep packet analysis. None required a networking degree. They required knowing the basics well enough to ask the right question.
Why developers avoid it
Networking feels like someone else’s problem. There’s an infrastructure team, or a cloud provider, or an abstraction layer that’s supposed to handle it. And those things help. But abstractions leak. They always leak. When they do, the developer staring at the logs needs to know which layer is broken.
“It works on my machine” is almost always a networking statement. Different DNS, different routes, different timeouts, different connection behavior. If you can’t reason about the network, you can’t debug the gap between your laptop and production.
Where to start
You don’t need to read RFCs. You need a working mental model. These resources got our team there:
- “TCP/IP Illustrated, Volume 1” by Stevens. Dense but precise. Read the chapters on TCP connection states and you will understand half the production issues you have ever seen.
- Julia Evans’ networking zines. Approachable, visual, and surprisingly deep. Start with the one on DNS.
ss,dig,curl -v,tcpdump. Run them. Break things on purpose. Watch what happens when you kill a connection mid-handshake or poison a DNS cache.
The network isn’t going to become simpler. The developers who understand it will debug faster, design better systems, and stop filing tickets that say “the API is randomly slow.”