2021: The Year Everything We Ignored Caught Fire

I’m writing this on December 27 from my couch, still half-watching my phone for Log4j-related alerts. That feels like a fitting way to end 2021. A year that started with optimism about “returning to normal” and ended with the entire Java ecosystem scrambling to patch a logging library.

This was the busiest year yet. More projects than any year since I started. More interesting problems. More travel (finally). And more moments of “we should have fixed this two years ago” than I can count.

The Work Grew

I took on more enterprise work this year than I expected. Large telecoms and several mid-market companies working through cloud migrations, platform engineering, and reliability improvements. The common thread: organizations that moved fast in 2020 to survive the pandemic were now dealing with the technical debt they accumulated in the process.

A lot of my work was cleaning up. Infrastructure that was “temporary” and became permanent. Monitoring that was good enough for a small team but fell apart at scale. Deployment pipelines held together with bash scripts and prayers.

The work was good. Hard, but good. I like the puzzle of walking into an organization, understanding how things actually work (not how the architecture diagrams say they work), and finding the highest-leverage improvements.

Log4j: The December From Hell

I wrote about Log4j separately, but it deserves mention here because it defined my December. I was helping three teams respond simultaneously. The experience reinforced something I’ve been saying for years: most organizations can’t answer “where is this library used” quickly.

The companies that responded fastest were the ones that had invested in dependency inventory and software bills of materials. Not fancy tools – just the discipline of knowing what runs where and what it depends on.

The companies that struggled were the ones where infrastructure was a collection of snowflakes built by people who no longer worked there. Nobody knew what Java versions were running. Nobody knew which vendor products bundled Log4j. Every discovery was a surprise.

Log4j wasn’t just a security incident. It was an inventory test. Most organizations failed it.

The AWS Outage Reminded Us (Again)

December 7. us-east-1 goes down. Half the internet breaks. My phone lights up.

I wrote about this too, but the meta lesson is what sticks with me. We keep having the same conversation after every major cloud outage. “We should do multi-region.” “We should test our failover.” “We shouldn’t depend on a single control plane.” And then three weeks later the urgency fades and the feature backlog wins.

I proposed a multi-region resilience plan to one company in March. They deprioritized it. When the December outage hit, they were scrambling. We’re now implementing that plan. Seven months and one painful outage later.

Platform Engineering Matured

This was the year platform engineering stopped being a buzzword and started being a real discipline at the enterprises I work with. Teams built internal developer platforms. Some of them were good. Many of them weren’t.

The good ones treated the platform as a product. They had user feedback loops, golden paths for common service types, and self-service provisioning that actually worked without filing a ticket.

The bad ones built a Backstage instance, called it “developer experience,” and wondered why nobody used it.

The gap between good and bad platform engineering isn’t technology. It’s product thinking. Understanding what your developers actually need and building that, instead of what the platform team thinks is cool.

OpenTelemetry Became Real

I migrated two organizations to OpenTelemetry this year. Both had been locked into vendor-specific instrumentation that made switching backends painful and expensive. The migration wasn’t trivial, but the result was worth it: unified tracing, vendor-neutral instrumentation, and the freedom to change backends without rewriting application code.

OpenTelemetry tracing hit 1.0 this year. That matters. Metrics are still maturing. Logs are early. But the trajectory is clear – this is going to be the standard instrumentation layer for the next decade.

The Talent Market Went Bonkers

Every organization I worked with was struggling to hire. Senior infrastructure engineers, SREs, platform engineers – impossible to find, expensive to keep. Remote work went from pandemic necessity to baseline expectation. Companies that insisted on full-time office work lost people.

I watched two companies lose key infrastructure engineers to remote-first companies offering 30% more compensation. In both cases, the departing engineers were the ones who knew where the bodies were buried. The knowledge loss hurt more than the headcount loss.

The lesson should be obvious by now: document your systems, cross-train your teams, and don’t build an organization where critical knowledge lives in one person’s head.

What I Got Wrong This Year

I underestimated how hard hybrid work would be for the teams I consulted with. I thought the tooling problems were mostly solved after 2020. They were. The people problems weren’t. Meeting equity, information flow, decision visibility – these are genuinely hard when half the team is remote and half is in a room.

I also underestimated the supply chain security risk. I knew it mattered conceptually. SolarWinds happened in late 2020. But I didn’t push hard enough on dependency inventory and SBOM practices in early 2021. When Log4j hit, that gap became very real. That’s on me.

Looking at 2022

I expect more of the same themes with more urgency. Supply chain security will move from “we should do this” to “we’ve to do this.” Platform engineering will continue maturing. Multi-region resilience will get funded at organizations that felt the December pain.

The talent market won’t cool down. If anything, the demand for people who can operate complex distributed systems will increase faster than the supply.

I’m heading into 2022 with a full plate, a long list of blog posts I want to write, and a renewed commitment to pushing on the boring fundamentals – inventory, documentation, testing, and ownership – that make the difference when things catch fire.

Because things will catch fire. They always do. The question is whether you’re ready.