// Topic
Infrastructure
Definition
Infrastructure coverage in this archive spans 41 posts from Feb 2016 to Mar 2026 and focuses on reliability, delivery speed, and cost discipline as one system, not three separate concerns. The strongest adjacent threads are devops, cloud, and kubernetes. Recurring title motifs include kubernetes, infrastructure, production, and need.
Working claims
- Most posts prioritize predictable operations over feature breadth or stack novelty.
- Early posts lean on production and kubernetes, while newer posts lean on infrastructure and engineering as constraints shifted.
- This topic repeatedly intersects with devops, cloud, and kubernetes, so design choices here rarely stand alone.
How to apply this
- Set SLOs first, then choose tooling that keeps deploy, observability, and rollback simple.
- Start with the newest post to calibrate current constraints, then backtrack to older entries for first principles.
- When boundary questions appear, cross-read devops and cloud before committing implementation details.
Where teams get burned
- Adding platform layers faster than the team can operate and debug them.
- Chasing throughput gains without proving they improve end-user reliability.
- Applying guidance from 2016 to 2026 without revisiting assumptions as context changed.
Suggested reading path
- Start here (current state): Beyond Cloud-Heavy Architecture: Why Agentic Systems Need Local-First, Hardware-Aware Design
- Then read (operating middle): Comparing Infrastructure Testing Approaches: What Actually Catches Bugs
- Finish with (foundational context): Docker in Production: What We Learned Running Containers at Dropbyke
Related posts
- Beyond Cloud-Heavy Architecture: Why Agentic Systems Need Local-First, Hardware-Aware Design
- Your AI Pipeline Is Just ETL With Extra Steps (And That’s Fine)
- Your AI Infrastructure Is Not Special
- Your AI Infrastructure Is Not Ready for Scale. Neither Is Mine.
- Vector Databases: What They Actually Are and When You Need One
- Your Cloud Bill Is Not a Mystery
- Platform Engineering: DevOps Grew Up
- You Do Not Need a FinOps Team
References
42 posts
- The 2026 AI Build vs. Buy Calculus (It’s Just Operational Cost)
By mid-2026, AI build vs buy has nothing to do with novelty. It is a ruthless mathematical calculation of telemetry, context freshness, and infrastructure lock-in.
Beyond Cloud-Heavy Architecture: Why Agentic Systems Need Local-First, Hardware-Aware Design
Local-first, hardware-aware architecture is becoming the default for high-reliability AI systems. The cloud-heavy pattern costs too much and fails too unpredictably for agentic workloads.
Your AI Pipeline Is Just ETL With Extra Steps (And That's Fine)
AI data pipelines aren't some new paradigm. They're ETL with a retrieval layer bolted on. The discipline that makes them work is the same discipline that has always made pipelines work: detect change, chunk intelligently, keep indexes fresh.
Your AI Infrastructure Is Not Special
AI infrastructure at scale is just infrastructure. The same boring patterns -- gateways, caching, circuit breakers, budget enforcement -- solve the same boring problems.
Your AI Infrastructure Is Not Ready for Scale. Neither Is Mine.
The GPU shortage is real, rate limits are a production constraint, and your AI demo is going to collapse under real traffic. Some annoyed thoughts on infrastructure realism.
Vector Databases: What They Actually Are and When You Need One
A practical guide to vector databases -- what they store, how similarity search works, and the architectural decisions that matter in production.
Your Cloud Bill Is Not a Mystery
Most cloud cost problems are visibility problems. Fix tagging, kill idle resources, right-size what remains, and make cost a regular engineering conversation.
Platform Engineering: DevOps Grew Up
Platform engineering is what happens when you realize 'you build it, you run it' does not scale past a handful of teams.
You Do Not Need a FinOps Team
Cloud cost management is not a discipline. It is basic engineering hygiene dressed up with a consulting-friendly name.
Most Platform Teams Are Building the Wrong Thing
After assessing platform maturity at a dozen enterprises, the pattern is clear: most platform teams build tools nobody asked for while developers wait in ticket queues.
Your Kubernetes Bill Is Lying to You
Most Kubernetes clusters are 40-60% over-provisioned. Here's how I help teams cut their bills without sacrificing reliability.
Database Reliability Engineering: What I've Learned the Hard Way
Practical database reliability from running Postgres at the fintech startup and at large enterprises. Includes config examples, migration patterns, and the operational habits that actually prevent outages.
Data Engineering Patterns: Batch vs. CDC vs. Streaming
A comparison of data ingestion patterns from building the fintech startup's financial data pipelines, plus when each one actually makes sense.
Multi-Cloud Is Mostly a Marketing Strategy
Multi-cloud sounds great in vendor pitches. In practice, it doubles your operational burden for benefits most teams will never need.
Apple Silicon Won't Replace Your Servers (Yet)
The M1 is impressive hardware. The 'ARM everywhere in the data center' takes are not. Here's what actually matters for server infrastructure.
Platform Engineering Is Just DevOps With a Rebrand
The industry loves renaming things. Platform engineering is DevOps done properly — and most companies still won't do it right.
I Wrote Six Kubernetes Operators. Here's What Actually Matters.
Lessons from building production operators at Decloud: the reconciliation loop, controller-runtime patterns, and the mistakes that cost us sleep.
Stop Guessing Your Kubernetes Resource Limits
Most K8s clusters I audit are either wildly overprovisioned or one bad deploy away from eviction storms. Here's how I set requests, limits, and guardrails.
Your VPN Was Never a Security Architecture
COVID broke everyone's VPN. Good. It was a terrible security model to begin with. The answer isn't scaling your VPN — it's replacing the mental model entirely.
Your Cloud Security Is Falling Apart Right Now
Everyone's scrambling to scale cloud infrastructure overnight. I've seen what happens when security gets deprioritized under pressure — at NATO exercises, at Decloud, at the fintech startup. Here's how to not become a headline.
Your Video Infrastructure Isn't Ready for What's Coming
Most companies building video calling right now are making the same three architecture mistakes. Here's what I keep seeing and how to fix it before your SFUs fall over.
Comparing Infrastructure Testing Approaches: What Actually Catches Bugs
I tested Terraform modules with unit checks, policy engines, and full integration runs side by side. Here's what each approach actually catches and what it misses.
Your Terraform Monolith Will Break. Here's How to Fix It Before It Does.
Lessons from splitting a 4000-resource Terraform state into something teams can actually work with -- state layout, module boundaries, and the workflow discipline nobody wants to do until they have to.
Kubernetes Ships Insecure by Default. Here's What to Do About It.
Kubernetes defaults optimize for fast adoption, not safety. A hardening checklist drawn from running clusters at the fintech startup, Dropbyke, and early Decloud work.
Your Cloud Bill Is Lying to You: A Cost Optimization Comparison
A direct comparison of cloud cost optimization strategies -- what actually moves the needle vs. what just makes finance feel better.
GitOps: Stop SSHing Into Production
How I moved three teams off ad-hoc kubectl deployments and onto Git-driven infrastructure -- with code examples, repo layouts, and the mistakes I made along the way.
The Boring Kubernetes Checklist That Actually Keeps Production Alive
Most Kubernetes outages come from skipping the basics. Here's the checklist I use after running clusters at the fintech startup and now at Decloud.
Istio: Powerful, Painful, and Probably More Than You Need
My honest take on evaluating Istio at the fintech startup — what it actually gives you, what it costs you, and why most teams should think twice before adopting it.
IaC Patterns That Actually Work
Opinionated Infrastructure as Code patterns from running Terraform at the fintech startup. Repo layout, modules, state management, and the stuff that burns you if you ignore it.
Kubernetes Operators: Powerful, but Overhyped
Operators are the hot thing in the Kubernetes world right now. They're genuinely useful — but the hype is outpacing the reality for most teams.
Zero Trust Is Not a Product. Here's How We Actually Built It.
Perimeter security is dead. At the fintech startup, I ripped out the castle-and-moat model and replaced it with zero trust — identity-first, micro-segmented, no implicit trust anywhere. Here's what that actually looked like.
Two Years of Kubernetes in Production — The Boring Parts Are the Hard Parts
Year two of running Kubernetes at the fintech startup. The panic is gone. Now it's networking, resource tuning, and all the operational grunt work nobody blogs about.
Spectre and Meltdown Broke My Weekend
Five days after the Spectre/Meltdown disclosure, a CTO's raw take on what happened, what we patched, and why this changes the game for anyone running shared infrastructure.
Your Containers Aren't Secure. Here's What to Actually Do About It.
Containers give you process isolation, not a security boundary. I break down how we hardened images, locked down runtimes, and segmented networks at the fintech startup — plus the stuff nobody warns you about.
Multi-Region Architecture: What I Wish Someone Had Told Me
We serve financial data to users across Europe at the fintech startup. Here's what I've learned about going multi-region -- the patterns that work, the ones that burn you, and when you should even bother.
Pitching Infrastructure to People Who Don't Care About Infrastructure
Your board doesn't care about Kubernetes. They care about money, risk, and speed. Here's how I learned to pitch infra investment at the fintech startup.
Your Cloud Bill Is Lying to You
That clean AWS pricing page has almost nothing to do with your actual invoice. I learned this the hard way at the fintech startup.
A Year Running Kubernetes in Production — What Actually Happened
After a year of running Kubernetes in production, the wins are real but the sharp edges drew blood first. Here's what paid off, what bit us, and what I'd do differently.
Log Aggregation at Scale: ELK vs Alternatives
ELK is powerful. It's also a second full-time job. Here's what I learned running it at Dropbyke, and what I'd consider instead.
The Real Cost of Running Your Own Servers in 2016
Most startups have no business running their own servers. The math is not close.
Ansible Won Because It's the Simplest
I used all three. Ansible required the least ceremony. That's the whole argument.
Docker in Production: What We Learned Running Containers at Dropbyke
Running Docker in production at Dropbyke forced us to get serious about image builds, container networking, log aggregation, and security. Here is what actually worked.