Devops

Definition

Devops coverage in this archive spans 49 posts from Feb 2016 to Nov 2022 and focuses on reliability, delivery speed, and cost discipline as one system, not three separate concerns. The strongest adjacent threads are infrastructure, kubernetes, and security. Recurring title motifs include kubernetes, production, platform, and scale.

Key claims

Most posts prioritize predictable operations over feature breadth or stack novelty.
Early posts lean on kubernetes and production, while newer posts lean on kubernetes and platform as constraints shifted.
This topic repeatedly intersects with infrastructure, kubernetes, and security, so design choices here rarely stand alone.

Practical checklist

Set SLOs first, then choose tooling that keeps deploy, observability, and rollback simple.
Start with the newest post to calibrate current constraints, then backtrack to older entries for first principles.
When boundary questions appear, cross-read infrastructure and kubernetes before committing implementation details.

Failure modes

Adding platform layers faster than the team can operate and debug them.
Chasing throughput gains without proving they improve end-user reliability.
Applying guidance from 2016 to 2022 without revisiting assumptions as context changed.

References

Infrastructure as Code Patterns That Actually Scale

Nov 2022

Practical Terraform patterns for teams that have outgrown the tutorial stage: module design, state management, environment promotion, and policy enforcement.

Platform Engineering: DevOps Grew Up

Nov 2022

Platform engineering is what happens when you realize 'you build it, you run it' does not scale past a handful of teams.

Monorepo vs. Polyrepo: A Practical Decision Guide

Oct 2022

Monorepo or polyrepo depends on coupling, team shape, and your appetite for build tooling. Here is how to decide without getting religious about it.

Kubernetes Requests and Limits: Lessons From Getting It Wrong

Sep 2022

CPU is compressible. Memory is not. That one sentence explains 80% of Kubernetes resource problems.

Hardening Kubernetes: The Stuff That Actually Matters

Feb 2022

Kubernetes defaults are built for getting things running, not for keeping attackers out. A layered hardening walkthrough covering pods, RBAC, network policies, secrets, and the control plane.

DORA Metrics: Stop Ruining a Good Idea

Jan 2022

DORA metrics are useful exactly until someone puts them on a performance review. Here's how to use them without destroying your engineering culture.

Terraform at Scale: What Changed Since 2019

Dec 2021

Two years ago I wrote about Terraform patterns for growing teams. Here's what held up, what broke, and what I do differently now.

Most Platform Teams Are Building the Wrong Thing

Nov 2021

After assessing platform maturity at a dozen enterprises, the pattern is clear: most platform teams build tools nobody asked for while developers wait in ticket queues.

Feature Flags at Scale: What Nobody Warns You About

Sep 2021

Feature flags are great until you have 847 of them and nobody knows which ones are safe to remove. Practical lessons from Decloud and enterprise teams.

Observability-Driven Development Is Just Instrumenting Your Code

Jun 2021

ODD sounds fancy. It's not. It means writing logs, metrics, and traces before you ship, not after your first outage.

DevSecOps in Practice: What I Actually Implement

Apr 2021

The concrete pipeline configs, policy-as-code patterns, and runtime controls I set up to bake security into delivery.

Most Teams Are Not Ready for MLOps

Mar 2021

MLOps is real, but most teams buying MLOps tooling cannot even version their training data. Fix the basics first.

Platform Engineering Is Just DevOps With a Rebrand

Oct 2020

The industry loves renaming things. Platform engineering is DevOps done properly — and most companies still won't do it right.

Observability for Small Distributed Teams (What Actually Works)

Sep 2020

Most observability advice is written for 500-engineer orgs. Here's what actually matters when you're a small distributed team trying not to drown in dashboards.

The GitHub Actions Patterns I Actually Use in Production

Jul 2020

Matrix builds, dependency caching, gated deploys, and the security gotchas I hit building Decloud's CI/CD pipeline on GitHub Actions.

Stop Guessing Your Kubernetes Resource Limits

Jun 2020

Most K8s clusters I audit are either wildly overprovisioned or one bad deploy away from eviction storms. Here's how I set requests, limits, and guardrails.

Comparing Infrastructure Testing Approaches: What Actually Catches Bugs

Feb 2020

I tested Terraform modules with unit checks, policy engines, and full integration runs side by side. Here's what each approach actually catches and what it misses.

My Kubernetes Predictions for 2020 (Most of Yours Are Wrong)

Jan 2020

The adoption debate is over. 2020 is about operating Kubernetes well -- managed control planes, GitOps by default, policy enforcement, and being honest about what's overhyped.

Zero Downtime Deploys Are a Team Habit, Not a Tool

Oct 2019

Every team says they want zero downtime. Few want to do the boring work that actually gets them there. Here's what that boring work looks like.

Your Terraform Monolith Will Break. Here's How to Fix It Before It Does.

Sep 2019

Lessons from splitting a 4000-resource Terraform state into something teams can actually work with -- state layout, module boundaries, and the workflow discipline nobody wants to do until they have to.

Internal Platforms vs. Ad-Hoc Tooling: Which Developer Experience Actually Wins

Aug 2019

A comparison of two approaches to developer experience -- purpose-built internal platforms versus the organic tooling that teams build for themselves -- and when each one actually delivers.

Your Incident Response Plan Is Useless Until Someone Bleeds

Jul 2019

Most incident response plans are shelf-ware. Here's what actually matters when your infrastructure is on fire -- drawn from real breaches, NATO cyber exercises, and startup chaos.

Your Internal Platform Is Probably a Liability

Mar 2019

Most internal developer platforms fail not because they're technically bad, but because nobody treated them like a product. Thoughts from building (and scrapping) platform tooling across three startups.

GitOps: Stop SSHing Into Production

Feb 2019

How I moved three teams off ad-hoc kubectl deployments and onto Git-driven infrastructure -- with code examples, repo layouts, and the mistakes I made along the way.

The Boring Kubernetes Checklist That Actually Keeps Production Alive

Jan 2019

Most Kubernetes outages come from skipping the basics. Here's the checklist I use after running clusters at the fintech startup and now at Decloud.

IaC Patterns That Actually Work

Oct 2018

Opinionated Infrastructure as Code patterns from running Terraform at the fintech startup. Repo layout, modules, state management, and the stuff that burns you if you ignore it.

Container Security in 2018: What Actually Changed

Aug 2018

Eight months after my first container security post, an update on what moved at the fintech startup and in the ecosystem — PodSecurityPolicy, image signing, and the shift from scratch to real.

Why Monitoring Wasn't Enough and How We Built Observability at a Fintech Startup

Jul 2018

After a mystery outage that our dashboards couldn't explain, I rebuilt the fintech startup's telemetry stack around metrics, logs, and traces. Here's what I learned.

SRE Principles Are Great. The Cargo-Culting Is Not.

Apr 2018

The SRE hype train has everyone copying Google's playbook without asking whether it fits. Here's what actually matters when you're not running planet-scale infrastructure.

Kubernetes Operators: Powerful, but Overhyped

Apr 2018

Operators are the hot thing in the Kubernetes world right now. They're genuinely useful — but the hype is outpacing the reality for most teams.

Two Years of Kubernetes in Production — The Boring Parts Are the Hard Parts

Jan 2018

Year two of running Kubernetes at the fintech startup. The panic is gone. Now it's networking, resource tuning, and all the operational grunt work nobody blogs about.

What I Learned Building Our Platform Team This Year

Dec 2017

Reflections on standing up the fintech startup's platform team in 2017 — what worked, what didn't, and why treating infra like a product changed everything.

Your Containers Aren't Secure. Here's What to Actually Do About It.

Dec 2017

Containers give you process isolation, not a security boundary. I break down how we hardened images, locked down runtimes, and segmented networks at the fintech startup — plus the stuff nobody warns you about.

Your Incident Process Will Break at 15 People. Here's What to Do.

Oct 2017

What I learned building incident management at the fintech startup — from five people shouting across a room to actual structured response.

You Don't Need to Be Netflix to Break Things on Purpose

Aug 2017

Chaos engineering isn't just for the big players. Here's how a small team can start breaking things deliberately and actually learn from it.

Stop Doing Security Reviews by Hand

Jul 2017

Your manual security gate is a bottleneck pretending to be a process. Here's how I moved security checks into the pipeline at the fintech startup so we could ship fast without shipping stupid.

Monitoring Is Not Enough

Mar 2017

Your dashboards look green. Your users say the site is broken. That gap is the whole problem.

A Year Running Kubernetes in Production — What Actually Happened

Jan 2017

After a year of running Kubernetes in production, the wins are real but the sharp edges drew blood first. Here's what paid off, what bit us, and what I'd do differently.

Why We Deleted 42 Grafana Panels

Dec 2016

Most teams monitor too much and alert on the wrong things. Five metrics are enough to run a startup backend.

Container Orchestration: Docker Swarm vs Kubernetes vs Mesos

Oct 2016

A side-by-side comparison of Swarm, Kubernetes, and Mesos based on running all three in evaluation at Dropbyke. Kubernetes is going to win, but the operational tax is real.

Building a Security-First Engineering Culture

Oct 2016

Security culture is not a training program or a tool purchase. It is a set of habits that leadership enforces through consistency, not speeches.

Log Aggregation at Scale: ELK vs Alternatives

Sep 2016

ELK is powerful. It's also a second full-time job. Here's what I learned running it at Dropbyke, and what I'd consider instead.

Database Migrations Without Downtime

Aug 2016

A practical guide to evolving schemas without maintenance windows by keeping old and new code compatible at every step.

Why I Moved Our Infrastructure to Terraform

Jun 2016

We moved from console-driven, script-heavy infrastructure to Terraform so changes are reviewed, reproducible, and recoverable from code.

Continuous Deployment Without the Chaos

Jun 2016

Continuous deployment is not a tooling problem. It is a discipline problem. We deploy the Dropbyke backend dozens of times a day because we built habits first and automation second.

Security Incident Response for Startups

May 2016

A practical incident response playbook for small teams: define incidents, assign owners, contain fast, investigate calmly, and recover with clear communication.

Ansible Won Because It's the Simplest

Apr 2016

I used all three. Ansible required the least ceremony. That's the whole argument.

Building a DevOps Culture from Scratch

Mar 2016

DevOps is a cultural shift, not a job title. This post lays out a practical, 2016-era path to shared responsibility, fast feedback, and resilient delivery without hand-wavy promises.

Docker in Production: What We Learned Running Containers at Dropbyke

Feb 2016

Running Docker in production at Dropbyke forced us to get serious about image builds, container networking, log aggregation, and security. Here is what actually worked.

Devops

Definition

Key claims

Practical checklist

Failure modes

Suggested reading path

Related posts

References

Infrastructure as Code Patterns That Actually Scale

Platform Engineering: DevOps Grew Up

Monorepo vs. Polyrepo: A Practical Decision Guide

Kubernetes Requests and Limits: Lessons From Getting It Wrong

Hardening Kubernetes: The Stuff That Actually Matters

DORA Metrics: Stop Ruining a Good Idea

Terraform at Scale: What Changed Since 2019

Most Platform Teams Are Building the Wrong Thing

Feature Flags at Scale: What Nobody Warns You About

Observability-Driven Development Is Just Instrumenting Your Code

DevSecOps in Practice: What I Actually Implement

Most Teams Are Not Ready for MLOps

Platform Engineering Is Just DevOps With a Rebrand

Observability for Small Distributed Teams (What Actually Works)

The GitHub Actions Patterns I Actually Use in Production

Stop Guessing Your Kubernetes Resource Limits

Comparing Infrastructure Testing Approaches: What Actually Catches Bugs

My Kubernetes Predictions for 2020 (Most of Yours Are Wrong)

Zero Downtime Deploys Are a Team Habit, Not a Tool

Your Terraform Monolith Will Break. Here's How to Fix It Before It Does.

Internal Platforms vs. Ad-Hoc Tooling: Which Developer Experience Actually Wins

Your Incident Response Plan Is Useless Until Someone Bleeds

Your Internal Platform Is Probably a Liability

GitOps: Stop SSHing Into Production

The Boring Kubernetes Checklist That Actually Keeps Production Alive

IaC Patterns That Actually Work

Container Security in 2018: What Actually Changed

Why Monitoring Wasn't Enough and How We Built Observability at a Fintech Startup

SRE Principles Are Great. The Cargo-Culting Is Not.

Kubernetes Operators: Powerful, but Overhyped

Two Years of Kubernetes in Production — The Boring Parts Are the Hard Parts

What I Learned Building Our Platform Team This Year

Your Containers Aren't Secure. Here's What to Actually Do About It.

Your Incident Process Will Break at 15 People. Here's What to Do.

You Don't Need to Be Netflix to Break Things on Purpose

Stop Doing Security Reviews by Hand

Monitoring Is Not Enough

A Year Running Kubernetes in Production — What Actually Happened

Why We Deleted 42 Grafana Panels

Container Orchestration: Docker Swarm vs Kubernetes vs Mesos

Building a Security-First Engineering Culture

Log Aggregation at Scale: ELK vs Alternatives

Database Migrations Without Downtime

Why I Moved Our Infrastructure to Terraform

Continuous Deployment Without the Chaos

Security Incident Response for Startups

Ansible Won Because It's the Simplest

Building a DevOps Culture from Scratch

Docker in Production: What We Learned Running Containers at Dropbyke