Why I Moved Our Infrastructure to Terraform

Quick take

After a botched production recovery at Dropbyke exposed how little we actually knew about our own infrastructure, I moved everything to Terraform. It wasn’t fast and it wasn’t glamorous, but now our AWS environments live in version control and I sleep better.

The Night That Changed My Mind

Three months into running Dropbyke’s infrastructure, I got paged at 1 AM because an EC2 instance hosting our bike-availability API had died. No big deal, I thought. Spin up a new one. Except nobody could tell me the exact instance type, the security group rules, or the IAM role attached to it. The wiki said one thing. The console said another. The CloudFormation stack that supposedly managed it had drifted so far from reality that applying it would have torn down half the VPC.

We recovered in about four hours. Four hours for one instance. That was the moment I decided our infrastructure couldn’t live in people’s heads anymore.

What We Were Running Before

Dropbyke’s AWS setup was a patchwork. Some resources lived in a CloudFormation stack written by a contractor who had since left. Other pieces were created by hand in the console during late-night debugging sessions and never documented. A handful of bash scripts in a repo called infra-scripts handled things like spinning up staging environments, but they were brittle and nobody trusted them enough to run without reading every line first.

CloudFormation itself wasn’t the problem exactly. The problem was that we treated it as a one-time setup tool instead of a living definition. We would write a template, deploy it, then immediately start making manual changes on top. Within weeks the template was fiction.

Why Terraform Won

I evaluated three options: doubling down on CloudFormation, adopting Ansible for infrastructure provisioning, or trying Terraform. Terraform was still young in mid-2016, somewhere around version 0.6, but it had two properties that mattered to me.

First, the declarative model. You describe what you want, not the steps to get there. After years of imperative scripts where ordering and idempotency were my problem, this was a relief. I could define a VPC, subnets, security groups, and instances in HCL and let Terraform figure out the dependency graph.

Second, the plan step. Before Terraform touches anything, it shows you exactly what it intends to do. Create this, modify that, destroy the other. At Dropbyke we were a small team and we couldn’t afford surprises. The plan became our pre-flight checklist.

resource "aws_instance" "bike_api" {
  ami           = "ami-0a2e1043"
  instance_type = "t2.small"
  subnet_id     = aws_subnet.private.id

  vpc_security_group_ids = [
    aws_security_group.api.id,
  ]

  iam_instance_profile = aws_iam_instance_profile.bike_api.name

  tags = {
    Name        = "bike-api-${var.environment}"
    Service     = "bike-availability"
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

That snippet is close to our actual bike-availability API definition. Nothing clever. Readable. Reviewable. If someone asks how production is configured, the answer is in Git.

The Migration

I didn’t attempt a big bang. That would have been reckless with a two-person engineering team and live users on the platform.

Week one: all new infrastructure goes into Terraform. No exceptions. If you need a new security group, you write HCL.

Week two through four: I started importing existing resources. terraform import is tedious. You import a resource, write the corresponding HCL, run plan, and iterate until plan shows no changes. For our setup it meant importing roughly forty resources across the VPC, subnets, route tables, security groups, instances, RDS, and an Elastic Load Balancer.

terraform import aws_security_group.api sg-4a8b1c2f

I did this environment by environment. Development first, where mistakes were cheap. Then staging. Production came last, after I had already made every dumb mistake in a place where it didn’t matter. By the time I imported the production VPC, the process was mechanical.

Modules helped keep things sane. I built a small VPC module and a service module. Every environment became a composition of those two building blocks with different variables. The upfront effort paid back immediately in consistency. Staging actually looked like production for the first time.

State: The Sharp Edge

Terraform’s state file is where it tracks the mapping between your HCL and real AWS resources. It’s powerful and it’s dangerous. Early on I kept state in a local file and committed it to Git. That worked fine for a solo operator but fell apart as soon as a second engineer touched the repo.

We moved state to an S3 backend with DynamoDB locking within the first month. Locking matters. Without it, two concurrent applies can corrupt state and leave you in a worse position than having no IaC at all. The state file also contains secrets in plain text, things like RDS passwords and API keys. Encryption at rest on the S3 bucket and tight IAM policies aren’t optional. We treat state like production credentials because that’s what it contains.

What Changed Operationally

The biggest shift was cultural, not technical. Before Terraform, infrastructure changes happened whenever someone had console access and a reason. After Terraform, every change went through a pull request. Run terraform plan, paste the output into the PR, get a review, merge, apply. Simple process. Hard to skip.

We caught real problems this way. A plan once showed that changing a subnet CIDR would destroy and recreate every instance in the subnet. That would have been a fifteen-minute outage discovered at 2 AM instead of during a calm code review on a Tuesday afternoon.

Recovery got faster too. When that same bike-availability API instance died again two months later, rebuilding it was a terraform apply away. Five minutes instead of four hours. That alone justified the migration.

Honest Trade-offs

Terraform isn’t free. The learning curve is real, especially around state management and provider quirks. HCL is readable but limited. Complex logic feels awkward compared to a real programming language. And in mid-2016, the provider ecosystem still has gaps. We hit bugs. We read source code. We worked around things.

But the alternative was worse. The alternative was guessing at 1 AM.

Where We Stand

Every piece of Dropbyke’s AWS infrastructure is now defined in Terraform and stored in version control. Environments are reproducible. Changes are reviewed. Recovery is predictable. The migration took about six weeks of focused work alongside normal feature development.

If your infrastructure still lives in consoles and half-remembered scripts, make the move. Not because Terraform is perfect, but because the cost of not knowing what you’re running is higher than you think. You will find out exactly how high at 1 AM on a Tuesday.