Your Terraform Monolith Will Break. Here's How to Fix It Before …

Quick take

Stop dumping everything into one state file. Split by domain, pin everything, automate plan-and-apply, and treat -target like a fire extinguisher – not a daily tool.

I’ve managed Terraform across three companies now. At the fintech startup we started the way everyone starts: one state file, a handful of resources, terraform apply from someone’s laptop. It was fine. Then it wasn’t.

By the time I was deep into building Decloud during EF, I’d already learned this lesson the hard way. The problem is never Terraform itself. The problem is that infrastructure grows quietly. You add a few resources here, a new environment there, and one morning you run terraform plan and it takes eleven minutes and touches things you forgot existed. That’s the moment you realize your Terraform setup doesn’t scale. And by then, untangling it’s painful.

This post is what I wish I’d known before the first time a plan returned 200+ changes because someone refactored a module without understanding what depended on it.

The breaking point

Small Terraform setups are a joy. You describe what you want, you apply it, it exists. Beautiful.

At scale, the problems are predictable:

Plans take forever. API rate limits turn a 30-second plan into a 10-minute gamble.
State becomes a bottleneck. Two engineers run apply at the same time. One wins, one gets a lock error and has to wait.
Blast radius grows silently. A change to a security group touches the same state as your database config. One bad merge and you’re having a very bad afternoon.
Drift accumulates. Resources nobody has looked at in months quietly diverge from what Terraform thinks they are.

None of these kill you immediately. They just make everything slower and scarier until one day someone is afraid to run apply at all.

Split state by domain, not by convenience

The single most impactful thing you can do is break your state apart. Not by team, not by project – by domain.

terraform/
|-- network/
|-- security/
|-- data/
|-- kubernetes/
`-- services/
    |-- api/
    `-- web/

Each directory gets its own state backend. Each one can be planned and applied independently. A change to your VPC config doesn’t require Terraform to also reconcile your Kubernetes cluster and all your application services.

For environments, use separate directories and backends. I know workspaces exist. I’ve used them. They work fine when it’s just you. They get confusing fast when three engineers are working across staging and production and nobody remembers which workspace they’re in. Separate directories are boring. Boring is good.

Remote state with locking. Always.

If you’re running Terraform at scale and your state is local, stop reading this and go fix that first.

terraform {
  backend "s3" {
    bucket         = "terraform-state"
    key            = "network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

DynamoDB for locking. Encryption on. Versioning on the S3 bucket so you can recover when (not if) someone corrupts state. This is infrastructure for your infrastructure. Treat it seriously.

Cross-state references

When stacks need to talk to each other, use remote state data sources. Keep the outputs stable – changing an output name in your network stack shouldn’t break your services stack.

data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "terraform-state"
    key    = "network/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_subnet" "app" {
  vpc_id     = data.terraform_remote_state.network.outputs.vpc_id
  cidr_block = "10.0.20.0/24"
}

Think of outputs as your stack’s API. You wouldn’t change an API contract without versioning. Same principle.

Modules: small, versioned, boring

I’ve seen Terraform modules that try to do everything. A “vpc” module that also creates subnets, NAT gateways, route tables, security groups, and somehow a Lambda function. Don’t do this.

Keep modules focused on one thing.

module "vpc" {
  source = "./modules/vpc"
  cidr   = "10.0.0.0/16"
}

module "public_subnet" {
  source = "./modules/subnet"
  vpc_id = module.vpc.vpc_id
  cidr   = "10.0.1.0/24"
}

Pin versions. Pin everything. Terraform version, provider versions, module versions.

terraform {
  required_version = "~> 0.12"
}

provider "aws" {
  version = "~> 2.40"
  region  = "us-east-1"
}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "2.21.0"
}

I learned this one when an unpinned provider upgrade changed the default behavior of an AWS resource and our plan suddenly wanted to destroy and recreate a production database. Nobody was hurt. My heart rate was.

The workflow that actually works

At the fintech startup I watched us go from “just apply it” to a proper review flow, and the number of infrastructure incidents dropped to near zero. The process isn’t exciting:

Branch for the change.
terraform fmt and terraform validate locally.
CI runs plan and posts the output to the PR.
The team that owns that stack reviews the plan.
Merge triggers apply. Only after approval.

We used Atlantis for this and it worked well:

# atlantis.yaml
version: 3
projects:
- name: network
  dir: terraform/network
  workflow: default

workflows:
  default:
    plan:
      steps:
      - init
      - plan
    apply:
      steps:
      - apply

The key insight is that the plan output is the thing being reviewed, not just the code diff. A four-line HCL change can produce a 200-line plan. The plan is what matters.

The stuff that will save you at 2 AM

Plan files. Always save the plan and apply from the saved file. What was reviewed should be exactly what gets applied.

terraform plan -out=plan.tfplan
terraform apply plan.tfplan

-target is for emergencies only. It’s tempting to use -target to apply just one resource when you’re in a hurry. Don’t make it a habit. It skips dependency resolution and creates drift. Use it when something is on fire, then follow up with a full plan to clean up.

Drift detection. Schedule terraform plan -detailed-exitcode to run regularly. Surface drift before it surprises you during an incident. Exit code 2 means there’s drift. Pipe that into an alert.

terraform plan -detailed-exitcode
# Exit 0: no changes
# Exit 1: error
# Exit 2: changes detected (drift)

Parallelism tuning. Default is 10. You can bump it to 20 for faster applies, but watch for API rate limit errors. More parallelism isn’t always faster.

Guard the gates

Treat Terraform like application code. Lint it. Format it. Run policy checks before it reaches production.

# .pre-commit-config.yaml
repos:
- repo: https://github.com/antonbabenko/pre-commit-terraform
  hooks:
  - id: terraform_fmt
  - id: terraform_validate
  - id: terraform_tflint

Pre-commit hooks catch the easy mistakes. For the hard ones – like someone trying to open port 22 to the world or creating an unencrypted S3 bucket – use Sentinel or OPA. Policy as code is one of those things that feels like overhead until the first time it stops a bad change from reaching production.

Own your stacks

Define CODEOWNERS by directory. The network team reviews network changes. The platform team reviews Kubernetes changes. Nobody should be able to merge infrastructure changes without review from someone who understands the blast radius.

Keep documentation minimal and close to the code. A README in each stack directory with the inputs, outputs, and any non-obvious decisions. Not a wiki that nobody updates.

Terraform scales fine. The part that doesn’t scale is humans making ad hoc decisions about where things go and how changes flow. Get the state layout right early, keep modules small and versioned, and make the review process non-negotiable. Everything else follows from that discipline.

Your Terraform Monolith Will Break. Here's How to Fix It Before It Does.