Terraform at Scale: What Changed Since 2019

Quick take

Two years into scaling Terraform across enterprise teams, the core advice holds: split state by ownership, version your modules, run everything in CI. What changed: workspaces are still a trap, Terraform Cloud is worth considering, OPA/Conftest policy enforcement is now table stakes, and your module registry is your most important internal product. If you’re still running terraform apply from your laptop, we need to talk.

I wrote about Terraform patterns back in 2019 when I was mostly dealing with teams of 5-15 engineers sharing a handful of state files. Since then I’ve helped scale Terraform at organizations with hundreds of engineers, dozens of AWS accounts, and thousands of managed resources.

Some of what I wrote in 2019 held up. Some of it was naive. Here is the updated version.

What Held Up

Split state by ownership and blast radius. This is still the single most important Terraform pattern. One state file per component, scoped by team ownership and failure domain. Networking is separate from application infrastructure. Shared data stores are separate from service-specific resources.

The reason hasn’t changed: a terraform plan that touches resources owned by three different teams is a plan nobody wants to review and nobody wants to approve. Small states, clear ownership, independent deploys.

Remote state with locking is non-negotiable. S3 + DynamoDB for AWS. Still the standard baseline in 2021. Encrypt state at rest, restrict access to the minimum set of IAM roles, and never store state locally.

terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "prod/networking/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

Explicit cross-component dependencies. Publish outputs. Consume them through terraform_remote_state data sources or (better) through SSM parameters or a service registry. Don’t assume. Don’t hardcode. If many stacks need the same value, treat it as a versioned contract.

What I Do Differently Now

Modules Are a Product, Not a Library

In 2019 I treated modules as code reuse. Wrong framing. Modules at scale are a contract between the platform team and everyone else. The interface matters more than the implementation.

A good module in 2021:

module "service" {
  source  = "company/ecs-service/aws"
  version = "~> 2.1"

  name        = "orders-api"
  environment = "production"
  # everything else has sane defaults
}

Narrow input surface. Strong defaults. Documented outputs. Pinned versions. If a product team has to read the module source to use it, the module has failed.

I now recommend running an internal module registry. Terraform Cloud has one built in. If you’re self-hosting, a simple Git-based registry with tagged releases works. The point is discoverability and versioning. Without it, module consumption is copy-paste and you lose the ability to push improvements to consumers.

Workspaces Are Still a Trap

I was cautious about workspaces in 2019. I’m more forceful now: don’t use them for environment separation. Directories per environment are uglier but dramatically safer.

With directories, each environment has its own backend config, its own variable files, and its own state. You can review a production change without worrying about accidentally picking the wrong workspace. You can apply to staging without touching production. The blast radius is obvious from the file path.

Workspaces hide this separation behind a CLI flag. I’ve seen production outages caused by terraform workspace select picking the wrong environment. Twice at different companies. Both times the engineer was experienced and careful. The tool made a catastrophic mistake easy.

Use workspaces for ephemeral environments (PR preview environments, load test clusters) where the lifecycle is short and the blast radius is low.

CI Is the Only Operator

In 2019 I said Terraform should run in CI. In 2021 I say nobody should have terraform apply credentials on their laptop. Period.

The workflow:

Developer opens a PR with infrastructure changes
CI runs terraform plan and posts the output as a PR comment
A reviewer reads the plan (not just the code diff – the actual plan output)
Merge to main triggers terraform apply
Short-lived credentials via OIDC federation. No long-lived AWS keys anywhere.

This gives you an audit trail, consistent environments, peer review of actual changes, and no “I ran apply from my laptop and I’m not sure what happened” incidents.

Policy Enforcement Is Table Stakes

In 2019, policy enforcement felt optional. In 2021, after watching too many “someone opened 0.0.0.0/0 on a production security group” incidents, it’s mandatory.

OPA with Conftest is my go-to for self-hosted setups. Terraform Cloud has Sentinel. Either works. The goal is the same: block known-bad patterns before they reach production.

Policies I enforce everywhere:

No public ingress on security groups without explicit approval
All resources must have required tags (team, environment, service)
No inline IAM policies – use managed policies
S3 buckets must have encryption and versioning enabled
RDS instances must not be publicly accessible

These aren’t business logic. They’re guardrails that prevent the easy mistakes.

Secrets Management Got Better

State files contain sensitive data. This was true in 2019 and it’s still true in 2021. The difference is that the tooling around it improved.

Use a secrets manager (AWS Secrets Manager, HashiCorp Vault) and pass sensitive values at runtime. Mark outputs as sensitive so they don’t leak in plan output or CI logs. Never put credentials, API keys, or certificates in .tfvars files.

output "database_password" {
  value     = aws_db_instance.main.password
  sensitive = true
}

Drift Detection Isn’t Optional

Terraform assumes it’s the only thing changing your infrastructure. It isn’t. Console clicks, auto-scaling events, AWS service updates – drift happens.

Run terraform plan on a schedule (daily for production, weekly for staging) and alert on unexpected changes. This catches two things: manual changes that should have been codified, and resource changes that indicate a problem.

The Operational Routine

Terraform isn’t deploy-and-forget. It needs maintenance.

Weekly: Review plan output for drift in production stacks
Monthly: Update provider versions (pin them, but don’t let them go stale for months)
Quarterly: Audit state access, review module usage, clean up orphaned resources
Every PR: Review the plan output, not just the HCL diff

What I Would Tell 2019 Me

Invest in the module registry earlier. It’s the highest-leverage thing you can build for Terraform at scale. Every team that consumes a well-designed module is a team that isn’t reinventing infrastructure patterns.

Stop fighting workspaces. Just use directories. The extra files are worth the safety.

Get apply off of laptops on day one, not “when we’ve time.” The incident that happens before you migrate to CI will be more expensive than the migration itself.

Terraform scales fine. The syntax isn’t the hard part. Coordination, ownership, and safety are the hard parts. Get those right and the HCL takes care of itself.