Quick take
Writing Terraform is easy. Keeping it maintainable across fifty services, four environments, and a dozen contributors is where things fall apart. Structure your modules tightly, treat state as a product feature, promote changes through environments instead of rebuilding, and enforce policy in CI. The rest is discipline.
Everyone can write a Terraform file. The problem starts around month six, when you have fifteen modules nobody remembers writing, state files that three people are afraid to touch, and a staging environment that has drifted so far from production that testing there’s theater.
I’ve seen this pattern at multiple organizations. The IaC starts clean. Then it grows. Then it becomes the thing everyone is scared of. The patterns below are what I’ve seen work to prevent that decay.
Module Design: One Job Per Module
The single most important structural decision is module scope. A module should do one thing. Not “provision a service and its database and its DNS and its monitoring.” One thing.
Here is what a focused module looks like for an application service:
# modules/ecs-service/main.tf
variable "service_name" {
type = string
description = "Name of the ECS service"
}
variable "container_image" {
type = string
description = "Docker image URI"
}
variable "cpu" {
type = number
default = 256
}
variable "memory" {
type = number
default = 512
}
variable "desired_count" {
type = number
default = 2
}
variable "subnet_ids" {
type = list(string)
}
variable "security_group_ids" {
type = list(string)
}
resource "aws_ecs_service" "this" {
name = var.service_name
cluster = var.cluster_arn
task_definition = aws_ecs_task_definition.this.arn
desired_count = var.desired_count
launch_type = "FARGATE"
network_configuration {
subnets = var.subnet_ids
security_groups = var.security_group_ids
}
}
resource "aws_ecs_task_definition" "this" {
family = var.service_name
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
cpu = var.cpu
memory = var.memory
container_definitions = jsonencode([{
name = var.service_name
image = var.container_image
portMappings = [{
containerPort = 8080
protocol = "tcp"
}]
}])
}
Clean inputs, clean outputs, one responsibility. The database, the DNS record, the monitoring – those are separate modules composed at a higher level. When the ECS module changes, you know exactly what’s affected.
The moment a module becomes a grab bag, it stops being reusable. I’ve seen “service” modules with 40 input variables and 200 lines of conditional logic. Nobody wants to touch them. Nobody trusts them. They become legacy infrastructure inside your infrastructure code.
Repository Layout
Your repo structure should reflect ownership, not technology. A layout I’ve used successfully:
infrastructure/
modules/
ecs-service/
rds-instance/
cloudfront-distribution/
environments/
production/
main.tf
terraform.tfvars
staging/
main.tf
terraform.tfvars
dev/
main.tf
terraform.tfvars
Each environment directory composes the same modules with different variables. This makes differences between environments explicit and small instead of hidden and surprising.
State Management Isn’t a Detail
State management is where IaC projects actually fail. Not in the HCL. In the state.
Rules that have saved me from pain:
Remote state with locking. Always. No exceptions. Two people running terraform apply at the same time against the same state file is how you get a 3am incident.
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/api-service/terraform.tfstate"
region = "eu-west-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
Segment state by blast radius. Don’t put your entire production environment in one state file. When you run terraform plan and it wants to touch 200 resources, you aren’t going to review that carefully. Nobody does. Split by service or component so each plan is small and reviewable.
Encrypt everything. State files contain secrets whether you want them to or not. Database passwords, API keys, connection strings – they end up in state. Encryption at rest is the minimum.
At a real-time messaging company, we segmented state per service per environment. A change to the messaging service’s production infrastructure touched exactly that service’s state. Nothing else. This made reviews fast, plans small, and rollbacks contained.
Environment Promotion Over Rebuilds
The safest deployment model is promotion. A change that passed in dev should be the exact same change in staging, which should be the exact same change in production. Not “similar.” Identical.
# environments/production/main.tf
module "api_service" {
source = "../../modules/ecs-service"
service_name = "api-service"
container_image = var.api_image # promoted from staging
cpu = 512
memory = 1024
desired_count = 4
subnet_ids = var.private_subnet_ids
security_group_ids = [aws_security_group.api.id]
}
The module source is the same across environments. Only the variables change. When you promote, you update the image tag and any environment-specific sizing, and the rest stays locked.
Avoid long-lived branches per environment. I’ve seen teams with a staging branch and a production branch that diverged months ago. Nobody knows what the differences are. Nobody wants to find out. Use one branch, different variable files.
Testing: Layers, Not Afterthoughts
IaC testing is closer to software testing than most teams realize. Three layers that work:
Static analysis. Run terraform validate, tflint, and a policy tool like OPA or Sentinel in CI. These catch syntax errors, deprecated patterns, and policy violations before anyone runs a plan. Fast and cheap.
# policy/require_encryption.rego
package terraform
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_s3_bucket"
not resource.change.after.server_side_encryption_configuration
msg := sprintf("S3 bucket %s must have encryption enabled", [resource.address])
}
Plan review. Every change gets a terraform plan output attached to the PR. Reviewers look at the plan, not just the code. The code tells you what you intended. The plan tells you what will actually happen. These aren’t always the same thing.
Integration tests for critical modules. For modules that provision core infrastructure, use something like Terratest to actually create and destroy resources in a test account. This is slow and expensive, so reserve it for the modules where a mistake costs real money.
Policy as Code
Policy documents nobody reads aren’t policy. Automated checks that block bad changes are policy.
Encode your security baselines, encryption requirements, and network boundaries as code. Run them in CI before any apply. When a developer opens a PR that creates an unencrypted S3 bucket, the pipeline catches it, not a security audit six months later.
Managing Drift
Drift is inevitable. Someone will fix something in the console during an incident. An auto-scaling event will leave resources in a state Terraform didn’t expect. External automation will modify tags.
The answer isn’t “never touch the console.” The answer is detect drift early and reconcile it regularly. Run terraform plan on a schedule. Alert on unexpected changes. Treat drift like tech debt: small amounts are fine, accumulated drift is dangerous.
Secrets: Keep Them Out
Secrets don’t belong in .tfvars files, in version control, or in state files if you can avoid it. Use a secrets manager and reference values at runtime:
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "production/api-service/db-password"
}
resource "aws_db_instance" "main" {
# ...
password = data.aws_secretsmanager_secret_version.db_password.secret_string
}
Mark outputs as sensitive. Review plan output for accidental exposure. This isn’t paranoia – I’ve seen database credentials committed to state files in shared S3 buckets with overly broad access policies.
The Payoff
None of this is glamorous. Module boundaries, state segmentation, promotion pipelines, policy checks – it’s all plumbing. But it’s the plumbing that determines whether your IaC is an asset or a liability.
The teams I’ve seen manage infrastructure well at scale all share the same trait: they treat their Terraform code with the same rigor they treat their application code. Reviews, tests, clear ownership, and a healthy fear of clever abstractions. Boring infrastructure is reliable infrastructure.