Infrastructure as Code Patterns That Actually Scale

Quick take

Writing Terraform is easy. Keeping it maintainable across fifty services, four environments, and a dozen contributors is where things fall apart. Structure your modules tightly, treat state as a product feature, promote changes through environments instead of rebuilding, and enforce policy in CI. The rest is discipline.

Everyone can write a Terraform file. The problem starts around month six, when you have fifteen modules nobody remembers writing, state files that three people are afraid to touch, and a staging environment that has drifted so far from production that testing there’s theater.

I’ve seen this pattern at multiple organizations. The IaC starts clean. Then it grows. Then it becomes the thing everyone is scared of. The patterns below are what I’ve seen work to prevent that decay.

Module Design: One Job Per Module

The single most important structural decision is module scope. A module should do one thing. Not “provision a service and its database and its DNS and its monitoring.” One thing.

Here is what a focused module looks like for an application service:

# modules/ecs-service/main.tf

variable "service_name" {
  type        = string
  description = "Name of the ECS service"
}

variable "container_image" {
  type        = string
  description = "Docker image URI"
}

variable "cpu" {
  type    = number
  default = 256
}

variable "memory" {
  type    = number
  default = 512
}

variable "desired_count" {
  type    = number
  default = 2
}

variable "subnet_ids" {
  type = list(string)
}

variable "security_group_ids" {
  type = list(string)
}

resource "aws_ecs_service" "this" {
  name            = var.service_name
  cluster         = var.cluster_arn
  task_definition = aws_ecs_task_definition.this.arn
  desired_count   = var.desired_count
  launch_type     = "FARGATE"

  network_configuration {
    subnets         = var.subnet_ids
    security_groups = var.security_group_ids
  }
}

resource "aws_ecs_task_definition" "this" {
  family                   = var.service_name
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = var.cpu
  memory                   = var.memory

  container_definitions = jsonencode([{
    name  = var.service_name
    image = var.container_image
    portMappings = [{
      containerPort = 8080
      protocol      = "tcp"
    }]
  }])
}

Clean inputs, clean outputs, one responsibility. The database, the DNS record, the monitoring – those are separate modules composed at a higher level. When the ECS module changes, you know exactly what’s affected.

The moment a module becomes a grab bag, it stops being reusable. I’ve seen “service” modules with 40 input variables and 200 lines of conditional logic. Nobody wants to touch them. Nobody trusts them. They become legacy infrastructure inside your infrastructure code.

Repository Layout

Your repo structure should reflect ownership, not technology. A layout I’ve used successfully:

infrastructure/
  modules/
    ecs-service/
    rds-instance/
    cloudfront-distribution/
  environments/
    production/
      main.tf
      terraform.tfvars
    staging/
      main.tf
      terraform.tfvars
    dev/
      main.tf
      terraform.tfvars

Each environment directory composes the same modules with different variables. This makes differences between environments explicit and small instead of hidden and surprising.

State Management Isn’t a Detail

State management is where IaC projects actually fail. Not in the HCL. In the state.

Rules that have saved me from pain:

Remote state with locking. Always. No exceptions. Two people running terraform apply at the same time against the same state file is how you get a 3am incident.

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/api-service/terraform.tfstate"
    region         = "eu-west-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

Segment state by blast radius. Don’t put your entire production environment in one state file. When you run terraform plan and it wants to touch 200 resources, you aren’t going to review that carefully. Nobody does. Split by service or component so each plan is small and reviewable.

Encrypt everything. State files contain secrets whether you want them to or not. Database passwords, API keys, connection strings – they end up in state. Encryption at rest is the minimum.

At a real-time messaging company, we segmented state per service per environment. A change to the messaging service’s production infrastructure touched exactly that service’s state. Nothing else. This made reviews fast, plans small, and rollbacks contained.

Environment Promotion Over Rebuilds

The safest deployment model is promotion. A change that passed in dev should be the exact same change in staging, which should be the exact same change in production. Not “similar.” Identical.

# environments/production/main.tf

module "api_service" {
  source = "../../modules/ecs-service"

  service_name       = "api-service"
  container_image    = var.api_image  # promoted from staging
  cpu                = 512
  memory             = 1024
  desired_count      = 4
  subnet_ids         = var.private_subnet_ids
  security_group_ids = [aws_security_group.api.id]
}

The module source is the same across environments. Only the variables change. When you promote, you update the image tag and any environment-specific sizing, and the rest stays locked.

Avoid long-lived branches per environment. I’ve seen teams with a staging branch and a production branch that diverged months ago. Nobody knows what the differences are. Nobody wants to find out. Use one branch, different variable files.

Testing: Layers, Not Afterthoughts

IaC testing is closer to software testing than most teams realize. Three layers that work:

Static analysis. Run terraform validate, tflint, and a policy tool like OPA or Sentinel in CI. These catch syntax errors, deprecated patterns, and policy violations before anyone runs a plan. Fast and cheap.

# policy/require_encryption.rego

package terraform

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_s3_bucket"
  not resource.change.after.server_side_encryption_configuration
  msg := sprintf("S3 bucket %s must have encryption enabled", [resource.address])
}

Plan review. Every change gets a terraform plan output attached to the PR. Reviewers look at the plan, not just the code. The code tells you what you intended. The plan tells you what will actually happen. These aren’t always the same thing.

Integration tests for critical modules. For modules that provision core infrastructure, use something like Terratest to actually create and destroy resources in a test account. This is slow and expensive, so reserve it for the modules where a mistake costs real money.

Policy as Code

Policy documents nobody reads aren’t policy. Automated checks that block bad changes are policy.

Encode your security baselines, encryption requirements, and network boundaries as code. Run them in CI before any apply. When a developer opens a PR that creates an unencrypted S3 bucket, the pipeline catches it, not a security audit six months later.

Managing Drift

Drift is inevitable. Someone will fix something in the console during an incident. An auto-scaling event will leave resources in a state Terraform didn’t expect. External automation will modify tags.

The answer isn’t “never touch the console.” The answer is detect drift early and reconcile it regularly. Run terraform plan on a schedule. Alert on unexpected changes. Treat drift like tech debt: small amounts are fine, accumulated drift is dangerous.

Secrets: Keep Them Out

Secrets don’t belong in .tfvars files, in version control, or in state files if you can avoid it. Use a secrets manager and reference values at runtime:

data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "production/api-service/db-password"
}

resource "aws_db_instance" "main" {
  # ...
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}

Mark outputs as sensitive. Review plan output for accidental exposure. This isn’t paranoia – I’ve seen database credentials committed to state files in shared S3 buckets with overly broad access policies.

The Payoff

None of this is glamorous. Module boundaries, state segmentation, promotion pipelines, policy checks – it’s all plumbing. But it’s the plumbing that determines whether your IaC is an asset or a liability.

The teams I’ve seen manage infrastructure well at scale all share the same trait: they treat their Terraform code with the same rigor they treat their application code. Reviews, tests, clear ownership, and a healthy fear of clever abstractions. Boring infrastructure is reliable infrastructure.