Ops Notes

10 Terraform AWS VPC Best Practices I Learned the Hard Way

Cloud & DevOps Visualization

Let me be straight with you — I’ve broken more production VPCs with Terraform than I care to admit. Last month alone, we had three incidents. One took down our entire staging environment for 2 hours because someone forgot to check the terraform plan output.

Here’s what I actually learned from those failures.

Default CIDR Blocks Are a Trap

I see so many people use 10.0.0.0/16 for everything. Then they try to peer two VPCs and — surprise — IP overlap. We spent 4 hours one Friday night fixing that mess.

Fix: Use environment-specific CIDR blocks. We do 10.${env_id}.0.0/16 where dev=10, staging=20, prod=30. Simple, predictable, never collides.

Don’t Over-Modularize

Reddit’s r/devops had a thread that nailed it: “Modules should earn their keep.” A VPC with three subnets doesn’t need 12 nested modules.

# Keep it simple until complexity justifies itself
resource "aws_vpc" "main" {
  cidr_block           = "10.${var.env_id}.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
}

I once inherited a codebase with 47 modules for a single VPC. It took 3 minutes just to run terraform init. Don’t be that person.

Subnet Sizing: Don’t Be Cheap

Someone on Hacker News mentioned their /24 subnet ran out of IPs during a scaling event. I felt that pain.

Subnet TypeRecommended CIDRUsable IPsUse Case
Public/24251ALB, NAT Gateway, Bastion
Private App/204091ECS, EKS worker nodes
Private Data/221019RDS, ElastiCache, Aurora

Why /22 for data subnets? Multi-AZ RDS with read replicas eats IPs fast. We learned this when our Aurora cluster couldn’t scale because the /24 was full.

Explicit Route Tables or Go Home

aws_main_route_table is a trap. Someone on our team accidentally changed the main route table association, and all private subnets got internet access. In production.

# Always explicit. Always.
resource "aws_route_table_association" "private_app" {
  count          = length(var.private_app_subnet_cidrs)
  subnet_id      = aws_subnet.private_app[count.index].id
  route_table_id = aws_route_table.private.id
}

Security Groups vs NACLs: The Real Answer

Security groups are stateful. NACLs are stateless. Use SGs 99% of the time. NACLs only when you need explicit IP denials at the subnet level.

We got DDoSed once. Security groups couldn’t block the source IP fast enough. NACLs saved us. But for day-to-day? Debugging stateless rules is a nightmare. Stick with SGs.

VPC Flow Logs Are Not Optional

I saw a Reddit thread asking “Why are my packets being dropped?” with no Flow Logs enabled. That’s like driving without a dashboard.

resource "aws_flow_log" "main" {
  iam_role_arn    = aws_iam_role.flow_log.arn
  log_destination = aws_cloudwatch_log_group.flow_log.arn
  traffic_type    = "ALL"
  vpc_id          = aws_vpc.main.id
}

Use traffic_type = "ALL". We once found a cross-VPC access issue by looking at REJECT logs. Would’ve taken days without them.

NAT Gateways: Expensive but Worth It

Single-AZ NAT Gateway costs ~$32/month plus data transfer. We tried NAT instances to save money. The ops overhead was triple. One instance died, private subnets lost internet, EKS couldn’t pull images, everything went down.

Rule: Deploy NAT gateways in at least two AZs for production. The cost of downtime is way higher than the NAT gateway bill.

Remote State or Don’t Bother

“Remote state or go home” isn’t just a meme. We use S3 with DynamoDB locking.

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "vpc/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}

Without DynamoDB locking, two people can apply simultaneously and corrupt the state file. We learned this the hard way — one full day recovering a corrupted state.

Environments: Skip Workspaces

Terraform workspaces seem great until you accidentally apply prod config to dev. The state files are too easy to mix up.

Better approach: Directory isolation.

terraform/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   └── prod/
└── modules/
    ├── vpc/
    └── security-groups/

Five extra minutes of setup saves five hours of debugging environment confusion.

Tags: Not Optional

AWS billing groups by tags. Without them, you can’t track costs per team or project.

Our mandatory tags:

Environment: dev/staging/prod
Team: platform/data/ml
CostCenter: xyz-123
Owner: user@company.com
ManagedBy: terraform

Real story: We had a $2,000 surprise bill from an orphaned NAT Gateway. If we’d had an Owner tag, we’d have found the responsible person in 5 minutes instead of 3 hours.

FAQ

Q: VPC Peering or Transit Gateway? A: Under 5 VPCs, use peering. Over 5, Transit Gateway. Peering doesn’t support transitive routing — that’s the biggest gotcha.

Q: Should each environment have its own VPC? A: Absolutely. Shared VPCs are too risky — a dev mistake can affect prod. AWS RAM sharing exists, but for production, full isolation is safer.

Q: How to manage Terraform versions? A: Lock with required_version, like >= 1.0, < 2.0. Never use latest. Terraform 1.5 broke our provider config.

Q: How to plan VPC CIDR? A: Leave room to grow. We used three /16 blocks for a medium project and regret not going bigger. Use RFC 1918 space — 10.0.0.0/8 will last your career.

Q: Private subnet access to S3? A: VPC Endpoint (Gateway type). It’s free and doesn’t go through the internet. Using NAT Gateway for S3 is just burning money.

Best Practices Summary

PracticeRecommendedAvoid
State StorageS3 + DynamoDBLocal files
Environment IsolationDirectory isolationWorkspaces
Subnet Size/20 or larger/24 or smaller
NAT StrategyNAT GatewayNAT Instance
Traffic LoggingFlow LogsNone
Route AssociationExplicit resourcesMain route table
TaggingMandatory tagsNo tags

One last thing: Terraform won’t think for you. Every terraform apply is a potential incident. We now require PR reviews with terraform plan output attached. Our incident rate dropped from 30% to under 5%.

Don’t let your VPC be the next post-mortem headline.