Let me be straight with you — I’ve broken more production VPCs with Terraform than I care to admit. Last month alone, we had three incidents. One took down our entire staging environment for 2 hours because someone forgot to check the terraform plan output.
Here’s what I actually learned from those failures.
Default CIDR Blocks Are a Trap
I see so many people use 10.0.0.0/16 for everything. Then they try to peer two VPCs and — surprise — IP overlap. We spent 4 hours one Friday night fixing that mess.
Fix: Use environment-specific CIDR blocks. We do 10.${env_id}.0.0/16 where dev=10, staging=20, prod=30. Simple, predictable, never collides.
Don’t Over-Modularize
Reddit’s r/devops had a thread that nailed it: “Modules should earn their keep.” A VPC with three subnets doesn’t need 12 nested modules.
# Keep it simple until complexity justifies itself
resource "aws_vpc" "main" {
cidr_block = "10.${var.env_id}.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
}
I once inherited a codebase with 47 modules for a single VPC. It took 3 minutes just to run terraform init. Don’t be that person.
Subnet Sizing: Don’t Be Cheap
Someone on Hacker News mentioned their /24 subnet ran out of IPs during a scaling event. I felt that pain.
| Subnet Type | Recommended CIDR | Usable IPs | Use Case |
|---|---|---|---|
| Public | /24 | 251 | ALB, NAT Gateway, Bastion |
| Private App | /20 | 4091 | ECS, EKS worker nodes |
| Private Data | /22 | 1019 | RDS, ElastiCache, Aurora |
Why /22 for data subnets? Multi-AZ RDS with read replicas eats IPs fast. We learned this when our Aurora cluster couldn’t scale because the /24 was full.
Explicit Route Tables or Go Home
aws_main_route_table is a trap. Someone on our team accidentally changed the main route table association, and all private subnets got internet access. In production.
# Always explicit. Always.
resource "aws_route_table_association" "private_app" {
count = length(var.private_app_subnet_cidrs)
subnet_id = aws_subnet.private_app[count.index].id
route_table_id = aws_route_table.private.id
}
Security Groups vs NACLs: The Real Answer
Security groups are stateful. NACLs are stateless. Use SGs 99% of the time. NACLs only when you need explicit IP denials at the subnet level.
We got DDoSed once. Security groups couldn’t block the source IP fast enough. NACLs saved us. But for day-to-day? Debugging stateless rules is a nightmare. Stick with SGs.
VPC Flow Logs Are Not Optional
I saw a Reddit thread asking “Why are my packets being dropped?” with no Flow Logs enabled. That’s like driving without a dashboard.
resource "aws_flow_log" "main" {
iam_role_arn = aws_iam_role.flow_log.arn
log_destination = aws_cloudwatch_log_group.flow_log.arn
traffic_type = "ALL"
vpc_id = aws_vpc.main.id
}
Use traffic_type = "ALL". We once found a cross-VPC access issue by looking at REJECT logs. Would’ve taken days without them.
NAT Gateways: Expensive but Worth It
Single-AZ NAT Gateway costs ~$32/month plus data transfer. We tried NAT instances to save money. The ops overhead was triple. One instance died, private subnets lost internet, EKS couldn’t pull images, everything went down.
Rule: Deploy NAT gateways in at least two AZs for production. The cost of downtime is way higher than the NAT gateway bill.
Remote State or Don’t Bother
“Remote state or go home” isn’t just a meme. We use S3 with DynamoDB locking.
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "vpc/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}
Without DynamoDB locking, two people can apply simultaneously and corrupt the state file. We learned this the hard way — one full day recovering a corrupted state.
Environments: Skip Workspaces
Terraform workspaces seem great until you accidentally apply prod config to dev. The state files are too easy to mix up.
Better approach: Directory isolation.
terraform/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ ├── staging/
│ └── prod/
└── modules/
├── vpc/
└── security-groups/
Five extra minutes of setup saves five hours of debugging environment confusion.
Tags: Not Optional
AWS billing groups by tags. Without them, you can’t track costs per team or project.
Our mandatory tags:
Environment: dev/staging/prod
Team: platform/data/ml
CostCenter: xyz-123
Owner: user@company.com
ManagedBy: terraform
Real story: We had a $2,000 surprise bill from an orphaned NAT Gateway. If we’d had an Owner tag, we’d have found the responsible person in 5 minutes instead of 3 hours.
FAQ
Q: VPC Peering or Transit Gateway? A: Under 5 VPCs, use peering. Over 5, Transit Gateway. Peering doesn’t support transitive routing — that’s the biggest gotcha.
Q: Should each environment have its own VPC? A: Absolutely. Shared VPCs are too risky — a dev mistake can affect prod. AWS RAM sharing exists, but for production, full isolation is safer.
Q: How to manage Terraform versions?
A: Lock with required_version, like >= 1.0, < 2.0. Never use latest. Terraform 1.5 broke our provider config.
Q: How to plan VPC CIDR? A: Leave room to grow. We used three /16 blocks for a medium project and regret not going bigger. Use RFC 1918 space — 10.0.0.0/8 will last your career.
Q: Private subnet access to S3? A: VPC Endpoint (Gateway type). It’s free and doesn’t go through the internet. Using NAT Gateway for S3 is just burning money.
Best Practices Summary
| Practice | Recommended | Avoid |
|---|---|---|
| State Storage | S3 + DynamoDB | Local files |
| Environment Isolation | Directory isolation | Workspaces |
| Subnet Size | /20 or larger | /24 or smaller |
| NAT Strategy | NAT Gateway | NAT Instance |
| Traffic Logging | Flow Logs | None |
| Route Association | Explicit resources | Main route table |
| Tagging | Mandatory tags | No tags |
One last thing: Terraform won’t think for you. Every terraform apply is a potential incident. We now require PR reviews with terraform plan output attached. Our incident rate dropped from 30% to under 5%.
Don’t let your VPC be the next post-mortem headline.