Updated for 2026 Architectures

Modern Guide to Learning Terraform & OpenTofu

In 2023, the IaC landscape shifted when HashiCorp moved Terraform to a BSL, prompting the Linux Foundation to launch OpenTofu. This guide bridges the gap between writing HCL and architecting production-ready, AI-enabled enterprise environments.

View Multi-Cloud Code Examples View the Deadliest Gotchas

1. The Modern Toolchain (Beyond the CLI)

Writing IaC in a modern environment involves much more than just the terraform or tofu binaries. A robust ecosystem is required to ensure security, quality, and predictability before code ever reaches the cloud.

Beginner Mistake: Global Installs

Installing via brew install terraform or apt-get is dangerous. Different projects require different versions. If your CLI version is newer than the remote state file's version, you can accidentally upgrade the state, locking out teammates on older versions.

Best Practice: Use `tenv`

tenv (the modern successor to tfenv) manages versions for Terraform, OpenTofu, and Terragrunt simultaneously. You should place a .terraform-version or .opentofu-version file in your repository root, and tenv will automatically download and utilize the exact matching binary.

Terminal Setup

# Install tenv, then let it manage your binaries:
tenv tofu install latest
tenv tofu use latest

# Or specific versions for legacy projects:
tenv tf install 1.5.7
tenv tf use 1.5.7

The Essential Ecosystem Tools

1
tflint: Catches cloud-specific provider errors (e.g., choosing an invalid EC2 instance type) rather than just syntax errors.
2
Trivy: Static analysis security scanner. Prevents you from deploying public S3 buckets or unencrypted databases.
3
infracost: Connects to cloud pricing APIs. Outputs a PR comment showing exactly how much a change will alter your monthly bill.
4
pre-commit-terraform: A Git hook framework that runs fmt, tflint, and docs generation before you commit.

2. GitOps & DevOps Integration

Running tofu apply from a developer's laptop is an absolute anti-pattern. Modern DevOps relies on GitOps: Git is the single source of truth, and infrastructure is updated via automated pipelines.

Push Model (Traditional CI/CD): GitHub Actions or GitLab CI runs plan on PR creation, and runs apply upon merge to the main branch.

Gotcha: Plan Asymmetry. If you run a plan, wait two days for approval, merge, and then run apply, the cloud state might have drifted. Fix: Output your plan to an artifact (-out=tfplan). The CD pipeline must consume that exact artifact. If the infrastructure changed in the meantime, the apply will safely fail rather than doing something unexpected.

Pull Model (True GitOps): Tools like FluxCD (with the TF Controller) or ArgoCD run inside your Kubernetes cluster. They constantly monitor your Git repository. If the repo changes, the cluster pulls the changes and applies the code from the inside out, securing your networking perimeter.

Beginner Mistake: Pasting AWS IAM Access Keys into GitHub Actions secrets. These keys never expire and compromise your account if leaked.

Best Practice: Use OIDC (OpenID Connect). GitHub Actions securely identifies itself to the cloud provider, which issues temporary, short-lived (e.g., 1 hour) STS tokens. There are no static credentials to rotate.

3. Architecture & Blast Radius

The "God State" Anti-Pattern

Putting your entire company's infrastructure into one main.tf and one state file is a recipe for disaster. A single typo could destroy your production database while updating a dev Lambda. Furthermore, the Directed Acyclic Graph (DAG) computation for large states becomes exponentially slower, causing 15-minute plan times.

Best Practice: Directory-Separated Layers

Do not use CLI Workspaces for environment separation. They hide which environment you are in. Use explicit directories to split your state by Environment and Layer.

.
├── modules/
│   ├── network/         # Reusable VPC code
│   └── eks_cluster/     # Reusable K8s code
└── environments/
    ├── dev/
    │   ├── 01-network/  # Changes rarely (State A)
    │   ├── 02-data/     # Databases (State B)
    │   └── 03-app/      # App compute (State C)
    └── prod/
        ├── 01-network/  
        ├── 02-data/     
        └── 03-app/

Nuance: Decoupling Layers. Use data sources to pass the VPC ID from Layer 1 to Layer 3. Relying on terraform_remote_state tightly couples state files, meaning a corruption in the network state file could break your app deployment pipeline.

4. Secrets Management Best Practices

Terraform is inherently dangerous with secrets because, historically, everything passed through it was stored in the .tfstate file in plain text. If an attacker gains read access to your S3 bucket, they own your database passwords.

1. Generating Dynamic Secrets

Never hardcode passwords. Let the tool generate the secret, push it to a vault (like AWS Secrets Manager), and immediately discard it from your local memory.

resource "random_password" "db" {
  length  = 16
  special = true
}

resource "aws_secretsmanager_secret_version" "db" {
  secret_id     = aws_secretsmanager_secret.db.id
  secret_string = random_password.db.result
}

2. OpenTofu Exclusive: State Encryption v1.7+

Even with dynamic secrets, random_password still saves the result in the state file. OpenTofu 1.7+ introduced native, client-side state encryption to fix this fundamental flaw, encrypting the JSON before it is transmitted over the network.

terraform {
  encryption {
    key_provider "awskms" "my_kms" {
      kms_key_id = "arn:aws:kms:us-east-1:123:key/xyz"
    }
    method "aes_gcm" "my_method" {
      keys = key_provider.awskms.my_kms
    }
    state { method = method.aes_gcm.my_method }
  }
}

5. Terraform vs. Ansible (The Handoff)

A massive source of confusion is where Terraform ends and Ansible begins.

Terraform/OpenTofu: Provisioning (Hardware/Cloud APIs). Creates the VPC, the EC2 instance, the load balancer.
Ansible: Configuration Management (Software/OS). SSHing into the EC2 instance, installing Nginx, configuring the firewall.

Gotcha: Never use local-exec to run Ansible.

Beginners often use provisioners to trigger playbooks inside Terraform. This breaks idempotency. Provisioners only run during resource creation. If you update your Ansible playbook, running tofu apply will do nothing.

Best Practice: Dynamic Inventory or Cloud-Init

Instead of linking them directly, keep them decoupled. Let Terraform build the servers and tag them (Role = "WebServer"). When Terraform exits, your CI pipeline triggers Ansible. Ansible uses a Dynamic Inventory plugin to query the AWS/GCP API, finds all instances with that tag, and configures them independently.

6. Deploying Enterprise Agentic AI Automation

AI is no longer just a chatbot; it involves Agentic AI—autonomous agents that can access tools, query databases, and take actions. Deploying infrastructure for these agents is a massive new use case for IaC.

1. GPU Compute

Provisioning Amazon EKS clusters with specific Node Groups utilizing Nvidia GPUs (p4d or g5) to run local LLMs or agent reasoning engines.

2. Vector Memory (RAG)

Using providers for Pinecone, Qdrant, or OpenSearch Serverless to provision the vector database where the agent stores embeddings and memory.

3. Strict IAM Boundaries

Agents are autonomous and dangerous. Terraform must provision Least Privilege IAM Roles (IRSA) restricting the agent to read only specific context buckets.

Use Case: AI Agents Reviewing Infrastructure

Enterprises are deploying read-only agents to review IaC pull requests. Use Terraform to provision a webhook that triggers a Lambda running an agent. When a developer opens a PR, the agent reads the tofu plan JSON output, cross-references it with Confluence security docs, and leaves a plain-English review.

7. HCL Mastery & Coding Standards

The Deadliest Gotcha: `count` vs `for_each`

The Index Shifting Bug: If you use count with a list ["public", "private", "db"], and you delete "public", the array indexes shift. Terraform thinks you changed item 0, changed item 1, and deleted item 2. It will destroy and recreate your private and database subnets, causing an outage.

Best Practice: Always use for_each for collections. It uses map keys (strings) instead of integer indexes.

variable "subnets" {
  type = set(string)
  default = ["public", "private", "database"]
}

resource "aws_subnet" "main" {
  for_each = var.subnets 
  # Safely creates aws_subnet.main["public"], etc.
}

Complex Variables with Optional Attributes

Historically, variables were simple strings. Today, use strongly typed Objects with optional() fields to create robust, self-documenting modules.

variable "database_config" {
  type = object({
    engine_version       = string
    instance_class       = string
    multi_az             = optional(bool, false)
    deletion_protection  = optional(bool, true)
  })
}

8. State Management & Refactoring

The `moved` Block

If you rename a resource (e.g., aws_instance.web to aws_instance.frontend), the engine will destroy the old instance and build a new one. Prevent this using the moved block, updating the state file without touching the cloud.

moved {
  from = aws_instance.web
  to   = aws_instance.frontend
}

Configuration-Driven Import

Bringing click-ops infrastructure into IaC used to require tedious CLI commands. Now, define an import block and run tofu plan -generate-config-out=generated.tf to autogenerate the HCL.

import {
  to = aws_s3_bucket.legacy_data
  id = "my-manually-created-bucket"
}

9. Multi-Cloud Rosetta Stone

The syntax logic of HCL remains identical across clouds, but the provider implementations and required resources differ dramatically. Below are side-by-side examples of the most common provisioning tasks across AWS, GCP, and Azure.

Scenario A: Provision a Web Server (Ubuntu + Git Clone + Ports 80/443)

Best Practice: Instead of using local-exec to configure the server, pass a bash script into the instance's user data (Cloud-init). It will run automatically on boot.

# 1. Fetch latest Ubuntu AMI
data "aws_ami" "ubuntu" {
  most_recent = true
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }
  owners = ["099720109477"] # Canonical
}

# 2. Provision EC2 Instance
resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.micro"
  vpc_security_group_ids = [aws_security_group.web_sg.id]

  user_data = <<-EOF
              #!/bin/bash
              apt-get update
              apt-get install -y nginx git
              git clone https://github.com/example/repo.git /var/www/html/repo
              systemctl start nginx
              EOF
}

# 3. Security Group (Open 80 & 443)
resource "aws_security_group" "web_sg" {
  name = "web-sg"
  dynamic "ingress" {
    for_each = [80, 443]
    content {
      from_port   = ingress.value
      to_port     = ingress.value
      protocol    = "tcp"
      cidr_blocks = ["0.0.0.0/0"]
    }
  }
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# 1. Fetch latest Ubuntu Image
data "google_compute_image" "ubuntu" {
  family  = "ubuntu-2204-lts"
  project = "ubuntu-os-cloud"
}

# 2. Provision Compute Engine Instance
resource "google_compute_instance" "web" {
  name         = "web-server"
  machine_type = "e2-micro"
  zone         = "us-central1-a"

  # GCP uses network tags to attach firewall rules
  tags = ["http-server", "https-server"]

  boot_disk {
    initialize_params { image = data.google_compute_image.ubuntu.self_link }
  }

  network_interface {
    network = "default"
    access_config {} # Assigns Ephemeral Public IP
  }

  metadata_startup_script = <<-EOF
                            #!/bin/bash
                            apt-get update
                            apt-get install -y nginx git
                            git clone https://github.com/example/repo.git /var/www/html/repo
                            systemctl start nginx
                            EOF
}

# 3. Firewall Rule (Open 80 & 443)
resource "google_compute_firewall" "web_firewall" {
  name    = "allow-web"
  network = "default"

  allow {
    protocol = "tcp"
    ports    = ["80", "443"]
  }

  source_ranges = ["0.0.0.0/0"]
  target_tags   = ["http-server", "https-server"]
}

# Azure requires explicit creation of Network Interfaces (NICs)
resource "azurerm_network_interface" "web_nic" {
  name                = "web-nic"
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name
  ip_configuration {
    name                          = "internal"
    subnet_id                     = azurerm_subnet.sub.id
    private_ip_address_allocation = "Dynamic"
    public_ip_address_id          = azurerm_public_ip.pub.id
  }
}

# 1. Provision Virtual Machine
resource "azurerm_linux_virtual_machine" "web" {
  name                = "web-server"
  resource_group_name = azurerm_resource_group.rg.name
  location            = azurerm_resource_group.rg.location
  size                = "Standard_B1s"
  admin_username      = "adminuser"
  network_interface_ids = [azurerm_network_interface.web_nic.id]

  admin_ssh_key {
    username   = "adminuser"
    public_key = file("~/.ssh/id_rsa.pub")
  }

  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-server-jammy"
    sku       = "22_04-lts-gen2"
    version   = "latest"
  }

  # Gotcha: Azure requires base64 encoding for custom_data (cloud-init)
  custom_data = base64encode(<<-EOF
                #!/bin/bash
                apt-get update
                apt-get install -y nginx git
                git clone https://github.com/example/repo.git /var/www/html/repo
                systemctl start nginx
                EOF)
}

# 2. Network Security Group (Open 80 & 443)
resource "azurerm_network_security_group" "web_nsg" {
  name                = "web-nsg"
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name

  security_rule {
    name                       = "AllowWeb"
    priority                   = 100
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_ranges    = ["80", "443"]
    source_address_prefix      = "*"
    destination_address_prefix = "*"
  }
}

Scenario B: Provision Storage Buckets from a List

Best Practice: Always use for_each with a set(string) so that removing an item from the middle of the list doesn't trigger destruction of subsequent buckets due to index shifting.

variable "bucket_names" {
  type    = set(string)
  default = ["assets", "logs", "backups"]
}

resource "aws_s3_bucket" "buckets" {
  for_each      = var.bucket_names
  bucket        = "my-company-${each.key}-2026"
  force_destroy = false

  tags = {
    Environment = "Production"
    Purpose     = each.key
  }
}

variable "bucket_names" {
  type    = set(string)
  default = ["assets", "logs", "backups"]
}

resource "google_storage_bucket" "buckets" {
  for_each      = var.bucket_names
  name          = "my-company-${each.key}-2026"
  location      = "US"
  force_destroy = false

  # Uniform bucket-level access is a modern GCP best practice
  uniform_bucket_level_access = true

  labels = {
    environment = "production"
    purpose     = each.key
  }
}

# Azure groups "buckets" (containers) inside a Storage Account
resource "azurerm_storage_account" "main" {
  name                     = "mycompanystorage2026"
  resource_group_name      = azurerm_resource_group.rg.name
  location                 = azurerm_resource_group.rg.location
  account_tier             = "Standard"
  account_replication_type = "LRS"
}

variable "container_names" {
  type    = set(string)
  default = ["assets", "logs", "backups"]
}

resource "azurerm_storage_container" "containers" {
  for_each              = var.container_names
  name                  = each.key
  storage_account_name  = azurerm_storage_account.main.name
  container_access_type = "private"
}

Scenario C: Provision PostgreSQL & Network Whitelist

Best Practice: Databases should almost never have public IPs. Configure them to only accept traffic on Port 5432 from within your private network/VPC.

# 1. DB Security Group (Allow 5432 from VPC)
resource "aws_security_group" "db_sg" {
  name   = "postgres-sg"
  vpc_id = aws_vpc.main.id

  ingress {
    from_port   = 5432
    to_port     = 5432
    protocol    = "tcp"
    # Only allow traffic from the VPC CIDR (Private access)
    cidr_blocks = [aws_vpc.main.cidr_block] 
  }
}

# 2. Provision RDS Postgres
resource "aws_db_instance" "postgres" {
  identifier           = "prod-db"
  engine               = "postgres"
  engine_version       = "15.4"
  instance_class       = "db.t3.micro"
  allocated_storage    = 20
  
  db_name              = "myappdb"
  username             = "dbadmin"
  password             = data.aws_secretsmanager_secret_version.db_pass.secret_string
  
  vpc_security_group_ids = [aws_security_group.db_sg.id]
  db_subnet_group_name   = aws_db_subnet_group.main.name

  publicly_accessible  = false # Crucial for security
  skip_final_snapshot  = true
}

# Gotcha: GCP requires VPC peering for private Cloud SQL IPs
resource "google_compute_global_address" "private_ip_address" {
  name          = "private-ip-db"
  purpose       = "VPC_PEERING"
  address_type  = "INTERNAL"
  prefix_length = 16
  network       = google_compute_network.main.id
}

resource "google_service_networking_connection" "private_vpc_connection" {
  network                 = google_compute_network.main.id
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = [google_compute_global_address.private_ip_address.name]
}

# Provision Cloud SQL Postgres
resource "google_sql_database_instance" "postgres" {
  name             = "prod-db-instance"
  database_version = "POSTGRES_15"
  region           = "us-central1"

  depends_on = [google_service_networking_connection.private_vpc_connection]

  settings {
    tier = "db-f1-micro"
    ip_configuration {
      ipv4_enabled    = false # Disables public IP
      private_network = google_compute_network.main.id
    }
  }
}

# Provision Postgres Flexible Server
resource "azurerm_postgresql_flexible_server" "postgres" {
  name                   = "proddbflexserver"
  resource_group_name    = azurerm_resource_group.rg.name
  location               = azurerm_resource_group.rg.location
  version                = "15"
  administrator_login    = "dbadmin"
  administrator_password = data.azurerm_key_vault_secret.db_pass.value

  sku_name   = "B_Standard_B1ms"
  storage_mb = 32768

  # For strict private access, Azure integrates via delegated subnet
  delegated_subnet_id    = azurerm_subnet.db_subnet.id
  private_dns_zone_id    = azurerm_private_dns_zone.db_zone.id
}

# Alternative: If using public access with strict firewall rules
resource "azurerm_postgresql_flexible_server_firewall_rule" "allow_vnet" {
  name             = "allow-vnet-ips"
  server_id        = azurerm_postgresql_flexible_server.postgres.id
  start_ip_address = "10.0.0.0" # VNET CIDR Start
  end_ip_address   = "10.0.255.255" # VNET CIDR End
}

10. The Ultimate Production Checklist

Before you merge any IaC to production, run it through this final validation layer.

Architecture & Code

Blast Radius Separation

State is explicitly separated by environment (Dev/Prod) and Layer (Network/App) into separate directories.
Strict Loops

Using for_each instead of count for mapping over variable collections to prevent index-shifting destruction bugs.
Ansible Handoff

Configuration management is handed off via Dynamic Inventory or Cloud-init rather than using fragile local-exec triggers.

Security & GitOps

OIDC Authentication

CI/CD pipelines authenticate to cloud providers using temporary STS OIDC tokens instead of static access keys.
Plan Asymmetry Addressed

The CD pipeline executes apply utilizing an exact -out=tfplan binary artifact generated during the PR phase.
AI Agent IAM Boundaries

Any infrastructure provisioned for autonomous AI agents utilizes strictly scoped, least-privilege IAM roles (IRSA).

No sections found