Terraform for Data Infrastructure: A Practical Guide

Key Takeaways

  • Infrastructure as Code is not optional for modern data teams. If your Snowflake warehouses, Airflow environments, or Kafka clusters are provisioned through ClickOps, you are accumulating invisible technical debt.
  • Terraform has mature providers for Snowflake, Databricks, AWS Redshift, Confluent Kafka, and all three major clouds — making it the de facto choice for data infrastructure automation.
  • State management is the hardest part. Remote backends with locking (S3 + DynamoDB, GCS, Terraform Cloud) are non-negotiable for team workflows.
  • Reusable modules let you stamp out consistent data environments across dev, staging, and production without copy-pasting HCL files.
  • The biggest wins come from boring things: lifecycle policies on storage buckets, IAM roles for service accounts, and tagging conventions that actually stick because they're enforced in code.

Why I Started Terraform-ing Everything After One Bad Weekend

Let me tell you about the weekend that changed my mind about infrastructure as code for data platforms. It was late 2023, and I was a platform engineer at a mid-sized fintech. Our data team had a Snowflake account with about 40 warehouses, a dozen databases, and somewhere north of 200 schemas. All provisioned manually through the Snowflake UI by various data engineers over two years.

Someone accidentally dropped a production database. Not a table — the entire database. The UNDROP command saved us, but when we tried to figure out who had access to what, and which warehouses belonged to which team, we realized nobody had a clear picture. There was no documentation, no audit trail beyond Snowflake's query history, and no way to recreate the environment from scratch.

That Monday, I started writing Terraform for our Snowflake setup. Within three weeks, we had every warehouse, database, schema, role, and grant defined in HCL files. When a new data engineer joined, their access was a pull request. When we needed a staging copy of production, it was terraform workspace new staging. The difference was night and day.

If you work in data engineering and your infrastructure is still managed through web consoles and ad-hoc CLI commands, this guide is for you. I'll walk through the practical side: how to set up Terraform for data-specific tools, real code you can adapt, and the mistakes I've made so you don't have to.

Terraform Basics for Data Engineers

If you're already comfortable with Terraform, skip ahead. For data engineers who've heard the name but never written HCL, here's the minimum you need to know.

Terraform is a declarative infrastructure tool by HashiCorp. You describe the desired state of your infrastructure in .tf files, and Terraform figures out how to get there. It talks to cloud providers, SaaS platforms, and databases through providers — plugins that translate your HCL into API calls.

The workflow is always the same:

  1. terraform init — downloads providers and sets up your backend.
  2. terraform plan — shows what Terraform would change without touching anything.
  3. terraform apply — makes the changes for real.
  4. terraform destroy — tears everything down (careful with this one).

Here's the simplest possible example — creating an S3 bucket:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

resource "aws_s3_bucket" "data_lake" {
  bucket = "mycompany-data-lake-prod"

  tags = {
    Team        = "data-engineering"
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

That's it. Run terraform apply, and you have a bucket. More importantly, that bucket's existence and configuration are now tracked in a state file, and any changes go through code review.

Provisioning Snowflake with Terraform

The Snowflake Terraform provider is one of the most complete SaaS providers I've worked with. It covers warehouses, databases, schemas, roles, grants, pipes, stages, tasks — basically everything you'd do in the Snowflake UI.

First, configure the provider. I keep credentials in environment variables, never in the HCL:

terraform {
  required_providers {
    snowflake = {
      source  = "Snowflake-Labs/snowflake"
      version = "~> 0.96"
    }
  }
}

provider "snowflake" {
  organization_name = var.snowflake_org
  account_name      = var.snowflake_account
  user              = var.snowflake_user
  password          = var.snowflake_password
  role              = "ACCOUNTADMIN"
}

Now let's provision a real data engineering setup — a warehouse, database, schema, and role-based access:

resource "snowflake_warehouse" "etl" {
  name           = "ETL_WH"
  warehouse_size = "medium"
  auto_suspend   = 120
  auto_resume    = true

  initially_suspended = true
  min_cluster_count   = 1
  max_cluster_count   = 3
  scaling_policy      = "ECONOMY"

  comment = "Warehouse for ETL/ELT workloads"
}

resource "snowflake_warehouse" "analytics" {
  name           = "ANALYTICS_WH"
  warehouse_size = "small"
  auto_suspend   = 60
  auto_resume    = true

  initially_suspended = true
  comment             = "Warehouse for BI tools and ad-hoc queries"
}

resource "snowflake_database" "raw" {
  name    = "RAW"
  comment = "Landing zone for raw ingested data"
}

resource "snowflake_database" "transformed" {
  name    = "TRANSFORMED"
  comment = "Cleaned and modeled data (dbt output)"
}

resource "snowflake_schema" "raw_stripe" {
  database = snowflake_database.raw.name
  name     = "STRIPE"
  comment  = "Stripe payment events"
}

resource "snowflake_schema" "raw_segment" {
  database = snowflake_database.raw.name
  name     = "SEGMENT"
  comment  = "Segment clickstream data"
}

# Role-based access control
resource "snowflake_account_role" "data_analyst" {
  name    = "DATA_ANALYST"
  comment = "Read-only access to transformed data"
}

resource "snowflake_grant_privileges_to_account_role" "analyst_warehouse" {
  privileges        = ["USAGE"]
  account_role_name = snowflake_account_role.data_analyst.name
  on_account_object {
    object_type = "WAREHOUSE"
    object_name = snowflake_warehouse.analytics.name
  }
}

resource "snowflake_grant_privileges_to_account_role" "analyst_db" {
  privileges        = ["USAGE"]
  account_role_name = snowflake_account_role.data_analyst.name
  on_account_object {
    object_type = "DATABASE"
    object_name = snowflake_database.transformed.name
  }
}

The real power here isn't any single resource — it's that your entire Snowflake RBAC model is now version-controlled. When the security team asks "who has access to what?", the answer is a Git repository, not a spreadsheet.

Databricks Workspaces and Clusters

Databricks has an official Terraform provider that handles workspace setup, cluster policies, jobs, and Unity Catalog resources. In my experience, the most valuable thing to Terraform is cluster policies and job definitions, because those are the resources that drift the most when people create them through the UI.

terraform {
  required_providers {
    databricks = {
      source  = "databricks/databricks"
      version = "~> 1.50"
    }
  }
}

provider "databricks" {
  host  = var.databricks_host
  token = var.databricks_token
}

# Cluster policy to control costs
resource "databricks_cluster_policy" "data_eng" {
  name = "Data Engineering Standard"

  definition = jsonencode({
    "autotermination_minutes" : { "type" : "range", "maxValue" : 60, "defaultValue" : 20 },
    "spark_version" : { "type" : "regex", "pattern" : "1[4-5]\\.[0-9]+\\.x-scala.*" },
    "node_type_id" : { "type" : "allowlist", "values" : ["i3.xlarge", "i3.2xlarge"] },
    "num_workers" : { "type" : "range", "minValue" : 1, "maxValue" : 8 },
    "custom_tags.Team" : { "type" : "fixed", "value" : "data-engineering" }
  })
}

# Job cluster for a production pipeline
resource "databricks_job" "daily_etl" {
  name = "daily-etl-pipeline"

  task {
    task_key = "ingest"

    new_cluster {
      num_workers   = 4
      spark_version = "14.3.x-scala2.12"
      node_type_id  = "i3.xlarge"

      aws_attributes {
        instance_profile_arn = var.instance_profile_arn
      }
    }

    notebook_task {
      notebook_path = "/Repos/production/pipelines/ingest"
    }
  }

  task {
    task_key = "transform"
    depends_on {
      task_key = "ingest"
    }

    new_cluster {
      num_workers   = 2
      spark_version = "14.3.x-scala2.12"
      node_type_id  = "i3.xlarge"
    }

    notebook_task {
      notebook_path = "/Repos/production/pipelines/transform"
    }
  }

  schedule {
    quartz_cron_expression = "0 0 6 * * ?"
    timezone_id            = "America/New_York"
  }

  email_notifications {
    on_failure = ["data-eng-alerts@company.com"]
  }
}

One thing I learned the hard way: always Terraform your cluster policies before individual clusters. If engineers are creating clusters through the UI without policies, you'll burn through your cloud budget faster than you think.

Redshift Clusters and Serverless

For teams on AWS that use Redshift, Terraform handles both provisioned clusters and the newer Serverless option. Here's a Serverless setup that I've used in production:

resource "aws_redshiftserverless_namespace" "analytics" {
  namespace_name      = "analytics"
  db_name             = "analytics_db"
  admin_username      = "admin"
  admin_user_password = var.redshift_admin_password

  iam_roles = [aws_iam_role.redshift_loader.arn]

  tags = {
    Team        = "data-engineering"
    Environment = var.environment
  }
}

resource "aws_redshiftserverless_workgroup" "analytics" {
  namespace_name = aws_redshiftserverless_namespace.analytics.namespace_name
  workgroup_name = "analytics-workgroup"
  base_capacity  = 32  # RPUs

  publicly_accessible = false
  subnet_ids          = var.private_subnet_ids
  security_group_ids  = [aws_security_group.redshift.id]
}

resource "aws_iam_role" "redshift_loader" {
  name = "redshift-data-loader"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "redshift.amazonaws.com"
      }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "redshift_s3_read" {
  role       = aws_iam_role.redshift_loader.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
}

The Serverless model is particularly nice for Terraform because you don't have to worry about node types and cluster sizing — just set the base capacity in RPUs and let AWS handle scaling.

Managing Airflow and Dagster Infrastructure

Orchestrators are the glue of any data platform, and their infrastructure tends to be the messiest. I've managed both MWAA (Managed Airflow on AWS) and self-hosted Dagster through Terraform.

AWS MWAA (Managed Airflow)

resource "aws_s3_bucket" "airflow_dags" {
  bucket = "${var.project}-airflow-dags-${var.environment}"
}

resource "aws_s3_bucket_versioning" "airflow_dags" {
  bucket = aws_s3_bucket.airflow_dags.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_mwaa_environment" "main" {
  name              = "${var.project}-${var.environment}"
  airflow_version   = "2.9.2"
  environment_class = "mw1.medium"

  dag_s3_path              = "dags/"
  requirements_s3_path     = "requirements.txt"
  source_bucket_arn        = aws_s3_bucket.airflow_dags.arn
  execution_role_arn       = aws_iam_role.mwaa_execution.arn
  webserver_access_mode    = "PRIVATE_ONLY"

  min_workers = 1
  max_workers = 10

  network_configuration {
    security_group_ids = [aws_security_group.mwaa.id]
    subnet_ids         = var.private_subnet_ids
  }

  airflow_configuration_options = {
    "core.default_timezone"          = "America/New_York"
    "celery.worker_autoscale"        = "10,1"
    "webserver.default_ui_timezone"  = "America/New_York"
    "smtp.smtp_mail_from"            = "airflow@company.com"
  }

  logging_configuration {
    dag_processing_logs {
      enabled   = true
      log_level = "WARNING"
    }
    scheduler_logs {
      enabled   = true
      log_level = "INFO"
    }
    task_logs {
      enabled   = true
      log_level = "INFO"
    }
  }
}

Dagster on Kubernetes (EKS)

For Dagster, I typically use the Helm provider inside Terraform to deploy the Dagster Helm chart onto EKS. It's a pattern that feels a bit odd at first — Terraform managing Helm releases — but it keeps everything in one place:

resource "helm_release" "dagster" {
  name       = "dagster"
  repository = "https://dagster-io.github.io/helm"
  chart      = "dagster"
  version    = "1.7.5"
  namespace  = "dagster"

  create_namespace = true

  values = [templatefile("${path.module}/values/dagster.yaml", {
    postgres_host     = aws_db_instance.dagster_metadata.address
    postgres_password = var.dagster_db_password
    s3_bucket         = aws_s3_bucket.dagster_io.bucket
    service_account   = kubernetes_service_account.dagster.metadata[0].name
  })]

  depends_on = [
    aws_db_instance.dagster_metadata,
    aws_s3_bucket.dagster_io,
    kubernetes_service_account.dagster,
  ]
}

The key insight with orchestrator infrastructure is that the orchestrator itself is only part of the story. You also need to Terraform the surrounding resources: the metadata database, the S3 buckets for logs and IO, the IAM roles for task execution, and the networking rules that let your DAGs or assets talk to data sources.

S3 and GCS Buckets with Lifecycle Policies

Storage buckets are the bread and butter of data infrastructure, and lifecycle policies are where Terraform really shines. I've seen data lakes where raw event data sits in standard storage forever because nobody set up transitions. That's money on fire.

resource "aws_s3_bucket" "data_lake" {
  bucket = "${var.project}-data-lake-${var.environment}"

  tags = {
    Team        = "data-engineering"
    Environment = var.environment
    DataClass   = "internal"
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id

  # Raw events — hot for 30 days, then IA, then Glacier
  rule {
    id     = "raw-events-tiering"
    status = "Enabled"

    filter {
      prefix = "raw/events/"
    }

    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    transition {
      days          = 90
      storage_class = "GLACIER"
    }

    expiration {
      days = 730  # 2 years retention
    }
  }

  # Temporary staging files — clean up after 7 days
  rule {
    id     = "staging-cleanup"
    status = "Enabled"

    filter {
      prefix = "staging/"
    }

    expiration {
      days = 7
    }

    abort_incomplete_multipart_upload {
      days_after_initiation = 1
    }
  }

  # Incomplete multipart uploads everywhere
  rule {
    id     = "abort-incomplete-uploads"
    status = "Enabled"

    filter {}

    abort_incomplete_multipart_upload {
      days_after_initiation = 3
    }
  }
}

For GCS, the equivalent looks like this:

resource "google_storage_bucket" "data_lake" {
  name          = "${var.project}-data-lake-${var.environment}"
  location      = "US"
  storage_class = "STANDARD"

  uniform_bucket_level_access = true

  lifecycle_rule {
    condition {
      age                   = 30
      matches_storage_class = ["STANDARD"]
      matches_prefix        = ["raw/events/"]
    }
    action {
      type          = "SetStorageClass"
      storage_class = "NEARLINE"
    }
  }

  lifecycle_rule {
    condition {
      age                   = 90
      matches_storage_class = ["NEARLINE"]
      matches_prefix        = ["raw/events/"]
    }
    action {
      type          = "SetStorageClass"
      storage_class = "COLDLINE"
    }
  }

  lifecycle_rule {
    condition {
      age            = 7
      matches_prefix = ["staging/"]
    }
    action {
      type = "Delete"
    }
  }
}

The incomplete multipart upload rule is one that people always forget. Without it, failed large uploads leave partial objects that you pay for indefinitely. I've seen S3 bills drop by 15% just from adding that rule.

Kafka Clusters with Confluent

If your data platform includes event streaming, the Confluent Terraform provider manages Kafka clusters, topics, schemas, ACLs, and connectors. Here's a pattern for a production Kafka setup:

terraform {
  required_providers {
    confluent = {
      source  = "confluentinc/confluent"
      version = "~> 2.0"
    }
  }
}

resource "confluent_kafka_cluster" "production" {
  display_name = "data-platform-prod"
  availability = "MULTI_ZONE"
  cloud        = "AWS"
  region       = "us-east-1"

  dedicated {
    cku = 2
  }

  environment {
    id = confluent_environment.production.id
  }
}

resource "confluent_kafka_topic" "user_events" {
  kafka_cluster {
    id = confluent_kafka_cluster.production.id
  }

  topic_name       = "user-events"
  partitions_count = 12
  rest_endpoint    = confluent_kafka_cluster.production.rest_endpoint

  config = {
    "retention.ms"    = "604800000"  # 7 days
    "cleanup.policy"  = "delete"
    "compression.type" = "zstd"
    "max.message.bytes" = "2097152"  # 2MB
  }

  credentials {
    key    = confluent_api_key.admin.id
    secret = confluent_api_key.admin.secret
  }
}

resource "confluent_kafka_topic" "user_events_dlq" {
  kafka_cluster {
    id = confluent_kafka_cluster.production.id
  }

  topic_name       = "user-events.dlq"
  partitions_count = 6
  rest_endpoint    = confluent_kafka_cluster.production.rest_endpoint

  config = {
    "retention.ms"   = "2592000000"  # 30 days
    "cleanup.policy" = "delete"
  }

  credentials {
    key    = confluent_api_key.admin.id
    secret = confluent_api_key.admin.secret
  }
}

Always create your dead letter queue topics alongside the main topics. I've been on too many incident calls where events were failing silently because nobody set up a DLQ, and by the time we noticed, the original events had aged out of the source topic.

State Management: The Part Nobody Warns You About

Terraform state is where things get tricky, especially on data teams where multiple engineers might be provisioning resources. The state file tracks the mapping between your HCL and real-world resources. If two people run terraform apply at the same time with local state, you will corrupt it. Guaranteed.

Use a remote backend with state locking from day one:

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "data-platform/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

A few state management practices that have saved me:

  • Separate state files per domain. Don't put your Snowflake config, your Kafka cluster, and your Airflow environment in the same state. If one state gets corrupted or a plan takes too long, it shouldn't affect everything else.
  • Use terraform state mv when refactoring. If you rename a resource or move it into a module, use the state command to move it rather than letting Terraform destroy and recreate.
  • Never edit state files manually. I know it's tempting when something gets stuck. Use terraform state rm and terraform import instead.
  • Enable state file versioning on S3. When (not if) something goes wrong, you can roll back to a previous state version.

Reusable Modules for Data Infrastructure

Once you've written Terraform for one environment, you'll want to stamp out dev, staging, and production consistently. Modules are the way to do this. Here's how I structure a data platform module:

# modules/data-lake/main.tf

variable "project" {
  type        = string
  description = "Project name for resource naming"
}

variable "environment" {
  type        = string
  description = "Environment (dev, staging, prod)"
}

variable "raw_retention_days" {
  type        = number
  default     = 730
  description = "Days to retain raw data before expiration"
}

variable "enable_glacier_transition" {
  type        = bool
  default     = true
  description = "Enable transition to Glacier for raw data"
}

resource "aws_s3_bucket" "raw" {
  bucket = "${var.project}-raw-${var.environment}"
  tags   = local.tags
}

resource "aws_s3_bucket" "transformed" {
  bucket = "${var.project}-transformed-${var.environment}"
  tags   = local.tags
}

resource "aws_s3_bucket" "staging" {
  bucket = "${var.project}-staging-${var.environment}"
  tags   = local.tags
}

locals {
  tags = {
    Project     = var.project
    Environment = var.environment
    ManagedBy   = "terraform"
    Module      = "data-lake"
  }
}

output "raw_bucket_arn" {
  value = aws_s3_bucket.raw.arn
}

output "transformed_bucket_arn" {
  value = aws_s3_bucket.transformed.arn
}

Then call it per environment:

# environments/production/main.tf

module "data_lake" {
  source = "../../modules/data-lake"

  project                   = "analytics"
  environment               = "production"
  raw_retention_days        = 730
  enable_glacier_transition = true
}

# environments/dev/main.tf

module "data_lake" {
  source = "../../modules/data-lake"

  project                   = "analytics"
  environment               = "dev"
  raw_retention_days        = 30
  enable_glacier_transition = false
}

The directory structure I recommend for a data platform:

terraform/
├── modules/
│   ├── data-lake/          # S3/GCS buckets, lifecycle policies, IAM
│   ├── snowflake-env/      # Databases, schemas, warehouses, roles
│   ├── kafka-cluster/      # Cluster, topics, ACLs, connectors
│   ├── orchestrator/       # MWAA or Dagster + supporting infra
│   └── networking/         # VPC, subnets, security groups
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   └── production/
└── global/                 # Shared resources (IAM, DNS, state bucket)
    ├── main.tf
    └── variables.tf

Common Mistakes (and How I Know About Them)

I've made every single one of these, usually on a Friday afternoon.

1. Hardcoding credentials in HCL files

This sounds obvious, but I've seen Snowflake passwords in committed .tf files more times than I'd like to admit. Use variables with environment variables or a secrets manager:

variable "snowflake_password" {
  type      = string
  sensitive = true
}

# Set via: export TF_VAR_snowflake_password="..."
# Or use a data source for AWS Secrets Manager / Vault

2. Not using prevent_destroy on critical resources

A careless terraform destroy or a refactor that removes a resource block will delete your production database. Add lifecycle protection:

resource "snowflake_database" "production" {
  name = "PRODUCTION"

  lifecycle {
    prevent_destroy = true
  }
}

3. Putting everything in one state file

I've watched a terraform plan take 15 minutes because the state file had 800+ resources spanning Snowflake, AWS, and Databricks. Split by domain. Your future self will thank you.

4. Ignoring provider version constraints

The Snowflake provider, in particular, has had breaking changes between minor versions. Always pin your provider versions with ~> constraints and test upgrades explicitly.

5. Not importing existing resources

If you're adding Terraform to an existing platform, don't just write the resource blocks and hit apply — Terraform will try to create new resources instead of managing existing ones. Use terraform import for each existing resource:

# Import an existing Snowflake warehouse
terraform import snowflake_warehouse.etl '"ETL_WH"'

# Import an existing S3 bucket
terraform import aws_s3_bucket.data_lake mycompany-data-lake-prod

This is tedious for a large existing environment, but it's a one-time cost. There are tools like terraformer that can auto-generate import blocks, though I've had mixed results with them for Snowflake resources specifically.

6. Forgetting about data residency and region constraints

When you Terraform a multi-region data platform, it's easy to accidentally create resources in the wrong region. Always parameterize regions and add validation:

variable "aws_region" {
  type = string

  validation {
    condition     = contains(["us-east-1", "eu-west-1", "eu-central-1"], var.aws_region)
    error_message = "Only approved regions are allowed for data infrastructure."
  }
}

Putting It All Together

Here's the sequence I follow when bringing Terraform to a data platform for the first time:

  1. Set up the backend first. Create the S3 bucket and DynamoDB table for state management — ironically, I do this manually or with a tiny bootstrap script.
  2. Start with storage. S3/GCS buckets with lifecycle policies are low-risk and high-value. You'll see cost savings within the first month.
  3. Move to the warehouse. Import your existing Snowflake/Databricks/Redshift setup. This is the most tedious step but the most impactful for governance.
  4. Add networking and IAM. VPCs, security groups, service accounts, roles. This is the unsexy work that prevents 3 AM incidents.
  5. Terraform your orchestrator last. Airflow and Dagster environments have the most dependencies on everything else, so they go last.

The end state you're working toward is simple: a new engineer joins the team, clones the repo, and can understand exactly what infrastructure exists and why. A new environment takes a terraform apply, not a three-day ticket. And when something breaks at 2 AM, you can look at the Git log to see exactly what changed.

Infrastructure as code for data platforms isn't glamorous. Nobody's going to write a blog post about how your S3 lifecycle policies saved the company $4,000 a month (well, except me apparently). But it's the foundation that makes everything else — the pipelines, the models, the dashboards — reliable and reproducible. And once you've experienced managing infrastructure through code, going back to clicking around in web consoles feels like editing production data in phpMyAdmin. Technically possible. Absolutely terrifying.

Further reading: The Snowflake provider docs, Databricks provider docs, and Confluent provider docs are all excellent. HashiCorp's Terraform tutorials are the best starting point if you're completely new.

Leave a Comment