Mastering Slowly Changing Dimensions

Mastering Slowly Changing Dimensions

Mastering Slowly Changing Dimensions: A Complete Guide for Data Engineers

Introduction

Slowly Changing Dimensions (SCDs) represent one of the most fundamental yet critical concepts in data warehousing that every data engineer must master. As business data evolves over time, maintaining historical accuracy while ensuring current information remains accessible becomes a complex challenge that can make or break your analytics capabilities.

Whether you’re building customer analytics dashboards, tracking product changes, or maintaining regulatory compliance, understanding how to properly implement SCDs will determine the reliability and usefulness of your data warehouse. This comprehensive guide explores the various SCD types, implementation strategies, and modern approaches using today’s leading data platforms.

What Are Slowly Changing Dimensions?

Slowly Changing Dimensions are dimension tables in a data warehouse where attribute values change over time, but at a relatively slow pace compared to fact table updates. Unlike rapidly changing transactional data, SCDs typically involve changes to customer information, product details, organizational structures, or geographical data that evolve gradually.

The challenge lies in deciding how to handle these changes while preserving data integrity and historical context. Do you overwrite the old value, keep both versions, or create a new record? The answer depends on your business requirements and compliance needs.

SCD Type 1: Overwrite (No History Preservation)

When to Use: When historical values aren’t important for analysis, or when corrections need to be applied retroactively.

Implementation: Simply update the existing record with new values.

-- Example: Updating customer email address
UPDATE dim_customer 
SET email = 'newemail@company.com',
    last_updated = CURRENT_TIMESTAMP
WHERE customer_id = 12345;

Advantages:

  • Simple implementation
  • Minimal storage requirements
  • No complexity in joins

Disadvantages:

  • Complete loss of historical data
  • Cannot perform trend analysis on changed attributes
  • Potential compliance issues

Real-world Example: Correcting data entry errors, updating contact information where history isn’t business-critical.

SCD Type 2: Create New Records (Full History Preservation)

When to Use: When you need complete historical tracking and the ability to analyze trends over time.

Implementation: Create a new record for each change while preserving the original.

-- Create new record for changed customer
INSERT INTO dim_customer (
    customer_id,
    customer_key, -- Business key remains same
    first_name,
    last_name,
    email,
    address,
    effective_date,
    expiration_date,
    is_current
)
SELECT 
    NEXT_VALUE_FOR(customer_id_seq),
    customer_key,
    first_name,
    last_name,
    'newemail@company.com', -- New email
    'New Address, City, State', -- New address
    CURRENT_DATE,
    '9999-12-31',
    TRUE
FROM dim_customer 
WHERE customer_key = 'CUST_12345' AND is_current = TRUE;

-- Update previous record
UPDATE dim_customer 
SET expiration_date = CURRENT_DATE - 1,
    is_current = FALSE
WHERE customer_key = 'CUST_12345' AND is_current = TRUE;

Advanced SCD Type 2 with dbt:

-- models/dim_customer_scd2.sql
{{ config(
    materialized='incremental',
    unique_key='customer_key',
    on_schema_change='sync_all_columns'
) }}

WITH source_data AS (
    SELECT * FROM {{ ref('stg_customers') }}
    {% if is_incremental() %}
        WHERE updated_at > (SELECT MAX(effective_date) FROM {{ this }})
    {% endif %}
),

scd2_logic AS (
    SELECT 
        *,
        LAG(customer_key) OVER (PARTITION BY customer_key ORDER BY updated_at) as prev_record,
        LEAD(updated_at, 1, '9999-12-31'::date) OVER (PARTITION BY customer_key ORDER BY updated_at) as expiration_date
    FROM source_data
)

SELECT 
    {{ dbt_utils.generate_surrogate_key(['customer_key', 'updated_at']) }} as customer_id,
    customer_key,
    first_name,
    last_name,
    email,
    address,
    updated_at as effective_date,
    expiration_date,
    CASE WHEN expiration_date = '9999-12-31'::date THEN TRUE ELSE FALSE END as is_current
FROM scd2_logic

Advantages:

  • Complete historical preservation
  • Enables trend analysis and point-in-time reporting
  • Audit trail capabilities
  • Regulatory compliance support

Disadvantages:

  • Increased storage requirements
  • More complex queries and joins
  • Potential performance impacts

SCD Type 3: Add New Columns (Limited History)

When to Use: When you need to track only the previous value alongside the current value.

Implementation: Add columns to store both current and previous values.

-- Add columns for previous values
ALTER TABLE dim_customer 
ADD COLUMN previous_address VARCHAR(255),
ADD COLUMN previous_address_date DATE;

-- Update with new address while preserving previous
UPDATE dim_customer 
SET previous_address = address,
    previous_address_date = address_change_date,
    address = 'New Address, City, State',
    address_change_date = CURRENT_DATE
WHERE customer_id = 12345;

Advantages:

  • Simple queries for before/after comparisons
  • Moderate storage increase
  • Easy to understand and implement

Disadvantages:

  • Limited to one previous value
  • Schema changes required for new tracked attributes
  • Not scalable for multiple changes

SCD Type 4: History Tables (Separate Current and Historical)

When to Use: When you need to optimize current data queries while maintaining complete history.

Implementation: Maintain separate tables for current and historical data.

-- Current dimension table
CREATE TABLE dim_customer_current (
    customer_id INT PRIMARY KEY,
    customer_key VARCHAR(50),
    first_name VARCHAR(100),
    last_name VARCHAR(100),
    email VARCHAR(255),
    address VARCHAR(500),
    last_updated TIMESTAMP
);

-- Historical dimension table
CREATE TABLE dim_customer_history (
    customer_id INT,
    customer_key VARCHAR(50),
    first_name VARCHAR(100),
    last_name VARCHAR(100),
    email VARCHAR(255),
    address VARCHAR(500),
    effective_date DATE,
    expiration_date DATE,
    change_reason VARCHAR(255)
);

Modern Implementation with Databricks Delta Lake:

from delta.tables import DeltaTable
from pyspark.sql import functions as F

# Read source data
new_customer_data = spark.read.table("bronze.customers")

# Current table operations
current_table = DeltaTable.forName(spark, "silver.dim_customer_current")

# Merge new data into current table
current_table.alias("current").merge(
    new_customer_data.alias("updates"),
    "current.customer_key = updates.customer_key"
).whenMatchedUpdate(
    set = {
        "first_name": "updates.first_name",
        "last_name": "updates.last_name",
        "email": "updates.email",
        "address": "updates.address",
        "last_updated": F.current_timestamp()
    }
).whenNotMatchedInsert(
    values = {
        "customer_key": "updates.customer_key",
        "first_name": "updates.first_name",
        "last_name": "updates.last_name",
        "email": "updates.email",
        "address": "updates.address",
        "last_updated": F.current_timestamp()
    }
).execute()

# Archive changed records to history table
changed_records = current_table.toDF().join(
    new_customer_data, ["customer_key"], "inner"
).where("current.last_updated = new.update_timestamp")

changed_records.write.mode("append").saveAsTable("silver.dim_customer_history")

SCD Type 6: Hybrid Approach (Combination of Types 1, 2, and 3)

When to Use: When you need the benefits of multiple SCD types for different attributes.

Implementation: Combines versioning, current value updates, and previous value tracking.

CREATE TABLE dim_customer_hybrid (
    customer_id INT IDENTITY(1,1) PRIMARY KEY,
    customer_key VARCHAR(50), -- Business key
    version_number INT,
    first_name VARCHAR(100),
    last_name VARCHAR(100),
    email VARCHAR(255), -- Type 1: Always current
    current_address VARCHAR(500), -- Type 3: Current value
    previous_address VARCHAR(500), -- Type 3: Previous value
    original_address VARCHAR(500), -- Type 6: Original value
    effective_date DATE,
    expiration_date DATE,
    is_current BIT,
    created_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Modern SCD Implementation Strategies

Using Snowflake Streams and Tasks

-- Create stream to capture changes
CREATE STREAM customer_changes_stream ON TABLE source.customers;

-- Create task for automated SCD processing
CREATE TASK process_customer_scd
    WAREHOUSE = 'SCD_WH'
    SCHEDULE = '5 MINUTE'
AS
INSERT INTO dim_customer (
    customer_key, first_name, last_name, email, address,
    effective_date, expiration_date, is_current
)
SELECT 
    customer_key,
    first_name,
    last_name,
    email,
    address,
    CURRENT_DATE,
    '9999-12-31'::DATE,
    TRUE
FROM customer_changes_stream
WHERE METADATA$ACTION = 'INSERT' OR METADATA$ACTION = 'UPDATE';

Using Apache Airflow for SCD Orchestration

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta

def process_scd_type2(**context):
    """
    Process SCD Type 2 updates using pandas and SQLAlchemy
    """
    import pandas as pd
    from sqlalchemy import create_engine
    
    engine = create_engine('postgresql://user:pass@host:port/db')
    
    # Read source data
    source_df = pd.read_sql("""
        SELECT customer_key, first_name, last_name, email, address, updated_at
        FROM source.customers 
        WHERE updated_at >= %s
    """, engine, params=[context['ds']])
    
    # Read current dimension data
    current_df = pd.read_sql("""
        SELECT customer_key, first_name, last_name, email, address
        FROM dim_customer 
        WHERE is_current = true
    """, engine)
    
    # Identify changes
    merged_df = source_df.merge(
        current_df, 
        on='customer_key', 
        how='left', 
        suffixes=('_new', '_current')
    )
    
    changes_df = merged_df[
        (merged_df['email_new'] != merged_df['email_current']) |
        (merged_df['address_new'] != merged_df['address_current'])
    ]
    
    if not changes_df.empty:
        # Expire current records
        expire_query = """
            UPDATE dim_customer 
            SET expiration_date = %s, is_current = false
            WHERE customer_key IN %s AND is_current = true
        """
        
        # Insert new records
        new_records = changes_df[[
            'customer_key', 'first_name_new', 'last_name_new', 
            'email_new', 'address_new'
        ]].copy()
        
        new_records.columns = [
            'customer_key', 'first_name', 'last_name', 'email', 'address'
        ]
        new_records['effective_date'] = context['ds']
        new_records['expiration_date'] = '9999-12-31'
        new_records['is_current'] = True
        
        new_records.to_sql('dim_customer', engine, if_exists='append', index=False)

dag = DAG(
    'scd_processing',
    default_args={
        'owner': 'data-team',
        'depends_on_past': False,
        'start_date': datetime(2024, 1, 1),
        'email_on_failure': True,
        'email_on_retry': False,
        'retries': 2,
        'retry_delay': timedelta(minutes=5)
    },
    description='Process slowly changing dimensions',
    schedule_interval='@daily',
    catchup=False
)

process_scd = PythonOperator(
    task_id='process_customer_scd',
    python_callable=process_scd_type2,
    dag=dag
)

Performance Optimization Strategies

Indexing for SCD Tables

-- Composite index for SCD Type 2 queries
CREATE INDEX idx_customer_business_effective 
ON dim_customer (customer_key, effective_date, expiration_date);

-- Index for current record queries
CREATE INDEX idx_customer_current 
ON dim_customer (customer_key) 
WHERE is_current = true;

-- Partitioning by date for large SCD tables
CREATE TABLE dim_customer_partitioned (
    customer_id BIGINT,
    customer_key VARCHAR(50),
    first_name VARCHAR(100),
    last_name VARCHAR(100),
    email VARCHAR(255),
    address VARCHAR(500),
    effective_date DATE,
    expiration_date DATE,
    is_current BOOLEAN
) PARTITION BY RANGE (effective_date);

Query Optimization Techniques

-- Efficient point-in-time query
WITH customer_snapshot AS (
    SELECT 
        customer_key,
        first_name,
        last_name,
        email,
        address,
        ROW_NUMBER() OVER (
            PARTITION BY customer_key 
            ORDER BY effective_date DESC
        ) as rn
    FROM dim_customer
    WHERE effective_date <= '2024-06-01'
      AND expiration_date > '2024-06-01'
)
SELECT * FROM customer_snapshot WHERE rn = 1;

-- Materialized view for current records
CREATE MATERIALIZED VIEW dim_customer_current AS
SELECT * FROM dim_customer WHERE is_current = true;

Testing and Data Quality for SCDs

dbt Tests for SCD Implementation

# models/schema.yml
version: 2

models:
  - name: dim_customer_scd2
    tests:
      - dbt_utils.expression_is_true:
          expression: "effective_date <= expiration_date"
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns:
            - customer_key
            - effective_date
    columns:
      - name: customer_key
        tests:
          - not_null
      - name: effective_date
        tests:
          - not_null
      - name: is_current
        tests:
          - not_null

Data Quality Monitoring

def validate_scd_integrity(connection):
    """
    Validate SCD Type 2 implementation integrity
    """
    checks = {
        'overlapping_dates': """
            SELECT customer_key, COUNT(*) as overlaps
            FROM (
                SELECT d1.customer_key
                FROM dim_customer d1
                JOIN dim_customer d2 ON d1.customer_key = d2.customer_key
                WHERE d1.customer_id != d2.customer_id
                  AND d1.effective_date < d2.expiration_date
                  AND d1.expiration_date > d2.effective_date
            ) overlaps
            GROUP BY customer_key
            HAVING COUNT(*) > 0
        """,
        
        'multiple_current_records': """
            SELECT customer_key, COUNT(*) as current_count
            FROM dim_customer
            WHERE is_current = true
            GROUP BY customer_key
            HAVING COUNT(*) > 1
        """,
        
        'gaps_in_timeline': """
            SELECT 
                customer_key,
                effective_date,
                LAG(expiration_date) OVER (
                    PARTITION BY customer_key 
                    ORDER BY effective_date
                ) as prev_expiration
            FROM dim_customer
            WHERE effective_date > LAG(expiration_date) OVER (
                    PARTITION BY customer_key 
                    ORDER BY effective_date
                )
        """
    }
    
    results = {}
    for check_name, query in checks.items():
        result = pd.read_sql(query, connection)
        results[check_name] = len(result) == 0  # True if no issues found
    
    return results

Choosing the Right SCD Type

CriteriaType 1Type 2Type 3Type 4Type 6
Storage RequirementsLowHighMediumHighHigh
Query ComplexitySimpleComplexSimpleMediumComplex
Historical AnalysisNoneCompleteLimitedCompleteComplete
Performance ImpactMinimalModerateMinimalLowModerate
Compliance SupportPoorExcellentPoorGoodExcellent
Implementation EffortLowHighMediumHighVery High

Decision Framework

Choose SCD Type 1 when:

  • Historical data isn’t required for analysis
  • Storage and performance are primary concerns
  • Data corrections need to be applied retroactively
  • Regulatory requirements don’t mandate history preservation

Choose SCD Type 2 when:

  • Complete historical tracking is essential
  • Trend analysis and point-in-time reporting are required
  • Regulatory compliance demands audit trails
  • Storage costs are not a primary constraint

Choose SCD Type 3 when:

  • Only recent changes need tracking
  • Before/after comparisons are sufficient
  • Schema flexibility exists for new columns
  • Storage optimization is important

Choose SCD Type 4 when:

  • Current data queries need optimization
  • Complete history must be maintained separately
  • Mixed access patterns exist (frequent current, occasional historical)
  • System architecture supports multiple tables

Key Takeaways

Strategic Considerations:

  • SCD implementation should align with business requirements and compliance needs
  • Consider storage costs, query performance, and maintenance overhead in your decision
  • Modern cloud data platforms offer built-in features that simplify SCD implementation
  • Always implement proper testing and monitoring for SCD processes

Technical Best Practices:

  • Use appropriate indexing strategies for your chosen SCD type
  • Implement automated data quality checks to ensure SCD integrity
  • Consider partitioning large SCD tables by date ranges
  • Leverage modern tools like dbt, Airflow, and cloud-native features for automation

Implementation Guidelines:

  • Start with clear business requirements for historical data retention
  • Design your SCD strategy before implementation, not after
  • Plan for data migration when changing SCD types
  • Document your SCD approach thoroughly for team knowledge sharing

Mastering SCDs is essential for building robust, compliant, and analytically powerful data warehouses. The key lies in understanding your specific use case requirements and choosing the appropriate implementation strategy that balances functionality, performance, and maintainability.

Further Reading

Leave a Reply

Your email address will not be published. Required fields are marked *