Mastering Slowly Changing Dimensions: A Complete Guide for Data Engineers
Introduction
Slowly Changing Dimensions (SCDs) represent one of the most fundamental yet critical concepts in data warehousing that every data engineer must master. As business data evolves over time, maintaining historical accuracy while ensuring current information remains accessible becomes a complex challenge that can make or break your analytics capabilities.
Whether you’re building customer analytics dashboards, tracking product changes, or maintaining regulatory compliance, understanding how to properly implement SCDs will determine the reliability and usefulness of your data warehouse. This comprehensive guide explores the various SCD types, implementation strategies, and modern approaches using today’s leading data platforms.
What Are Slowly Changing Dimensions?
Slowly Changing Dimensions are dimension tables in a data warehouse where attribute values change over time, but at a relatively slow pace compared to fact table updates. Unlike rapidly changing transactional data, SCDs typically involve changes to customer information, product details, organizational structures, or geographical data that evolve gradually.
The challenge lies in deciding how to handle these changes while preserving data integrity and historical context. Do you overwrite the old value, keep both versions, or create a new record? The answer depends on your business requirements and compliance needs.

SCD Type 1: Overwrite (No History Preservation)
When to Use: When historical values aren’t important for analysis, or when corrections need to be applied retroactively.
Implementation: Simply update the existing record with new values.
-- Example: Updating customer email address
UPDATE dim_customer
SET email = 'newemail@company.com',
last_updated = CURRENT_TIMESTAMP
WHERE customer_id = 12345;
Advantages:
- Simple implementation
- Minimal storage requirements
- No complexity in joins
Disadvantages:
- Complete loss of historical data
- Cannot perform trend analysis on changed attributes
- Potential compliance issues
Real-world Example: Correcting data entry errors, updating contact information where history isn’t business-critical.
SCD Type 2: Create New Records (Full History Preservation)
When to Use: When you need complete historical tracking and the ability to analyze trends over time.
Implementation: Create a new record for each change while preserving the original.
-- Create new record for changed customer
INSERT INTO dim_customer (
customer_id,
customer_key, -- Business key remains same
first_name,
last_name,
email,
address,
effective_date,
expiration_date,
is_current
)
SELECT
NEXT_VALUE_FOR(customer_id_seq),
customer_key,
first_name,
last_name,
'newemail@company.com', -- New email
'New Address, City, State', -- New address
CURRENT_DATE,
'9999-12-31',
TRUE
FROM dim_customer
WHERE customer_key = 'CUST_12345' AND is_current = TRUE;
-- Update previous record
UPDATE dim_customer
SET expiration_date = CURRENT_DATE - 1,
is_current = FALSE
WHERE customer_key = 'CUST_12345' AND is_current = TRUE;
Advanced SCD Type 2 with dbt:
-- models/dim_customer_scd2.sql
{{ config(
materialized='incremental',
unique_key='customer_key',
on_schema_change='sync_all_columns'
) }}
WITH source_data AS (
SELECT * FROM {{ ref('stg_customers') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(effective_date) FROM {{ this }})
{% endif %}
),
scd2_logic AS (
SELECT
*,
LAG(customer_key) OVER (PARTITION BY customer_key ORDER BY updated_at) as prev_record,
LEAD(updated_at, 1, '9999-12-31'::date) OVER (PARTITION BY customer_key ORDER BY updated_at) as expiration_date
FROM source_data
)
SELECT
{{ dbt_utils.generate_surrogate_key(['customer_key', 'updated_at']) }} as customer_id,
customer_key,
first_name,
last_name,
email,
address,
updated_at as effective_date,
expiration_date,
CASE WHEN expiration_date = '9999-12-31'::date THEN TRUE ELSE FALSE END as is_current
FROM scd2_logic
Advantages:
- Complete historical preservation
- Enables trend analysis and point-in-time reporting
- Audit trail capabilities
- Regulatory compliance support
Disadvantages:
- Increased storage requirements
- More complex queries and joins
- Potential performance impacts
SCD Type 3: Add New Columns (Limited History)
When to Use: When you need to track only the previous value alongside the current value.
Implementation: Add columns to store both current and previous values.
-- Add columns for previous values
ALTER TABLE dim_customer
ADD COLUMN previous_address VARCHAR(255),
ADD COLUMN previous_address_date DATE;
-- Update with new address while preserving previous
UPDATE dim_customer
SET previous_address = address,
previous_address_date = address_change_date,
address = 'New Address, City, State',
address_change_date = CURRENT_DATE
WHERE customer_id = 12345;
Advantages:
- Simple queries for before/after comparisons
- Moderate storage increase
- Easy to understand and implement
Disadvantages:
- Limited to one previous value
- Schema changes required for new tracked attributes
- Not scalable for multiple changes
SCD Type 4: History Tables (Separate Current and Historical)
When to Use: When you need to optimize current data queries while maintaining complete history.
Implementation: Maintain separate tables for current and historical data.
-- Current dimension table
CREATE TABLE dim_customer_current (
customer_id INT PRIMARY KEY,
customer_key VARCHAR(50),
first_name VARCHAR(100),
last_name VARCHAR(100),
email VARCHAR(255),
address VARCHAR(500),
last_updated TIMESTAMP
);
-- Historical dimension table
CREATE TABLE dim_customer_history (
customer_id INT,
customer_key VARCHAR(50),
first_name VARCHAR(100),
last_name VARCHAR(100),
email VARCHAR(255),
address VARCHAR(500),
effective_date DATE,
expiration_date DATE,
change_reason VARCHAR(255)
);
Modern Implementation with Databricks Delta Lake:
from delta.tables import DeltaTable
from pyspark.sql import functions as F
# Read source data
new_customer_data = spark.read.table("bronze.customers")
# Current table operations
current_table = DeltaTable.forName(spark, "silver.dim_customer_current")
# Merge new data into current table
current_table.alias("current").merge(
new_customer_data.alias("updates"),
"current.customer_key = updates.customer_key"
).whenMatchedUpdate(
set = {
"first_name": "updates.first_name",
"last_name": "updates.last_name",
"email": "updates.email",
"address": "updates.address",
"last_updated": F.current_timestamp()
}
).whenNotMatchedInsert(
values = {
"customer_key": "updates.customer_key",
"first_name": "updates.first_name",
"last_name": "updates.last_name",
"email": "updates.email",
"address": "updates.address",
"last_updated": F.current_timestamp()
}
).execute()
# Archive changed records to history table
changed_records = current_table.toDF().join(
new_customer_data, ["customer_key"], "inner"
).where("current.last_updated = new.update_timestamp")
changed_records.write.mode("append").saveAsTable("silver.dim_customer_history")
SCD Type 6: Hybrid Approach (Combination of Types 1, 2, and 3)
When to Use: When you need the benefits of multiple SCD types for different attributes.
Implementation: Combines versioning, current value updates, and previous value tracking.
CREATE TABLE dim_customer_hybrid (
customer_id INT IDENTITY(1,1) PRIMARY KEY,
customer_key VARCHAR(50), -- Business key
version_number INT,
first_name VARCHAR(100),
last_name VARCHAR(100),
email VARCHAR(255), -- Type 1: Always current
current_address VARCHAR(500), -- Type 3: Current value
previous_address VARCHAR(500), -- Type 3: Previous value
original_address VARCHAR(500), -- Type 6: Original value
effective_date DATE,
expiration_date DATE,
is_current BIT,
created_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Modern SCD Implementation Strategies
Using Snowflake Streams and Tasks
-- Create stream to capture changes
CREATE STREAM customer_changes_stream ON TABLE source.customers;
-- Create task for automated SCD processing
CREATE TASK process_customer_scd
WAREHOUSE = 'SCD_WH'
SCHEDULE = '5 MINUTE'
AS
INSERT INTO dim_customer (
customer_key, first_name, last_name, email, address,
effective_date, expiration_date, is_current
)
SELECT
customer_key,
first_name,
last_name,
email,
address,
CURRENT_DATE,
'9999-12-31'::DATE,
TRUE
FROM customer_changes_stream
WHERE METADATA$ACTION = 'INSERT' OR METADATA$ACTION = 'UPDATE';
Using Apache Airflow for SCD Orchestration
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
def process_scd_type2(**context):
"""
Process SCD Type 2 updates using pandas and SQLAlchemy
"""
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:pass@host:port/db')
# Read source data
source_df = pd.read_sql("""
SELECT customer_key, first_name, last_name, email, address, updated_at
FROM source.customers
WHERE updated_at >= %s
""", engine, params=[context['ds']])
# Read current dimension data
current_df = pd.read_sql("""
SELECT customer_key, first_name, last_name, email, address
FROM dim_customer
WHERE is_current = true
""", engine)
# Identify changes
merged_df = source_df.merge(
current_df,
on='customer_key',
how='left',
suffixes=('_new', '_current')
)
changes_df = merged_df[
(merged_df['email_new'] != merged_df['email_current']) |
(merged_df['address_new'] != merged_df['address_current'])
]
if not changes_df.empty:
# Expire current records
expire_query = """
UPDATE dim_customer
SET expiration_date = %s, is_current = false
WHERE customer_key IN %s AND is_current = true
"""
# Insert new records
new_records = changes_df[[
'customer_key', 'first_name_new', 'last_name_new',
'email_new', 'address_new'
]].copy()
new_records.columns = [
'customer_key', 'first_name', 'last_name', 'email', 'address'
]
new_records['effective_date'] = context['ds']
new_records['expiration_date'] = '9999-12-31'
new_records['is_current'] = True
new_records.to_sql('dim_customer', engine, if_exists='append', index=False)
dag = DAG(
'scd_processing',
default_args={
'owner': 'data-team',
'depends_on_past': False,
'start_date': datetime(2024, 1, 1),
'email_on_failure': True,
'email_on_retry': False,
'retries': 2,
'retry_delay': timedelta(minutes=5)
},
description='Process slowly changing dimensions',
schedule_interval='@daily',
catchup=False
)
process_scd = PythonOperator(
task_id='process_customer_scd',
python_callable=process_scd_type2,
dag=dag
)
Performance Optimization Strategies
Indexing for SCD Tables
-- Composite index for SCD Type 2 queries
CREATE INDEX idx_customer_business_effective
ON dim_customer (customer_key, effective_date, expiration_date);
-- Index for current record queries
CREATE INDEX idx_customer_current
ON dim_customer (customer_key)
WHERE is_current = true;
-- Partitioning by date for large SCD tables
CREATE TABLE dim_customer_partitioned (
customer_id BIGINT,
customer_key VARCHAR(50),
first_name VARCHAR(100),
last_name VARCHAR(100),
email VARCHAR(255),
address VARCHAR(500),
effective_date DATE,
expiration_date DATE,
is_current BOOLEAN
) PARTITION BY RANGE (effective_date);
Query Optimization Techniques
-- Efficient point-in-time query
WITH customer_snapshot AS (
SELECT
customer_key,
first_name,
last_name,
email,
address,
ROW_NUMBER() OVER (
PARTITION BY customer_key
ORDER BY effective_date DESC
) as rn
FROM dim_customer
WHERE effective_date <= '2024-06-01'
AND expiration_date > '2024-06-01'
)
SELECT * FROM customer_snapshot WHERE rn = 1;
-- Materialized view for current records
CREATE MATERIALIZED VIEW dim_customer_current AS
SELECT * FROM dim_customer WHERE is_current = true;
Testing and Data Quality for SCDs
dbt Tests for SCD Implementation
# models/schema.yml
version: 2
models:
- name: dim_customer_scd2
tests:
- dbt_utils.expression_is_true:
expression: "effective_date <= expiration_date"
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- customer_key
- effective_date
columns:
- name: customer_key
tests:
- not_null
- name: effective_date
tests:
- not_null
- name: is_current
tests:
- not_null
Data Quality Monitoring
def validate_scd_integrity(connection):
"""
Validate SCD Type 2 implementation integrity
"""
checks = {
'overlapping_dates': """
SELECT customer_key, COUNT(*) as overlaps
FROM (
SELECT d1.customer_key
FROM dim_customer d1
JOIN dim_customer d2 ON d1.customer_key = d2.customer_key
WHERE d1.customer_id != d2.customer_id
AND d1.effective_date < d2.expiration_date
AND d1.expiration_date > d2.effective_date
) overlaps
GROUP BY customer_key
HAVING COUNT(*) > 0
""",
'multiple_current_records': """
SELECT customer_key, COUNT(*) as current_count
FROM dim_customer
WHERE is_current = true
GROUP BY customer_key
HAVING COUNT(*) > 1
""",
'gaps_in_timeline': """
SELECT
customer_key,
effective_date,
LAG(expiration_date) OVER (
PARTITION BY customer_key
ORDER BY effective_date
) as prev_expiration
FROM dim_customer
WHERE effective_date > LAG(expiration_date) OVER (
PARTITION BY customer_key
ORDER BY effective_date
)
"""
}
results = {}
for check_name, query in checks.items():
result = pd.read_sql(query, connection)
results[check_name] = len(result) == 0 # True if no issues found
return results
Choosing the Right SCD Type
Criteria | Type 1 | Type 2 | Type 3 | Type 4 | Type 6 |
---|---|---|---|---|---|
Storage Requirements | Low | High | Medium | High | High |
Query Complexity | Simple | Complex | Simple | Medium | Complex |
Historical Analysis | None | Complete | Limited | Complete | Complete |
Performance Impact | Minimal | Moderate | Minimal | Low | Moderate |
Compliance Support | Poor | Excellent | Poor | Good | Excellent |
Implementation Effort | Low | High | Medium | High | Very High |
Decision Framework
Choose SCD Type 1 when:
- Historical data isn’t required for analysis
- Storage and performance are primary concerns
- Data corrections need to be applied retroactively
- Regulatory requirements don’t mandate history preservation
Choose SCD Type 2 when:
- Complete historical tracking is essential
- Trend analysis and point-in-time reporting are required
- Regulatory compliance demands audit trails
- Storage costs are not a primary constraint
Choose SCD Type 3 when:
- Only recent changes need tracking
- Before/after comparisons are sufficient
- Schema flexibility exists for new columns
- Storage optimization is important
Choose SCD Type 4 when:
- Current data queries need optimization
- Complete history must be maintained separately
- Mixed access patterns exist (frequent current, occasional historical)
- System architecture supports multiple tables
Key Takeaways
Strategic Considerations:
- SCD implementation should align with business requirements and compliance needs
- Consider storage costs, query performance, and maintenance overhead in your decision
- Modern cloud data platforms offer built-in features that simplify SCD implementation
- Always implement proper testing and monitoring for SCD processes
Technical Best Practices:
- Use appropriate indexing strategies for your chosen SCD type
- Implement automated data quality checks to ensure SCD integrity
- Consider partitioning large SCD tables by date ranges
- Leverage modern tools like dbt, Airflow, and cloud-native features for automation
Implementation Guidelines:
- Start with clear business requirements for historical data retention
- Design your SCD strategy before implementation, not after
- Plan for data migration when changing SCD types
- Document your SCD approach thoroughly for team knowledge sharing
Mastering SCDs is essential for building robust, compliant, and analytically powerful data warehouses. The key lies in understanding your specific use case requirements and choosing the appropriate implementation strategy that balances functionality, performance, and maintainability.
Leave a Reply