Microsoft SQL Server – Data/ML Engineer Blog

Adaptive Performance Engineering in Microsoft SQL Server: Columnstore, Partitioning, and CDC for Modern Data Pipelines

Why this matters (a real-world hook)

You’ve got a nightly pipeline that once finished in 20 minutes. Now it drifts past the business day, starving dashboards and downstream jobs. Hardware’s fine; the problem is architectural: wide fact tables, skewed loads, and queries that scan more history than they need. SQL Server has the tools—columnstore, partitioning, and CDC—but the wins only show up when you combine them correctly.

This guide shows you how to use those features together, with pragmatic T-SQL and production-minded guardrails.

Concepts & Architecture (the mental model)

Goal: Keep hot data hot and cold data cheap, while making ETL/ELT incremental and queries IO-lean.

Clustered Columnstore Index (CCI): Compresses and columnizes large tables, turning scans into cache-friendly reads. Best for analytics-style workloads; supports batch mode execution.
Table Partitioning: Physically segments a large table by a key (often date). Enables partition elimination and partition switching for near-instant data loads/archival.
Change Data Capture (CDC): Streams committed row changes into change tables. Great for incremental extracts, SCD handling, and downstream lake/warehouse sync.
Query Store: Captures query plans & regressions; your telemetry to validate tuning.
Resource Governor: Optional; controls workload groups to protect critical jobs from noisy neighbors.

Pattern:
CDC ⇒ Land incrementals ⇒ Upsert into a partitioned staging table ⇒ Partition switch into CCI fact ⇒ Analytical queries hit compressed, partition-eliminated segments.

Quick comparison: when to use what

Challenge	Columnstore (CCI)	Partitioning	CDC
Slow analytics scans	✅ Huge win via compression & batch mode	➖ Helps via elimination	➖ Indirectly
Painful nightly loads	✅ Fewer pages, faster writes with rowgroup strategy	✅ Switch-in is near-instant	✅ Build true incrementals
Hot/cold data mgmt	➖ Not alone	✅ Natural boundary by date	➖ Not for storage tiering
Downstream sync	➖	➖	✅ Canonical change feed
Backfill historical ranges	➖	✅ Switch old partitions fast	➖

Real example: build a partitioned, columnstore fact with incremental loads

Scenario: Daily sales fact at ~2B rows, partitioned by SalesDate. You ingest deltas every hour via CDC.

1) Create partition objects (monthly by date)

-- 1) Partition function: monthly by SalesDate
CREATE PARTITION FUNCTION pf_SalesDate (date)
AS RANGE RIGHT FOR VALUES (
  '2024-01-01', '2024-02-01', '2024-03-01', '2024-04-01',
  '2024-05-01', '2024-06-01', '2024-07-01', '2024-08-01',
  '2024-09-01', '2024-10-01', '2024-11-01', '2024-12-01'
);

-- 2) Partition scheme: place all on PRIMARY (use filegroups in prod)
CREATE PARTITION SCHEME ps_SalesDate
AS PARTITION pf_SalesDate
ALL TO ([PRIMARY]);

Production tip: Map older partitions to cheaper filegroups/storage; keep the current + previous month on fast storage.

2) Create the fact table with clustered columnstore

CREATE TABLE dbo.FactSales
(
  SalesDate       date       NOT NULL,
  StoreId         int        NOT NULL,
  ProductId       int        NOT NULL,
  Quantity        int        NOT NULL,
  NetAmount       money      NOT NULL,
  LastModifiedUtc datetime2  NOT NULL
)
ON ps_SalesDate(SalesDate);

-- Clustered columnstore for analytics
CREATE CLUSTERED COLUMNSTORE INDEX CCI_FactSales
ON dbo.FactSales;

Why CCI on a partitioned table? Each partition maintains its own rowgroups. Queries that hit one month only decompress that month.

3) Create a like-for-like staging table for switch-in

CREATE TABLE dbo.FactSales_Stage
(
  SalesDate       date       NOT NULL,
  StoreId         int        NOT NULL,
  ProductId       int        NOT NULL,
  Quantity        int        NOT NULL,
  NetAmount       money      NOT NULL,
  LastModifiedUtc datetime2  NOT NULL
)
ON ps_SalesDate(SalesDate);

Important: Schemas, constraints (including NULLability), and partition alignment must match exactly between _Stage and Fact to use ALTER TABLE … SWITCH.

4) Load incrementals from CDC

Assume CDC is enabled on the OLTP source and changes are landed to stg.Sales_CDC:

-- Upsert into stage for the current month only
WITH this_month AS (
  SELECT DATEFROMPARTS(YEAR(SYSDATETIME()), MONTH(SYSDATETIME()), 1) AS StartDate
)
MERGE dbo.FactSales_Stage AS T
USING (
  SELECT s.SalesDate, s.StoreId, s.ProductId, s.Quantity, s.NetAmount, s.LastModifiedUtc
  FROM stg.Sales_CDC s
  CROSS JOIN this_month m
  WHERE s.SalesDate >= m.StartDate
) AS S
ON  T.SalesDate = S.SalesDate
AND T.StoreId   = S.StoreId
AND T.ProductId = S.ProductId
WHEN MATCHED AND T.LastModifiedUtc < S.LastModifiedUtc
  THEN UPDATE SET Quantity = S.Quantity, NetAmount = S.NetAmount, LastModifiedUtc = S.LastModifiedUtc
WHEN NOT MATCHED
  THEN INSERT (SalesDate, StoreId, ProductId, Quantity, NetAmount, LastModifiedUtc)
       VALUES (S.SalesDate, S.StoreId, S.ProductId, S.Quantity, S.NetAmount, S.LastModifiedUtc);

Alternative: If MERGE scares you (it should), prefer idempotent upsert with UPDATE + INSERT WHERE NOT EXISTS, or MERGE with the documented best-practice hints.

5) Switch stage into the live fact for just the hot partition

-- Identify current partition number
SELECT $PARTITION.pf_SalesDate(CAST(SYSDATETIME() AS date)) AS CurrentPartition;
GO

-- Switch only the current partition (example uses partition 15)
ALTER TABLE dbo.FactSales_Stage
SWITCH PARTITION 15 TO dbo.FactSales PARTITION 15;

This is metadata-only (near-instant) if schemas align. Your hourlies now update the current month without touching history.

Best practices (that actually move the needle)

Partition by date, not surrogate keys. Align to query filters (WHERE SalesDate BETWEEN …).
Keep partitions chunky. Aim for 1–5 million rows per partition per rowgroup; too small kills columnstore compression.
Control rowgroup quality. Bulk load in batches ≥102,400 rows to form compressed rowgroups; avoid lots of deltastores.
Use filtered nonclustered rowstore indexes on hot, highly selective columns if you must support point lookups; drop them before heavy loads if they slow ingestion.
Leverage BATCH_MODE ON ROWSTORE (if available) for certain queries that can batch on rowstore (version-dependent).
Enable Query Store in read-write mode and baseline critical queries before/after changes.
Automate partition maintenance:
- SPLIT upcoming boundary before the month starts.
- MERGE old tiny partitions.
- SWITCH OUT to archive filegroups before purging.
Watch for implicit conversions (e.g., datetime vs datetime2), which block index seeks and partition elimination.
Avoid SELECT * in ETL and analytics. Project only needed columns to maximize columnstore scan efficiency.
Plan for memory & tempdb. Columnstore compression and sorts chew RAM; size MAXDOP, memory grants, and tempdb appropriately.

Common pitfalls (and how to dodge them)

“Columnstore made my OLTP slower.” True—don’t put CCI on write-heavy OLTP tables. Use CCI on analytics facts; leave OLTP rowstore or consider nonclustered columnstore on replicas.
“Partitioning didn’t help.” Check that queries filter on the partitioning column and that stats are current. Without elimination, you’re still scanning everything.
“Switch failed at midnight.” You forgot to pre-create the next month’s boundary (SPLIT) or schemas diverged (NULL/DEFAULT mismatch).
“CDC ballooned.” Prune capture instances you don’t need; establish retention windows; downstream consumers must keep up or you’ll chase ghosts in cleanup.
“MERGE produced duplicates.” Classic. Add unique keys and WHEN NOT MATCHED BY SOURCE THEN DELETE only if truly desired; otherwise, stick to proven upsert patterns.

Short code cookbook

Create and monitor columnstore rowgroups

SELECT OBJECT_NAME(object_id) AS TableName, state_desc, total_rows, size_in_bytes
FROM sys.column_store_row_groups
WHERE object_id = OBJECT_ID('dbo.FactSales');

Split next month’s boundary (automate monthly)

DECLARE @next date = DATEFROMPARTS(YEAR(SYSDATETIME()), MONTH(SYSDATETIME())+1, 1);
ALTER PARTITION SCHEME ps_SalesDate NEXT USED [PRIMARY];
ALTER PARTITION FUNCTION pf_SalesDate() SPLIT RANGE (@next);

Enable Query Store (baseline before changes)

ALTER DATABASE YourDb
SET QUERY_STORE = ON (OPERATION_MODE = READ_WRITE, CLEANUP_POLICY = (STALE_QUERY_THRESHOLD_DAYS = 30));

Internal link ideas (official only)

Link these keywords to Microsoft’s official docs:

Clustered columnstore index (SQL Server docs)
Table partitioning (partition functions/schemes, switch)
Change Data Capture (CDC) in SQL Server
Query Store configuration and best practices
Resource Governor
sys.column_store_row_groups DMV
ALTER TABLE … SWITCH reference

Summary & CTA

By combining CDC for precise incrementals, partitioning for surgical data movement, and columnstore for compressed, batch-mode scans, you turn a sluggish nightly into a resilient, near-real-time pipeline. Start small: partition your largest fact by date, enable Query Store, and benchmark a single report before and after CCI. Once the pattern works, automate monthly boundary management and switch-in loads.

Call to action: Want a versioned runbook (T-SQL + PowerShell) to automate partition SPLIT/MERGE and switch-in? Say “send the runbook” and I’ll include templates and a validation checklist.

Image prompt (for DALL·E/Midjourney)

“A clean, modern data architecture diagram of a Microsoft SQL Server analytics stack showing CDC → staging → partitioned clustered columnstore fact with partition switch-in, plus Query Store feedback loop — minimalistic, high contrast, isometric 3D, blue/steel palette.”

Pitch ideas (next articles)

Incremental Dimensional Modeling — SCD2 with CDC, temporal tables, and metadata-driven frameworks.

SQL Server CDC Deep Dive — retention tuning, high-volume patterns, and downstream sync to data lakes/Delta.

Partition Switching Runbook — step-by-step automation with retry logic and integrity checks.

Columnstore Performance Tuning — rowgroup health, segment elimination, dictionary pressure, and batch mode operators.

Hybrid Architectures — SQL Server + Azure Synapse/Azure Data Lake with PolyBase and external tables.

Operational Hardening — Query Store baselining, plan forcing ethics, and automated regression detection.

Hot Paths vs Cold Paths — blending In-Memory OLTP for writes with CCI for reads, and how to route traffic.

Resource Isolation — Resource Governor recipes for isolating ELT, ad-hoc, and reporting groups.

Data/ML Engineer Blog