From Bronze to Gold: A Medallion Architecture You Can Actually Operate (with Delta Lake)

Meta description (158 chars):
Practical guide to Medallion Architecture on Delta Lake—modeling, CDC, quality gates, and ops playbook—built for mid-level data engineers who run pipelines.

Why this matters (intro)

You don’t get paged because your diagrams are pretty—you get paged because last night’s job rebuilt 30 TB and still shipped bad records to finance. The Medallion Architecture (Bronze → Silver → Gold) promises order in a chaotic lake, but most teams stall at “we drew it.” This article shows how to operate it for real: Delta Lake modeling patterns, quality gates, CDC upserts, compaction/Z-ORDER, and a pragmatic SLO mindset that fits day-2 operations.

The Medallion architecture at a glance

Layer	Purpose	Write Pattern	Read Pattern	Retention	Typical Consumers
Bronze	Raw, immutable(ish) history; auditability	Append-only (stream or batch)	Rarely queried directly	Long	Data engineers, replay/forensics
Silver	Cleaned, conformed entities; deduped + typed	MERGE/Upsert; CDC applied	BI/ML feature building	Medium/Long	Analysts, feature engineers
Gold	Business-ready marts (star schemas, KPIs)	Fully modeled, sliced	Low-latency reads; serving	Short/Medium	BI dashboards, execs, apps

Key idea: Cost of change rises up the stack. Keep Bronze cheap to write; push business semantics to Gold; keep Silver as the stable “truthy” contract between raw chaos and business logic.

Delta Lake building blocks you’ll actually use

ACID transactions on the lake (no partial writes).
Schema enforcement & evolution (mergeSchema, constraints).
Time travel for rollback and reproducibility.
MERGE INTO for reliable upserts/CDC.
OPTIMIZE + ZORDER to reduce small files and speed selective queries.
VACUUM to prune old files safely.
Table constraints and Delta expectations for data quality.

Reference data model (end-to-end)

1) Bronze: raw ingestion (append-only)

Goals: land everything, preserve lineage, never block on upstream quality.

from pyspark.sql.functions import input_file_name, current_timestamp

raw = (spark.readStream
       .format("cloudFiles")                # Databricks Auto Loader, or use standard file source
       .option("cloudFiles.format", "json")
       .load("s3://datalake/landing/orders/"))

bronze = (raw
  .withColumn("_source_file", input_file_name())
  .withColumn("_ingest_ts", current_timestamp()))

(bronze.writeStream
  .format("delta")
  .option("checkpointLocation", "s3://datalake/_chk/orders_bronze")
  .outputMode("append")
  .toTable("lake.bronze_orders"))

Table properties (recommended):

ALTER TABLE lake.bronze_orders SET TBLPROPERTIES (
  delta.appendOnly = true,
  delta.minReaderVersion = 2,
  delta.minWriterVersion = 7
);

Retention & cleanup:

-- Compact small files
OPTIMIZE lake.bronze_orders;
-- Keep long history for replay (adjust as needed)
VACUUM lake.bronze_orders RETAIN 168 HOURS; -- 7 days

2) Silver: dedup, typing, and CDC application

Goals: one-row-per-business-key with latest values; enforce basic contracts (types, nullability).

Example: apply CDC from upstream (OpType I/U/D)

from pyspark.sql.functions import col, expr

bronze = spark.readStream.table("lake.bronze_orders")

# Cast + filter malformed
typed = (bronze
  .selectExpr(
    "cast(order_id as string) as order_id",
    "cast(customer_id as string) as customer_id",
    "cast(order_ts as timestamp) as order_ts",
    "cast(status as string) as status",
    "cast(total_amount as decimal(18,2)) as total_amount",
    "cast(op as string) as op"  # 'I','U','D'
  )
  .where("order_id is not null and order_ts is not null"))

(typed.writeStream
  .trigger(processingTime="1 minute")
  .foreachBatch(lambda df, _:
      df.createOrReplaceTempView("changes")) \
  .option("checkpointLocation", "s3://datalake/_chk/orders_silver_merge")
  .start())

# Batch MERGE (scheduled per micro-batch) — run as a SQL step:
spark.sql("""
MERGE INTO lake.silver_orders AS t
USING (SELECT * FROM changes) AS s
ON t.order_id = s.order_id
WHEN MATCHED AND s.op = 'D' THEN DELETE
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED AND s.op IN ('I','U') THEN INSERT *
""")

Hardening the contract:

ALTER TABLE lake.silver_orders
  ADD CONSTRAINT ck_amount CHECK (total_amount >= 0);

-- Expectations (fail or quarantine)
ALTER TABLE lake.silver_orders SET TBLPROPERTIES (
  delta.dataSkippingNumIndexedCols = 32
);

Performance hygiene:

OPTIMIZE lake.silver_orders ZORDER BY (customer_id, order_ts);
VACUUM lake.silver_orders RETAIN 168 HOURS;

3) Gold: business marts you can serve

Goal: deliver dimensional or wide tables with stable SLAs, not just SQL that “usually works.”

Example: daily order KPI fact

CREATE OR REPLACE TABLE mart.gold_orders_daily AS
SELECT
  date_trunc('DAY', order_ts) AS order_date,
  count(*) AS orders,
  sum(total_amount) AS gross_sales,
  approx_percentile(total_amount, 0.5) AS p50_order_value
FROM lake.silver_orders
WHERE status IN ('paid','shipped')
GROUP BY 1;

-- Optional serving optimizations
OPTIMIZE mart.gold_orders_daily ZORDER BY (order_date);

Gold contract (publish it):

Grain: day
Dimensions: date
Measures: orders, gross_sales, p50_order_value
Filters baked in: status IN ('paid','shipped')
SLA: ready by 06:30 ET; backfill allowed; schema changes via versioned RFC

Query patterns that scale

Selective reads: Z-ORDER on most-filtered columns (e.g., customer_id, order_ts).
Time travel for “as of” reports:
SELECT * FROM mart.gold_orders_daily VERSION AS OF 123;
Incremental rebuilds: partition Gold by date; rebuild last N days only.
Late-arriving data: run a rolling merge window in Silver (e.g., last 7 days).

Data quality: gate it, don’t chase it

Bronze → Silver checks

Row-level: required fields (order_id, order_ts)
Type coercion success rates
Duplicate keys % per micro-batch

Silver → Gold checks

Referential consistency (dimensions exist)
Metric drift vs 7-day baseline
Null ratio on business-critical fields

Operational pattern: quarantine, don’t fail. Route bad records to lake.quarantine_orders with a reason; alert on thresholds.

CDC, late data, and SCDs without pain

CDC upserts belong in Silver using MERGE INTO.
Late data: re-merge a rolling window (WHERE order_ts >= now() - interval 7 days).
SCD Type 2 in Delta:
- Keep is_current, valid_from, valid_to
- Use MERGE with conditional updates and inserts
- Z-ORDER by business key and valid_from

Cost & performance guardrails

File sizes: aim for 128–512 MB after OPTIMIZE.
Partitioning: coarse by date (order_date), not by high-cardinality IDs.
Caching: cache small dimensions at compute layer if reused heavily.
OPTIMIZE cadence: daily for hot Silver/Gold; weekly for Bronze.
VACUUM safety: retain >= 7 days if you rely on time travel/backfills.
Z-ORDER columns: choose 1–3 most selective; more isn’t always better.
Concurrency: prefer Structured Streaming or idempotent batch with checkpoints.

What to automate from day one (runbook)

Freshness SLOs per layer (e.g., Bronze ≤ 10 min lag; Silver ≤ 25; Gold ≤ 60).
Observability: ingestion throughput, MERGE latency, small-file ratio, expectation violations.
Backfills: parameterized jobs that accept a date range; never hand-edit paths.
Schema changes: require PR + environment promotion; enable mergeSchema only in Silver.
Access model: read-only to Bronze; curated read to Silver; governed access to Gold.
Disaster drills: restore a table at VERSION AS OF X, replay N days from Bronze.

Common pitfalls (and blunt fixes)

Millions of tiny files: you forgot compaction; schedule OPTIMIZE.
Gold rebuilds take all night: partition smarter; rebuild sliding windows; precompute heavy joins in Silver.
CDC duplicates: wrong join key in MERGE; add natural key + event time ordering.
“It worked locally” configs in prod: lock table properties and enforce via CI checks.
BI joins are flaky: your Silver isn’t conformed; invest in entity resolution and keys.

Conclusion & takeaways

Operate, don’t just architect. Treat Bronze/Silver/Gold as SLO-bound services.
Delta Lake gives you the primitives (ACID, MERGE, OPTIMIZE, VACUUM, constraints) to make it real.
Model for change: push semantics up, keep ingestion simple, automate compaction and quality gates.
Document contracts (grain, keys, SLAs) where consumers can see them.

Call to action:
Want a copy-paste starter? Ask me for a Medallion quickstart repo (jobs, checks, and sample data) tuned for your cloud + scheduler.

Internal link ideas (add on your site)

“Delta Lake Basics: ACID on Object Storage”
“CDC with Spark MERGE: Patterns & Anti-Patterns”
“Z-ORDER vs Partitioning: When to Use Which”
“Data Quality Gates with Expectations”
“Designing SLAs/SLOs for Analytics Pipelines”
“Choosing Between Delta, Hudi, and Iceberg”

(If you cite externals, stick to official docs: Delta Lake, Apache Spark, Databricks, AWS/GCP/Azure storage, etc.)

Image prompt

“A clean, modern data architecture diagram of a Medallion (Bronze→Silver→Gold) Delta Lake on cloud object storage, showing ingestion, CDC merges, quality gates, and OPTIMIZE/Z-ORDER—minimalistic, high contrast, 3D isometric style.”