From Bronze to Gold: A Medallion Architecture You Can Actually Operate (with Delta Lake)
Meta description (158 chars):
Practical guide to Medallion Architecture on Delta Lake—modeling, CDC, quality gates, and ops playbook—built for mid-level data engineers who run pipelines.
Why this matters (intro)
You don’t get paged because your diagrams are pretty—you get paged because last night’s job rebuilt 30 TB and still shipped bad records to finance. The Medallion Architecture (Bronze → Silver → Gold) promises order in a chaotic lake, but most teams stall at “we drew it.” This article shows how to operate it for real: Delta Lake modeling patterns, quality gates, CDC upserts, compaction/Z-ORDER, and a pragmatic SLO mindset that fits day-2 operations.
The Medallion architecture at a glance
| Layer | Purpose | Write Pattern | Read Pattern | Retention | Typical Consumers |
|---|---|---|---|---|---|
| Bronze | Raw, immutable(ish) history; auditability | Append-only (stream or batch) | Rarely queried directly | Long | Data engineers, replay/forensics |
| Silver | Cleaned, conformed entities; deduped + typed | MERGE/Upsert; CDC applied | BI/ML feature building | Medium/Long | Analysts, feature engineers |
| Gold | Business-ready marts (star schemas, KPIs) | Fully modeled, sliced | Low-latency reads; serving | Short/Medium | BI dashboards, execs, apps |
Key idea: Cost of change rises up the stack. Keep Bronze cheap to write; push business semantics to Gold; keep Silver as the stable “truthy” contract between raw chaos and business logic.
Delta Lake building blocks you’ll actually use
- ACID transactions on the lake (no partial writes).
- Schema enforcement & evolution (
mergeSchema, constraints). - Time travel for rollback and reproducibility.
- MERGE INTO for reliable upserts/CDC.
- OPTIMIZE + ZORDER to reduce small files and speed selective queries.
- VACUUM to prune old files safely.
- Table constraints and Delta expectations for data quality.
Reference data model (end-to-end)
1) Bronze: raw ingestion (append-only)
Goals: land everything, preserve lineage, never block on upstream quality.
from pyspark.sql.functions import input_file_name, current_timestamp
raw = (spark.readStream
.format("cloudFiles") # Databricks Auto Loader, or use standard file source
.option("cloudFiles.format", "json")
.load("s3://datalake/landing/orders/"))
bronze = (raw
.withColumn("_source_file", input_file_name())
.withColumn("_ingest_ts", current_timestamp()))
(bronze.writeStream
.format("delta")
.option("checkpointLocation", "s3://datalake/_chk/orders_bronze")
.outputMode("append")
.toTable("lake.bronze_orders"))
Table properties (recommended):
ALTER TABLE lake.bronze_orders SET TBLPROPERTIES (
delta.appendOnly = true,
delta.minReaderVersion = 2,
delta.minWriterVersion = 7
);
Retention & cleanup:
-- Compact small files
OPTIMIZE lake.bronze_orders;
-- Keep long history for replay (adjust as needed)
VACUUM lake.bronze_orders RETAIN 168 HOURS; -- 7 days
2) Silver: dedup, typing, and CDC application
Goals: one-row-per-business-key with latest values; enforce basic contracts (types, nullability).
Example: apply CDC from upstream (OpType I/U/D)
from pyspark.sql.functions import col, expr
bronze = spark.readStream.table("lake.bronze_orders")
# Cast + filter malformed
typed = (bronze
.selectExpr(
"cast(order_id as string) as order_id",
"cast(customer_id as string) as customer_id",
"cast(order_ts as timestamp) as order_ts",
"cast(status as string) as status",
"cast(total_amount as decimal(18,2)) as total_amount",
"cast(op as string) as op" # 'I','U','D'
)
.where("order_id is not null and order_ts is not null"))
(typed.writeStream
.trigger(processingTime="1 minute")
.foreachBatch(lambda df, _:
df.createOrReplaceTempView("changes")) \
.option("checkpointLocation", "s3://datalake/_chk/orders_silver_merge")
.start())
# Batch MERGE (scheduled per micro-batch) — run as a SQL step:
spark.sql("""
MERGE INTO lake.silver_orders AS t
USING (SELECT * FROM changes) AS s
ON t.order_id = s.order_id
WHEN MATCHED AND s.op = 'D' THEN DELETE
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED AND s.op IN ('I','U') THEN INSERT *
""")
Hardening the contract:
ALTER TABLE lake.silver_orders
ADD CONSTRAINT ck_amount CHECK (total_amount >= 0);
-- Expectations (fail or quarantine)
ALTER TABLE lake.silver_orders SET TBLPROPERTIES (
delta.dataSkippingNumIndexedCols = 32
);
Performance hygiene:
OPTIMIZE lake.silver_orders ZORDER BY (customer_id, order_ts);
VACUUM lake.silver_orders RETAIN 168 HOURS;
3) Gold: business marts you can serve
Goal: deliver dimensional or wide tables with stable SLAs, not just SQL that “usually works.”
Example: daily order KPI fact
CREATE OR REPLACE TABLE mart.gold_orders_daily AS
SELECT
date_trunc('DAY', order_ts) AS order_date,
count(*) AS orders,
sum(total_amount) AS gross_sales,
approx_percentile(total_amount, 0.5) AS p50_order_value
FROM lake.silver_orders
WHERE status IN ('paid','shipped')
GROUP BY 1;
-- Optional serving optimizations
OPTIMIZE mart.gold_orders_daily ZORDER BY (order_date);
Gold contract (publish it):
- Grain: day
- Dimensions: date
- Measures: orders, gross_sales, p50_order_value
- Filters baked in:
status IN ('paid','shipped') - SLA: ready by 06:30 ET; backfill allowed; schema changes via versioned RFC
Query patterns that scale
- Selective reads: Z-ORDER on most-filtered columns (e.g.,
customer_id,order_ts). - Time travel for “as of” reports:
SELECT * FROM mart.gold_orders_daily VERSION AS OF 123; - Incremental rebuilds: partition Gold by date; rebuild last N days only.
- Late-arriving data: run a rolling merge window in Silver (e.g., last 7 days).
Data quality: gate it, don’t chase it
Bronze → Silver checks
- Row-level: required fields (
order_id,order_ts) - Type coercion success rates
- Duplicate keys % per micro-batch
Silver → Gold checks
- Referential consistency (dimensions exist)
- Metric drift vs 7-day baseline
- Null ratio on business-critical fields
Operational pattern: quarantine, don’t fail. Route bad records to lake.quarantine_orders with a reason; alert on thresholds.
CDC, late data, and SCDs without pain
- CDC upserts belong in Silver using
MERGE INTO. - Late data: re-merge a rolling window (
WHERE order_ts >= now() - interval 7 days). - SCD Type 2 in Delta:
- Keep
is_current,valid_from,valid_to - Use MERGE with conditional updates and inserts
- Z-ORDER by business key and
valid_from
- Keep
Cost & performance guardrails
- File sizes: aim for 128–512 MB after
OPTIMIZE. - Partitioning: coarse by date (
order_date), not by high-cardinality IDs. - Caching: cache small dimensions at compute layer if reused heavily.
- OPTIMIZE cadence: daily for hot Silver/Gold; weekly for Bronze.
- VACUUM safety: retain >= 7 days if you rely on time travel/backfills.
- Z-ORDER columns: choose 1–3 most selective; more isn’t always better.
- Concurrency: prefer Structured Streaming or idempotent batch with checkpoints.
What to automate from day one (runbook)
- Freshness SLOs per layer (e.g., Bronze ≤ 10 min lag; Silver ≤ 25; Gold ≤ 60).
- Observability: ingestion throughput, MERGE latency, small-file ratio, expectation violations.
- Backfills: parameterized jobs that accept a date range; never hand-edit paths.
- Schema changes: require PR + environment promotion; enable
mergeSchemaonly in Silver. - Access model: read-only to Bronze; curated read to Silver; governed access to Gold.
- Disaster drills: restore a table at
VERSION AS OFX, replay N days from Bronze.
Common pitfalls (and blunt fixes)
- Millions of tiny files: you forgot compaction; schedule
OPTIMIZE. - Gold rebuilds take all night: partition smarter; rebuild sliding windows; precompute heavy joins in Silver.
- CDC duplicates: wrong join key in MERGE; add natural key + event time ordering.
- “It worked locally” configs in prod: lock table properties and enforce via CI checks.
- BI joins are flaky: your Silver isn’t conformed; invest in entity resolution and keys.
Conclusion & takeaways
- Operate, don’t just architect. Treat Bronze/Silver/Gold as SLO-bound services.
- Delta Lake gives you the primitives (ACID, MERGE, OPTIMIZE, VACUUM, constraints) to make it real.
- Model for change: push semantics up, keep ingestion simple, automate compaction and quality gates.
- Document contracts (grain, keys, SLAs) where consumers can see them.
Call to action:
Want a copy-paste starter? Ask me for a Medallion quickstart repo (jobs, checks, and sample data) tuned for your cloud + scheduler.
Internal link ideas (add on your site)
- “Delta Lake Basics: ACID on Object Storage”
- “CDC with Spark MERGE: Patterns & Anti-Patterns”
- “Z-ORDER vs Partitioning: When to Use Which”
- “Data Quality Gates with Expectations”
- “Designing SLAs/SLOs for Analytics Pipelines”
- “Choosing Between Delta, Hudi, and Iceberg”
(If you cite externals, stick to official docs: Delta Lake, Apache Spark, Databricks, AWS/GCP/Azure storage, etc.)
Image prompt
“A clean, modern data architecture diagram of a Medallion (Bronze→Silver→Gold) Delta Lake on cloud object storage, showing ingestion, CDC merges, quality gates, and OPTIMIZE/Z-ORDER—minimalistic, high contrast, 3D isometric style.”
Tags
#DeltaLake #MedallionArchitecture #DataEngineering #ApacheSpark #Lakehouse #CDC #DataQuality #Scalability #ETL #Analytics












Leave a Reply