Hybrid Data Architectures: Oracle as System of Record with Kafka, Spark & Snowflake for Analytics Offloading
Hook: Your Oracle OLTP keeps the business alive—but every new dashboard, ML feature, and ad-hoc query drags it down. You can’t let analysts hammer the production database, and nightly dumps are already missing SLAs. The fix isn’t “move everything.” It’s a hybrid where Oracle stays the system of record and Kafka/Spark/Snowflake do the heavy lifting for analytics—without breaking consistency or your budget.
Why this matters
- Protect OLTP throughput. Keep transactions fast by offloading reads and heavy joins.
- Deliver near-real-time insights. Minutes—not days—between a row hitting Oracle and a metric updating in Snowflake.
- Enable flexible modeling. Serve BI, data science, and APIs without contorting your OLTP schema.
- Decouple change. App teams evolve Oracle; data teams ship independently on Kafka/Spark/Snowflake.
Reference architecture (concept)
- Oracle (SoR): ACID writes, normalized schemas.
- CDC Capture: GoldenGate or Debezium reads redo logs; publishes to Kafka.
- Kafka: Durable, ordered change streams by table/topic.
- Spark / Snowpark / Connectors: Transform, compact, enrich.
- Snowflake: Curated analytics models (Bronze/Silver/Gold).
- BI/ML/Reverse ETL: Dashboards, features, operational outputs.
Latency target: < 5 minutes end-to-end for most use cases. Aim for exactly-once semantics in your pipelines.
Four practical patterns (and when to use them)
| Pattern | What it does | Pros | Cons | Use when |
|---|---|---|---|---|
| A. “Raw CDC → Snowflake” | Land Oracle row-level changes as-is (Bronze). | Fast to implement; minimal logic. | Lots of small files; hard to query directly. | Quick wins, audit trails. |
| B. “Spark Merge (UPSERT) → Silver” | Batch micro-batches from Kafka; MERGE to current tables. | Query-ready; deduped; idempotent. | Requires careful keys & watermarks. | BI dashboards, API reads. |
| C. “SCD Type 2 in Snowflake” | Track history with effective date ranges. | Full temporal analysis. | Storage cost; more joins. | Compliance, trend analysis. |
| D. “Snowpipe Streaming / Kafka Connect → Snowflake” | Low-latency ingestion directly to Snowflake. | Operationally simple; minutes latency. | Transform after load or with Streams/Tasks. | Near-real-time metrics with light transforms. |
Data modeling & consistency choices
- Keys that survive: Choose a stable business key (e.g., CUSTOMER_ID) and keep Oracle ROWID/SCN for lineage.
- Deletes: Prefer soft delete flags in Oracle. If physical deletes happen, ensure CDC tool emits tombstones and you MERGE them.
- Ordering & idempotency: Use
(table, pk, scn)or(pk, op_ts)as a composite id to dedupe. - CAP trade-off: Snowflake side is usually AP/EC (available, eventually consistent). Aim for read-your-writes only where business demands it.
Stream processing design (Spark + Kafka → Snowflake)
Kafka topic conventions
oracle.<schema>.<table>(key = primary key JSON, value = row image + op type)- Partitions: hash by PK to keep order within an entity.
- Retention: ≥ 7 days for replay; compaction on topics that carry upserts.
Bronze → Silver with PySpark (micro-batch foreachBatch)
from pyspark.sql import functions as F
from pyspark.sql.types import *
from snowflake.connector.pandas_tools import write_pandas
# 1) Read CDC from Kafka
df = (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", KAFKA_BROKERS)
.option("subscribe", "oracle.hr.employee")
.option("startingOffsets", "latest")
.load())
# 2) Parse payload (Debezium-style)
payload = (df.select(F.col("key").cast("string").alias("k"),
F.col("value").cast("string").alias("v"))
.withColumn("j", F.from_json("v", schema_of_your_payload)) # define schema explicitly
.selectExpr("j.op as op", "j.after as after", "j.before as before",
"j.source.ts_ms as ts_ms"))
silver_cols = ["emp_id","name","dept","salary","updated_at"]
current = (payload
.filter("op in ('c','u') and after is not null")
.select([F.col(f"after.{c}").alias(c) for c in silver_cols]))
deletes = (payload
.filter("op = 'd' and before is not null")
.select(F.col("before.emp_id").alias("emp_id"))
.withColumn("is_deleted", F.lit(True)))
def upsert_to_snowflake(batch_df, batch_id):
# Use Snowflake connector via JDBC or Snowpark; shown conceptually with temp stage + MERGE
batch_df.createOrReplaceTempView("batch_updates")
spark.sql("""
MERGE INTO ANALYTICS_SILVER.HR_EMPLOYEE t
USING batch_updates s
ON t.emp_id = s.emp_id
WHEN MATCHED THEN UPDATE SET
name = s.name, dept = s.dept, salary = s.salary, updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT *
""")
# Handle deletes similarly with a second MERGE or soft-delete flag
(current.writeStream
.outputMode("update")
.foreachBatch(upsert_to_snowflake)
.option("checkpointLocation", "s3://.../chk/hr_employee")
.trigger(processingTime="1 minute")
.start())
Tip: keep checkpointLocation stable and monitored. That directory is your state & exactly-once guarantee.
Snowflake ingestion options (choose one)
1) Kafka Connect → Snowflake (fastest path)
- Pros: Minimal code; handles exactly-once with offsets; creates files efficiently.
- Cons: Transformations happen after load; manage Connect infra.
- Snowflake side: Use Streams + Tasks to transform into Silver/Gold.
-- Raw landing
CREATE TABLE RAW.HR_EMPLOYEE_INGEST (RECORD VARIANT);
-- Stream & Task to upsert
CREATE OR REPLACE STREAM STRM_EMPLOYEE ON TABLE RAW.HR_EMPLOYEE_INGEST;
CREATE OR REPLACE TASK T_UPSERT_EMPLOYEE
WAREHOUSE = WH_T
SCHEDULE = 'USING CRON 0 * * * * America/New_York'
AS
MERGE INTO SILVER.HR_EMPLOYEE t
USING (
SELECT
v:after.emp_id::number emp_id,
v:after.name::string name,
v:after.dept::string dept,
v:after.salary::number salary,
v:source.ts_ms::timestamp_ntz updated_at,
v:op::string op
FROM (SELECT RECORD AS v FROM RAW.HR_EMPLOYEE_INGEST) r
) s
ON t.emp_id = s.emp_id
WHEN MATCHED AND s.op='d' THEN UPDATE SET t.is_deleted = TRUE, t.updated_at = s.updated_at
WHEN MATCHED THEN UPDATE SET name=s.name, dept=s.dept, salary=s.salary, updated_at=s.updated_at
WHEN NOT MATCHED AND s.op IN ('c','u') THEN INSERT (emp_id,name,dept,salary,updated_at) VALUES (s.emp_id,s.name,s.dept,s.salary,s.updated_at);
2) Snowpipe Streaming (lowest latency managed ingest)
- Push from an app/connector directly into a table with seconds-level latency.
- Great for metrics/operational analytics; keep transforms in Streams/Tasks.
3) Spark → Snowflake MERGE (control & complex transforms)
- Use
foreachBatchto runMERGEwith business logic, joins, and SCDs. - Slightly higher ops complexity; excellent for curated Silver/Gold.
SCD Type 2 in Snowflake (compact & fast)
CREATE TABLE MARTS.HR_EMPLOYEE_SCD2 (
EMP_ID NUMBER,
NAME STRING,
DEPT STRING,
SALARY NUMBER,
VALID_FROM TIMESTAMP_NTZ,
VALID_TO TIMESTAMP_NTZ,
IS_CURRENT BOOLEAN,
HASH_KEY BINARY(16) -- md5 of business attributes
);
-- Pseudocode-ish MERGE for SCD2
MERGE INTO MARTS.HR_EMPLOYEE_SCD2 tgt
USING STAGE.SILVER_EMPLOYEE src
ON tgt.EMP_ID = src.EMP_ID AND tgt.IS_CURRENT = TRUE
WHEN MATCHED AND hash(src.*) <> tgt.HASH_KEY THEN
UPDATE SET tgt.VALID_TO = CURRENT_TIMESTAMP, tgt.IS_CURRENT = FALSE
WHEN MATCHED AND hash(src.*) <> tgt.HASH_KEY THEN
INSERT (EMP_ID, NAME, DEPT, SALARY, VALID_FROM, VALID_TO, IS_CURRENT, HASH_KEY)
VALUES (src.EMP_ID, src.NAME, src.DEPT, src.SALARY, CURRENT_TIMESTAMP, NULL, TRUE, hash(src.*))
WHEN NOT MATCHED THEN
INSERT (... same as above ...);
Use Streams on SILVER to only process changed rows. Consider Hybrid Tables or Search Optimization for current-row lookups.
Observability & governance you cannot skip
- Data contracts: For each Oracle table, declare schema, PK, delete behavior, latency SLO, nullability, semantics. Enforce with Great Expectations or Soda on Bronze/Silver.
- Lineage: Capture
(topic → job → table)with OpenLineage/Marquez or Snowflake’s Access History; store SCN/LSN in columns for audit. - Replay: Keep Kafka retention ≥ 7 days and maintain idempotent MERGE logic to reprocess safely.
- Schema evolution: Require supplemental logging in Oracle; block breaking changes; auto-add nullable columns downstream.
- Cost controls:
- Micro-batch 1–5 min; auto-suspend warehouses.
- File sizing: 128–512 MB for staged loads.
- Prune with clustering on
BUSINESS_DATE/PK_HASH.
- Security: Row/column masking in Snowflake; network policies; rotate credentials for connectors.
Common pitfalls (blunt & practical)
- Hammering Oracle with SELECT polling instead of redo-based CDC. Fix: use GoldenGate/Debezium; enable supplemental logging.
- Out-of-order upserts causing “zombie” records. Fix: order by SCN/op_ts and use last-write-wins logic in MERGE.
- Tiny files tanking Snowflake performance. Fix: stage compaction or Spark coalesce before load.
- Ignoring deletes. If the business cares, model them. Emit tombstones and MERGE as soft deletes.
- One giant topic for everything. Fix: per-table topics, hash partition by PK, set compaction/retention correctly.
- Undefined SLAs. “Near real time” is meaningless. Set numbers (e.g., 95% < 5 min) and monitor E2E lag.
Minimal runbook: from zero to value
- Enable CDC on Oracle (supplemental logging) → stand up GoldenGate/Debezium.
- Create Kafka topics per table; partition by PK; set retention & compaction.
- Land Bronze (raw JSON/VARIANT) to Snowflake via Connect or Snowpipe Streaming.
- Build Silver with Spark/Snowpark MERGEs (dedupe, columns typed, deletes handled).
- Publish Gold marts for BI/ML; add Streams & Tasks for incremental refresh.
- Wire up tests, lineage, alerts (schema drift, null explosions, late data, lag).
- Document contracts and RPO/RTO. Rehearse replay and backfill.
Conclusion & takeaways
- Keep Oracle focused on transactions; move analytics to Kafka/Spark/Snowflake.
- Choose ingestion based on latency vs complexity: Connect/Snowpipe (simpler), Spark MERGE (control).
- Build with idempotency, ordering, and deletes in mind from day one.
- Govern with contracts, lineage, and tests, or your “hybrid” becomes a guessing game.
- Define hard SLAs and observe them. If you can’t measure it, you don’t have it.
Call to action: Want a tailored blueprint for your Oracle schema and SLAs? Tell me your top 5 tables, latency target, and current CDC tool—I’ll draft a concrete plan you can implement this week.
Internal link ideas
- “CDC 101: GoldenGate vs Debezium—latency, cost, ops”
- “Designing SCD2 in Snowflake with Streams & Tasks”
- “Kafka partitioning for exactly-once upserts”
- “Snowflake cost tuning: file sizing, clustering, and warehouse right-sizing”
- “Schema evolution playbook for NoSQL + RDBMS hybrids”
Image prompt (for DALL·E/Midjourney)
“A clean, modern hybrid data architecture diagram showing Oracle as system of record with redo-based CDC into Kafka, Spark processing into Snowflake Bronze/Silver/Gold layers, and BI/ML consumers; minimalistic, high contrast, 3D isometric style; labeled flows for upserts, deletes, and SCD2.”
Tags
#HybridArchitecture #Oracle #Kafka #Spark #Snowflake #CDC #DataEngineering #RealTimeAnalytics #SCD #DataModeling








Leave a Reply