Lakehouse Blueprint • Medallion Architecture • Oracle CDC → Azure

From Oracle to Azure Lakehouse: A Practical Medallion (Bronze/Silver/Gold) Blueprint

A senior-architect walkthrough of how to ingest transactional Oracle data via CDC and semi-structured CSVs into Azure using a modern Lakehouse approach—ending in a BI-ready Gold layer and a scalable data warehouse.

⏱️ Read time: ~12–15 minutes • 🎯 Audience: Data Engineers, Architects, BI/Analytics • 🧩 Topics: ADF, Spark/Databricks, Delta Lake, Synapse/Snowflake

While this blueprint uses Oracle as the primary example for Change Data Capture (CDC), the same architecture applies equally well to SQL Server. Both platforms support CDC and transactional log-based replication methods, and Azure Data Factory, Databricks, and the Bronze → Silver → Gold Lakehouse pattern operate the same way regardless of whether the source is Oracle or SQL Server. The only difference is the CDC connector technology (e.g., Oracle LogMiner or GoldenGate vs. SQL Server CDC or Debezium SQL Server connector), but the overall ingestion, standardization, and merging into Delta tables remains identical.

TL;DR

Bronze (Raw): land data fast & safely (Oracle CDC + files) with minimal transformation.
Silver (Clean): enforce schema, clean, de-duplicate, conform dimensions, standardize formats.
Gold (Curated): publish analytics-ready data products (star schemas, aggregates) into a warehouse.
Why Lakehouse: cheap scalable storage + ACID tables + decoupled compute for performance and cost control.

1) Data Ingestion: Source → Bronze (Raw)

The ingestion layer has one job: reliably move data from sources into cloud storage with strong observability and repeatability. In a Lakehouse approach, the Bronze layer is your “landing zone” for raw data.

Sources

Oracle transactional data using CDC (Change Data Capture)
CSV / semi-structured files from SFTP, SharePoint, on-prem drops, vendor feeds, etc.

Recommended Azure Tooling

Azure Data Factory (ADF): orchestration, scheduling, retries, metadata-driven pipelines
CDC tooling: depending on your stack (e.g., Oracle LogMiner-based, GoldenGate, Debezium, etc.)
Landing storage: ADLS Gen2 container (e.g., /bronze/oracle/..., /bronze/csv/...)

Why we don’t query production Oracle directly

For analytics pipelines, repeatedly querying production OLTP databases tends to create operational risk:

Performance & locking risk: analytics scans can spike CPU/IO and impact business transactions.
Unbounded read patterns: BI asks “give me last 2 years,” OLTP tables aren’t tuned for that.
Operational fragility: every new report becomes a new dependency on production.
CDC is purpose-built: it captures changes incrementally and reduces load dramatically.

Bronze Data Contracts (what to store)

Bronze should preserve fidelity (raw shape) and add operational metadata:

Ingestion timestamp (ingested_at)
Source system and entity (source_system, entity)
CDC operation (op = I/U/D) and transaction/SCN if available
Batch/run ID for traceability

Senior architect tip: Bronze is not “dirty data.” Bronze is traceable data. You should always be able to answer: “What did we receive, when, from where, and what did we do with it?”

2) Processing Layer: Bronze → Silver (Clean)

Silver is where you convert raw ingestion into trusted, queryable, standardized tables. It’s the layer optimized for data quality and reusability across consumers.

Recommended tooling

Apache Spark / Azure Databricks for scalable transformations
Delta Lake tables for ACID guarantees, schema evolution, and time travel
Schema-on-read approach with controlled enforcement in Silver

Typical Silver transformations

Schema enforcement: types, nullability rules, field standardization
De-duplication: handle CDC duplicates or file replays
Conformance: align codes, currencies, time zones
PII handling: masking, tokenization, or isolation into restricted zones
Incremental upserts: apply CDC changes efficiently (MERGE into Delta)

File formats: Delta vs Parquet

Parquet is a great columnar file format for analytics. Delta is Parquet plus a transaction log that enables ACID, upserts, deletes, and time travel. In Silver, where you typically need incremental change application (MERGE) and reliable replay, Delta is often the pragmatic choice.


// PSEUDOCODE: apply CDC changes to a Delta Silver table

// 1) Read new CDC batch from Bronze
DataFrame cdc = spark.read()
    .format("json")
    .load("/bronze/oracle/orders_cdc/run_id=2026-02-03");

// 2) Standardize schema / clean
DataFrame staged = cdc
    .withColumn("event_time", to_timestamp(col("event_time")))
    .withColumn("op", upper(col("op")))
    .dropDuplicates("order_id", "scn");

// 3) Merge into Silver Delta table
staged.createOrReplaceTempView("staged_orders");

spark.sql("""
MERGE INTO silver.orders AS tgt
USING staged_orders AS src
ON tgt.order_id = src.order_id

WHEN MATCHED AND src.op = 'D' THEN DELETE
WHEN MATCHED AND src.op IN ('U','I') THEN UPDATE SET *
WHEN NOT MATCHED AND src.op IN ('I','U') THEN INSERT *
""");

Senior architect tip: Treat Silver as the “shared truth layer.” Most teams should build on Silver—not re-ingest raw or rebuild logic from scratch per dashboard.

3) Gold Layer (Data Warehouse): Silver → Gold

Gold is where you publish analytics-ready data products: curated dimensions, facts, aggregates, and semantic-friendly models. It’s the layer designed to answer business questions efficiently and consistently.

Gold outputs

Star schema: Facts (transactions) + Dimensions (customer, product, date)
Aggregates: daily revenue, cohort retention, operational KPIs
Business definitions: standardized metrics used by BI tools
Serving layers: materialized views or warehouse tables for high concurrency

Warehouse choice: Synapse vs Snowflake (conceptually)

At Gold, many organizations choose a warehouse engine (e.g., Azure Synapse or Snowflake) for predictable BI performance and concurrency. The key idea is: Gold is query-serving optimized.

Why a warehouse is optimized for BI vs a standard RDBMS

Columnar storage: faster scans and compression for analytics workloads
MPP execution: distributed query processing across nodes
Workload isolation: separate compute pools / virtual warehouses
Elasticity: scale up/down for peak dashboard usage

Senior architect tip: Gold should be productized. Think “data products” with SLAs, owners, tests, and change management—not “a dumping ground of tables.”

4) Technology Justification: Columnar Storage + Decoupled Compute/Storage

The Lakehouse approach succeeds largely because it aligns compute and storage with the way analytics works: lots of scans, aggregations, and joins over large datasets.

Columnar storage

Reads only the needed columns (huge win for wide tables)
Better compression due to similar values in a column
Vectorized execution in modern engines for faster compute

Decoupled compute & storage

Scale compute independently: scale up clusters only during heavy ETL or BI peaks
Lower cost: store everything cheaply in ADLS, pay compute on demand
Workload isolation: separate ETL compute from BI compute to reduce contention

Rule of thumb: Store once (cheap), compute many times (elastic). This is one of the biggest cost/performance shifts vs traditional on-prem analytics stacks.

5) Post-Warehouse Actions: What Happens After Gold?

Once Gold is established, value accelerates when you plug in downstream consumers and governance. Here are three common next steps:

1) BI / Semantic Layer

Publish curated models for Power BI (or your BI tool), with certified datasets
Define KPI measures and shared calculations consistently
Enable row-level security (RLS) aligned with business roles

2) ML / Feature Store

Create reusable features from Silver/Gold (customer lifetime value, churn signals, etc.)
Support offline training + online serving patterns
Track feature lineage and drift

3) Reverse ETL / Operational Analytics

Push curated insights back into operational tools (CRM, marketing, support)
Examples: prioritized leads, churn risk flags, fraud risk scores
Ensure idempotent sync and auditing for operational systems

Senior architect tip: Don’t stop at dashboards. The next maturity step is making analytics actionable in operational systems.

Technical Blueprint Table (Steps, Inputs, Outputs, and Quality Gates)

Stage	Primary Goal	Inputs	Tools	Output (Storage)	Quality Gates / Observability
Ingestion (Bronze)	Land data reliably, preserve raw fidelity	Oracle CDC, CSV drops	ADF, CDC tool, ADLS	Raw files / raw Delta tables `/bronze/...`	Run IDs, counts, schema drift alerts, latency SLAs
Processing (Silver)	Clean, standardize, enforce schema, upsert changes	Bronze raw	Databricks/Spark, Delta	Conformed Delta tables `/silver/...`	DQ tests (null checks, uniqueness), dedupe rates, merge metrics
Curate (Gold)	Analytics-ready models (facts/dims, aggregates)	Silver tables	Databricks/Spark + Warehouse engine	Warehouse tables / Gold Delta `/gold/...`	Metric reconciliation, SLA dashboards, semantic model validation
Consume	Serve BI/ML/Reverse ETL	Gold products	Power BI, ML platform, Reverse ETL	Dashboards, models, features, sync jobs	Usage telemetry, access audits, freshness monitoring

Logical Flow Diagram

A) Inline SVG (Blogger-friendly, no dependencies)

Final Notes: Operational Excellence (What Makes This Work in Real Life)

Metadata-driven pipelines: avoid hardcoding 100 tables—use configs.
Data quality as code: checks at Silver and reconciliation at Gold.
Lineage + auditing: from dashboard back to source event/file.
Idempotency: every run can safely reprocess without duplication.
Access controls: isolate PII and apply least privilege end-to-end.

If you want, I can tailor this blueprint to your environment (Synapse vs Snowflake, Databricks vs Fabric), add a deeper CDC strategy section, or include a sample metadata-driven pipeline configuration.

Lakehouse Blueprint • Medallion Architecture • Oracle CDC → Azure

TL;DR

1) Data Ingestion: Source → Bronze (Raw)

Sources

Recommended Azure Tooling

Why we don’t query production Oracle directly

Bronze Data Contracts (what to store)

2) Processing Layer: Bronze → Silver (Clean)

Recommended tooling

Typical Silver transformations

File formats: Delta vs Parquet

3) Gold Layer (Data Warehouse): Silver → Gold

Gold outputs

Warehouse choice: Synapse vs Snowflake (conceptually)

Why a warehouse is optimized for BI vs a standard RDBMS

4) Technology Justification: Columnar Storage + Decoupled Compute/Storage

Columnar storage

Decoupled compute & storage

5) Post-Warehouse Actions: What Happens After Gold?

1) BI / Semantic Layer

2) ML / Feature Store

3) Reverse ETL / Operational Analytics

Technical Blueprint Table (Steps, Inputs, Outputs, and Quality Gates)

Logical Flow Diagram

A) Inline SVG (Blogger-friendly, no dependencies)

Final Notes: Operational Excellence (What Makes This Work in Real Life)

Posted by Surendra Rayapati

You may like these posts

Post a Comment

0 Comments

About Me

Archive

Most Popular

Search This Blog

Tags

Recent Post

Popular Posts

Footer Menu Widget

Contact form