From Oracle to Azure Lakehouse: A Practical Medallion (Bronze/Silver/Gold) Blueprint
A senior-architect walkthrough of how to ingest transactional Oracle data via CDC and semi-structured CSVs into Azure using a modern Lakehouse approach—ending in a BI-ready Gold layer and a scalable data warehouse.
While this blueprint uses Oracle as the primary example for Change Data Capture (CDC), the same architecture applies equally well to SQL Server. Both platforms support CDC and transactional log-based replication methods, and Azure Data Factory, Databricks, and the Bronze → Silver → Gold Lakehouse pattern operate the same way regardless of whether the source is Oracle or SQL Server. The only difference is the CDC connector technology (e.g., Oracle LogMiner or GoldenGate vs. SQL Server CDC or Debezium SQL Server connector), but the overall ingestion, standardization, and merging into Delta tables remains identical.
TL;DR
- Bronze (Raw): land data fast & safely (Oracle CDC + files) with minimal transformation.
- Silver (Clean): enforce schema, clean, de-duplicate, conform dimensions, standardize formats.
- Gold (Curated): publish analytics-ready data products (star schemas, aggregates) into a warehouse.
- Why Lakehouse: cheap scalable storage + ACID tables + decoupled compute for performance and cost control.
1) Data Ingestion: Source → Bronze (Raw)
The ingestion layer has one job: reliably move data from sources into cloud storage with strong observability and repeatability. In a Lakehouse approach, the Bronze layer is your “landing zone” for raw data.
Sources
- Oracle transactional data using CDC (Change Data Capture)
- CSV / semi-structured files from SFTP, SharePoint, on-prem drops, vendor feeds, etc.
Recommended Azure Tooling
- Azure Data Factory (ADF): orchestration, scheduling, retries, metadata-driven pipelines
- CDC tooling: depending on your stack (e.g., Oracle LogMiner-based, GoldenGate, Debezium, etc.)
- Landing storage: ADLS Gen2 container (e.g.,
/bronze/oracle/...,/bronze/csv/...)
Why we don’t query production Oracle directly
For analytics pipelines, repeatedly querying production OLTP databases tends to create operational risk:
- Performance & locking risk: analytics scans can spike CPU/IO and impact business transactions.
- Unbounded read patterns: BI asks “give me last 2 years,” OLTP tables aren’t tuned for that.
- Operational fragility: every new report becomes a new dependency on production.
- CDC is purpose-built: it captures changes incrementally and reduces load dramatically.
Bronze Data Contracts (what to store)
Bronze should preserve fidelity (raw shape) and add operational metadata:
- Ingestion timestamp (
ingested_at) - Source system and entity (
source_system,entity) - CDC operation (
op = I/U/D) and transaction/SCN if available - Batch/run ID for traceability
2) Processing Layer: Bronze → Silver (Clean)
Silver is where you convert raw ingestion into trusted, queryable, standardized tables. It’s the layer optimized for data quality and reusability across consumers.
Recommended tooling
- Apache Spark / Azure Databricks for scalable transformations
- Delta Lake tables for ACID guarantees, schema evolution, and time travel
- Schema-on-read approach with controlled enforcement in Silver
Typical Silver transformations
- Schema enforcement: types, nullability rules, field standardization
- De-duplication: handle CDC duplicates or file replays
- Conformance: align codes, currencies, time zones
- PII handling: masking, tokenization, or isolation into restricted zones
- Incremental upserts: apply CDC changes efficiently (MERGE into Delta)
File formats: Delta vs Parquet
Parquet is a great columnar file format for analytics. Delta is Parquet plus a transaction log that enables ACID, upserts, deletes, and time travel. In Silver, where you typically need incremental change application (MERGE) and reliable replay, Delta is often the pragmatic choice.
// PSEUDOCODE: apply CDC changes to a Delta Silver table
// 1) Read new CDC batch from Bronze
DataFrame cdc = spark.read()
.format("json")
.load("/bronze/oracle/orders_cdc/run_id=2026-02-03");
// 2) Standardize schema / clean
DataFrame staged = cdc
.withColumn("event_time", to_timestamp(col("event_time")))
.withColumn("op", upper(col("op")))
.dropDuplicates("order_id", "scn");
// 3) Merge into Silver Delta table
staged.createOrReplaceTempView("staged_orders");
spark.sql("""
MERGE INTO silver.orders AS tgt
USING staged_orders AS src
ON tgt.order_id = src.order_id
WHEN MATCHED AND src.op = 'D' THEN DELETE
WHEN MATCHED AND src.op IN ('U','I') THEN UPDATE SET *
WHEN NOT MATCHED AND src.op IN ('I','U') THEN INSERT *
""");
3) Gold Layer (Data Warehouse): Silver → Gold
Gold is where you publish analytics-ready data products: curated dimensions, facts, aggregates, and semantic-friendly models. It’s the layer designed to answer business questions efficiently and consistently.
Gold outputs
- Star schema: Facts (transactions) + Dimensions (customer, product, date)
- Aggregates: daily revenue, cohort retention, operational KPIs
- Business definitions: standardized metrics used by BI tools
- Serving layers: materialized views or warehouse tables for high concurrency
Warehouse choice: Synapse vs Snowflake (conceptually)
At Gold, many organizations choose a warehouse engine (e.g., Azure Synapse or Snowflake) for predictable BI performance and concurrency. The key idea is: Gold is query-serving optimized.
Why a warehouse is optimized for BI vs a standard RDBMS
- Columnar storage: faster scans and compression for analytics workloads
- MPP execution: distributed query processing across nodes
- Workload isolation: separate compute pools / virtual warehouses
- Elasticity: scale up/down for peak dashboard usage
4) Technology Justification: Columnar Storage + Decoupled Compute/Storage
The Lakehouse approach succeeds largely because it aligns compute and storage with the way analytics works: lots of scans, aggregations, and joins over large datasets.
Columnar storage
- Reads only the needed columns (huge win for wide tables)
- Better compression due to similar values in a column
- Vectorized execution in modern engines for faster compute
Decoupled compute & storage
- Scale compute independently: scale up clusters only during heavy ETL or BI peaks
- Lower cost: store everything cheaply in ADLS, pay compute on demand
- Workload isolation: separate ETL compute from BI compute to reduce contention
5) Post-Warehouse Actions: What Happens After Gold?
Once Gold is established, value accelerates when you plug in downstream consumers and governance. Here are three common next steps:
1) BI / Semantic Layer
- Publish curated models for Power BI (or your BI tool), with certified datasets
- Define KPI measures and shared calculations consistently
- Enable row-level security (RLS) aligned with business roles
2) ML / Feature Store
- Create reusable features from Silver/Gold (customer lifetime value, churn signals, etc.)
- Support offline training + online serving patterns
- Track feature lineage and drift
3) Reverse ETL / Operational Analytics
- Push curated insights back into operational tools (CRM, marketing, support)
- Examples: prioritized leads, churn risk flags, fraud risk scores
- Ensure idempotent sync and auditing for operational systems
Technical Blueprint Table (Steps, Inputs, Outputs, and Quality Gates)
| Stage | Primary Goal | Inputs | Tools | Output (Storage) | Quality Gates / Observability |
|---|---|---|---|---|---|
| Ingestion (Bronze) | Land data reliably, preserve raw fidelity | Oracle CDC, CSV drops | ADF, CDC tool, ADLS | Raw files / raw Delta tables/bronze/... |
Run IDs, counts, schema drift alerts, latency SLAs |
| Processing (Silver) | Clean, standardize, enforce schema, upsert changes | Bronze raw | Databricks/Spark, Delta | Conformed Delta tables/silver/... |
DQ tests (null checks, uniqueness), dedupe rates, merge metrics |
| Curate (Gold) | Analytics-ready models (facts/dims, aggregates) | Silver tables | Databricks/Spark + Warehouse engine | Warehouse tables / Gold Delta/gold/... |
Metric reconciliation, SLA dashboards, semantic model validation |
| Consume | Serve BI/ML/Reverse ETL | Gold products | Power BI, ML platform, Reverse ETL | Dashboards, models, features, sync jobs | Usage telemetry, access audits, freshness monitoring |
Logical Flow Diagram
A) Inline SVG (Blogger-friendly, no dependencies)
Final Notes: Operational Excellence (What Makes This Work in Real Life)
- Metadata-driven pipelines: avoid hardcoding 100 tables—use configs.
- Data quality as code: checks at Silver and reconciliation at Gold.
- Lineage + auditing: from dashboard back to source event/file.
- Idempotency: every run can safely reprocess without duplication.
- Access controls: isolate PII and apply least privilege end-to-end.
If you want, I can tailor this blueprint to your environment (Synapse vs Snowflake, Databricks vs Fabric), add a deeper CDC strategy section, or include a sample metadata-driven pipeline configuration.
0 Comments