Data Models & Query Languages — A System Design Deep Dive

Data Models & Query Languages — A System Design Deep Dive (Inspired by DDIA Chapter 2)

Book Notes System Design

Inspired by Designing Data-Intensive Applications, Chapter 2 — distilled for interviews and real-world engineering.

1) Why Data Models Matter

Data models aren’t just storage choices; they shape how we think, design APIs, enforce invariants, and scale. The “right” model aligns with access patterns, keeps hot paths simple, and minimizes accidental complexity.

What Models Decide

Queryability: What’s easy/hard/expensive to ask.
Integrity: Where constraints live (DB vs app).
Locality: How much IO per request.
Evolution: Cost of change over time.

Rule of Thumb

Choose models by dominant access patterns, relationship complexity, and consistency requirements—then validate with prototypes.

2) Relational Model — Deep Dive

Normalization & Integrity

1NF: Atomic values; no repeating groups.
2NF/3NF: No partial/transitive dependency on keys—prevents update anomalies.
BCNF: Every determinant is a key—stricter than 3NF.
Keys: Prefer surrogate (e.g., BIGINT) when natural keys are wide/unstable.
Constraints: Primary/foreign keys, unique, check; push invariants into the database when possible.

Join Semantics

Inner (intersection), left/right (preserve one side), full, cross, semi/anti (via EXISTS/NOT EXISTS).
Join selectivity and index coverage dominate latency in OLTP.

Query Planning (Mental Model)

Indexes reduce scanned rows; statistics inform join order; cost-based optimizers pick algorithms (hash join, sort-merge, nested loop).
Composite indexes: order by equality → range → order-by columns. “Leftmost prefix” matters.

Modeling Patterns

Many-to-many: junction table with composite unique constraint.
Polymorphic associations: either a type+id pair (with app-enforced FK) or separate tables + views.
Soft deletes: deleted_at plus partial indexes to keep uniqueness for active rows.
EAV (Entity-Attribute-Value): flexible but often a query/constraint anti-pattern; prefer JSON column + targeted indexes if you must.

Example: Orders, Items, Indexing

-- Core tables
CREATE TABLE orders (
  id BIGSERIAL PRIMARY KEY,
  user_id BIGINT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC);

CREATE TABLE order_items (
  id BIGSERIAL PRIMARY KEY,
  order_id BIGINT NOT NULL REFERENCES orders(id),
  sku TEXT NOT NULL,
  qty INT NOT NULL CHECK (qty > 0),
  price_cents INT NOT NULL CHECK (price_cents >= 0)
);

CREATE INDEX idx_items_order ON order_items(order_id);

-- Top query: recent orders with totals for a user
SELECT o.id, o.created_at, SUM(oi.qty * oi.price_cents) AS total_cents
FROM orders o
JOIN order_items oi ON oi.order_id = o.id
WHERE o.user_id = $1
GROUP BY o.id
ORDER BY o.created_at DESC
LIMIT 50;

Interview tip: Normalize by default for integrity and flexibility; denormalize selectively for read-heavy hot paths. Call out specific materializations (e.g., “order_total_cents” column refreshed via trigger or job).

3) Document Model — Deep Dive

Aggregate-Centric Design

Store an aggregate (document) that you usually read/write as a whole.
Embed when child lifecycle is bound to parent and access is co-located.
Reference when many-to-many, large unbounded arrays, or cross-aggregate reuse.

Indexing & Access

Support equality/range on top-level & nested fields (items.sku), and compound indexes for multi-field filters.
Use partial or TTL indexes for lifecycle management.
Avoid “array-of-huge-subs” that frequently append beyond page size—can cause doc moves and write amplification.

Atomicity & Concurrency

Most document stores guarantee atomic updates within a single document. Cross-document transactions may be limited or costlier.
Prefer idempotent updates and upserts for retried writes.

Aggregation Pipeline Example

// Total spend per user over last 30 days
db.orders.aggregate([
  { $match: { createdAt: { $gte: ISODate("2026-01-01") } } },
  { $unwind: "$items" },
  { $group: {
      _id: "$userId",
      totalCents: { $sum: { $multiply: [ "$items.qty", "$items.priceCents" ] } }
  }},
  { $sort: { totalCents: -1 } },
  { $limit: 50 }
]);

Gotcha: “Embed everything” fails with growing, unbounded arrays (e.g., activity feeds). Prefer references + fan-out-on-write/read patterns, or a separate collection.

4) Graph Models — Deep Dive (Property Graphs & RDF)

When Graphs Win

Multi-hop traversals (friends-of-friends, shortest path, role hierarchies).
Query patterns are relationship-centric more than attribute-centric.

Property Graph vs RDF

Aspect	Property Graph (e.g., Neo4j/Cypher)	RDF (SPARQL)
Data Shape	Nodes/edges with key-value properties	Triples (subject, predicate, object)
Query Style	Pattern matching with path syntax	Triple patterns with optional reasoning
Strength	Operational traversals, intuitive modeling	Semantic web, vocabularies, inference

Cypher Example

// Potential recommendations: friends-of-friends who liked the same SKU
MATCH (u:User {id: 42})-[:FRIEND]-(f)-[:FRIEND]-(fof),
      (u)-[:PURCHASED]-(:Item {sku: "ABC-123"}),
      (fof)-[:PURCHASED]-(:Item {sku: "ABC-123"})
RETURN DISTINCT fof
ORDER BY fof.lastActive DESC
LIMIT 20;

SPARQL Example

# Get people who manage someone who manages Alice
PREFIX ex: <http://example.org/schema#>
SELECT DISTINCT ?person WHERE {
  ?person ex:manages ?x .
  ?x ex:manages ex:Alice .
}

Interview tip: For depth-first/variable-length path queries, graphs avoid N self-joins and complex recursive CTEs. Always start with a selective index-bound node to keep traversals tight.

5) Query Languages: SQL, Aggregation Pipelines, Graph

Declarative vs Imperative

SQL / SPARQL / Cypher: Declare what you want; engine plans how.
MapReduce/Agg Pipelines: Imperative or staged; great for ETL and batch analytics.

SQL Essentials for Interviews

JOINs, GROUP BY, HAVING, window functions (e.g., ROW_NUMBER, LAG).
Use EXPLAIN/EXPLAIN ANALYZE to surface plan, rows, and index usage.

Aggregation Pipelines

Think of each stage as a relational transform (filter, project, group, join-like via $lookup), but executed close to document structure.

Graph Queries

Pattern-first: match nodes and relationships, bind variables, constrain with predicates, then RETURN. Consider traversal bounds (depth, labels) and path uniqueness.

6) Schema Flexibility & Evolution Patterns

Schema-on-Write vs Schema-on-Read

Schema-on-write (relational): strong guarantees upfront; migrations needed.
Schema-on-read (document): tolerant storage; contracts enforced in app and readers.

Compatibility & Versioning

Additive changes are easiest (new nullable/optional fields).
Breaking changes: use dual-read (read v1 or v2) and/or dual-write during backfills.
Keep a data migration playbook: cutover plan, backfill jobs, metrics, rollback.

Operational Playbook (Example)

Introduce new column/field with defaults.
Write dual (old + new) or derive on read.
Backfill in batches with idempotent jobs.
Flip reads to new schema; monitor errors/latency.
Remove old paths once stable; add constraints.

7) Joins, Locality & Denormalization

Trade-offs

Joins: powerful, but expensive across shards and large tables without selective indexes.
Locality: embedding (docs) or clustering (RDBMS) reduces random IO for common reads.
Denormalization: duplicate/derive data to speed reads (materialized views, counters)—manage staleness.

Patterns

Materialized aggregations: e.g., order_total_cents updated via trigger/job.
Fan-out on write (eager) vs fan-out on read (lazy) for feeds.
Read models tailored for screens (CQRS-lite).

8) Polyglot Persistence & Decision Frameworks

Choose by Access Pattern

Ad-hoc, cross-entity analytics → Relational.
Aggregate reads/writes, hierarchical JSON → Document.
Multi-hop relationship traversals → Graph.

Secondary Criteria

Consistency needs (constraints, transactions).
Sharding strategy and growth projections.
Team expertise and operational maturity.

Decision Matrix (Quick Start)

Requirement	Relational	Document	Graph
Complex joins and ad-hoc SQL	★★★★★	★★	★★
Nested, variable schema aggregates	★★	★★★★★	★★
Multi-hop traversals (variable length)	★	★	★★★★★
Strict integrity (FKs, constraints)	★★★★★	★★	★★★
Horizontal scalability (ops simplicity)	★★★	★★★★	★★★

9) Performance Fundamentals: Indexing, Joins & Plans

Index Heuristics

High-selectivity predicates deserve indexes; avoid low-cardinality leading columns.
Composite index order: equality columns → range columns → order by/group by.
Covering indexes (include columns) reduce lookups on hot queries.
Beware write amplification: every index must be maintained on insert/update/delete.

Join Algorithms (Mental Cost)

Nested loop: great when inner side is indexed and selective.
Hash join: good for large, unsorted inputs; memory trade-offs.
Sort-merge: efficient when both inputs are pre-sorted.

Sharding Impact

Cross-shard joins get expensive; prefer collocation or pre-join denormalized views.
Graph traversals across shards require careful start-node routing and hop limits.

10) Case Studies

E‑commerce Catalog & Orders

Catalog: Document store for product pages (variable attributes), plus search index. Keep global constraints (SKU uniqueness) in a relational sidecar or enforce at app with careful upserts.
Orders: Relational core (joins, accounting integrity) + materialized totals; optionally emit denormalized read models per user in a doc store.

// Document: product
{
  "sku": "ABC-123",
  "title": "Widget",
  "attrs": { "color": "red", "size": "M", "material": "cotton" },
  "media": [{ "type": "image", "url": "..." }],
  "pricing": { "list": 1999, "sale": 1499 },
  "availability": { "region": "US", "stock": 42 }
}

Social Graph & Feed

Graph DB for relationships, suggestions, ACLs.
Fan-out strategies:
- Small creators: fan-out-on-write to followers’ inboxes (doc store).
- Large creators: fan-out-on-read with caching and rankers.

Access Control (RBAC/ABAC)

Graph model for SUBJECT -[HAS_ROLE]-> ROLE -[GRANTS]-> PERMISSION and resource hierarchies.
Materialize effective permissions per user/resource pair for hot reads.

11) Interview Playbook: Prompts, Answers & Trade‑offs

Common Prompts & How to Attack

“Would you use relational or document for a user profile with settings and recent activity?”
Start with document for the profile aggregate (flexible fields, one-shot reads). Keep activity in a separate collection (append-only). If you need global integrity (unique email), enforce with unique index or a relational registry.
“Model and query a friends-of-friends recommendation.”
If traversals dominate: property graph + Cypher. If data volume is small/moderate, relational with recursive CTEs is fine; ensure start node selectivity and hop limits.
“We need ad-hoc analytics across many entities.”
Relational warehouse (columnar MPP). ETL from operational stores; star schemas; materialized aggregates; partitioning by date.
“Schema evolves weekly—how do we keep velocity?”
Document store for fast-changing aggregates; enforce contracts at service boundary; additive changes; dual-read for breaking changes; telemetry on field usage and null rates.
“Cross-shard joins are slow—what now?”
Co-locate related partitions (same shard key), pre-join via denormalized views, or move the query to a warehouse/OLAP job. Consider changing shard key to match top access pattern.
“Design product search + filter.”
Primary store (doc or relational), plus a search index (e.g., inverted index). Keep source of truth authoritative; changes flow via CDC/streams. Use incremental reindex and dual reads during rebuilds.

Sound-Bite Trade‑offs

“Normalize for integrity; denormalize for speed—measure and materialize hot paths.”
“Graphs shine when the relationship is the query.”
“Pick your model by query shape, not by fashion.”
“Schema changes are constant—opt for additive changes and staged cutovers.”

Mini Whiteboard Checklist

List entities + cardinalities (1‑1, 1‑N, N‑M).
Write top 5 queries (with filters/sorts) and read/write ratios.
Choose model(s) and justify with access pattern fit.
Define indexes (composite order) and sharding if needed.
Plan schema evolution and denormalized read models.
Call out failure modes, staleness, and backfill/rollbacks.

12) Red Flags & Anti‑Patterns

ORM-first design that hides poor data modeling decisions.
Massive unbounded arrays embedded in documents.
Global uniqueness enforced only in app code under concurrency.
EAV everywhere leading to unreadable queries and no constraints.
Composite indexes in the wrong order; low-selectivity leading columns.
Cross-shard joins without collocation strategy.

13) Cheat Sheets & Heuristics

Composite Index Order

Equality filters first (highest selectivity).
Then range filter columns.
Then columns used in ORDER BY/GROUP BY.

Embed vs Reference (Docs)

Embed if co-read, bounded size, same lifecycle.
Reference if N‑M, unbounded, or shared across aggregates.

Graph Traversal Hygiene

Always anchor on selective start nodes.
Bound path length; filter by labels early.
Materialize frequently traversed neighborhoods.

14) Summary

Relational excels at integrity, joins, and ad-hoc analytics.
Document excels at aggregate locality, flexible schemas, and agile iteration.
Graph excels at multi-hop, relationship-centric queries.
Choose by access patterns; evolve with additive changes; materialize hot paths; measure and iterate.

Data Models & Query Languages — A System Design Deep Dive

1) Why Data Models Matter

What Models Decide

Rule of Thumb

2) Relational Model — Deep Dive

Normalization & Integrity

Join Semantics

Query Planning (Mental Model)

Modeling Patterns

Example: Orders, Items, Indexing

3) Document Model — Deep Dive

Aggregate-Centric Design

Indexing & Access

Atomicity & Concurrency

Aggregation Pipeline Example

4) Graph Models — Deep Dive (Property Graphs & RDF)

When Graphs Win

Property Graph vs RDF

Cypher Example

SPARQL Example

5) Query Languages: SQL, Aggregation Pipelines, Graph

Declarative vs Imperative

SQL Essentials for Interviews

Aggregation Pipelines

Graph Queries

6) Schema Flexibility & Evolution Patterns

Schema-on-Write vs Schema-on-Read

Compatibility & Versioning

Operational Playbook (Example)

7) Joins, Locality & Denormalization

Trade-offs

Patterns

8) Polyglot Persistence & Decision Frameworks

Choose by Access Pattern

Secondary Criteria

Decision Matrix (Quick Start)

9) Performance Fundamentals: Indexing, Joins & Plans

Index Heuristics

Join Algorithms (Mental Cost)

Sharding Impact

10) Case Studies

E‑commerce Catalog & Orders

Social Graph & Feed

Access Control (RBAC/ABAC)

11) Interview Playbook: Prompts, Answers & Trade‑offs

Common Prompts & How to Attack

Sound-Bite Trade‑offs

Mini Whiteboard Checklist

12) Red Flags & Anti‑Patterns

13) Cheat Sheets & Heuristics

Composite Index Order

Embed vs Reference (Docs)

Graph Traversal Hygiene

14) Summary

Posted by Surendra Rayapati

You may like these posts

Post a Comment

0 Comments

About Me

Archive

Most Popular

Search This Blog

Tags

Recent Post

Popular Posts

Footer Menu Widget

Contact form