Data Models & Query Languages — A System Design Deep Dive

Data Models & Query Languages — A System Design Deep Dive (Inspired by DDIA Chapter 2)
Book Notes System Design

Inspired by Designing Data-Intensive Applications, Chapter 2 — distilled for interviews and real-world engineering.

1) Why Data Models Matter

Data models aren’t just storage choices; they shape how we think, design APIs, enforce invariants, and scale. The “right” model aligns with access patterns, keeps hot paths simple, and minimizes accidental complexity.

What Models Decide

  • Queryability: What’s easy/hard/expensive to ask.
  • Integrity: Where constraints live (DB vs app).
  • Locality: How much IO per request.
  • Evolution: Cost of change over time.

Rule of Thumb

Choose models by dominant access patterns, relationship complexity, and consistency requirements—then validate with prototypes.

2) Relational Model — Deep Dive

Normalization & Integrity

  • 1NF: Atomic values; no repeating groups.
  • 2NF/3NF: No partial/transitive dependency on keys—prevents update anomalies.
  • BCNF: Every determinant is a key—stricter than 3NF.
  • Keys: Prefer surrogate (e.g., BIGINT) when natural keys are wide/unstable.
  • Constraints: Primary/foreign keys, unique, check; push invariants into the database when possible.

Join Semantics

  • Inner (intersection), left/right (preserve one side), full, cross, semi/anti (via EXISTS/NOT EXISTS).
  • Join selectivity and index coverage dominate latency in OLTP.

Query Planning (Mental Model)

  • Indexes reduce scanned rows; statistics inform join order; cost-based optimizers pick algorithms (hash join, sort-merge, nested loop).
  • Composite indexes: order by equality → range → order-by columns. “Leftmost prefix” matters.

Modeling Patterns

  • Many-to-many: junction table with composite unique constraint.
  • Polymorphic associations: either a type+id pair (with app-enforced FK) or separate tables + views.
  • Soft deletes: deleted_at plus partial indexes to keep uniqueness for active rows.
  • EAV (Entity-Attribute-Value): flexible but often a query/constraint anti-pattern; prefer JSON column + targeted indexes if you must.

Example: Orders, Items, Indexing

-- Core tables
CREATE TABLE orders (
  id BIGSERIAL PRIMARY KEY,
  user_id BIGINT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC);

CREATE TABLE order_items (
  id BIGSERIAL PRIMARY KEY,
  order_id BIGINT NOT NULL REFERENCES orders(id),
  sku TEXT NOT NULL,
  qty INT NOT NULL CHECK (qty > 0),
  price_cents INT NOT NULL CHECK (price_cents >= 0)
);

CREATE INDEX idx_items_order ON order_items(order_id);

-- Top query: recent orders with totals for a user
SELECT o.id, o.created_at, SUM(oi.qty * oi.price_cents) AS total_cents
FROM orders o
JOIN order_items oi ON oi.order_id = o.id
WHERE o.user_id = $1
GROUP BY o.id
ORDER BY o.created_at DESC
LIMIT 50;
Interview tip: Normalize by default for integrity and flexibility; denormalize selectively for read-heavy hot paths. Call out specific materializations (e.g., “order_total_cents” column refreshed via trigger or job).

3) Document Model — Deep Dive

Aggregate-Centric Design

  • Store an aggregate (document) that you usually read/write as a whole.
  • Embed when child lifecycle is bound to parent and access is co-located.
  • Reference when many-to-many, large unbounded arrays, or cross-aggregate reuse.

Indexing & Access

  • Support equality/range on top-level & nested fields (items.sku), and compound indexes for multi-field filters.
  • Use partial or TTL indexes for lifecycle management.
  • Avoid “array-of-huge-subs” that frequently append beyond page size—can cause doc moves and write amplification.

Atomicity & Concurrency

  • Most document stores guarantee atomic updates within a single document. Cross-document transactions may be limited or costlier.
  • Prefer idempotent updates and upserts for retried writes.

Aggregation Pipeline Example

// Total spend per user over last 30 days
db.orders.aggregate([
  { $match: { createdAt: { $gte: ISODate("2026-01-01") } } },
  { $unwind: "$items" },
  { $group: {
      _id: "$userId",
      totalCents: { $sum: { $multiply: [ "$items.qty", "$items.priceCents" ] } }
  }},
  { $sort: { totalCents: -1 } },
  { $limit: 50 }
]);
Gotcha: “Embed everything” fails with growing, unbounded arrays (e.g., activity feeds). Prefer references + fan-out-on-write/read patterns, or a separate collection.

4) Graph Models — Deep Dive (Property Graphs & RDF)

When Graphs Win

  • Multi-hop traversals (friends-of-friends, shortest path, role hierarchies).
  • Query patterns are relationship-centric more than attribute-centric.

Property Graph vs RDF

AspectProperty Graph (e.g., Neo4j/Cypher)RDF (SPARQL)
Data ShapeNodes/edges with key-value propertiesTriples (subject, predicate, object)
Query StylePattern matching with path syntaxTriple patterns with optional reasoning
StrengthOperational traversals, intuitive modelingSemantic web, vocabularies, inference

Cypher Example

// Potential recommendations: friends-of-friends who liked the same SKU
MATCH (u:User {id: 42})-[:FRIEND]-(f)-[:FRIEND]-(fof),
      (u)-[:PURCHASED]-(:Item {sku: "ABC-123"}),
      (fof)-[:PURCHASED]-(:Item {sku: "ABC-123"})
RETURN DISTINCT fof
ORDER BY fof.lastActive DESC
LIMIT 20;

SPARQL Example

# Get people who manage someone who manages Alice
PREFIX ex: <http://example.org/schema#>
SELECT DISTINCT ?person WHERE {
  ?person ex:manages ?x .
  ?x ex:manages ex:Alice .
}
Interview tip: For depth-first/variable-length path queries, graphs avoid N self-joins and complex recursive CTEs. Always start with a selective index-bound node to keep traversals tight.

5) Query Languages: SQL, Aggregation Pipelines, Graph

Declarative vs Imperative

  • SQL / SPARQL / Cypher: Declare what you want; engine plans how.
  • MapReduce/Agg Pipelines: Imperative or staged; great for ETL and batch analytics.

SQL Essentials for Interviews

  • JOINs, GROUP BY, HAVING, window functions (e.g., ROW_NUMBER, LAG).
  • Use EXPLAIN/EXPLAIN ANALYZE to surface plan, rows, and index usage.

Aggregation Pipelines

Think of each stage as a relational transform (filter, project, group, join-like via $lookup), but executed close to document structure.

Graph Queries

Pattern-first: match nodes and relationships, bind variables, constrain with predicates, then RETURN. Consider traversal bounds (depth, labels) and path uniqueness.

6) Schema Flexibility & Evolution Patterns

Schema-on-Write vs Schema-on-Read

  • Schema-on-write (relational): strong guarantees upfront; migrations needed.
  • Schema-on-read (document): tolerant storage; contracts enforced in app and readers.

Compatibility & Versioning

  • Additive changes are easiest (new nullable/optional fields).
  • Breaking changes: use dual-read (read v1 or v2) and/or dual-write during backfills.
  • Keep a data migration playbook: cutover plan, backfill jobs, metrics, rollback.

Operational Playbook (Example)

  1. Introduce new column/field with defaults.
  2. Write dual (old + new) or derive on read.
  3. Backfill in batches with idempotent jobs.
  4. Flip reads to new schema; monitor errors/latency.
  5. Remove old paths once stable; add constraints.

7) Joins, Locality & Denormalization

Trade-offs

  • Joins: powerful, but expensive across shards and large tables without selective indexes.
  • Locality: embedding (docs) or clustering (RDBMS) reduces random IO for common reads.
  • Denormalization: duplicate/derive data to speed reads (materialized views, counters)—manage staleness.

Patterns

  • Materialized aggregations: e.g., order_total_cents updated via trigger/job.
  • Fan-out on write (eager) vs fan-out on read (lazy) for feeds.
  • Read models tailored for screens (CQRS-lite).

8) Polyglot Persistence & Decision Frameworks

Choose by Access Pattern

  • Ad-hoc, cross-entity analytics → Relational.
  • Aggregate reads/writes, hierarchical JSON → Document.
  • Multi-hop relationship traversals → Graph.

Secondary Criteria

  • Consistency needs (constraints, transactions).
  • Sharding strategy and growth projections.
  • Team expertise and operational maturity.

Decision Matrix (Quick Start)

RequirementRelationalDocumentGraph
Complex joins and ad-hoc SQL★★★★★★★★★
Nested, variable schema aggregates★★★★★★★★★
Multi-hop traversals (variable length)★★★★★
Strict integrity (FKs, constraints)★★★★★★★★★★
Horizontal scalability (ops simplicity)★★★★★★★★★★

9) Performance Fundamentals: Indexing, Joins & Plans

Index Heuristics

  • High-selectivity predicates deserve indexes; avoid low-cardinality leading columns.
  • Composite index order: equality columns → range columns → order by/group by.
  • Covering indexes (include columns) reduce lookups on hot queries.
  • Beware write amplification: every index must be maintained on insert/update/delete.

Join Algorithms (Mental Cost)

  • Nested loop: great when inner side is indexed and selective.
  • Hash join: good for large, unsorted inputs; memory trade-offs.
  • Sort-merge: efficient when both inputs are pre-sorted.

Sharding Impact

  • Cross-shard joins get expensive; prefer collocation or pre-join denormalized views.
  • Graph traversals across shards require careful start-node routing and hop limits.

10) Case Studies

E‑commerce Catalog & Orders

  • Catalog: Document store for product pages (variable attributes), plus search index. Keep global constraints (SKU uniqueness) in a relational sidecar or enforce at app with careful upserts.
  • Orders: Relational core (joins, accounting integrity) + materialized totals; optionally emit denormalized read models per user in a doc store.
// Document: product
{
  "sku": "ABC-123",
  "title": "Widget",
  "attrs": { "color": "red", "size": "M", "material": "cotton" },
  "media": [{ "type": "image", "url": "..." }],
  "pricing": { "list": 1999, "sale": 1499 },
  "availability": { "region": "US", "stock": 42 }
}

Social Graph & Feed

  • Graph DB for relationships, suggestions, ACLs.
  • Fan-out strategies:
    • Small creators: fan-out-on-write to followers’ inboxes (doc store).
    • Large creators: fan-out-on-read with caching and rankers.

Access Control (RBAC/ABAC)

  • Graph model for SUBJECT -[HAS_ROLE]-> ROLE -[GRANTS]-> PERMISSION and resource hierarchies.
  • Materialize effective permissions per user/resource pair for hot reads.

11) Interview Playbook: Prompts, Answers & Trade‑offs

Common Prompts & How to Attack

  1. “Would you use relational or document for a user profile with settings and recent activity?”
    Start with document for the profile aggregate (flexible fields, one-shot reads). Keep activity in a separate collection (append-only). If you need global integrity (unique email), enforce with unique index or a relational registry.
  2. “Model and query a friends-of-friends recommendation.”
    If traversals dominate: property graph + Cypher. If data volume is small/moderate, relational with recursive CTEs is fine; ensure start node selectivity and hop limits.
  3. “We need ad-hoc analytics across many entities.”
    Relational warehouse (columnar MPP). ETL from operational stores; star schemas; materialized aggregates; partitioning by date.
  4. “Schema evolves weekly—how do we keep velocity?”
    Document store for fast-changing aggregates; enforce contracts at service boundary; additive changes; dual-read for breaking changes; telemetry on field usage and null rates.
  5. “Cross-shard joins are slow—what now?”
    Co-locate related partitions (same shard key), pre-join via denormalized views, or move the query to a warehouse/OLAP job. Consider changing shard key to match top access pattern.
  6. “Design product search + filter.”
    Primary store (doc or relational), plus a search index (e.g., inverted index). Keep source of truth authoritative; changes flow via CDC/streams. Use incremental reindex and dual reads during rebuilds.

Sound-Bite Trade‑offs

  • “Normalize for integrity; denormalize for speed—measure and materialize hot paths.”
  • “Graphs shine when the relationship is the query.”
  • “Pick your model by query shape, not by fashion.”
  • “Schema changes are constant—opt for additive changes and staged cutovers.”

Mini Whiteboard Checklist

  1. List entities + cardinalities (1‑1, 1‑N, N‑M).
  2. Write top 5 queries (with filters/sorts) and read/write ratios.
  3. Choose model(s) and justify with access pattern fit.
  4. Define indexes (composite order) and sharding if needed.
  5. Plan schema evolution and denormalized read models.
  6. Call out failure modes, staleness, and backfill/rollbacks.

12) Red Flags & Anti‑Patterns

  • ORM-first design that hides poor data modeling decisions.
  • Massive unbounded arrays embedded in documents.
  • Global uniqueness enforced only in app code under concurrency.
  • EAV everywhere leading to unreadable queries and no constraints.
  • Composite indexes in the wrong order; low-selectivity leading columns.
  • Cross-shard joins without collocation strategy.

13) Cheat Sheets & Heuristics

Composite Index Order

  1. Equality filters first (highest selectivity).
  2. Then range filter columns.
  3. Then columns used in ORDER BY/GROUP BY.

Embed vs Reference (Docs)

  • Embed if co-read, bounded size, same lifecycle.
  • Reference if N‑M, unbounded, or shared across aggregates.

Graph Traversal Hygiene

  • Always anchor on selective start nodes.
  • Bound path length; filter by labels early.
  • Materialize frequently traversed neighborhoods.

14) Summary

  • Relational excels at integrity, joins, and ad-hoc analytics.
  • Document excels at aggregate locality, flexible schemas, and agile iteration.
  • Graph excels at multi-hop, relationship-centric queries.
  • Choose by access patterns; evolve with additive changes; materialize hot paths; measure and iterate.

Credits: This post is an original, interview-focused synthesis inspired by concepts from Designing Data‑Intensive Applications (Martin Kleppmann), Chapter 2, expressed in my own words.

Next step: Draft your top queries, pick a model using the decision matrix, then benchmark both normalized and denormalized designs for your real workload.

Post a Comment

0 Comments