Inspired by Designing Data-Intensive Applications, Chapter 2 — distilled for interviews and real-world engineering.
1) Why Data Models Matter
Data models aren’t just storage choices; they shape how we think, design APIs, enforce invariants, and scale. The “right” model aligns with access patterns, keeps hot paths simple, and minimizes accidental complexity.
What Models Decide
- Queryability: What’s easy/hard/expensive to ask.
- Integrity: Where constraints live (DB vs app).
- Locality: How much IO per request.
- Evolution: Cost of change over time.
Rule of Thumb
Choose models by dominant access patterns, relationship complexity, and consistency requirements—then validate with prototypes.
2) Relational Model — Deep Dive
Normalization & Integrity
- 1NF: Atomic values; no repeating groups.
- 2NF/3NF: No partial/transitive dependency on keys—prevents update anomalies.
- BCNF: Every determinant is a key—stricter than 3NF.
- Keys: Prefer surrogate (e.g., BIGINT) when natural keys are wide/unstable.
- Constraints: Primary/foreign keys, unique, check; push invariants into the database when possible.
Join Semantics
- Inner (intersection), left/right (preserve one side), full, cross, semi/anti (via
EXISTS/NOT EXISTS). - Join selectivity and index coverage dominate latency in OLTP.
Query Planning (Mental Model)
- Indexes reduce scanned rows; statistics inform join order; cost-based optimizers pick algorithms (hash join, sort-merge, nested loop).
- Composite indexes: order by equality → range → order-by columns. “Leftmost prefix” matters.
Modeling Patterns
- Many-to-many: junction table with composite unique constraint.
- Polymorphic associations: either a type+id pair (with app-enforced FK) or separate tables + views.
- Soft deletes:
deleted_atplus partial indexes to keep uniqueness for active rows. - EAV (Entity-Attribute-Value): flexible but often a query/constraint anti-pattern; prefer JSON column + targeted indexes if you must.
Example: Orders, Items, Indexing
-- Core tables
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC);
CREATE TABLE order_items (
id BIGSERIAL PRIMARY KEY,
order_id BIGINT NOT NULL REFERENCES orders(id),
sku TEXT NOT NULL,
qty INT NOT NULL CHECK (qty > 0),
price_cents INT NOT NULL CHECK (price_cents >= 0)
);
CREATE INDEX idx_items_order ON order_items(order_id);
-- Top query: recent orders with totals for a user
SELECT o.id, o.created_at, SUM(oi.qty * oi.price_cents) AS total_cents
FROM orders o
JOIN order_items oi ON oi.order_id = o.id
WHERE o.user_id = $1
GROUP BY o.id
ORDER BY o.created_at DESC
LIMIT 50;
3) Document Model — Deep Dive
Aggregate-Centric Design
- Store an aggregate (document) that you usually read/write as a whole.
- Embed when child lifecycle is bound to parent and access is co-located.
- Reference when many-to-many, large unbounded arrays, or cross-aggregate reuse.
Indexing & Access
- Support equality/range on top-level & nested fields (
items.sku), and compound indexes for multi-field filters. - Use partial or TTL indexes for lifecycle management.
- Avoid “array-of-huge-subs” that frequently append beyond page size—can cause doc moves and write amplification.
Atomicity & Concurrency
- Most document stores guarantee atomic updates within a single document. Cross-document transactions may be limited or costlier.
- Prefer idempotent updates and upserts for retried writes.
Aggregation Pipeline Example
// Total spend per user over last 30 days
db.orders.aggregate([
{ $match: { createdAt: { $gte: ISODate("2026-01-01") } } },
{ $unwind: "$items" },
{ $group: {
_id: "$userId",
totalCents: { $sum: { $multiply: [ "$items.qty", "$items.priceCents" ] } }
}},
{ $sort: { totalCents: -1 } },
{ $limit: 50 }
]);
4) Graph Models — Deep Dive (Property Graphs & RDF)
When Graphs Win
- Multi-hop traversals (friends-of-friends, shortest path, role hierarchies).
- Query patterns are relationship-centric more than attribute-centric.
Property Graph vs RDF
| Aspect | Property Graph (e.g., Neo4j/Cypher) | RDF (SPARQL) |
|---|---|---|
| Data Shape | Nodes/edges with key-value properties | Triples (subject, predicate, object) |
| Query Style | Pattern matching with path syntax | Triple patterns with optional reasoning |
| Strength | Operational traversals, intuitive modeling | Semantic web, vocabularies, inference |
Cypher Example
// Potential recommendations: friends-of-friends who liked the same SKU
MATCH (u:User {id: 42})-[:FRIEND]-(f)-[:FRIEND]-(fof),
(u)-[:PURCHASED]-(:Item {sku: "ABC-123"}),
(fof)-[:PURCHASED]-(:Item {sku: "ABC-123"})
RETURN DISTINCT fof
ORDER BY fof.lastActive DESC
LIMIT 20;
SPARQL Example
# Get people who manage someone who manages Alice
PREFIX ex: <http://example.org/schema#>
SELECT DISTINCT ?person WHERE {
?person ex:manages ?x .
?x ex:manages ex:Alice .
}
5) Query Languages: SQL, Aggregation Pipelines, Graph
Declarative vs Imperative
- SQL / SPARQL / Cypher: Declare what you want; engine plans how.
- MapReduce/Agg Pipelines: Imperative or staged; great for ETL and batch analytics.
SQL Essentials for Interviews
- JOINs, GROUP BY, HAVING, window functions (e.g.,
ROW_NUMBER,LAG). - Use EXPLAIN/EXPLAIN ANALYZE to surface plan, rows, and index usage.
Aggregation Pipelines
Think of each stage as a relational transform (filter, project, group, join-like via $lookup), but executed close to document structure.
Graph Queries
Pattern-first: match nodes and relationships, bind variables, constrain with predicates, then RETURN. Consider traversal bounds (depth, labels) and path uniqueness.
6) Schema Flexibility & Evolution Patterns
Schema-on-Write vs Schema-on-Read
- Schema-on-write (relational): strong guarantees upfront; migrations needed.
- Schema-on-read (document): tolerant storage; contracts enforced in app and readers.
Compatibility & Versioning
- Additive changes are easiest (new nullable/optional fields).
- Breaking changes: use dual-read (read v1 or v2) and/or dual-write during backfills.
- Keep a data migration playbook: cutover plan, backfill jobs, metrics, rollback.
Operational Playbook (Example)
- Introduce new column/field with defaults.
- Write dual (old + new) or derive on read.
- Backfill in batches with idempotent jobs.
- Flip reads to new schema; monitor errors/latency.
- Remove old paths once stable; add constraints.
7) Joins, Locality & Denormalization
Trade-offs
- Joins: powerful, but expensive across shards and large tables without selective indexes.
- Locality: embedding (docs) or clustering (RDBMS) reduces random IO for common reads.
- Denormalization: duplicate/derive data to speed reads (materialized views, counters)—manage staleness.
Patterns
- Materialized aggregations: e.g.,
order_total_centsupdated via trigger/job. - Fan-out on write (eager) vs fan-out on read (lazy) for feeds.
- Read models tailored for screens (CQRS-lite).
8) Polyglot Persistence & Decision Frameworks
Choose by Access Pattern
- Ad-hoc, cross-entity analytics → Relational.
- Aggregate reads/writes, hierarchical JSON → Document.
- Multi-hop relationship traversals → Graph.
Secondary Criteria
- Consistency needs (constraints, transactions).
- Sharding strategy and growth projections.
- Team expertise and operational maturity.
Decision Matrix (Quick Start)
| Requirement | Relational | Document | Graph |
|---|---|---|---|
| Complex joins and ad-hoc SQL | ★★★★★ | ★★ | ★★ |
| Nested, variable schema aggregates | ★★ | ★★★★★ | ★★ |
| Multi-hop traversals (variable length) | ★ | ★ | ★★★★★ |
| Strict integrity (FKs, constraints) | ★★★★★ | ★★ | ★★★ |
| Horizontal scalability (ops simplicity) | ★★★ | ★★★★ | ★★★ |
9) Performance Fundamentals: Indexing, Joins & Plans
Index Heuristics
- High-selectivity predicates deserve indexes; avoid low-cardinality leading columns.
- Composite index order: equality columns → range columns → order by/group by.
- Covering indexes (include columns) reduce lookups on hot queries.
- Beware write amplification: every index must be maintained on insert/update/delete.
Join Algorithms (Mental Cost)
- Nested loop: great when inner side is indexed and selective.
- Hash join: good for large, unsorted inputs; memory trade-offs.
- Sort-merge: efficient when both inputs are pre-sorted.
Sharding Impact
- Cross-shard joins get expensive; prefer collocation or pre-join denormalized views.
- Graph traversals across shards require careful start-node routing and hop limits.
10) Case Studies
E‑commerce Catalog & Orders
- Catalog: Document store for product pages (variable attributes), plus search index. Keep global constraints (SKU uniqueness) in a relational sidecar or enforce at app with careful upserts.
- Orders: Relational core (joins, accounting integrity) + materialized totals; optionally emit denormalized read models per user in a doc store.
// Document: product
{
"sku": "ABC-123",
"title": "Widget",
"attrs": { "color": "red", "size": "M", "material": "cotton" },
"media": [{ "type": "image", "url": "..." }],
"pricing": { "list": 1999, "sale": 1499 },
"availability": { "region": "US", "stock": 42 }
}
Social Graph & Feed
- Graph DB for relationships, suggestions, ACLs.
- Fan-out strategies:
- Small creators: fan-out-on-write to followers’ inboxes (doc store).
- Large creators: fan-out-on-read with caching and rankers.
Access Control (RBAC/ABAC)
- Graph model for
SUBJECT -[HAS_ROLE]-> ROLE -[GRANTS]-> PERMISSIONand resource hierarchies. - Materialize effective permissions per user/resource pair for hot reads.
11) Interview Playbook: Prompts, Answers & Trade‑offs
Common Prompts & How to Attack
- “Would you use relational or document for a user profile with settings and recent activity?”
Start with document for the profile aggregate (flexible fields, one-shot reads). Keep activity in a separate collection (append-only). If you need global integrity (unique email), enforce with unique index or a relational registry. - “Model and query a friends-of-friends recommendation.”
If traversals dominate: property graph + Cypher. If data volume is small/moderate, relational with recursive CTEs is fine; ensure start node selectivity and hop limits. - “We need ad-hoc analytics across many entities.”
Relational warehouse (columnar MPP). ETL from operational stores; star schemas; materialized aggregates; partitioning by date. - “Schema evolves weekly—how do we keep velocity?”
Document store for fast-changing aggregates; enforce contracts at service boundary; additive changes; dual-read for breaking changes; telemetry on field usage and null rates. - “Cross-shard joins are slow—what now?”
Co-locate related partitions (same shard key), pre-join via denormalized views, or move the query to a warehouse/OLAP job. Consider changing shard key to match top access pattern. - “Design product search + filter.”
Primary store (doc or relational), plus a search index (e.g., inverted index). Keep source of truth authoritative; changes flow via CDC/streams. Use incremental reindex and dual reads during rebuilds.
Sound-Bite Trade‑offs
- “Normalize for integrity; denormalize for speed—measure and materialize hot paths.”
- “Graphs shine when the relationship is the query.”
- “Pick your model by query shape, not by fashion.”
- “Schema changes are constant—opt for additive changes and staged cutovers.”
Mini Whiteboard Checklist
- List entities + cardinalities (1‑1, 1‑N, N‑M).
- Write top 5 queries (with filters/sorts) and read/write ratios.
- Choose model(s) and justify with access pattern fit.
- Define indexes (composite order) and sharding if needed.
- Plan schema evolution and denormalized read models.
- Call out failure modes, staleness, and backfill/rollbacks.
12) Red Flags & Anti‑Patterns
- ORM-first design that hides poor data modeling decisions.
- Massive unbounded arrays embedded in documents.
- Global uniqueness enforced only in app code under concurrency.
- EAV everywhere leading to unreadable queries and no constraints.
- Composite indexes in the wrong order; low-selectivity leading columns.
- Cross-shard joins without collocation strategy.
13) Cheat Sheets & Heuristics
Composite Index Order
- Equality filters first (highest selectivity).
- Then range filter columns.
- Then columns used in ORDER BY/GROUP BY.
Embed vs Reference (Docs)
- Embed if co-read, bounded size, same lifecycle.
- Reference if N‑M, unbounded, or shared across aggregates.
Graph Traversal Hygiene
- Always anchor on selective start nodes.
- Bound path length; filter by labels early.
- Materialize frequently traversed neighborhoods.
14) Summary
- Relational excels at integrity, joins, and ad-hoc analytics.
- Document excels at aggregate locality, flexible schemas, and agile iteration.
- Graph excels at multi-hop, relationship-centric queries.
- Choose by access patterns; evolve with additive changes; materialize hot paths; measure and iterate.

0 Comments