Top System Design Interview Answers — Reliability, Scalability, Maintainability

Top System Design Interview Answers — Reliability, Scalability, Maintainability (DDIA Ch.1)

System Design

In-depth, interview-ready patterns with trade‑offs, failure modes, and metrics. Written for whiteboard discussions and production reality.

How should I use this post?

We begin with correctness under failure, move through scaling patterns, and finish with operability & evolution. Each section opens with a crisp Thesis, then a pragmatic design, trade‑offs, failure modes, and metrics—so you can speak fluently at the board and reason about production realities.

How do you achieve exactly‑once processing for payments and orders?

Transition: Reliability starts with correctness. Rather than chasing mythical “exactly‑once”, we compose guarantees we can actually implement.

Thesis: True exactly‑once is infeasible in distributed systems; we achieve it with at‑least‑once delivery + idempotency and/or a transactional outbox with deduplication.

Design

Generate an idempotency_key per request (client or gateway).
Begin a DB transaction: upsert a PENDING payment row with a unique constraint on the idempotency key.
In the same transaction, append to an outbox table (topic + payload + event_id).
A background consumer (or CDC) reads outbox events and calls the external gateway.
Consumer is idempotent: maintains a processed set (table) keyed by event_id.
Reconciliation job queries the gateway by request_id to resolve ambiguous states.

-- Producer (RDBMS)
BEGIN;
INSERT INTO payments(request_id, user_id, amount, status)
VALUES($rid,$uid,$amt,'PENDING')
ON CONFLICT (request_id) DO NOTHING;

INSERT INTO outbox(event_id, topic, payload)
VALUES(gen_event_id(), 'ChargeRequested', jsonb_build_object('request_id',$rid,'user_id',$uid,'amount',$amt));
COMMIT;

// Consumer (pseudo)
for event in stream('ChargeRequested'):
  if processed.exists(event.id): continue
  res = gateway.charge(event.payload.request_id, event.payload.user_id, event.payload.amount)
  upsert(payments, request_id=event.payload.request_id, status=res.status, txn_id=res.txn_id)
  processed.add(event.id)

Failure Modes & Fixes

Gateway charged, DB write failed → reconciliation queries gateway by request_id, upserts final status.
Consumer crash mid‑charge → safe to retry; idempotency key prevents double charge.
Duplicate client requests → unique constraint on request_id collapses duplicates.

Metrics

Outbox lag, duplicate rate, retry counts, reconciliation backlog, success rate, P95/P99 end‑to‑end latency.

Sound bite: “We implement exactly‑once effects by combining at‑least‑once delivery with idempotent handlers, enforced by unique keys and an outbox/CDC pipeline.”

How do you choose a shard key, and when should you change it?

Transition: With correctness guarded, we scale state by partitioning it intelligently.

Thesis: Choose a shard key that aligns with dominant access patterns and spreads load; change it when hotspots or scatter‑gather dominate.

Design & Migration

Start: hash(user_id) for uniform load in user‑centric apps; co‑locate related entities by the same key.
Scans/time‑series: range by date with “today” bucketed/salted to avoid hotspots.
Change key when skew is high or cross‑shard queries dominate.
Migration: introduce a router supporting v1/v2 → dual‑write → idempotent backfill → flip reads gradually → retire v1.

Metrics

Shard skew (top‑1% QPS vs median), cross‑shard query ratio, rebalancing time, hottest‑key QPS.

Sound bite: “Hash for uniformity, range for scans. Co‑locate things read together. When hotspots or scatter‑gather appear, migrate via router + dual‑write + backfill.”

How would you scale reads from 10k RPS to 1M RPS?

Transition: Partitioning is one axis; caching and read replicas multiply capacity while preserving correctness signals.

Thesis: Layer edge caching, request coalescing, lag‑aware replicas, and materialized views with read‑your‑writes via session stickiness or versioning.

Playbook

CDN/edge: cache JSON/HTML with ETag or signed URLs; vary by auth if needed.
App cache: soft TTL + background refresh; single‑flight to stop stampedes.
DB replicas: lag‑aware routing; post‑write flows stick to leader for N seconds.
Precompute hot aggregates via materialized views; maintain by triggers/CDC/jobs.
Salt extremely hot keys; consistent hashing for cache cluster.

// Stampede guard (sketch)
value = cache.get(key)
if (!value) {
  if (lock.try_acquire(key, 2000 /*ms*/)) {
    value = recompute()
    cache.set(key, value, /*soft_ttl*/ 60)
  } else {
    value = cache.get_stale_or_placeholder(key)
  }
}

Metrics

CDN hit rate, coalesced request ratio, cache fill latency, replica lag distribution, P95/P99.

Sound bite: “Push out with CDN, collapse duplicates, read from lag‑aware replicas, and precompute hot views—stick to leader for read‑your‑writes.”

How do you prevent cascading failures with backpressure and graceful degradation?

Transition: At high load, stability is about bounding everything and degrading gracefully.

Thesis: Combine timeouts, bounded queues, circuit breakers, rate limits, and load shedding with fallbacks.

Design

Jittered exponential backoff; budget retries to stay within client SLA.
Open breaker on sustained errors/latency; fast‑fail with cached/approx results.
Bounded worker pools and queues; prioritize critical traffic.
Adaptive concurrency per dependency; queue depth‑based backpressure.

// Retry with backoff + jitter (pseudo)
for (i=0; i<max; i++) {
  try { return call() } 
  catch (e) { sleep(base * (2 ** i) + rand(0, jitter)) }
}

Metrics

Retry rate, breaker open time, queue depth, tail latency, dropped/served‑stale counts.

How do you design a multi‑region active‑active system and pick consistency?

Transition: Geography adds partitions and latency; choose consistency per operation.

Thesis: For read latency and availability, use multi‑leader/leaderless with conflict resolution. For strict invariants, enforce a single writer per entity or a globally serializable region.

Patterns

Per‑entity owner region for payments/unique constraints; reads local, writes coordinated.
CRDTs or app‑merge for collaborative docs, counts, likes.
Failover drills; write admission control during recovery.

Metrics

RPO/RTO, cross‑region RTT budget, conflict rate, reconciliation backlog.

Sound bite: “We read local for latency, but serialize writes where invariants matter—either per‑entity single leader or a global serializable region.”

How do you run zero‑downtime schema migrations?

Transition: Systems live and change; evolve without breaking readers or writes.

Thesis: Apply expand → migrate → contract with dual‑read/write and idempotent backfills.

Steps

Expand: add new columns/tables/indexes (nullable).
Dual‑write: populate new path on all new writes.
Backfill in batches (WHERE new_col IS NULL), throttle to keep replica lag healthy.
Dual‑read until confidence; then flip reads.
Contract: remove old code/columns when safe.

ALTER TABLE orders ADD COLUMN total_cents BIGINT;
-- Batched backfill (sketch):
UPDATE orders o
SET total_cents = s.sum_cents
FROM (
  SELECT order_id, SUM(qty*price_cents) AS sum_cents
  FROM order_items
  WHERE order_id BETWEEN $lo AND $hi
  GROUP BY order_id
) s
WHERE o.id = s.order_id AND o.total_cents IS NULL;

Metrics

Null‑rate over time, backfill throughput, replica lag, error deltas pre/post flip.

What does good day‑one observability look like?

Transition: You can’t run what you can’t see—observability underpins reliability.

Thesis: Ship traces, metrics, and structured logs with correlation IDs; define SLOs and runbooks.

Baseline

Tracing via OpenTelemetry; propagate trace_id/span_id and safe baggage.
Metrics: golden signals (latency, traffic, errors, saturation) per endpoint and tenant.
Structured JSON logs with request_id, hashed user/tenant, status, latency.
SLOs tied to UX; alerts on burn rate of error budgets; synthetic checks.

Anti‑Patterns

PII in logs, high‑cardinality label explosions, alert fatigue, dashboards without runbooks.

Should product catalog and orders live in the same database?

Transition: Data modeling choices affect reliability and scale.

Answer: Usually no. Catalog (variable attributes) → document store + search index. Orders (financial correctness) → relational with constraints and materialized totals. Sync via CDC to denormalized read models.

Risks if Combined

OLTP and search/analytics contention; long scans harming write latency; rigidity for catalog evolution.

When should you join vs denormalize/materialize?

Transition: Performance and integrity trade‑offs meet access patterns.

Heuristic: Normalize when writes are hot and constraints matter; denormalize/materialize when reads dominate and aggregates are stable, accepting bounded staleness.

Safe Denormalization

Materialized views (DB‑native or job‑maintained).
Derived columns (e.g., order_total_cents) with triggers/CDC jobs.
Version derivations to avoid circular dependencies.

Pitfalls

Silent drift, multi‑source truth, cross‑service coupling.

When do you pick a graph database over SQL?

Transition: Sometimes, relationships are the query.

Choose Graph if: Variable‑length traversals, path constraints, role hierarchies, FOAF/recommendations—and you can anchor traversals on selective nodes.

Otherwise

For fixed‑depth joins or analytics, relational with recursive CTEs or precomputed adjacency lists is often faster and simpler operationally.

-- Recursive CTE (FoF ≤ 2)
WITH RECURSIVE fof(id, depth) AS (
  SELECT friend_id, 1 FROM friends WHERE user_id = $1
  UNION ALL
  SELECT f.friend_id, fof.depth + 1
  FROM friends f JOIN fof ON f.user_id = fof.id
  WHERE fof.depth < 2
)
SELECT DISTINCT id FROM fof;

How do you design the right index for a specific query?

Transition: Scalability depends on making the planner’s life easy.

Example Query

SELECT id, created_at, total_cents
FROM orders
WHERE user_id = ?
  AND created_at BETWEEN ? AND ?
ORDER BY created_at DESC
LIMIT 50;

Rule

Equality → Range → Order (and make it covering).

CREATE INDEX idx_orders_user_created
ON orders(user_id, created_at DESC)
INCLUDE (total_cents);

Watchouts

Low‑cardinality leading columns, too many indexes (write amplification), stale statistics.

How do you implement fair multi‑tenant rate limiting?

Transition: Protect the platform and ensure fairness.

Thesis: Use hierarchical token buckets—global → tenant → endpoint—with bursts and sustained rates, plus Retry-After hints.

Design

Global ceiling to protect shared resources.
Per‑tenant bucket sized by plan; per‑endpoint bucket for expensive operations.
Distributed enforcement with sharded counters or sliding‑window approximations.

// Token bucket (multi-tenant sketch)
allow(tenant, path, cost=1):
  now = monotonic()
  refill_all(now)
  if bucket[tenant, path].tokens >= cost:
     bucket[tenant, path].tokens -= cost
     return ALLOW
  else:
     return DENY with Retry-After

How would you design product search with filters and ranking?

Transition: Combining data modeling, indexing, and latency budgets.

Architecture

Source of truth (relational/doc) → indexer (CDC/batch) → inverted index (denormalized, tokenized fields).
Facets via precomputed counts/doc values; BM25 baseline ranking + freshness/availability boosts; optional learning‑to‑rank.
Stale‑tolerant UX while index catches up; backpressure for large reindex jobs.

Metrics

Index freshness lag, P95 search latency, facet accuracy, zero‑result rate.

What’s your disaster recovery plan if the primary region is down?

Transition: Reliability at org scale demands practiced recovery.

Thesis: Define RPO/RTO, replicate data cross‑region, practice failovers, and control write admission during recovery.

Plan

Data: async cross‑region replicas (binlog/WAL shipping), multi‑AZ.
Compute: infra‑as‑code templates; warm standby capacity.
Failover: DNS/control‑plane switchover; shard ownership reassignment; block conflicting writes until caught up.
Replay: durable logs for queues/outboxes; postmortems and improvements.

Metrics

RPO achieved (bytes/seconds), RTO, recovery test frequency, config drift rate.

How do you protect user data while staying observable?

Transition: Maintainability includes responsible operations.

Thesis: Enforce least privilege and encryption; keep logs useful but sanitized; rotate secrets; test restores.

Controls

Per‑service identities; scoped IAM; short‑lived tokens (OIDC).
TLS in transit; envelope encryption at rest; key rotation.
Structured logs with redacted/tokenized PII; preserve trace_id/request_id.
Retention policies; GDPR/CCPA delete/export; encrypted, access‑logged backups with restore drills.

Pitfalls

PII in traces, shared root creds, over‑permissive roles, unencrypted backups.

What user‑visible consistency guarantees will you provide?

Transition: Beyond database semantics, we promise session guarantees that users feel.

Answer

Read‑your‑writes, monotonic reads, monotonic writes, and consistent prefix.
Implement via sticky sessions to leader/region post‑write or session tokens carrying last‑seen version/timestamp; route reads accordingly.
Use ETag/If-None-Match; include version in cache keys.

Metrics

Leader‑read ratio post‑write, replica‑lag distribution, stale‑read rate.

Which isolation level prevents which anomalies, and when is SERIALIZABLE required?

Transition: Correctness also means preventing cross‑request anomalies.

Quick map

Read Committed: avoids dirty reads; allows non‑repeatable reads & phantoms.
Repeatable Read / Snapshot: prevents non‑repeatable reads; can allow write skew.
Serializable: prevents all anomalies (2PL or SSI).

Example — Write Skew

-- Doctor-on-call: each txn checks count=1 then sets off-duty.
-- Under snapshot isolation, both commit → zero on duty.
-- Fix: SERIALIZABLE or SELECT ... FOR UPDATE with predicate locks.

When to use SERIALIZABLE

Financial transfers, inventory guarantees, scheduling constraints, cross-row invariants.

Metrics

Serialization failure (retry) rate, lock waits, long-running transactions.

When do you use Two‑Phase Commit vs a Saga?

Transition: Cross‑service changes require coordination or compensation.

2PC (distributed ACID)

Coordinator prepares then commits; strong consistency but blocking risk on failures.
Use in tight trust boundaries with low latency and strict SLAs.

Saga

Sequence of local transactions with compensating actions.
Orchestrated (central controller) or choreographed (events/outbox).

// Orchestrated saga (pseudo)
reserveInventory(); chargePayment(); confirmOrder();
compensate on failure: refundPayment(); releaseInventory();

Rule of thumb

Prefer Saga when availability trumps global atomicity; design idempotent steps and compensations.

How do you handle leader election and prevent split‑brain?

Transition: High availability hinges on safe failover.

Answer

Use consensus (Raft/Paxos) to elect one leader.
Protect shared resources with fencing tokens (monotonic lease terms) so old leaders can’t write after losing leadership.
Readiness vs liveness health checks; conservative timers to avoid flapping.

// Fencing example with DB
UPDATE shard SET leader_id=$me, term=term+1
WHERE id=$id AND term=$observed_term;

How do you generate IDs at scale (UUID v7/ULID/Snowflake)?

Transition: IDs affect ordering, locality, and hot shards.

Options

UUID v4: random, no order; index‑unfriendly.
UUID v7/ULID: time‑ordered; better locality and pagination.
Snowflake: 64‑bit (time + region + worker + seq); monotonic per worker, good sharding properties.

// Snowflake bits: 41 time | 5 region | 5 worker | 12 sequence

Guidance

Prefer v7/ULID or Snowflake for ordered inserts and stable keyset pagination; ensure clock safety & sequence rollover handling.

How do you reason about time, clocks, and ordering?

Transition: Correct ordering without trusting wall‑clock.

Answer

Wall clocks skew (NTP); don’t assume strict order.
Use Lamport timestamps for causality; vector clocks to detect concurrency and resolve with app rules.
Persist server‑assigned times; present user times with correction windows.

// Lamport: each event increments counter; on receive: counter = max(local, remote) + 1

How do you keep caches correct—invalidate or update?

Transition: Scale is easy; cache correctness is hard.

Patterns

Cache‑aside: app populates; invalidate on write; race risk without versioning.
Read‑through/Write‑through: cache mediates; simpler, higher write latency.
Write‑behind: async to DB; needs durable queue and replay.

Keep it correct

Use versioned keys (e.g., obj:{id}:{version}) so stale values don’t overwrite fresh.
Soft TTL + background refresh; request coalescing under a lock.

// Safe invalidate (versioned)
db.update(...); cache.set("obj:42:v3", data); cache.del("obj:42:v2")

Which pagination strategy should you use: OFFSET/LIMIT, keyset, or tokens?

Transition: Pagination impacts performance and user consistency.

Guidance

OFFSET/LIMIT: simple but slow/unstable on big offsets.
Keyset (seek): stable & fast—use the last seen sort key.
Continuation tokens: encode cursor + filters for complex sorts or sharded scans.

-- Keyset example
SELECT * FROM orders
WHERE user_id = $1 AND created_at < $cursor
ORDER BY created_at DESC
LIMIT 50;

What queue semantics do you guarantee (ordering, at‑least‑once, DLQ)?

Transition: Async pipelines must define delivery and order per key.

Answer

Guarantee ordering per partition key (e.g., userId) by routing all events to the same partition.
Accept at‑least‑once and design idempotent consumers.
Use DLQ with bounded retries and replay tooling.

// Idempotent consumer sketch
if (!seen(event.id)) { apply(event); mark_seen(event.id) }

How do you defend against hot keys and noisy neighbors?

Transition: Even perfect sharding suffers under skew.

Playbook

Detect via per‑key QPS histograms; apply per‑key rate limits and request coalescing.
Hot‑key sharding: salt the key into N buckets for cache/DB; merge on read.
Priority queues and tenant isolation (separate threadpools/quotas).

When should you use blue/green, rolling, or canary deployments?

Transition: Maintainability requires safe change delivery.

Answer

Blue/Green: instant switch + easy rollback; needs double capacity.
Rolling: gradual replacement; watch partial incompatibility and state transitions.
Canary: small % of traffic + automated SLO gates; pair with feature flags to decouple deploy vs release.

How do you manage data lifecycle: retention, tiering, privacy deletes?

Transition: Reliability includes data safety over time—and cost control.

Answer

Retention: hot (OLTP) → warm (object store) → cold/archive; TTL indexes for ephemeral data.
Privacy deletes: hard delete + tombstones for logs/streams; crypto‑erasure for immutable backups (destroy keys).
Lineage: track derived datasets to propagate deletes correctly.

How do you approach load testing and capacity planning?

Transition: We close the loop by proving the design under load.

Plan

Define SLOs (P95/P99, error budget). Model arrival (spiky vs steady); include background jobs.
Use production‑like data; ramp to 1×, 2×, and burst (e.g., 10×) with think‑time.
Apply Little’s Law to reason about concurrency: L = λ × W.
Measure saturation signals (CPU, heap, queue depth), shed load, verify graceful degradation.

// Capacity sanity (example)
peak_qps = 50_000; p95_latency = 0.050  // 50ms
concurrency ~= peak_qps * p95_latency   // ≈ 2,500 concurrent in-flight ops

Interview flow tip: Map any new scenario to these themes: session guarantees → isolation → cross‑service transactions → leadership & IDs → cache correctness → pagination → async semantics → hot key defense → safe deploys → lifecycle → load test.

How do you present a coherent interview narrative?

Lead with correctness (idempotency, outbox, invariants). Then scale the hot paths with cache, replicas, sharding, and backpressure. Finally, prove it’s operable and safe to evolve: observability, migrations, DR, security. Close by tying back to data modeling (normalize vs denormalize, graph vs SQL) and indexing choices that make the plan real.

One‑liner: “First make it right, then make it fast, then make it safe to change.”