In-depth, interview-ready patterns with trade‑offs, failure modes, and metrics. Written for whiteboard discussions and production reality.
How should I use this post?
We begin with correctness under failure, move through scaling patterns, and finish with operability & evolution. Each section opens with a crisp Thesis, then a pragmatic design, trade‑offs, failure modes, and metrics—so you can speak fluently at the board and reason about production realities.
How do you achieve exactly‑once processing for payments and orders?
Transition: Reliability starts with correctness. Rather than chasing mythical “exactly‑once”, we compose guarantees we can actually implement.
Thesis: True exactly‑once is infeasible in distributed systems; we achieve it with at‑least‑once delivery + idempotency and/or a transactional outbox with deduplication.
Design
- Generate an
idempotency_keyper request (client or gateway). - Begin a DB transaction: upsert a
PENDINGpayment row with a unique constraint on the idempotency key. - In the same transaction, append to an
outboxtable (topic + payload + event_id). - A background consumer (or CDC) reads outbox events and calls the external gateway.
- Consumer is idempotent: maintains a processed set (table) keyed by
event_id. - Reconciliation job queries the gateway by
request_idto resolve ambiguous states.
-- Producer (RDBMS)
BEGIN;
INSERT INTO payments(request_id, user_id, amount, status)
VALUES($rid,$uid,$amt,'PENDING')
ON CONFLICT (request_id) DO NOTHING;
INSERT INTO outbox(event_id, topic, payload)
VALUES(gen_event_id(), 'ChargeRequested', jsonb_build_object('request_id',$rid,'user_id',$uid,'amount',$amt));
COMMIT;
// Consumer (pseudo)
for event in stream('ChargeRequested'):
if processed.exists(event.id): continue
res = gateway.charge(event.payload.request_id, event.payload.user_id, event.payload.amount)
upsert(payments, request_id=event.payload.request_id, status=res.status, txn_id=res.txn_id)
processed.add(event.id)
Failure Modes & Fixes
- Gateway charged, DB write failed → reconciliation queries gateway by
request_id, upserts final status. - Consumer crash mid‑charge → safe to retry; idempotency key prevents double charge.
- Duplicate client requests → unique constraint on
request_idcollapses duplicates.
Metrics
- Outbox lag, duplicate rate, retry counts, reconciliation backlog, success rate, P95/P99 end‑to‑end latency.
How do you choose a shard key, and when should you change it?
Transition: With correctness guarded, we scale state by partitioning it intelligently.
Thesis: Choose a shard key that aligns with dominant access patterns and spreads load; change it when hotspots or scatter‑gather dominate.
Design & Migration
- Start: hash(user_id) for uniform load in user‑centric apps; co‑locate related entities by the same key.
- Scans/time‑series: range by date with “today” bucketed/salted to avoid hotspots.
- Change key when skew is high or cross‑shard queries dominate.
- Migration: introduce a router supporting v1/v2 → dual‑write → idempotent backfill → flip reads gradually → retire v1.
Metrics
- Shard skew (top‑1% QPS vs median), cross‑shard query ratio, rebalancing time, hottest‑key QPS.
How would you scale reads from 10k RPS to 1M RPS?
Transition: Partitioning is one axis; caching and read replicas multiply capacity while preserving correctness signals.
Thesis: Layer edge caching, request coalescing, lag‑aware replicas, and materialized views with read‑your‑writes via session stickiness or versioning.
Playbook
- CDN/edge: cache JSON/HTML with
ETagor signed URLs; vary by auth if needed. - App cache: soft TTL + background refresh; single‑flight to stop stampedes.
- DB replicas: lag‑aware routing; post‑write flows stick to leader for N seconds.
- Precompute hot aggregates via materialized views; maintain by triggers/CDC/jobs.
- Salt extremely hot keys; consistent hashing for cache cluster.
// Stampede guard (sketch)
value = cache.get(key)
if (!value) {
if (lock.try_acquire(key, 2000 /*ms*/)) {
value = recompute()
cache.set(key, value, /*soft_ttl*/ 60)
} else {
value = cache.get_stale_or_placeholder(key)
}
}
Metrics
- CDN hit rate, coalesced request ratio, cache fill latency, replica lag distribution, P95/P99.
How do you prevent cascading failures with backpressure and graceful degradation?
Transition: At high load, stability is about bounding everything and degrading gracefully.
Thesis: Combine timeouts, bounded queues, circuit breakers, rate limits, and load shedding with fallbacks.
Design
- Jittered exponential backoff; budget retries to stay within client SLA.
- Open breaker on sustained errors/latency; fast‑fail with cached/approx results.
- Bounded worker pools and queues; prioritize critical traffic.
- Adaptive concurrency per dependency; queue depth‑based backpressure.
// Retry with backoff + jitter (pseudo)
for (i=0; i<max; i++) {
try { return call() }
catch (e) { sleep(base * (2 ** i) + rand(0, jitter)) }
}
Metrics
- Retry rate, breaker open time, queue depth, tail latency, dropped/served‑stale counts.
How do you design a multi‑region active‑active system and pick consistency?
Transition: Geography adds partitions and latency; choose consistency per operation.
Thesis: For read latency and availability, use multi‑leader/leaderless with conflict resolution. For strict invariants, enforce a single writer per entity or a globally serializable region.
Patterns
- Per‑entity owner region for payments/unique constraints; reads local, writes coordinated.
- CRDTs or app‑merge for collaborative docs, counts, likes.
- Failover drills; write admission control during recovery.
Metrics
- RPO/RTO, cross‑region RTT budget, conflict rate, reconciliation backlog.
How do you run zero‑downtime schema migrations?
Transition: Systems live and change; evolve without breaking readers or writes.
Thesis: Apply expand → migrate → contract with dual‑read/write and idempotent backfills.
Steps
- Expand: add new columns/tables/indexes (nullable).
- Dual‑write: populate new path on all new writes.
- Backfill in batches (
WHERE new_col IS NULL), throttle to keep replica lag healthy. - Dual‑read until confidence; then flip reads.
- Contract: remove old code/columns when safe.
ALTER TABLE orders ADD COLUMN total_cents BIGINT;
-- Batched backfill (sketch):
UPDATE orders o
SET total_cents = s.sum_cents
FROM (
SELECT order_id, SUM(qty*price_cents) AS sum_cents
FROM order_items
WHERE order_id BETWEEN $lo AND $hi
GROUP BY order_id
) s
WHERE o.id = s.order_id AND o.total_cents IS NULL;
Metrics
- Null‑rate over time, backfill throughput, replica lag, error deltas pre/post flip.
What does good day‑one observability look like?
Transition: You can’t run what you can’t see—observability underpins reliability.
Thesis: Ship traces, metrics, and structured logs with correlation IDs; define SLOs and runbooks.
Baseline
- Tracing via OpenTelemetry; propagate
trace_id/span_idand safe baggage. - Metrics: golden signals (latency, traffic, errors, saturation) per endpoint and tenant.
- Structured JSON logs with
request_id, hashed user/tenant, status, latency. - SLOs tied to UX; alerts on burn rate of error budgets; synthetic checks.
Anti‑Patterns
- PII in logs, high‑cardinality label explosions, alert fatigue, dashboards without runbooks.
Should product catalog and orders live in the same database?
Transition: Data modeling choices affect reliability and scale.
Answer: Usually no. Catalog (variable attributes) → document store + search index. Orders (financial correctness) → relational with constraints and materialized totals. Sync via CDC to denormalized read models.
Risks if Combined
- OLTP and search/analytics contention; long scans harming write latency; rigidity for catalog evolution.
When should you join vs denormalize/materialize?
Transition: Performance and integrity trade‑offs meet access patterns.
Heuristic: Normalize when writes are hot and constraints matter; denormalize/materialize when reads dominate and aggregates are stable, accepting bounded staleness.
Safe Denormalization
- Materialized views (DB‑native or job‑maintained).
- Derived columns (e.g.,
order_total_cents) with triggers/CDC jobs. - Version derivations to avoid circular dependencies.
Pitfalls
- Silent drift, multi‑source truth, cross‑service coupling.
When do you pick a graph database over SQL?
Transition: Sometimes, relationships are the query.
Choose Graph if: Variable‑length traversals, path constraints, role hierarchies, FOAF/recommendations—and you can anchor traversals on selective nodes.
Otherwise
- For fixed‑depth joins or analytics, relational with recursive CTEs or precomputed adjacency lists is often faster and simpler operationally.
-- Recursive CTE (FoF ≤ 2)
WITH RECURSIVE fof(id, depth) AS (
SELECT friend_id, 1 FROM friends WHERE user_id = $1
UNION ALL
SELECT f.friend_id, fof.depth + 1
FROM friends f JOIN fof ON f.user_id = fof.id
WHERE fof.depth < 2
)
SELECT DISTINCT id FROM fof;
How do you design the right index for a specific query?
Transition: Scalability depends on making the planner’s life easy.
Example Query
SELECT id, created_at, total_cents
FROM orders
WHERE user_id = ?
AND created_at BETWEEN ? AND ?
ORDER BY created_at DESC
LIMIT 50;
Rule
Equality → Range → Order (and make it covering).
CREATE INDEX idx_orders_user_created
ON orders(user_id, created_at DESC)
INCLUDE (total_cents);
Watchouts
- Low‑cardinality leading columns, too many indexes (write amplification), stale statistics.
How do you implement fair multi‑tenant rate limiting?
Transition: Protect the platform and ensure fairness.
Thesis: Use hierarchical token buckets—global → tenant → endpoint—with bursts and sustained rates, plus Retry-After hints.
Design
- Global ceiling to protect shared resources.
- Per‑tenant bucket sized by plan; per‑endpoint bucket for expensive operations.
- Distributed enforcement with sharded counters or sliding‑window approximations.
// Token bucket (multi-tenant sketch)
allow(tenant, path, cost=1):
now = monotonic()
refill_all(now)
if bucket[tenant, path].tokens >= cost:
bucket[tenant, path].tokens -= cost
return ALLOW
else:
return DENY with Retry-After
How would you design product search with filters and ranking?
Transition: Combining data modeling, indexing, and latency budgets.
Architecture
- Source of truth (relational/doc) → indexer (CDC/batch) → inverted index (denormalized, tokenized fields).
- Facets via precomputed counts/doc values; BM25 baseline ranking + freshness/availability boosts; optional learning‑to‑rank.
- Stale‑tolerant UX while index catches up; backpressure for large reindex jobs.
Metrics
- Index freshness lag, P95 search latency, facet accuracy, zero‑result rate.
What’s your disaster recovery plan if the primary region is down?
Transition: Reliability at org scale demands practiced recovery.
Thesis: Define RPO/RTO, replicate data cross‑region, practice failovers, and control write admission during recovery.
Plan
- Data: async cross‑region replicas (binlog/WAL shipping), multi‑AZ.
- Compute: infra‑as‑code templates; warm standby capacity.
- Failover: DNS/control‑plane switchover; shard ownership reassignment; block conflicting writes until caught up.
- Replay: durable logs for queues/outboxes; postmortems and improvements.
Metrics
- RPO achieved (bytes/seconds), RTO, recovery test frequency, config drift rate.
How do you protect user data while staying observable?
Transition: Maintainability includes responsible operations.
Thesis: Enforce least privilege and encryption; keep logs useful but sanitized; rotate secrets; test restores.
Controls
- Per‑service identities; scoped IAM; short‑lived tokens (OIDC).
- TLS in transit; envelope encryption at rest; key rotation.
- Structured logs with redacted/tokenized PII; preserve
trace_id/request_id. - Retention policies; GDPR/CCPA delete/export; encrypted, access‑logged backups with restore drills.
Pitfalls
- PII in traces, shared root creds, over‑permissive roles, unencrypted backups.
What user‑visible consistency guarantees will you provide?
Transition: Beyond database semantics, we promise session guarantees that users feel.
Answer
- Read‑your‑writes, monotonic reads, monotonic writes, and consistent prefix.
- Implement via sticky sessions to leader/region post‑write or session tokens carrying last‑seen version/timestamp; route reads accordingly.
- Use
ETag/If-None-Match; include version in cache keys.
Metrics
- Leader‑read ratio post‑write, replica‑lag distribution, stale‑read rate.
Which isolation level prevents which anomalies, and when is SERIALIZABLE required?
Transition: Correctness also means preventing cross‑request anomalies.
Quick map
- Read Committed: avoids dirty reads; allows non‑repeatable reads & phantoms.
- Repeatable Read / Snapshot: prevents non‑repeatable reads; can allow write skew.
- Serializable: prevents all anomalies (2PL or SSI).
Example — Write Skew
-- Doctor-on-call: each txn checks count=1 then sets off-duty.
-- Under snapshot isolation, both commit → zero on duty.
-- Fix: SERIALIZABLE or SELECT ... FOR UPDATE with predicate locks.
When to use SERIALIZABLE
- Financial transfers, inventory guarantees, scheduling constraints, cross-row invariants.
Metrics
- Serialization failure (retry) rate, lock waits, long-running transactions.
When do you use Two‑Phase Commit vs a Saga?
Transition: Cross‑service changes require coordination or compensation.
2PC (distributed ACID)
- Coordinator prepares then commits; strong consistency but blocking risk on failures.
- Use in tight trust boundaries with low latency and strict SLAs.
Saga
- Sequence of local transactions with compensating actions.
- Orchestrated (central controller) or choreographed (events/outbox).
// Orchestrated saga (pseudo)
reserveInventory(); chargePayment(); confirmOrder();
compensate on failure: refundPayment(); releaseInventory();
Rule of thumb
- Prefer Saga when availability trumps global atomicity; design idempotent steps and compensations.
How do you handle leader election and prevent split‑brain?
Transition: High availability hinges on safe failover.
Answer
- Use consensus (Raft/Paxos) to elect one leader.
- Protect shared resources with fencing tokens (monotonic lease terms) so old leaders can’t write after losing leadership.
- Readiness vs liveness health checks; conservative timers to avoid flapping.
// Fencing example with DB
UPDATE shard SET leader_id=$me, term=term+1
WHERE id=$id AND term=$observed_term;
How do you generate IDs at scale (UUID v7/ULID/Snowflake)?
Transition: IDs affect ordering, locality, and hot shards.
Options
- UUID v4: random, no order; index‑unfriendly.
- UUID v7/ULID: time‑ordered; better locality and pagination.
- Snowflake: 64‑bit (time + region + worker + seq); monotonic per worker, good sharding properties.
// Snowflake bits: 41 time | 5 region | 5 worker | 12 sequence
Guidance
- Prefer v7/ULID or Snowflake for ordered inserts and stable keyset pagination; ensure clock safety & sequence rollover handling.
How do you reason about time, clocks, and ordering?
Transition: Correct ordering without trusting wall‑clock.
Answer
- Wall clocks skew (NTP); don’t assume strict order.
- Use Lamport timestamps for causality; vector clocks to detect concurrency and resolve with app rules.
- Persist server‑assigned times; present user times with correction windows.
// Lamport: each event increments counter; on receive: counter = max(local, remote) + 1
How do you keep caches correct—invalidate or update?
Transition: Scale is easy; cache correctness is hard.
Patterns
- Cache‑aside: app populates; invalidate on write; race risk without versioning.
- Read‑through/Write‑through: cache mediates; simpler, higher write latency.
- Write‑behind: async to DB; needs durable queue and replay.
Keep it correct
- Use versioned keys (e.g.,
obj:{id}:{version}) so stale values don’t overwrite fresh. - Soft TTL + background refresh; request coalescing under a lock.
// Safe invalidate (versioned)
db.update(...); cache.set("obj:42:v3", data); cache.del("obj:42:v2")
Which pagination strategy should you use: OFFSET/LIMIT, keyset, or tokens?
Transition: Pagination impacts performance and user consistency.
Guidance
- OFFSET/LIMIT: simple but slow/unstable on big offsets.
- Keyset (seek): stable & fast—use the last seen sort key.
- Continuation tokens: encode cursor + filters for complex sorts or sharded scans.
-- Keyset example
SELECT * FROM orders
WHERE user_id = $1 AND created_at < $cursor
ORDER BY created_at DESC
LIMIT 50;
What queue semantics do you guarantee (ordering, at‑least‑once, DLQ)?
Transition: Async pipelines must define delivery and order per key.
Answer
- Guarantee ordering per partition key (e.g., userId) by routing all events to the same partition.
- Accept at‑least‑once and design idempotent consumers.
- Use DLQ with bounded retries and replay tooling.
// Idempotent consumer sketch
if (!seen(event.id)) { apply(event); mark_seen(event.id) }
How do you defend against hot keys and noisy neighbors?
Transition: Even perfect sharding suffers under skew.
Playbook
- Detect via per‑key QPS histograms; apply per‑key rate limits and request coalescing.
- Hot‑key sharding: salt the key into N buckets for cache/DB; merge on read.
- Priority queues and tenant isolation (separate threadpools/quotas).
When should you use blue/green, rolling, or canary deployments?
Transition: Maintainability requires safe change delivery.
Answer
- Blue/Green: instant switch + easy rollback; needs double capacity.
- Rolling: gradual replacement; watch partial incompatibility and state transitions.
- Canary: small % of traffic + automated SLO gates; pair with feature flags to decouple deploy vs release.
How do you manage data lifecycle: retention, tiering, privacy deletes?
Transition: Reliability includes data safety over time—and cost control.
Answer
- Retention: hot (OLTP) → warm (object store) → cold/archive; TTL indexes for ephemeral data.
- Privacy deletes: hard delete + tombstones for logs/streams; crypto‑erasure for immutable backups (destroy keys).
- Lineage: track derived datasets to propagate deletes correctly.
How do you approach load testing and capacity planning?
Transition: We close the loop by proving the design under load.
Plan
- Define SLOs (P95/P99, error budget). Model arrival (spiky vs steady); include background jobs.
- Use production‑like data; ramp to 1×, 2×, and burst (e.g., 10×) with think‑time.
- Apply Little’s Law to reason about concurrency: L = λ × W.
- Measure saturation signals (CPU, heap, queue depth), shed load, verify graceful degradation.
// Capacity sanity (example)
peak_qps = 50_000; p95_latency = 0.050 // 50ms
concurrency ~= peak_qps * p95_latency // ≈ 2,500 concurrent in-flight ops
How do you present a coherent interview narrative?
Lead with correctness (idempotency, outbox, invariants). Then scale the hot paths with cache, replicas, sharding, and backpressure. Finally, prove it’s operable and safe to evolve: observability, migrations, DR, security. Close by tying back to data modeling (normalize vs denormalize, graph vs SQL) and indexing choices that make the plan real.
0 Comments