Reliable, Scalable, and Maintainable Systems — A Deep Dive

Reliable, Scalable, and Maintainable Systems — A System Design Deep Dive (DDIA Ch. 1)
Book Notes System Design

Inspired by Designing Data‑Intensive Applications, Chapter 1 — distilled for interviews and practical architecture.

1) The Three Pillars

Reliability

System continues to work correctly (as per spec) even in the face of faults. Users see correct, timely, and durable behavior.

  • Fault tolerance: handle machine/network failures, bugs, human error.
  • Correctness: idempotency, exactly‑once semantics (often via at‑least‑once + dedupe).
  • Durability: write‑ahead logs (WAL), replication, backups + restore drills.

Scalability

System can handle increased load by adding resources (or smartly rearchitecting) while meeting performance targets.

  • Define workload & growth: QPS, data size, concurrency, P95/P99 latency.
  • Scale up vs scale out; caching; sharding; async pipelines.

Maintainability

System remains easy to operate and evolve without undue risk or toil.

  • Operability: runbooks, automation, safe deploys.
  • Simplicity: reduce accidental complexity; clear boundaries.
  • Evolvability: schema & API versioning, migration playbooks.

2) Reliability — Deep Dive

Common Failure Modes

  • Hardware: disk, NIC, power, CPU thermal throttling.
  • Network: partitions, packet loss, jitter; retries creating thundering herds.
  • Software: memory leaks, deadlocks, data races, mishandled timeouts.
  • Human error: bad deploys, misconfig, destructive admin commands.

Reliability Techniques

  • Redundancy: N+1 instances, multi‑AZ/region, quorum replication.
  • Isolation: bulkheads, circuit breakers, per‑tenant throttles.
  • Timeouts & Retries: with jittered exponential backoff; enforce idempotency.
  • Durability: WAL + fsync policy, multi‑replica commit, periodic verified backups.

Idempotency & Exactly‑Once

// Pseudocode: idempotent charge using requestId
function charge(requestId, userId, amount) {
  if (Payments.exists(requestId)) return Payments.get(requestId);
  let result = gateway.charge(userId, amount);
  Payments.insert({requestId, userId, amount, status: result.status});
  return result;
}
Interview tip: “Exactly‑once” delivery is usually achieved by at‑least‑once + deduplication or transactional outbox/CDC, not by magical transports.

3) Scalability — Deep Dive

Define the Workload

  • Traffic: reads vs writes, peak QPS, burstiness, fan‑out.
  • Data: size on disk, hot set size, growth per day.
  • Performance goals: P50/P95/P99 latency, error budgets, throughput.

Scale Patterns

  • Scale up: bigger boxes for stateful cores; simpler but bounded.
  • Scale out: stateless fleet behind load balancers.
  • Caching: CDN, request cache, read‑through/write‑through, cache stampede protection.
  • Sharding: by hash, range, or directory (consistent hashing, sticky routing).
  • Asynchrony: job queues, stream processors; smoothing bursts and isolating hot paths.

Capacity Planning (Back‑of‑the‑Envelope)

// Example: can we handle peak write load?
peak_writes_qps = 20_000
single_node_capacity = 2_500   // measured
replication_factor = 3
safety_factor = 0.6            // headroom for failover
nodes_needed = ceil( peak_writes_qps / (single_node_capacity * safety_factor) )
// => ceil(20000 / 1500) = 14 nodes for primary role (before replicas)
Gotcha: Sharding solves throughput but complicates cross‑shard queries, transactions, and resharding. Add routing layers, avoid scatter‑gather, and colocate hot relationships.

4) Maintainability — Deep Dive

Operability

  • Automate: infra as code, one‑button rollbacks, database migration scripts.
  • Runbooks: clear diagnostics & escalation; chaos drills; disaster recovery run‑throughs.
  • Safe deploys: canary, feature flags, slow rollouts with health gates.

Simplicity

  • Prefer boring tech for critical paths; isolate experiments.
  • Minimize global transactions & cross‑service chatter.
  • Document invariants close to code; enforce in DB where possible.

Evolvability

  • Versioned APIs; backward‑compatible changes first.
  • Data migrations: dual‑write/read, backfills, idempotent jobs.
  • Use feature flags for progressive delivery.

5) Consistency vs Availability (and Latency)

CAP (Intuition)

Under a network partition, a system chooses to be consistent (reject some requests) or available (serve possibly stale/divergent data). Outside partitions, you can often have both.

PACELC (Pragmatic)

If Partition, choose Availability or Consistency; Else, trade Latency vs Consistency. Tune per operation: e.g., reads at R=1 (fast) and writes at W=quorum.

SLA/SLO Hygiene

  • SLA: external commitment (e.g., 99.9% uptime).
  • SLO: internal target (e.g., P99 < 300ms, error rate < 0.1%).
  • Error budget: allowable failure window to balance reliability & release velocity.

6) Replication & Partitioning Patterns

Replication

  • Leader–Follower: linearizable writes to leader; async followers for reads. Lag awareness and read‑your‑writes via session stickiness.
  • Multi‑Leader: better for geo‑write or offline clients; conflict resolution needed (CRDTs, last‑write‑wins, app‑specific merge).
  • Leaderless/Quorum: read/write quorums (R + W > N) for availability; tune for latency vs consistency.

Partitioning (Sharding)

  • Hash: uniform load, bad for range scans; add secondary indexes or scatter‑gather with limits.
  • Range: great for time‑series & scans; watch hotspots at “now”.
  • Directory/Lookup: flexible but adds an extra hop; keep cached and highly available.

Rebalancing

  • Consistent hashing reduces churn when adding/removing nodes.
  • Move shard boundaries gradually; keep dual‑writers or request routers during migration.

7) Load Shedding, Backpressure & Queueing

  • Rate limiting: token bucket/leaky bucket per user/IP/tenant.
  • Backpressure: propagate queue fullness to callers; fail fast with useful errors.
  • Load shedding: drop non‑critical traffic; graceful degradation (serve cached/approx data).
  • Queueing: async jobs to absorb bursts; prioritize critical queues.
// Example: token-bucket sketch
allow(request) {
  now = time()
  tokens = min(capacity, tokens + (now - last_refill) * rate)
  last_refill = now
  if (tokens >= cost(request)) { tokens -= cost(request); return true }
  return false
}

8) Observability & Operability Playbook

Golden Signals

  • Latency (P50/P95/P99)
  • Traffic (QPS, RPS, concurrency)
  • Errors (rate & codes)
  • Saturation (CPU, heap, queue length)

Tracing & Logging

  • Distributed traces with baggage/ids across services.
  • Structured, sampled logs; correlation IDs.
  • Redaction for PII; retention policies.

Runbooks & Automation

  • Health checks: liveness vs readiness; fail closed on dependencies.
  • Autoscaling tied to saturation (not just CPU), with cool‑downs.
  • Incident response: paging thresholds, severity levels, postmortems.

9) Security & Data Safety (Essentials)

  • Principle of least privilege; vault secrets; short‑lived tokens.
  • Encryption in transit (TLS) and at rest; key rotation.
  • Input validation, prepared statements, CSRF/SSR mitigation.
  • Backups + restore tests; retention + legal holds; GDPR/CCPA basics (delete/export).

10) Case Studies & Reference Architectures

A) Read‑Heavy Product Catalog

  • Primary: document store for product pages (flexible), with denormalized read model.
  • Search: inverted index service (synced via CDC).
  • Caching: CDN + edge keys by category/sku; soft TTL with background refresh.
  • Writes: go through a single writer (leader) to keep invariants for SKU uniqueness.

B) Write‑Heavy Event Ingestion (Analytics)

  • Front: stateless collectors with local disk buffer.
  • Queue: durable log (stream) with partitioning by userId/tenant.
  • Processing: stream jobs for real‑time metrics; batch jobs for backfills.
  • Storage: columnar warehouse partitioned by date; late‑arriving data handling.

C) Social Graph & Feed

  • Graph DB or adjacency lists in KV store; selective traversals.
  • Fan‑out strategies: on write (small creators) vs on read (large creators).
  • Ranking pipeline: features in stream, models in feature store, cache hot feeds.

11) Interview Playbook: Prompts, Patterns & Sound Bites

Prompts & How to Respond

  1. “Design a highly available key‑value store.”
    Leader–follower with quorum reads, hinted handoff, read repair; consistent hashing for shards; tunable R/W; anti‑entropy jobs; backpressure + admission control. Call out failure handling and rebalancing.
  2. “Scale a read‑heavy API from 10k to 1M RPS.”
    CDN + edge caching, cache stampede protection, request coalescing, async refresh; stateless app tier; DB read replicas + query caching; profile P95/P99; precompute hot data.
  3. “Guarantee users never see a double charge.”
    Idempotency keys, transactional outbox, exactly‑once consumer via dedupe table; retries with backoff; compensations for rare inconsistencies.
  4. “Handle bursty writes without dropping data.”
    Durable queue, backpressure to callers, load shedding for noncritical endpoints; autoscaling consumers; dead‑letter queues and replay.
  5. “Multi‑region active‑active?”
    Multi‑leader or leaderless; conflict resolution (CRDT or app merge), data residency, latency budget; per‑operation consistency levels; geo‑routing and failover drills.

Sound Bites

  • “Design for graceful degradation, not just success paths.”
  • “Prefer at‑least‑once + idempotency over brittle exactly‑once transports.”
  • “Sharding is a scaling tool that increases complexity—optimize routing and avoid cross‑shard joins.”
  • “Observability isn’t dashboards; it’s the ability to ask new questions without redeploying.”

12) Red Flags & Anti‑Patterns

  • Global transactions across many services (tight coupling, failure cascades).
  • No timeouts/retries or unbounded retries without jitter.
  • Single‑region everything (no isolation or recovery story).
  • Cache without stampede protection (dogpile on expiry).
  • No backpressure → queue growth → OOM → cascading failures.
  • One big database for reads + writes + analytics (resource contention).

13) Cheat Sheets & Checklists

Reliability Checklist

  • Timeouts, retries (jittered), circuit breakers.
  • Idempotency keys for external side effects.
  • Redundancy: multi‑AZ, replicas, failover tested.
  • Backups verified via restore drills.

Scalability Checklist

  • Define P95/P99, hot path, and top queries.
  • Cache hierarchy (edge, app, DB); stampede protection.
  • Shard keys aligned to access patterns.
  • Async pipelines for heavy work.

Maintainability Checklist

  • Infra as code; safe deploys; feature flags.
  • Tracing, metrics, structured logs, alerts.
  • Schema/API versioning; dual‑read/write playbook.
  • Clear SLOs and error budgets.

14) Summary

  • Reliability: tolerate faults, ensure durability, and embrace idempotency.
  • Scalability: define load, choose the right axes (cache, shard, async), and plan capacity.
  • Maintainability: automate operations, simplify designs, and evolve safely.
  • Balance consistency, availability, and latency per operation; validate with measurements.

Credits: This is an original, interview‑focused synthesis inspired by concepts from Designing Data‑Intensive Applications (Martin Kleppmann), Chapter 1. Errors or interpretations are my own.

Next: Pair this with the Chapter 2 deep dive to choose data models and query patterns that satisfy your Chapter 1 pillars.

Post a Comment

0 Comments