A talk by Hope

Scalable
Software Systems

why we build the way we build
→ / Space to advance  ·  M for chapters
Context
Microservices Kafka Redis
Docker Kubernetes Kong API Gateway
WebSocket Event-driven SPA Cloud / AWS
JWT OAuth2 / SSO Rate Limiter CDN Horizontal Scaling
Each of these was once someone's breakthrough to a real problem.
Know the problem — and you'll know when to reach for it.
2004
"Happiness is hearing something
you weren't supposed to.
And to find happiness, we gossip about others' lives."
Baby Hope was 4. He decided to make something that people can gossip together.
2004 · foundation
Gossip.com
went live.
One server. Everything in it.
Monolith MVC MySQL VPS · $5/mo
Routing, auth, templates, business logic, sessions — one process. Simple. And it worked.
Browser client HTTP VPS · $5 / mo PHP App Routing · Auth · Sessions · Models Views · Business Logic · Everything MySQL users · posts · sessions · all of it
2007
gossip.com · production · nginx/app · 2007-08-14 ────────────────────────────────────────────────────── [02:44:01] traffic spike detected — concurrent users: 847 [02:45:18] WARNING db connections: 480 / 500 [02:46:02] WARNING CPU: 91% · response p99: 4.2s [02:47:00] ERROR db connections exhausted — pool full [02:47:03] ERROR CPU: 100% · workers unresponsive [02:47:10] FATAL nginx: upstream timed out — 502 Bad Gateway [02:47:10] FATAL server unreachable ────────────────────────────────────────────────────── Baby Hope learned what a production incident is.
2007 · scaling
Go bigger
and stronger.
Vertical Scaling
$5 → $40 → $320 → $640/mo. Then nothing. Hardware stops scaling.
buys you time
hard ceiling — not a real solution
Browser SERVER $640 / month $5 → $40 → $320 → $640 upgraded with every spike CEILING — hardware limit MySQL
2007 · horizontal scaling
Go out,
not up.
Load Balancer Horizontal Scaling Imbalanced Least Connections Session Mismatch
Server 1 remembers you. Server 2 doesn't. You're logged out.
Browser client Server 1 app process Server 2 app process MySQL data ? Load Balancer round-robin req 1 → req 3 → req 2 → looks great on the diagram 20 slow req · 2s each drowning → heavy light req · 5ms each bored → light round-robin doesn't know. it just rotates. checks load ↓ active: 20 conn avg resp: 2.1s ✗ active: 3 conn avg resp: 12ms ✓ routes where it'll be handled faster — not blind rotation session: user42 ✓ in S1 memory session: ??? ✗ who are you? → back to login page
2007 · horizontal scaling
You log in on
Server 1. Refresh.
Server 2 has never
met you.
Same user,
same server.
Until that server dies.
Sticky Sessions False Fix
IP hash routes users to one server. Server dies. All their sessions die with it.
Browser client Load Balancer IP hash user A user B Server 1 app process Server 2 app process MySQL data problem "solved"... right? Server 1 all users stuck to S1 lose their session not solving the problem — hiding it
2007 · stateless architecture
Move state out.
Servers forget.
Stateless Shared Sessions Scale freely
Sessions lived inside each server. Swapping servers meant losing you.
any server handles any request
add / remove servers freely
DB = single source of truth
Browser Load Balancer Server 1 app process Server 2 app process MySQL session: user42 (local) session: user99 (local) each server owns its own session map sessions user42 ✓ user99 ✓ sessions moved to shared store stateless ✓ stateless ✓ user A user B any server → any user → works ✓
2009
40×
same five queries · per second · never cached
2009 · caching
Compute once.
Remember the answer.
Caching Redis TTL Cache Invalidation
40× fewer DB queries
stale data up to TTL window
invalidate on write for accuracy
App Servers ×10 instances MySQL 10M rows 40× / second same query · same result · every time Client browser Server 1 mem cache feed: 3 min old Server 2 mem cache feed: 47 sec old Server 3 no cache yet → hits DB again 3 servers · 3 separate caches different data · never agree · worse than no cache at all App Servers ×10 instances Shared Cache one place · all servers ask once · keep the result · all servers share it App Servers ×10 instances Redis in-memory · shared <1ms reads MISS HIT → return immediately TTL: 60s → auto-expire or: delete key when data changes "There are only two hard problems in computer science: naming things, and cache invalidation."
2009 · query performance
10M rows read.
To find
847.
EXPLAIN B-Tree Index
reads: seconds → milliseconds
every write must update the index
index what you actually query on
MySQL 10,000,000 rows SELECT * WHERE user_id = 42 1,400ms FULL TABLE SCAN reads every row one by one · like a phone book with no order EXPLAIN SELECT * FROM posts WHERE user_id = 42; type │ possible_keys │ rows examined │ Extra ALL │ NULL │ 10,000,000 │ Using where ← no index used matched: 847 rows → can we give it a shortcut? B-Tree Index user_id → row ptr id: 1 → 0x00a1 id: 18 → 0x03c2 id: 42 → 0x09f4 ← id: 71 → 0x12a8 id: 99 → 0x1fe0 sorted · O(log n) lookup jump skips 10M rows — jumps straight to id:42 3ms
2009 · global users
Gossip is not local.
It's a
fundamental human need.
You can't refactor
your way out of
geography.
Latency CDN Edge Nodes
400ms → 12ms for static files
origin servers free for logic
static only — dynamic still goes to origin
Origin Server Singapore 🇯🇵 Tokyo "site is so slow wtf" 🇩🇪 Frankfurt "images take forever to load" 🇧🇷 São Paulo "5 seconds for a profile photo?!" Hope checks: servers ✓ · DB ✓ · response times ✓ · logs ✓ everything looks fine. but users are still mad. Tokyo user Frankfurt user São Paulo user 400ms avg latency physics. cables. distance. Tokyo Frankfurt São Paulo CDN · Tokyo edge node CDN · Frankfurt edge node CDN · SP edge node Tokyo Frankfurt São Paulo CDN · Tokyo edge node CDN · Frankfurt edge node CDN · SP edge node 12ms from nearest edge
2011
gossip.com/post
9.2s
to publish one post
2011 · async
User waited 9.2s.
Server was
sending emails.
Synchronous Blocking Side Effects
post is saved in 200ms
user blocked by work they don't need
notifications are side effects — decouple them
Browser POST /post POST /post endpoint Save post 200ms Notify 3,200ms Email 2,400ms Update feeds 3,100ms TOTAL ≈ 9.2s user waiting... for all of this Notify 3,200ms Email 2,400ms Update feeds 3,100ms TOTAL ≈ 9.2s Save post 200ms ✓ user only needs this rest are side effects
2011 · job queue
The post is saved.
The user is
still waiting.
Write it down.
Walk away.
Someone else handles it.
Job Queue Atomicity Background Worker
user sees response in 200ms
post + job in same DB transaction
jobs may run more than once — be idempotent
Browser POST /post Post Service MySQL posts id · user_id · content jobs type · status · payload same DB transaction jobs status: pending 200ms ✓ Worker polls every 1s notify · email · feed processed in background user is long gone
2011 · real-time
In love and in feeds —
everyone wants
to be first.
Stop asking.
Let the server
speak first.
Polling WebSocket Pub/Sub
server pushes instantly on event
WS connections are stateful — need coordination
Redis pub/sub: any server can push to any client
Browser Server ● new post ? still waiting… server knows. user doesn't. GET /feed?since=… 304 nothing new every 3s per user wasted, laggy WebSocket persistent · full-duplex new post? push now. no lag · no poll Browser B Server 2 Server 1 can't reach Browser B's connection Redis pub/sub channel publish subscribe
2014
Friday deploy.
20 engineers 1 codebase notification team broke payments
2014 · microservices
Payments crashed.
It took Search,
Feeds, Auth —
everything.
Mind your own
business.
One service.
One job.
Microservices Kafka Eventual Consistency
deploy, fail, scale independently
distributed complexity: no shared DB
choose availability over perfect consistency
gossip.com / app Users · Posts · Comments Notifications · Feeds · Auth Analytics · Search ⚠ Payments crashed one deploy → everything redeployed Auth fails · Search fails · Feeds fail MySQL (shared) User Svc Post Svc Notif Svc Pay Svc Feed Svc db db db db db HTTP / REST sync REST: Post calls Notif calls Feed one service slow → whole chain blocks still tightly coupled Message Queue async event bus · post.created topic publish consume
2014 · api gateway
The client memorized
your entire org chart.
And their phone numbers.
One door.
Everything else:
not your problem.
API Gateway Routing Rate Limiting Auth
client only knows one address
auth, rate limit, logging in one place
single point of failure if misconfigured
Mobile Browser User Svc Post Svc Notif Svc Pay Svc Feed Svc client knows every address each svc handles its own auth API Gateway rate limit auth / JWT verify route → service logging Kong / Nginx / custom
2014 · identity
Identity that
travels without asking.
JWT Token Stateless Auth
any service verifies locally — no auth call
signed — tamper-evident
can't revoke before expiry without extra work
Auth Service issues tokens Client Post Svc Feed Svc Pay Svc login JWT token every request → auth call auth becomes bottleneck + SPOF JWT structure header payload signature alg: HS256 userId: 42 role: premium exp: 1720000000 HMAC-signed tamper-evident Authorization: Bearer <jwt> travels with every request verify locally ✓ verify locally ✓ verify locally ✓
2014 · identity federation
Don't own
identity.
Delegate it.
OAuth2 OIDC SSO
never store or touch user credentials
one integration pattern — any provider
enterprise SSO with existing company accounts
User Gossip.com your system login must build in-house: password hashing reset flows brute force protection 2FA / lockout …and users still want "Login with Google" none of this is your product Identity Provider Google · GitHub · Okta · Azure AD OIDC — same token shape, every provider prove identity token issued "I vouch for this person" — Gossip.com never sees the password SSO — log in once with company account, access everything
2014 · containers
Works on
my machine.
→ Ship the machine.
Docker Container Reproducible Builds
identical env: local → staging → prod
no "works on my machine"
Dockerfile = reproducible single source of truth
source code post-service v2.4.1 Laptop Node 18.x ✓ libssl 3.0 ✓ "works fine" Staging Node 16.x ✗ libssl 1.1 ✗ crashes Production Node 14.x ⚠ libssl 3.0 ✓ maybe ok? "but it works on my machine" Docker Container code · Node 18 · libssl · dependencies Laptop runs ✓ Staging runs ✓ Prod runs ✓ same container · same env · everywhere

Why don't we
ship the machine?

2017
support ticket · 2017-03-14 · 09:41am "I was charged, but I don't have premium." and then another one. and five more.
2017 · distributed transactions
That support ticket
is the receipt.
Four services.
Four databases.
No shared undo button.
Partial Failure No Atomicity
payment committed · subscription never ran
no automatic rollback across services
this is the fundamental problem of distributed txns
Order creates order Payment charges card Subscription grants access Email confirms orders db payments db subs db email db ✓ committed ✓ committed ✓ committed ✓ committed crashed User charged. No access. No automatic rollback. four separate databases · nothing coordinating them
2017 · saga pattern
We can't roll back.
Can we at least
apologize?
Compensate.
Don't pretend
it didn't happen.
Saga Compensation Local Transactions
no shared transaction needed
each step: forward action + compensating action
more code — every path must be planned
Order T1: create Payment T2: charge Subscr. T3: grant Email T4: confirm ✗ fails C1: cancel order ✗ crashes C2: refund forward: create compensation: cancel forward: charge compensation: refund forward: grant compensation: revoke forward: send (best effort) not one transaction — a choreographed sequence
2017 · outbox pattern
The payment went in.
The event
was in the air.
If it's not
in the database,
it didn't happen.
Outbox Pattern At-least-once Idempotency
event always published — even if Kafka was down
at-least-once: event may arrive twice
idempotency: duplicate = safe to ignore
Order Service DB payments order_id · status Kafka payment.completed topic publish event Kafka temporarily unreachable ✗ DB write ✓ event lost ✗ outbox event · status: pending same transaction Relay Process polls outbox publish retry if down Subscr. Svc idempotency check already activated for order 991? → skip
2017 · resilience
One slow service.
Fifty threads.
Nobody moves.
Don't wait.
Cut the line.
Cascade Failure Circuit Breaker Backoff + Jitter Rate Limiting
sick service doesn't take down healthy ones
exponential backoff + jitter on retry
rate limit at gateway: X req/s → 429
Feed Svc Rec Svc response slow... 50 threads waiting 10s timeout backing up too Other Services also timing out cascading failure Circuit Breaker OPEN fail fast return fallback immediately pressure relieved can recover retry: wait 1s → 2s → 4s → 8s + jitter rate limit @ gateway 100 req/s → 429 Try-Again-Later
2004 — 2017 · one gossip platform
2004 Monolith PHP · MySQL $5/month start simple 2007 Horizontal Scale Load Balancer Stateless state → DB 2009 Performance Redis · Indexes CDN cache once serve many 2011 Async Job Queue WebSocket decouple side effects 2014 Microservices Kafka · Docker API Gateway · JWT eventual consistency 2017 Distributed Saga · Outbox Circuit Breaker plan for failure Every line on this diagram was earned. something broke · something was slow · something cost too much
Thank you
for watching
<3 <3 <3
Ask any questions!
But I don't have any answers!
GET /api/v1/hope/health
🍜 Eating: 200
💻 Coding: 200
💔 Loving: 404
😴 Sleeping: 503
{ null } days of no incidents