A talk by Hope

Scalable
Software Systems

why we build the way we build
Press → or Space to continue  ·  M for chapters
Context
SPA WebSocket
Docker Kubernetes CI/CD Pipeline
Microservices API Gateway Load Balancer
Kafka Message Queue Event-driven Event Sourcing
Redis Multi-instance Horizontal Scaling
JWT OAuth2 Rate Limiter Distributed Tracing
Each of these used to be someone's breakthrough.
Know the problem it solved — and you'll know when to reach for it.
2004
gossip.com went live.
Gossip.com 2 developers 1 VPS · $5/mo 1 weekend
2004 · foundation
One server.
Everything in it.
Monolith MVC HTTP endpoints MySQL VPS · $5/mo
Hope was 21. The plan was "ship before Tuesday."
No architecture meeting. No ticket. Just vibes.
Browser client HTTP VPS · $5/mo App (MVC) PHP templates · CRUD · HTTP forms MySQL same machine · same disk
2007 · 02:47am
[02:44:01] traffic spike — 8,400 concurrent users [02:45:13] WARNING CPU 94% · RAM 97% [02:46:02] WARNING DB connections exhausted (500/500) [02:47:09] FATAL Cannot allocate memory [02:47:09] FATAL Process killed (OOM) [02:47:10] ERROR nginx: upstream timed out (110) [02:47:11] ──────── server unreachable ────────
2007 · scaling
Traffic spikes.
First instinct:
make it bigger.
Vertical Scaling $5/mo → $640/mo → ceiling
3am. The scandal dropped. Everyone opened Gossip.com.
The server upgraded 4 times in one night. Then silence.
VPS · $5/mo CPU RAM 50 req/min happy ✓ VPS (upgraded ×4) CPU 100% RAM 92% 10,000 req/min ceiling no more headroom what now?
2007 · scaling
Add more servers.
But they each have
their own memory.
Load Balancer Horizontal Scaling Stateless
Users kept getting logged out on refresh. Support tickets called it a "glitch."
It was not a glitch. It was architecture.
Client App 1 session: user123 App 2 session: ??? ⚠ user routed to App 2 → logged out state lives in memory on one machine Load Balancer App 1 stateless ✓ App 2 stateless ✓ session → external store Redis / DB / JWT token
2009
40
times per second · same 5 queries · never cached
2009 · performance
Data doesn't change
every millisecond.
Why ask every time?
Redis Distributed Cache TTL / invalidation
EXPLAIN SELECT on the feed query: full table scan. 800ms.
Same query, 40 times/sec. The DB was quietly weeping.
App Servers SELECT * every request MySQL 40 req/sec ⚠ same 5 queries local mem? cache cache ⚠ different values per server hit ✓ Redis shared · in-memory miss only MySQL source of truth cache invalidation TTL · invalidate on write "2 hard problems in CS"
2009 · performance
Images go through
your server.
They don't have to.
CDN DB Indexes
A user uploaded a 4MB profile photo. The app server routed it. Tokyo users got 6 second loads. Singapore was fine.
User images too LB + Apps serving everything Redis MySQL images, JS, CSS CDN edge · closest to user API only MySQL INDEX(user_id) INDEX(created_at) full scan → lookup
2011
gossip.com/post/new
9.2s
to publish one post
2011 · async
A post saves to DB
in 200ms.
What else happens?
Sync chain: ~9s Async
Product team said "just add a push notification."
Hope said nothing. The spinner said everything.
POST /post Save to DB 200ms ✓ Notify followers 3,200ms Send emails 2,400ms Update feeds: 3,100ms ~9s user stares at spinner does the user need to wait for the notification to send? no.
2011 · async
Post service fires
an event. Done.
Others catch it later.
Kafka Pub/Sub DLQ At-least-once
The restaurant analogy helped zero frontend engineers.
It helped every backend engineer. Make of that what you will.
Post Service saves post to DB blocks until all done post.created Kafka durable · ordered post service returns immediately Notification consumer Email consumer Feed Update consumer Dead Letter Queue failed after retries
2014
App Auth Notif ! Posts Pay DB FRIDAY DEPLOY
2014 · microservices
One repo. One deploy.
One bug takes down
everything.
Service Decomposition Team autonomy Fault isolation
Notification service had a bad regex. The deploy took down payments.
Two hours. A Friday. Hope cried at a laptop for the first time.
Monolith User · Post · Notification Payment · Feed · Auth one team breaks it → all down Auth service Post service Feed service Notification service Payment service User service ✓ deploy independently ✓ notification crash ≠ payment crash
2014 · microservices
10 services.
Clients need all
their addresses.
API Gateway JWT / OAuth2 Docker
"Works on my machine." Docker killed this phrase.
Not everyone forgave it for the learning curve.
Client client must know every service address Post Service Payment Notification + 3 more API Gateway single entry point routing · rate limit JWT token identity travels with request 🐳 docker 🐳 docker 🐳 docker
2017
847
support tickets · charged but no post · no rollback · money gone
2017 · distributed
Two services.
Two databases.
What if one crashes?
No ACID across services inconsistency
847 tickets in one night. "I paid but got nothing."
The mental model of "just rollback" died here, that night.
User pays Payment charged ✓ $9.99 saved Post Service creates post payments_db posts_db two services · two databases · no shared transaction Post Service 💥 CRASH post never created posts_db row missing row exists ✓ no shared transaction boundary
2017 · distributed
Coordinate steps.
If one fails,
roll back the others.
Saga Pattern Outbox Pattern Idempotency
The saga pattern has a much cooler name than it deserves.
It's just: do step, if broken, undo the previous step.
1. Payment charge ✓ how do we tell the post service reliably? outbox table same DB tx → guaranteed relay Kafka payment.charged 2. Create Post consumes event ✓ Done ⚠ Fails emit compensate event Compensate refund idempotency key retry safe · same key = same result
2017 · distributed
A flaky service
takes down callers.
Don't cascade.
Circuit Breaker Rate Limiting CAP Theorem
One flaky service held connections open for 30 seconds.
3 other services queued behind it. It was a Tuesday. It always is.
Service A Service B flaky / slow Service C also timing out cascading failure — one slow service hangs all callers CIRCUIT BREAKER CLOSED normal errors OPEN fail fast timeout HALF OPEN CAP Theorem Consistency Availability Partition tolerance pick two → we chose AP
Gossip.com · 2026
Same weekend project. A scalable platform.
every box is the answer to a specific pain
Client CDN API Gateway load balancing rate limiting routing · auth Auth Post Payment Redis read cache Outbox Circuit Breaker Kafka · Event Bus Feed consumer Notif consumer Email consumer DLQ DB Primary all writes · source of truth S3 / Storage media · objects Infrastructure Docker · Kubernetes · CI/CD pipeline · Auto Scaling · Multi-region
Thank for
watching
<3 <3 <3
Ask any questions!
But I don't have any answers!
The Hope Status
Eating: 200
Coding: 200
Loving: 404
Sleeping: 503