A talk by Hope
Scalable
Software Systems
why we build the way we build
→ / Space to advance · M for chapters
Context
Microservices
Kafka
Redis
Docker
Kubernetes
Kong API Gateway
WebSocket
Event-driven
SPA
Cloud / AWS
JWT
OAuth2 / SSO
Rate Limiter
CDN
Horizontal Scaling
Each of these was once someone's breakthrough to a real problem.
Know the problem — and you'll know when to reach for it.
2004
"Happiness is hearing something
you weren't supposed to.
And to find happiness, we gossip about others' lives."
2004 · foundation
Gossip.com
went live.
One server. Everything in it.
Monolith
MVC
MySQL
VPS · $5/mo
Routing, auth, templates, business logic, sessions — one process. Simple. And it worked.
2007
gossip.com · production · nginx/app · 2007-08-14
──────────────────────────────────────────────────────
[02:44:01] traffic spike detected — concurrent users: 847
[02:45:18] WARNING db connections: 480 / 500
[02:46:02] WARNING CPU: 91% · response p99: 4.2s
[02:47:00] ERROR db connections exhausted — pool full
[02:47:03] ERROR CPU: 100% · workers unresponsive
[02:47:10] FATAL nginx: upstream timed out — 502 Bad Gateway
[02:47:10] FATAL server unreachable
──────────────────────────────────────────────────────
Baby Hope learned what a production incident is.
2007 · scaling
Go bigger
and stronger.
Vertical Scaling
$5 → $40 → $320 → $640/mo. Then nothing. Hardware stops scaling.
✓buys you time
✗hard ceiling — not a real solution
2007 · horizontal scaling
Go out,
not up.
Load Balancer
Horizontal Scaling
Imbalanced
Least Connections
Session Mismatch
Server 1 remembers you. Server 2 doesn't. You're logged out.
2007 · horizontal scaling
You log in on
Server 1. Refresh.
Server 2 has never
met you.
Same user,
same server.
Until that server dies.
Sticky Sessions
False Fix
IP hash routes users to one server. Server dies. All their sessions die with it.
2007 · stateless architecture
Move state out.
Servers forget.
Stateless
Shared Sessions
Scale freely
Sessions lived inside each server. Swapping servers meant losing you.
✓any server handles any request
✓add / remove servers freely
→DB = single source of truth
2009
40×
same five queries · per second · never cached
2009 · caching
Compute once.
Remember the answer.
Caching
Redis
TTL
Cache Invalidation
✓40× fewer DB queries
✗stale data up to TTL window
→invalidate on write for accuracy
2009 · query performance
10M rows read.
To find
847.
EXPLAIN
B-Tree Index
✓reads: seconds → milliseconds
✗every write must update the index
→index what you actually query on
2009 · global users
Gossip is not local.
It's a
fundamental human need.
You can't refactor
your way out of
geography.
Latency
CDN
Edge Nodes
✓400ms → 12ms for static files
✓origin servers free for logic
→static only — dynamic still goes to origin
2011
9.2s
to publish one post
2011 · async
User waited 9.2s.
Server was
sending emails.
Synchronous
Blocking
Side Effects
→post is saved in 200ms
✗user blocked by work they don't need
→notifications are side effects — decouple them
2011 · job queue
The post is saved.
The user is
still waiting.
Write it down.
Walk away.
Someone else handles it.
Job Queue
Atomicity
Background Worker
✓user sees response in 200ms
✓post + job in same DB transaction
→jobs may run more than once — be idempotent
2011 · real-time
In love and in feeds —
everyone wants
to be first.
Stop asking.
Let the server
speak first.
Polling
WebSocket
Pub/Sub
✓server pushes instantly on event
✗WS connections are stateful — need coordination
→Redis pub/sub: any server can push to any client
2014
Friday deploy.
20 engineers
1 codebase
notification team broke payments
2014 · microservices
Payments crashed.
It took Search,
Feeds, Auth —
everything.
Mind your own
business.
One service.
One job.
Microservices
Kafka
Eventual Consistency
✓deploy, fail, scale independently
✗distributed complexity: no shared DB
→choose availability over perfect consistency
2014 · api gateway
The client memorized
your entire org chart.
And their phone numbers.
One door.
Everything else:
not your problem.
API Gateway
Routing
Rate Limiting
Auth
✓client only knows one address
✓auth, rate limit, logging in one place
✗single point of failure if misconfigured
2014 · identity
Identity that
travels without asking.
JWT
Token
Stateless Auth
✓any service verifies locally — no auth call
✓signed — tamper-evident
✗can't revoke before expiry without extra work
2014 · identity federation
Don't own
identity.
Delegate it.
OAuth2
OIDC
SSO
✓never store or touch user credentials
✓one integration pattern — any provider
✓enterprise SSO with existing company accounts
2014 · containers
Works on
my machine.
→ Ship the machine.
Docker
Container
Reproducible Builds
✓identical env: local → staging → prod
✓no "works on my machine"
→Dockerfile = reproducible single source of truth
Why don't we
ship the machine?
2017
support ticket · 2017-03-14 · 09:41am
"I was charged, but I don't have premium."
and then another one. and five more.
2017 · distributed transactions
That support ticket
is the receipt.
Four services.
Four databases.
No shared undo button.
Partial Failure
No Atomicity
✗payment committed · subscription never ran
✗no automatic rollback across services
→this is the fundamental problem of distributed txns
2017 · saga pattern
We can't roll back.
Can we at least
apologize?
Compensate.
Don't pretend
it didn't happen.
Saga
Compensation
Local Transactions
✓no shared transaction needed
→each step: forward action + compensating action
✗more code — every path must be planned
2017 · outbox pattern
The payment went in.
The event
was in the air.
If it's not
in the database,
it didn't happen.
Outbox Pattern
At-least-once
Idempotency
✓event always published — even if Kafka was down
→at-least-once: event may arrive twice
→idempotency: duplicate = safe to ignore
2017 · resilience
One slow service.
Fifty threads.
Nobody moves.
Don't wait.
Cut the line.
Cascade Failure
Circuit Breaker
Backoff + Jitter
Rate Limiting
✓sick service doesn't take down healthy ones
→exponential backoff + jitter on retry
→rate limit at gateway: X req/s → 429
2004 — 2017 · one gossip platform
Thank you
for watching
<3 <3 <3
Ask any questions!
But I don't have any answers!
GET /api/v1/hope/health
🍜 Eating: 200
💻 Coding: 200
💔 Loving: 404
😴 Sleeping: 503
{ null } days of no incidents