A talk by Hope

Scalable
Software Systems

why we build the way we build

→ / Space to advance · M for chapters

Context

Microservices Kafka Redis

Docker Kubernetes Kong API Gateway

WebSocket Event-driven SPA Cloud / AWS

JWT OAuth2 / SSO Rate Limiter CDN Horizontal Scaling

Each of these was once someone's breakthrough to a real problem.
Know the problem — and you'll know when to reach for it.

2004

"Happiness is hearing something
you weren't supposed to.
And to find happiness, we gossip about others' lives."

Baby Hope was 4. He decided to make something that people can gossip together.

2004 · foundation

Gossip.com
went live.
One server. Everything in it.

Monolith MVC MySQL VPS · $5/mo

Routing, auth, templates, business logic, sessions — one process. Simple. And it worked.

2007

gossip.com · production · nginx/app · 2007-08-14 ────────────────────────────────────────────────────── [02:44:01] traffic spike detected — concurrent users: 847 [02:45:18] WARNING db connections: 480 / 500 [02:46:02] WARNING CPU: 91% · response p99: 4.2s [02:47:00] ERROR db connections exhausted — pool full [02:47:03] ERROR CPU: 100% · workers unresponsive [02:47:10] FATAL nginx: upstream timed out — 502 Bad Gateway [02:47:10] FATAL server unreachable ────────────────────────────────────────────────────── Baby Hope learned what a production incident is.

2007 · scaling

Go bigger
and stronger.

Vertical Scaling

$5 → $40 → $320 → $640/mo. Then nothing. Hardware stops scaling.

✓buys you time

✗hard ceiling — not a real solution

2007 · horizontal scaling

Go out,
not up.

Load Balancer Horizontal Scaling Imbalanced Least Connections Session Mismatch

Server 1 remembers you. Server 2 doesn't. You're logged out.

2007 · horizontal scaling

You log in on
Server 1. Refresh.
Server 2 has never
met you.

Same user,
same server.
Until that server dies.

Sticky Sessions False Fix

IP hash routes users to one server. Server dies. All their sessions die with it.

2007 · stateless architecture

Move state out.
Servers forget.

Stateless Shared Sessions Scale freely

Sessions lived inside each server. Swapping servers meant losing you.

✓any server handles any request

✓add / remove servers freely

→DB = single source of truth

2009

40×

same five queries · per second · never cached

2009 · caching

Compute once.
Remember the answer.

Caching Redis TTL Cache Invalidation

✓40× fewer DB queries

✗stale data up to TTL window

→invalidate on write for accuracy

2009 · query performance

10M rows read.
To find
847.

EXPLAIN B-Tree Index

✓reads: seconds → milliseconds

✗every write must update the index

→index what you actually query on

2009 · global users

Gossip is not local.
It's a
fundamental human need.

You can't refactor
your way out of
geography.

Latency CDN Edge Nodes

✓400ms → 12ms for static files

✓origin servers free for logic

→static only — dynamic still goes to origin

2011

gossip.com/post

9.2s

to publish one post

2011 · async

User waited 9.2s.
Server was
sending emails.

Synchronous Blocking Side Effects

→post is saved in 200ms

✗user blocked by work they don't need

→notifications are side effects — decouple them

2011 · job queue

The post is saved.
The user is
still waiting.

Write it down.
Walk away.
Someone else handles it.

Job Queue Atomicity Background Worker

✓user sees response in 200ms

✓post + job in same DB transaction

→jobs may run more than once — be idempotent

2011 · real-time

In love and in feeds —
everyone wants
to be first.

Stop asking.
Let the server
speak first.

Polling WebSocket Pub/Sub

✓server pushes instantly on event

✗WS connections are stateful — need coordination

→Redis pub/sub: any server can push to any client

2014

Friday deploy.

20 engineers 1 codebase notification team broke payments

2014 · microservices

Payments crashed.
It took Search,
Feeds, Auth —
everything.

Mind your own
business.
One service.
One job.

Microservices Kafka Eventual Consistency

✓deploy, fail, scale independently

✗distributed complexity: no shared DB

→choose availability over perfect consistency

2014 · api gateway

The client memorized
your entire org chart.
And their phone numbers.

One door.
Everything else:
not your problem.

API Gateway Routing Rate Limiting Auth

✓client only knows one address

✓auth, rate limit, logging in one place

✗single point of failure if misconfigured

2014 · identity

Identity that
travels without asking.

JWT Token Stateless Auth

✓any service verifies locally — no auth call

✓signed — tamper-evident

✗can't revoke before expiry without extra work

2014 · identity federation

Don't own
identity.
Delegate it.

OAuth2 OIDC SSO

✓never store or touch user credentials

✓one integration pattern — any provider

✓enterprise SSO with existing company accounts

2014 · containers

Works on
my machine.

→ Ship the machine.

Docker Container Reproducible Builds

✓identical env: local → staging → prod

✓no "works on my machine"

→Dockerfile = reproducible single source of truth

Why don't we
ship the machine?

2017

support ticket · 2017-03-14 · 09:41am "I was charged, but I don't have premium." and then another one. and five more.

2017 · distributed transactions

That support ticket
is the receipt.

Four services.
Four databases.
No shared undo button.

Partial Failure No Atomicity

✗payment committed · subscription never ran

✗no automatic rollback across services

→this is the fundamental problem of distributed txns

2017 · saga pattern

We can't roll back.
Can we at least
apologize?

Compensate.
Don't pretend
it didn't happen.

Saga Compensation Local Transactions

✓no shared transaction needed

→each step: forward action + compensating action

✗more code — every path must be planned

2017 · outbox pattern

The payment went in.
The event
was in the air.

If it's not
in the database,
it didn't happen.

Outbox Pattern At-least-once Idempotency

✓event always published — even if Kafka was down

→at-least-once: event may arrive twice

→idempotency: duplicate = safe to ignore

2017 · resilience

One slow service.
Fifty threads.
Nobody moves.

Don't wait.
Cut the line.

Cascade Failure Circuit Breaker Backoff + Jitter Rate Limiting

✓sick service doesn't take down healthy ones

→exponential backoff + jitter on retry

→rate limit at gateway: X req/s → 429

2004 — 2017 · one gossip platform

Thank you
for watching

<3 <3 <3

Ask any questions!

But I don't have any answers!

GET /api/v1/hope/health

🍜 Eating: 200

💻 Coding: 200

💔 Loving: 404

😴 Sleeping: 503

{ null } days of no incidents

ScalableSoftware Systems

Scalable
Software Systems