Distributed Systems · Published

A Raft KV store you can kill, partition, and watch recover

Raft Lab · 2026~9 min read

Open /raft-lab, click "kill" on the node with the star next to it, and within a couple of seconds you'll watch the other four nodes hold an election, agree on a new leader at a higher term, and keep committing writes. Revive the dead node and it catches up. Partition the cluster three-against-two and the minority stops making progress while the majority keeps going. It's a real, from-scratch implementation of the Raft consensus algorithm — leader election, log replication, and the §5.4 safety rules — running as five nodes in one process over a transport I can break at runtime. Everyone watching the page drives the same cluster.

The point of building this wasn't "leader election works." Almost any tutorial gets you that far. The point was the part of the paper that actually bites: making sure a committed entry is never lost across an election, even when nodes crash, logs diverge, and term races happen. That's the difference between "I read the Raft paper" and "I implemented the parts that hurt."

The problem (and why it's actually hard)

A replicated state machine has to stay consistent across nodes that crash, restart, and get cut off from each other. Raft makes that tractable by reducing the whole thing to two ideas: elect one leader, and have it maintain an append-only replicated log. Followers apply entries from the log in order, so if every node has the same log, every node has the same state.

The naive version falls apart at the seams. Two candidates can time out at the same instant and split the vote. A leader can accept a write, replicate it to a minority, and then crash before the entry is safe — if the wrong node wins the next election, that write can be silently overwritten. A follower that was partitioned away comes back with a log that diverges from the leader's at some index, and you have to repair it without corrupting committed history.

Raft's answer is a small set of invariants, and two of them are subtle enough that they're where most hand-rolled implementations are quietly wrong:

The up-to-date vote restriction (§5.4.1). A node only grants its vote to a candidate whose log is at least as up-to-date as its own. This is what guarantees a new leader already contains every committed entry — so an election can't lose data.
The current-term commit rule (§5.4.2). A leader may only advance the commit index by counting replicas of an entry from its own term. Counting replicas of an older entry directly can mark something committed that a later leader then overwrites. The fix is unintuitive and easy to skip.

If you get either of these wrong, the bug doesn't show up in the happy path. It shows up as a committed write vanishing after a leader change — exactly the failure consensus exists to prevent, and exactly the one a "leader election works" demo never exercises.

How it works

Each node is modeled as an actor: every state transition, RPC handler, and timer for that node runs on its own single-thread executor. That makes the consensus logic lock-free and strictly sequential — I never reason about two threads mutating one node's term or log at once, because there's only ever one. Nodes don't call each other directly; they hand messages to a Transport interface, and the transport decides whether (and when) a message arrives.

That indirection is the whole design. The in-memory transport delivers a message by enqueueing it onto the target node's thread — an asynchronous network you can make hostile.

roles: FOLLOWER ──timeout──▶ CANDIDATE ──majority votes──▶ LEADER ▲ │ │ └──── higher term ────┴──────── higher term ───────┘ (step down, reset election timer) n1(L) n2 n3 n4 n5 leader n1 is killed ✗ · · · · heartbeats stop arriving │ election timeout fires first ▼ n2 ── RequestVote(term+1) ──▶ n3,n4,n5 n2 ◀──── vote granted ─────── n3,n4 (log up-to-date check passes) n2 becomes LEADER @ higher term n2 ── AppendEntries(heartbeat) ──▶ all commits keep flowing n1 revived ── AppendEntries(prevLogIndex/Term) ──▶ n1 n1 log diverges → leader backs up nextIndex, retries → n1 catches up

Kill the leader → randomized timeout → RequestVote on a higher term → new leader → rejoined node catches up via AppendEntries

The replicated state machine on top is a tiny key/value store (PUT / DELETE). When an entry is committed — replicated to a majority and safe — each node applies it to its KvStateMachine in log order. Because the logs converge, the KV state converges. The visualizer reads the leader's view and streams a snapshot of every node's role, term, commit index, and log length over a WebSocket at /ws/raft roughly every 400ms, which is what you see updating live on the page.

The design decisions that mattered

From the paper, not a library

I implemented Raft directly from the paper rather than wrapping an existing library. That's a deliberate choice about what this artifact is supposed to demonstrate. Anyone can add a dependency that does consensus; the question a distributed-systems hire actually has to answer is whether you understand consensus — the term races, the log-divergence repair, the commit rule — well enough to debug it when it misbehaves at 3am.

So the implementation includes the parts that are easy to wave away: randomized election timeouts and term-driven step-down (§5.2); AppendEntries with the previous-log consistency check, follower log truncation on conflict, and nextIndex/matchIndex tracking with decrement-and-retry backoff (§5.3); and both §5.4 safety rules. The one piece of logic worth showing is the current-term commit rule, because it's the rule most likely to be silently missing:

java

// Advance commitIndex to the highest index replicated on a majority —
// but ONLY count entries from the leader's current term (§5.4.2).
private void advanceCommit() {
  for (long n = lastLogIndex(); n > commitIndex; n--) {
      if (termAt(n) != currentTerm) {
          continue; // never commit an older-term entry by replica count
      }
      int replicas = 1; // self
      for (String peer : peers) {
          if (matchIndex.getOrDefault(peer, 0L) >= n) replicas++;
      }
      if (replicas > clusterSize() / 2) {
          commitIndex = n;
          applyCommitted();
          return;
      }
  }
}

The termAt(n) != currentTerm guard is the entire point. Drop it and the demo still looks fine — right up until a specific election sequence resurrects a write you thought was gone.

A pluggable transport I can break at runtime

The transport is an interface, and the in-memory implementation is intentionally hostile. At runtime it can take a node down, partition the cluster into two sides that can't talk, inject per-message latency, and drop a configurable fraction of messages. Every failure you can trigger from the browser maps to one of those knobs.

I rejected building this over real gRPC or sockets. A real network would add operational noise — ports, serialization, connection lifecycle — that has nothing to do with the algorithm and would make failures non-deterministic and hard to reproduce. The breakability is the pedagogical payload here: the same Raft code runs unchanged over this transport, and the obvious next seam is a real-RPC Transport implementation. What I traded away is real network behavior (head-of-line blocking, partial sends, TCP semantics). For a lab whose job is to make consensus failures visible and reproducible, that's the right trade.

Five nodes in one process

The whole cluster runs in a single JVM. That means zero infrastructure: no Postgres, no Redis, no message broker, no extra hosts. The live demo runs on one free instance, and the cluster boots with the backend.

The honest cost: this proves the algorithm, not a production network deployment. There's no real partition tolerance across machines, no clock skew, no serialization. I'm not claiming otherwise — it's a correctness demonstration of the consensus logic, deliberately decoupled from the deployment topology. The flip side is that every reader gets a reproducible, breakable cluster on a free tier, which a multi-host deployment could never offer as a public demo.

Does it actually work?

The invariants are pinned by deterministic in-process tests that run on fast timings — no sleep-and-hope, just polling until a condition holds or a timeout fires. The Raft package contributes nine of them, each mapped to a specific claim:

Election: exactly one leader is elected, and the cluster re-elects after the leader is killed.
Replication: a committed entry replicates and applies on every node; committed writes survive a leader change (the §5.4 safety claim).
Recovery: an isolated follower catches up after it rejoins; durable term/vote/log state is restored on construction (simulated restart).
Chaos: the majority keeps making progress during a partition while the minority can't, and the cluster still converges over a lossy link.

These are part of the backend's green suite on JDK 21 — 43 of 43 passing. In the browser, the behavior matches the tests: kill the leader and a new one appears at a higher term; revive the dead node and its commit index climbs to match; write a key and watch it commit to all five nodes; partition three-against-two and watch the two-node side freeze.

What these prove: the consensus logic holds its safety and liveness invariants under crash, partition, restart, and loss. What they don't prove: anything about a real multi-machine network, byzantine faults, or behavior at log sizes large enough to need compaction. The tests are deterministic precisely because the network is simulated — that's a feature for verification and a limit on what the verification covers.

What I'd do differently / what's next

The most obvious gap is snapshotting and log compaction (§7). Right now the replicated log grows unbounded — fine for a demo cluster, unacceptable for anything long-lived. A node that's been down for a while should be caught up with a snapshot, not a replay of the entire log.

After that, membership changes via joint consensus — adding and removing nodes safely while the cluster keeps serving — is the next real piece of the paper I haven't built. And the cleanest extension is a real-RPC transport: because the network is already behind an interface, swapping the in-memory bus for gRPC over actual sockets is a contained change that would turn this from an algorithm demo into a small distributed deployment.

One thing I'd flag as deliberately out of scope rather than missing: /raft-lab is a single shared cluster that every visitor drives, not a per-session sandbox. Anyone can reset it. That's the cost of a genuinely live demo, and I'd make the same call again.

Try it

Break it yourself at /raft-lab — kill the leader, partition the network, add latency or loss, and watch it recover. The code is on GitHub.