🧱 Engineering Brick: The Fault-Tolerant Exchange
🌸 The core may crash, the power may fade, But the Log remembers every trade.
Welcome back to the Stock Exchange Core series. In Part 1, we built the Order Book in RAM. In Part 2, we confined the Matching Engine to a single CPU core using the LMAX Disruptor.
But a fatal question remains: If the entire system lives in volatile RAM, what happens when the server loses power? Today, we zoom out to the Macro Network Level to build a system that is blindingly fast, yet practically immortal.
🌠 1. The Formal Specification (Problem Model)
We must ensure High Availability (HA) and Disaster Recovery (DR) for our in-memory Matching Engine.
The Constraints (SLA):
- Zero Data Loss: Once an order is acknowledged to the trader, it must never be lost.
- Failover Latency: If the Primary Engine crashes, the Backup Engine must take over in $< 1$ millisecond.
- The “No-Database” Rule: Writing to a traditional database (PostgreSQL, MongoDB) requires Disk I/O or network round-trips taking $1 - 5$ milliseconds. Our latency budget is $< 50 \mu s$. The engine must never wait for a disk.
⚡ 2. The Design Dialogue (Socratic Review)
I simulate an architectural review with a Database Expert (The Challenger) to break down the Event Sourcing mindset.
🕵️ The Challenger: To ensure we don’t lose trades during a crash, we need to save the Order Book to a clustered database using a Two-Phase Commit (2PC).
🧑💻 The Architect: Two-Phase Commits and distributed locks are the enemies of low latency. If our Single-Threaded Matching Engine pauses to wait for a database to acknowledge a write on a hard drive, our throughput drops from 100,000 TPS to 500 TPS.
🕵️ The Challenger: If we don’t save the state to a database, the state disappears when the RAM loses power. How do we recover?
🧑💻 The Architect: We don’t save the State (The Order Book); we save the Inputs (The Orders). Our Matching Engine is a strictly Deterministic State Machine. If we feed the exact same sequence of orders into an empty engine, we will deterministically arrive at the exact same Order Book state. This is called Event Sourcing.
🕵️ The Challenger: So we just log the inputs? What if two identical backup engines receive the inputs from the network in a slightly different order due to network jitter?
🧑💻 The Architect:
That is the fatal flaw of distributed systems. If Engine A sees [BUY, CANCEL] and Engine B sees [CANCEL, BUY], their states will diverge. To fix this, we introduce a strict gatekeeper: The Sequencer.
🧩 3. The Architecture: Sequencer & UDP Multicast
To guarantee global order, no client connects directly to the Matching Engine. They connect to Gateways, which forward orders to the Sequencer.
- The Sequencer: A highly optimized, lightweight component. Its only job is to receive an order, stamp it with a monotonically increasing
Sequence ID(e.g.,#1001,#1002), and broadcast it. - UDP Multicast: Instead of sending point-to-point TCP messages, the Sequencer uses UDP Multicast to blast the sequenced order to the network switch.
- The Replicated State Machine (RSM): The Primary Engine, the Secondary (Backup) Engine, and the Logger all listen to this exact same UDP multicast stream. They all process
#1001before#1002, guaranteeing perfectly synchronized parallel universes.
The Macro Architecture Diagram
(Stamps Sequence ID)"] end subgraph "The UDP Multicast Bus (Aeron / Reliable UDP)" BUS(("Multicast Stream")) end subgraph "The Execution Layer (Deterministic State Machines)" ME1["Primary Matching Engine
(Active - Emits Trades)"] ME2["Secondary Matching Engine
(Passive - Mutes Output)"] end subgraph "The Persistence Layer" LOG["The Logger
(Appends to WAL)"] DB[("End-of-Day
Data Warehouse")] end %% Connections GW -->|"Raw Orders"| SEQ SEQ -->|"Sequenced Orders"| BUS BUS -->|"#1001, #1002"| ME1 BUS -->|"#1001, #1002"| ME2 BUS -->|"#1001, #1002"| LOG LOG -.->|"Asynchronous Sync"| DB
🔄 4. Lifecycle Walkthrough: Crash & Recovery
Let’s trace how this architecture handles a catastrophic failure.
T0: Normal Operation
- The Sequencer multicasts Order
#1001. - Primary Engine receives
#1001, matches it, updates its RAM, and sends the trade confirmation back to the Gateway. - Secondary Engine receives
#1001, matches it, updates its RAM, but drops the output. It acts as a hot-standby shadow. - The Logger receives
#1001and appends it to a Write-Ahead Log (WAL) on a fast NVMe SSD using memory-mapped files (mmap).
T1: The Crash (Primary Engine Dies)
- The motherboard of the Primary Engine catches fire.
- A cluster manager (e.g., ZooKeeper/Raft heartbeat) detects the failure.
- Within micro-seconds, the Secondary Engine is promoted to Primary. Because it has been processing the exact same multicast stream, its RAM is 100% identical to the dead engine. It simply stops muting its outputs and takes over. The traders notice zero downtime.
T2: Total Cluster Failure (Disaster Recovery)
- A power outage takes down both engines.
- When power returns, we boot up a fresh, empty Matching Engine.
- The engine reads the Snapshot from midnight (a compressed dump of the Order Book).
- The engine then reads the Write-Ahead Log (WAL) from the Logger, fast-forwarding and replaying every order from
#1to#1001. - Within seconds, the RAM is perfectly rebuilt.
⚗️ 5. The Architect’s Crucible: Real-World Edge Cases
A mid-level engineer stops at the “Happy Path”. A Principal/Staff Architect must address when the physics of the system break down.
1. Backpressure & Flow Control (The Burst) What happens if the market goes crazy and the incoming UDP burst exceeds the Matching Engine’s capacity? We cannot block the Sequencer (that would freeze the whole exchange).
- The Strategy: The Matching Engine’s Ring Buffer absorbs micro-bursts. If the buffer hits 90% capacity, we must shed load at the edge. The Network Gateways detect the backpressure and immediately respond with HTTP 429 (Too Many Requests) or FIX Reject messages. We drop packets at the gates to protect the core.
2. Idempotency vs. UDP Duplicates (Exactly-Once Semantics) UDP Multicast is fast but inherently unreliable. To prevent packet loss, we use Aeron (Reliable UDP) which handles retransmissions. But what if Aeron replays a packet? We cannot execute a BUY order twice.
- The Strategy: The Matching Engine maintains strict Idempotency. It caches the
Last_Processed_Sequence_ID. If the engine receives Sequence#1002, but its internal counter is already at#1002, it treats the packet as a duplicate and silently discards it. At-least-once delivery + Idempotent receiver = Exactly-Once Semantics.
3. The Multi-Datacenter Illusion (The Speed of Light) Can we run the Primary Engine in New York and the Secondary Engine in Chicago for ultimate disaster recovery?
- The Physics: The speed of light in fiber limits ping times between NY and Chicago to ~16 milliseconds. If we require the Chicago backup to acknowledge every order before NY processes it, our latency skyrockets.
- The Reality: Elite trading systems are strictly Active-Passive across regions. We accept asynchronous replication to the secondary data center. If New York gets wiped out by a hurricane, we accept a few milliseconds of data loss to restore the Chicago cluster. You cannot defeat Einstein.
🗝 The “Brick” Summary (Mental Model)
- 🌠 Signal: The need for zero-data-loss durability without the latency penalty of Disk I/O or Databases.
- 🧩 Structure: Sequencer + UDP Multicast + Replicated State Machines (Primary/Secondary) + Append-Only WAL.
- 🏛 Invariant: Determinism.
f(State_N, Sequence_N+1) -> State_N+1. Identical inputs always produce identical state. - 💠 Pivot Insight: Decouple Execution from Persistence. The Engine does math in RAM; a completely separate Logger component handles the Disk I/O.
🪷 One sentence to trigger the reflex: “The Sequencer stamps the time, Multicast spreads the rhyme, the Logger writes it down, and the Engine wears the crown.”
Next up: In the Grand Finale, Part 4, we will tear this entire architecture down. We will discuss the ultimate trade-offs, when NOT to build this system, and the alternative architectures used by modern Tech Giants.
📚 Series: Stock Exchange Core
- Stock Exchange Core (1/4): Anatomy of a Microsecond Order Book
- Stock Exchange Core (2/4): The Single-Threaded Matching Engine
- Stock Exchange Core (3/4): High Availability & Deterministic Fault Tolerance (You are here)
- Stock Exchange Core (4/4): Trade-offs, Scaling Limits & Alternative Architectures