Matching Engine Determinism: Why Replay Fidelity Is the Real Test of Exchange Architecture

Ariel Silahian

Ariel Silahian is a senior technology executive in institutional electronic trading, with 30+ years across the buy and sell side (New York, Miami, London, Hong Kong). He is the author of "C++ High Performance for Financial Systems" (Packt) and the creator of VisualHFT, the open-source microstructure analytics stack. He writes on exchange architecture, market microstructure, and execution quality, and advises a select number of trading firms on infrastructure decisions that move P&L. Talk architecture: https://hftadvisory.com

The Test That Reveals Everything
Why “It Works” and “It’s Deterministic” Are Different Statements
The Five Architecture Decisions That Determine Replay Fidelity
The Gemini Case: What a 10-Hour Recovery Actually Means
The Regulatory Dimension: CAT, Reg SCI, and MiFID II RTS 25
Determinism at the Ingestion Boundary
Does Determinism Require a Single Thread? The Sharded-Engine Question
The Journal-Replay Diagnostic: A Practical Framework
Conclusion

The Test That Reveals Everything

There is one test that immediately separates a deterministic sequencer from one that merely behaves correctly under normal load. Take yesterday’s journal. Replay it into an empty order book. If the resulting book state, queue positions included, is not identical to the state you had at market close, your matching engine was never deterministic.

Most teams never run that test until recovery forces it on them.

A matching engine can pass every functional test your QA team designs and still carry non-determinism that only surfaces under restart or failover. The functional test suite verifies that orders match correctly, that prices are right, that fills are attributed to the correct accounts. It does not verify that the engine’s internal ordering decisions are stable across time, across restarts, or across replicas. Those are different questions, and the second set is harder.

The gap between “it works” and “it’s deterministic” has cost firms significantly in recovery time, regulatory standing, and architectural complexity. I have reviewed this failure mode across venue types, and the pattern is consistent: teams discover the problem at the worst possible moment, during a live incident when the pressure to get back online is maximum and the margin for careful reconciliation work is minimum.

Why “It Works” and “It’s Deterministic” Are Different Statements

Functional correctness and replay equivalence are orthogonal properties.

A matching engine is functionally correct if it produces valid matches: the right price, the right quantity, the right counterparties, in priority order. Most engines that reach production are functionally correct. A matching engine is deterministic if, given the same ordered input stream, it produces exactly the same output stream every time, including queue positions, timestamps assigned, and internal state transitions.

Thank you for reading this post, don't forget to subscribe!

Subscribe by Email

The distinction matters because production state is the result of thousands of edge cases, partial fills, amendments, and latency asymmetries over a full trading day. If you can only reproduce “a valid book” and not “the exact book,” you have an engine that agrees with itself about prices but may disagree about queue position, adverse selection attribution, or the precise causal sequence of events. That disagreement is invisible during normal operation and catastrophic during recovery.

The replay equivalence test is the only honest measure, and recovery forces it on you whether you planned for it or not. When the engine goes down and comes back up, the team is implicitly running a replay. If that replay does not produce identical state, the reconciliation work begins: which version of state is authoritative, which fills happened in which order, which orders were live at the point of failure. That work is expensive, and the Gemini December 2024 outage provides a specific, documented cost.

The Five Architecture Decisions That Determine Replay Fidelity

These are not design preferences. Each one has a specific failure consequence when violated.

1. Single-Threaded Processing Within the Matching Loop

The matching loop must be single-threaded. LMAX runs 6 million orders per second on a single thread. At that throughput, the single-thread constraint is not a performance concession; it is what makes determinism achievable.

Multi-threaded matching loops introduce ordering indeterminacy because the OS thread scheduler does not produce a stable ordering of operations across runs. Two threads both processing order arrivals will interleave differently depending on scheduler state, cache state, and system load. The resulting book state is a function of the OS as much as the input stream. You cannot replay it.

Teams that add parallelism inside the matching loop to push throughput eventually find recovery is non-reproducible. The book comes back slightly differently each time, because the thread interleaving at restart never matches the original run.

2. Logical Sequence Numbers Over Wall-Clock Timestamps

The sequencer’s integer sequence number is the ordering authority. Wall-clock time is not.

NTP achieves millisecond-range accuracy on public networks, and sub-millisecond to tens-of-microseconds jitter on a LAN. Even the best-case figure is too coarse to serve as an ordering tiebreaker for events that arrive microseconds apart. When two orders arrive and the clock jitters between them, the timestamp is unreliable as a stable ordering reference.

The correct design: every event that enters the matching engine carries a monotonically increasing sequence number assigned by the sequencer. That number is the ordering. If you need to reconstruct the book, you sort by sequence number, not by timestamp. The timestamps are useful for regulatory reporting and latency analysis, but they are not the causal chain. The sequence number is.

MiFID II RTS 25 requires a maximum 100 microsecond divergence from UTC for the HFT activity class, using PTP rather than NTP. That accuracy is for reporting, not for ordering. A compliant exchange timestamps with PTP-disciplined clocks for external reporting while keeping monotonic sequence numbers as the internal ordering authority.

Teams that use wall-clock timestamps as the ordering authority discover that replays produce different results depending on clock state at replay time versus production time. The book reconstructs differently on different hardware, under different system loads, or simply at a different time of day.

3. Gap Detection and Recovery for Market Data

If your engine relies on market data from upstream feeds, the determinism of the book depends on the completeness of the feed. Gap detection must be built into the protocol layer.

MoldUDP64 detects gaps by comparing the received packet’s sequence number against the client’s expected next sequence number. On a mismatch, the client requests a retransmit over a separate recovery session. This comparison is exact and fast, and it is the reason MoldUDP64 is widely used for exchange data distribution: gaps are structurally unambiguous, not probabilistic.

Per the CME MDP 3.0 specification, the maximum number of packets that can be requested in one resend request message is 2,000. That ceiling matters for recovery planning: a gap wider than 2,000 packets requires multiple recovery cycles, each with its own round-trip latency. For an engine under load, a large gap in the underlying feed can translate directly into a stale or incomplete book at the point of failure.

Engines without correct gap detection silently accept incomplete data, and the missing state is never flagged. A journal replay is only as complete as the input it captured, which cannot be guaranteed without gap detection.

4. Journal-Plus-Snapshot Architecture

The journal is the append-only log of every event processed by the engine. The snapshot is a point-in-time serialization of engine state, taken at intervals, that allows recovery without replaying the entire journal from genesis.

LMAX takes nightly snapshots. A full restart, including loading the recent snapshot and replaying a day’s worth of journals, completes in under a minute. The engine’s state is entirely derivable from processing the input events, which means the journal is authoritative and the snapshot is an optimization that reduces recovery time without sacrificing correctness.

The key property of the journal: it must be append-only, and it must be complete. Any write that goes to the engine state but not to the journal creates a divergence between what the journal says happened and what actually happened. That divergence will surface in replay.

Engines without persistent journals cannot replay. Engines with incomplete journals will replay to a state that differs from production by exactly the set of unlogged transitions. The book comes back, but with a different shape at the edges.

5. Idempotent Recovery

Idempotent recovery means that running the same replay twice produces the same result both times. This sounds obvious, but it is easy to violate.

Common violations: recovery code that re-timestamps events from the current clock rather than the journal, re-seeds a random number generator in the path, or reads external state (current market data, current risk parameters) instead of the journaled state at the time of the original event.

A clean replay will reproduce the same adverse selection rate per timestamp. If the adverse selection rate differs between the original run and the replay, the difference is attributable to ordering, not to market conditions. That is a diagnostic signal: it tells you exactly where the non-determinism is introduced.

The Gemini Case: What a 10-Hour Recovery Actually Means

On December 10, 2024, Gemini experienced a service disruption that resulted in an exchange outage for ten hours. The root cause was a three-node messaging infrastructure failure during a matching-engine upgrade. The recovery required reconciling state divergence across two trading systems.

Reconciling divergent state across two systems means establishing which system is authoritative, resolving differences event by event where they conflict, and verifying consistency before re-opening the book, all under pressure to get the exchange back online. That is tractable when the divergence is small. It becomes a research project when the event log is large and the two systems applied different ordering rules to the same input.

The ten-hour figure points to state reconciliation, not a hardware fault, as the bottleneck. Reconciliation problems of this kind are a direct consequence of non-deterministic recovery: if replay produced identical state on both systems, reconciliation would be trivial. That it took ten hours tells you the divergence was structural.

The architectural implication is straightforward: a journal-plus-snapshot design with single-threaded processing and logical sequence numbers would not have produced this reconciliation requirement. A restart would have replayed to identical state on both sides. The recovery window collapses from hours to minutes.

The Regulatory Dimension: CAT, Reg SCI, and MiFID II RTS 25

Determinism is a compliance requirement as much as an operational property.

In October 2024, FINRA fined Citadel Securities $1 million for Consolidated Audit Trail (CAT) reporting errors spanning 31.2 billion canceled-order events. Specifically, Citadel reported the canceled quantity in the “leaves quantity” field rather than zero, as required under FINRA Rule 6893(a), across 42.2 billion total inaccurate events. FINRA issued a parallel $1.2 million CAT fine to IMC Financial Markets the same day, covering 21.8 billion inaccurate events. CAT enforcement is a pattern.

The Citadel fine was a reporting code error, not a matching engine failure. That distinction is worth making precisely because it illustrates the regulatory principle clearly: the failure mode that mattered was degraded reconstructability. Regulators require the ability to reconstruct the sequence of events behind every order, cancel, and fill. Whether that reconstructability breaks because engine state is non-deterministic or because reporting code writes the wrong field, the regulatory outcome is the same. You cannot give regulators an authoritative account of what happened.

A non-deterministic engine creates a structural version of that problem. If two replays of the same day produce different queue orderings, you cannot tell a regulator with certainty who was first in queue at the point of an adverse fill. You can give them a plausible account. You cannot give them the account.

MiFID II RTS 25 mandates PTP-grade clock synchronization, a maximum 100 microsecond divergence from UTC for the HFT activity class, precisely because regulators need to correlate events across venues. An NTP-disciplined clock with millisecond-range jitter leaves cross-venue ordering ambiguous inside that jitter window; PTP pulls it inside the regulatory threshold. That accuracy is for reporting and correlation. As the second architecture decision established, it does not change the engine’s internal ordering authority, which remains the sequence number.

Reg SCI requires its covered entities to maintain tested business continuity and disaster recovery plans with defined recovery time objectives. The replay fidelity test is the functional measure those requirements are actually after: a team that cannot replay to identical state under controlled conditions cannot credibly claim its recovery plan will produce a consistent book under the much harder conditions of a live incident.

Determinism at the Ingestion Boundary

In-engine determinism is necessary but not sufficient.

The boundary where the engine meets feeds it does not own is where determinism most commonly breaks in practice. The matching engine can be perfectly deterministic given its input stream. If the input stream is itself non-deterministic, the engine’s determinism is irrelevant from a recovery standpoint.

Even a venue that owns every feed it consumes faces the same problems: timestamp normalization and reconnect-gap handling are internal sources of non-determinism, so “the feeds you don’t own” is the sharpest case, not the only one.

The sources of ingestion-boundary non-determinism are specific:

Timestamp normalization. If your feed handler re-timestamps or reformats incoming events, that normalization must be deterministic and journaled. If it applies the current clock rather than a fixed rule, two runs of the same raw feed produce different orderings.

Feed state on reconnect. When a feed drops and reconnects, the engine receives a gap. If it is filled from a feed snapshot, the snapshot’s contents depend on when it was taken, so two reconnects at different times produce different state. The journal must record the gap and its resolution, not just the events on either side.

Multiple feed handlers writing to one book. When primary and secondary handlers write to the same book, the ordering of their writes is a function of arrival timing and scheduling. Stable under normal operation, it can diverge under recovery, where the replay environment schedules differently than production.

The correct design treats the ingestion boundary as a sequencing problem. Every event that crosses into the engine gets a sequence number at the crossing point, not before. Replay always works from that sequence number, never from the original timestamp or from inferences about arrival order.

Does Determinism Require a Single Thread? The Sharded-Engine Question

The single-threaded matching loop is the invariant. The scope of “the loop” is a design question.

Modern exchange architectures often partition the order book by instrument. Each instrument runs on its own matching loop, with its own sequence of state transitions. Within each partition, the matching loop is single-threaded. Determinism is preserved within each partition because the invariant holds: one thread, one ordered sequence of events, reproducible on replay.

The single thread is not free. Its cost is a hard throughput ceiling: one core’s worth of ordered processing, in the low single-digit millions of orders per second on a tuned engine, with LMAX’s 6M/sec as the reference for a well-optimized loop. When a symbol or instrument group grows past what one loop can carry, you partition by instrument, you do not add threads inside the loop. The rule is simple: partition the workload, never the matching loop. Each partition keeps the single-thread invariant, so determinism is preserved per shard, and the cost moves to the cross-partition ordering layer, which is exactly where the harder problem lives.

The architectural question is cross-partition ordering. If a strategy or a reporting requirement needs to correlate events across instruments, the ordering of events across partitions is not determined by either partition’s sequence number. A heartbeat mechanism or a global sequencer can establish a consistent cross-partition ordering, but this is an additional component with its own failure modes.

The practical implication: per-instrument partitioning is sound, and it is the pattern most modern high-throughput engines use. The single-thread invariant applies within each partition’s matching loop. Cross-partition queries need an explicit ordering layer, and teams that try to reconstruct cross-instrument ordering from per-instrument sequence numbers alone will find the assumption fails on simultaneous events across instruments.

The Journal-Replay Diagnostic: A Practical Framework

This is the test that surfaces non-determinism before an incident does.

The test. Take the full journal from yesterday. Replay it against an empty order book in a fresh engine instance. Compare the resulting state to yesterday’s end-of-day state snapshot. The comparison must be exact: queue positions, sequence numbers assigned to fills, open order count and size per price level, and total adverse selection rate per timestamp window.

If the result matches exactly, the engine has replay fidelity for that day’s input. That is a meaningful property claim, not just a unit test.

If the result does not match, the divergence is diagnostic. Work backwards through the diff:

Queue position divergence points to either thread-ordering indeterminacy (the matching loop has some parallel processing) or to external state read during processing (the engine called something outside the journal during the original run that is not reproducible at replay time).

Sequence number drift in the fill log points to the timestamp normalization layer or to a secondary input source that is not journaled.

Adverse selection rate divergence per timestamp window is the most informative signal. A consistent shift (higher or lower adverse selection at the same timestamps in replay versus original) points to ordering differences. The market did not change between yesterday and today; the replay is processing the same events. If the adverse selection rate differs, the order in which the engine saw competing events was different. That order difference is the non-determinism.

Checklist of failure modes surfaced by this test:

Matching loop uses any parallel execution path (wall-clock-based, thread-pool-based, or lock-based)
Any state transition during the original run was not journaled (check for calls to external state: risk limits, reference data, fee schedules)
Timestamp normalization at the feed handler applies current clock rather than fixed-rule transformation
Recovery code re-timestamps events rather than using journaled timestamps
Gap-fill from market data uses snapshot timestamp rather than journaled sequence number
Cross-instrument ordering is inferred rather than explicitly sequenced

The test takes about as long as replaying a day’s events through the engine. For a system on the LMAX pattern, that is under a minute. For systems without efficient journal-plus-snapshot infrastructure, the diagnostic cost is itself informative.

Conclusion

The replay test has a clean falsifiability property: you run it, and the answer is binary. Either the book comes back identical, or it does not. There is no partial credit.

If your engine produces an identical book on replay, you have demonstrated something that most teams cannot demonstrate under controlled conditions: that the state your engine holds is a pure function of its input, not of the environment, the clock, or the execution order of concurrent threads. That property is what makes recovery trustworthy, what satisfies a regulatory reconstructability requirement, and what makes the ten-hour reconciliation problem structurally impossible.

If your engine does not produce an identical book on replay, you now have a specific diff and a structured checklist for finding the cause. The diff is a diagnostic instrument. The divergence pattern tells you exactly which of the five architecture decisions was violated, and at which boundary.

One working hypothesis I have not fully closed: in partitioned engines with a global sequencer layer, the global sequencer’s own journal integrity is the new constraint. A replay test per partition passes cleanly, but a cross-partition correlating replay depends on the global sequencer’s log being as complete and deterministic as each shard’s. I have not seen this tested rigorously in any multi-venue architecture I have reviewed. If you have run it, that result is the data point worth contributing.

Originally shared as a LinkedIn post: https://www.linkedin.com/feed/update/urn:li:ugcPost:7466145953830432768/

Never Miss an Update

Get notified when we publish new analysis on HFT, market microstructure, and electronic trading infrastructure. No spam.

Subscribe by Email

Ariel Silahian

Matching Engine Determinism: Why Replay Fidelity Is the Real Test of Exchange Architecture

Ariel Silahian

Table of Contents

The Test That Reveals Everything

Why “It Works” and “It’s Deterministic” Are Different Statements

The Five Architecture Decisions That Determine Replay Fidelity

1. Single-Threaded Processing Within the Matching Loop

2. Logical Sequence Numbers Over Wall-Clock Timestamps

3. Gap Detection and Recovery for Market Data

4. Journal-Plus-Snapshot Architecture

5. Idempotent Recovery

The Gemini Case: What a 10-Hour Recovery Actually Means

The Regulatory Dimension: CAT, Reg SCI, and MiFID II RTS 25

Determinism at the Ingestion Boundary

Does Determinism Require a Single Thread? The Sharded-Engine Question

The Journal-Replay Diagnostic: A Practical Framework

Conclusion

Never Miss an Update

Leave a Reply Cancel reply

Subscribe to Updates

Ariel Silahian

Table of Contents

The Test That Reveals Everything

Why “It Works” and “It’s Deterministic” Are Different Statements

The Five Architecture Decisions That Determine Replay Fidelity

1. Single-Threaded Processing Within the Matching Loop

2. Logical Sequence Numbers Over Wall-Clock Timestamps

3. Gap Detection and Recovery for Market Data

4. Journal-Plus-Snapshot Architecture

5. Idempotent Recovery

The Gemini Case: What a 10-Hour Recovery Actually Means

The Regulatory Dimension: CAT, Reg SCI, and MiFID II RTS 25

Determinism at the Ingestion Boundary

Does Determinism Require a Single Thread? The Sharded-Engine Question

The Journal-Replay Diagnostic: A Practical Framework

Conclusion

Never Miss an Update

Related Posts

Trading Infrastructure Sequencing: Why Risk Controls Must Ship Before Your Execution Engine

Building an Institutional Crypto HFT Desk: The Real Cost of Entry Beyond the Technology (2026 Update)

How Much Does It Cost to Start a High-Frequency Trading Firm? Full Budget Breakdown

Leave a Reply Cancel reply