Table of Contents
- The Failure Pattern Nobody Sees Coming
- The Reconnect Trap: What Binance’s Own Docs Say You Must Do
- What the April 15, 2025 AWS Failure Looked Like at the Feed Handler Level
- The Harder Problem: Silent Gaps Without a Disconnect Event
- REST as Authority, Stream as Suspect: How to Architect the Reconciliation Loop
- Calibration: The Part That Stays Unsolved
- Pre-Trade Risk Is Only as Good as the Position It Guards
- A Diagnostic Checklist for Your Feed Handler Architecture
The Failure Pattern Nobody Sees Coming
The book quotes against a position the desk does not hold.
That sentence is the whole problem. Everything else — reconnect logic, sequence ID handling, REST reconciliation cadence, tolerance calibration — is engineering detail in service of preventing that one outcome. But the desks that blow up on position drift do not blow up because they missed the engineering. They blow up because the system appeared to be working right up until it wasn’t.
Thank you for reading this post, don't forget to subscribe!
Multi-venue crypto market making stacks are architecturally vulnerable to a specific class of failure: the position state held locally diverges from the position state that actually exists across venues, and the divergence is invisible to every downstream system until the P&L prints it. Pre-trade risk gates check orders against local position. If local position says flat and real position says significantly long, the risk gate is not failing — it is faithfully executing against bad inputs.
A CTO searching for “crypto market making position drift” or “stale position pre-trade risk” is usually doing it after something went wrong. This article is written for the read before that.
The problem has three distinct layers. First, reconnect handlers that resume from a stale snapshot after an exchange outage. Second, silent feed gaps that drop updates without triggering a disconnect event. Third, calibration: the tolerance parameters that determine when the system resnaps versus ignores. Each layer has a known architecture. The calibration layer does not have a solved answer, and I will say so explicitly rather than paper over it.
The Reconnect Trap: What Binance’s Own Docs Say You Must Do
Every crypto exchange using a snapshot-then-delta WebSocket feed publishes the same core contract: initialize from a REST snapshot, resume from deltas using sequence IDs, and if you detect a gap, discard your local state and resnapshot.
Binance’s published documentation on order book management states the rule clearly: if the sequence number of the first post-reconnect event is greater than the locally stored update ID plus one, events were missed. Discard the local order book and restart the process from the beginning. Binance futures documentation adds a second check: each incoming event’s previous-update-ID field should match the previous event’s update ID; if it does not, reinitialize.
This is not an obscure edge case in a footnote. It is in the primary documentation for developers building market data handlers. Most teams have read it. The failure is not reading the docs — it is the reconnect handler not enforcing the rule under the conditions that make it hard to enforce.
The hard condition is this: after a multi-second outage, the handler reconnects. It has a snapshot in memory. The first incoming delta carries a sequence number. The correct behavior is to check whether that delta is contiguous with the stored snapshot and, if not, trigger a full resnapshot before resuming. The shortcut — resume from the cached snapshot without the check — compiles, passes tests, and works fine in development where reconnects are clean. In production, during an outage that affected multiple venues simultaneously, it silently builds a position that is not what the desk holds.
One data vendor has estimated that stale quotes from delayed or missed feed events cost market makers in the range of 5 to 10 basis points per trade in adverse selection over time. That figure is their estimate, not a universal benchmark. The right number depends on the desk, the venue, and the volatility regime. What is not in dispute is the direction: stale position state means the book quotes at the wrong price in the wrong size against counterparties who have access to the same market data the handler is missing.

What the April 15, 2025 AWS Failure Looked Like at the Feed Handler Level
On April 15, 2025, an AWS connectivity issue in the Tokyo region caused a roughly 36-minute disruption beginning around 1:15 AM PDT. At least eight exchanges were affected, including Binance and KuCoin. Binance suspended withdrawals for approximately 23 minutes during the incident. The outage was confirmed in public reporting across multiple tier-1 sources.
I was reviewing the logs of a desk I was advising that had multiple venue links active during the outage. When connectivity restored, several of those links came back onto stale snapshots. The reconnect handlers resumed deltas without resnapshotting. Local position showed approximately flat. The real position — reconstructed from post-incident fill reconciliation — was significantly long, in the range of $180K. The desk quoted as if flat. Adverse selection climbed as it got picked off by counterparties who saw the book correctly.
Pre-trade risk gates cleared the orders. Not because the risk system failed, but because the position it was guarding against was wrong. From the risk system’s perspective, everything was within limits. The exposure was invisible until fills started printing in a direction that did not make sense at the prices being quoted.
What we found in the logs confirmed the Binance documented rule: the first post-reconnect delta on those links sat more than one position ahead of the stored update ID. The gap was detectable. The handler simply did not check for it before resuming.
The October 2025 AWS US-East-1 outage, which took Coinbase offline and affected multiple chains including Ethereum Mainnet and several Layer 2 networks, represents the same class of risk with broader infrastructure reach. AWS-adjacent failures affecting crypto venues are not anomalies — there were at least two major incidents in 2025 alone. Desks that do not have tested reconnect protocols with verified resnapshot logic are not running a low-probability scenario. They are running a known-failure mode with no defense.
The Harder Problem: Silent Gaps Without a Disconnect Event
The reconnect trap has a tractable solution: enforce the sequence ID check on every reconnect, trigger a resnapshot when a gap is detected, resume only after the snapshot is confirmed current. Painful but not complicated.
The silent gap is harder.
Exchange WebSocket feeds drop updates without disconnecting. No reconnect event fires. The handler is running, the stream is live, the sequence number ticks forward — but an internal exchange event caused a message to be skipped without the disconnect-reconnect cycle that would normally trigger the recovery logic. This is documented behavior, not a corner case. KuCoin’s public issue tracker includes reports of sequence numbers being skipped in production, with confirmed cases where actual order book changes were not broadcast to the WebSocket — the handler received the surrounding events but not the one in the middle.
The naive fix — resnapshot on every sequence gap — does not work in practice. Exchanges have in-band sequence skips that are benign: administrative events, heartbeat anomalies, brief internal reordering. Resnapshotting on every gap means thrashing through clean sessions, introducing latency at exactly the moments when throughput matters most.
The architecture that handles this correctly treats REST as the authority and the stream as suspect. Reconcile local position against the REST endpoint on a cadence. REST snapshots represent confirmed exchange state; the stream represents the exchange’s best-effort delivery of state changes. When they agree, the stream is trusted. When they diverge, the stream is suspect.
Gating the resnapshot requires a tolerance: how much divergence, for how long, before action is taken. A tolerance defined as K consecutive check intervals showing divergence above a size threshold prevents thrashing on benign in-band skips while still catching real desync events. The key variables are the check cadence, the number of consecutive failures (K), and the size threshold above which divergence is treated as a real problem rather than a transient artifact.

REST as Authority, Stream as Suspect: How to Architect the Reconciliation Loop
The reconciliation loop is the architectural answer to silent gaps. It runs independently of the WebSocket handler. It does not interfere with the stream. It periodically checks whether the position the stream has built matches what the exchange reports as confirmed state via REST.
The design principles I build into handlers on desks I advise follow this structure:
REST poll cadence. Frequent enough to catch divergence before it compounds, infrequent enough to stay within exchange rate limits. On most major venues, a 500ms to 1s cadence is achievable without hitting rate limit thresholds. The desk’s tolerance for undetected divergence determines the floor; the exchange’s rate limit policy determines the ceiling.
Divergence measurement. Compare position locally held for each instrument against the position the REST response reports. The comparison is per-venue, per-instrument. Aggregate position across venues is a separate concern — venue-level divergence is the one the reconciliation loop addresses.
The K-consecutive-check gate. A single check showing divergence does not trigger a resnapshot. One abnormal reading could be a transient artifact of the REST endpoint’s own state. K consecutive checks — where K is calibrated per venue based on observed in-band skip frequency — must show divergence before the resnapshot fires. This is the mechanism that separates benign gaps from real desync.
Size threshold. Below a certain notional size, divergence is treated as operationally immaterial even if technically real. Above the threshold, it is treated as a risk exposure. The threshold is set relative to the desk’s position limits and volatility exposure at the time of the check, not as a fixed absolute dollar value.
Resnapshot trigger. When K consecutive checks exceed the size threshold, the handler discards its local state for that venue link and reinitializes from a fresh REST snapshot. The WebSocket stream resumes from that point. The reconnect-protocol logic runs even though no disconnect-reconnect cycle occurred.
The surviving desks build this loop as a first-class infrastructure component, not as a monitoring afterthought. The desks that blow up built it as a monitoring job that sent alerts.
Calibration: The Part That Stays Unsolved
The architecture above works. The calibration of its parameters does not have a solved answer, and anyone telling you it does is selling you static configuration.
The core problem: a volatility spike raises real desync risk and benign-gap frequency at the same time. A calm book with low throughput produces very few in-band skips; even a tight K-and-threshold configuration will not thrash. The same book during a volatility event generates more in-band skips and more real desync risk simultaneously. A fixed tolerance that is appropriate for calm conditions thrashes during the event and catches nothing real. A tolerance loosened for the event misses the real desyncs that happen at exactly that moment.
The desks that have solved this operationally — not theoretically — tune per venue and per regime. Venue-level calibration is required because exchange implementations differ. KuCoin and Binance do not have the same in-band skip frequencies. Their REST endpoints have different response characteristics. A K of 3 with a $10K threshold might be the right configuration for one venue and wrong for another by an order of magnitude.
Regime detection is the harder layer. A vol-regime signal needs to be sensitive enough to widen the resnapshot tolerance during real stress events and tight enough not to trigger on the volatility inherent in normal crypto trading. The desks using static tolerances are one AWS outage away from the failure described in this article. The ones using dynamic per-venue tolerances have reduced that risk — they have not eliminated it.
What remains unresolved in my work across desks: the boundary between a benign in-band skip at elevated vol and a real desync that happens to occur during elevated vol looks the same to the K-consecutive-check gate. Both present as divergence. The distinction is only clear after the fact. Every desk I have reviewed with a live version of this architecture has a different answer to where they drew that boundary, and none of them have a principled derivation for it. They have a number that has not blown up yet.

Pre-Trade Risk Is Only as Good as the Position It Guards
Pre-trade risk is a gate, not a guarantee. A gate checks inputs against limits. If the inputs are wrong, the gate is faithfully executing against incorrect information.
The failure mode described in this article is not a risk system failure. The risk system did exactly what it was designed to do. It checked orders against a locally held position that showed flat and cleared them within limits. The failure was upstream — in the position state the risk system was given to work with.
This is the architectural point that matters for CTOs reviewing their stack. The question is not “does our pre-trade risk system work?” The question is “what does our pre-trade risk system believe our position is, and how do we know that belief is current?”
Adverse selection from stale quotes accumulates per trade. One data vendor estimates 5 to 10 basis points per trade — their estimate, not a universal figure. On a desk running meaningful volume across multiple venues, the cost is not an anomaly. It is a recurring tax on every trade placed against stale state, and it continues until the desync is detected and corrected.
The stack review I apply when auditing multi-venue market making infrastructure always includes three specific checks:
- Reconnect handler verification. Does the handler enforce a sequence ID gap check on reconnect, or does it resume from cached snapshot unconditionally? Unconditional resume is a bug, not a design choice.
- Silent gap coverage. Is there a REST reconciliation loop running independently of the WebSocket handler, or does the system rely exclusively on the stream to maintain position correctness?
- Tolerance calibration state. Are the K-and-threshold parameters documented, versioned, and reviewed against live venue behavior, or were they set at build time and never touched?
The third question is the one that gets the most silence in the room.
A Diagnostic Checklist for Your Feed Handler Architecture
Forward this section to the engineer responsible for your venue connectivity.
Reconnect Protocol
- On reconnect, does the handler compare the first incoming delta’s sequence ID against the stored snapshot update ID?
- If the gap exceeds 1, does the handler discard local state and initiate a resnapshot before resuming the stream?
- Is this behavior tested under simulated mid-session disconnects, not just clean startup?
- Is it tested under the specific condition where multiple venue links reconnect simultaneously?
Silent Gap Coverage
- Is there a REST reconciliation job running on a defined cadence, independent of the WebSocket stream?
- Does the job compare per-instrument, per-venue position against the REST response, not just aggregate position?
- Does divergence require K consecutive checks above threshold before triggering a resnapshot, or does a single divergent check trigger action?
- What is the current K value and size threshold for each active venue? Is it documented?
Calibration
- Are K and threshold values venue-specific, or is a single global configuration applied to all venues?
- Has the calibration been reviewed against live venue behavior in the last 90 days?
- Is there a volatility-regime mechanism that adjusts tolerance, or is the configuration static?
- What is the escalation path when the reconciliation loop detects divergence above threshold? Alert only, or automated resnapshot?
Outage Coverage
- Has reconnect logic been tested against a simulated multi-venue simultaneous disconnect?
- Is there a runbook for the scenario where multiple venue links come back simultaneously after an exchange-level outage?
If more than three of these checkboxes are not confirmed, the stack has a position drift exposure. The question is not whether it will surface — it is which volatility event or infrastructure incident makes it visible.
Conclusion
The architecture is tractable. Enforce the sequence ID check on reconnect. Run a REST reconciliation loop on cadence with K-consecutive-check gating. Tune per venue.
What I have not solved — and have not seen solved cleanly on any desk I have reviewed — is the calibration boundary between benign in-band skips and real desync events at elevated volatility. Every desk has a number. None of them have a principled derivation for it. If you have built a regime-adaptive tolerance that demonstrably separates benign skips from real desyncs at elevated vol, and you have live data to support the calibration, that is the architecture review worth having.
Originally shared as a LinkedIn post — View original
Never Miss an Update
Get notified when we publish new analysis on HFT, market microstructure, and electronic trading infrastructure. No spam.
Subscribe by EmailHFT Systems Architect & Consultant | 20+ years architecting high-frequency trading systems. Author of "Trading Systems Performance Unleashed" (Packt, 2024). Creator of VisualHFT.
I help financial institutions architect high-frequency trading systems that are fast, stable, and profitable.
>> Learn more about what I do:
https://hftAdvisory.com
>> Your execution logs contain $200K+ in recoverable edge.
>> Microstructure Diagnostics — one-time audit, 3-5 day turnaround
https://hftadvisory.com/microstructure-diagnostics
... more info about me 👇