Table of Contents
- The Pattern Is Now Public
- What Each Firm Actually Built
- The Crypto-Native Proof of Concept
- What Nobody Is Publishing: The Thresholds
- The Governance Deadline
- The Bloomberg Problem
- What a CTO Should Audit Right Now
The Pattern Is Now Public
Four firms have published enough about their AI agent architectures that the shape is now legible to anyone reading carefully. The pattern is not accidental. It is not the product of shared tooling or a common vendor stack. It converged independently, and that convergence tells you something.
The architecture has four recognizable elements: a containment layer that limits what agents can access and do, an audit trail that makes every LLM call traceable after the fact, a sequential veto structure that routes output through human review before anything enters live trading, and a federated guardrail model that distributes deployment while centralizing standards.
Reading the March 2026 OpenAI case study on Balyasny, the Two Sigma AI outlook published March 30, 2026, the Bloomberg coverage of Man Group’s AlphaGPT, and industry analyst reporting on D.E. Shaw’s internal stack, the same four elements appear across all of them. The firms differ in size, strategy, and culture. The architecture looks the same.
Thank you for reading this post, don't forget to subscribe!
The AIMA 2025 survey of 150 fund managers representing $788 billion in AUM found that 95 percent are now using generative AI, up from 86 percent in 2023. Adoption is not the hard question anymore. The hard question is: what does a deployment look like when it is built to survive contact with production? These four firms answer that question in enough public detail to learn from. What they do not publish, the calibration values, the budget thresholds, the trigger logic, is another matter.
Two Sigma named the strategic shift precisely: “Large language models (LLMs) are widening the top, shifting the bottleneck from ‘we need more ideas’ to ‘we need to evaluate ideas faster.'” That is not a technology observation. It is an organizational architecture observation. The constraint has moved.
What Each Firm Actually Built
D.E. Shaw: Gateway and audit hashes
Public coverage of D.E. Shaw’s internal stack describes an LLM Gateway that logs every call, strips PII before it reaches any model, and throttles usage per-desk with budget controls. A component called DocLab adds cryptographic audit hashes to every document retrieval, creating a timestamped chain of what the model saw when it generated an output. Quants build tools against these interfaces in approximately ten lines of code. The productivity gain is real, but the envelope is enforced. Every call goes through the gateway. Nothing goes around it.
I want to be precise about the sourcing here: these details trace to industry analyst coverage (notably Resonanz Capital’s November 2025 synthesis of hedge-fund AI deployments) rather than a D.E. Shaw primary publication. D.E. Shaw does not publish internal stack specifications. The pattern, gateway, PII filter, per-desk budget, retrieval audit, is consistent with how every firm in this cohort has approached the same problem.
Man Group: AlphaGPT and the three-agent chain
Man Group’s AlphaGPT is the most publicly documented AI agent system in buy-side trading. Bloomberg has reported its structure: a three-agent chain handling ideation, implementation, and evaluation. The chain is sequential. A hypothesis-generating agent produces research directions. An implementing agent converts those into executable code. An evaluating agent applies statistical scrutiny before anything advances.
What matters architecturally is not the chain itself, multi-agent pipelines are common, but the placement of human review. Humans review every step before any signal enters live trading. Man Group has also been explicit about the failure mode: hallucination “remains a big issue,” per Bloomberg’s July 2025 coverage, and the firm ships the system anyway. The architecture is designed to absorb hallucination rather than prevent it. The containment layer is the answer to the failure mode, not the elimination of the failure mode.
Balyasny: Federated deployment, central guardrails
Balyasny established a dedicated Applied AI team in late 2022, approximately 20 researchers, engineers, and domain experts, as the central function responsible for guardrails, model evaluation, and deployment standards. Individual investment teams operate within that framework, using scoped tools built to the central team’s specifications.
I read the March 2026 OpenAI case study on Balyasny carefully, because the workflow-compression claims are specific and verifiable. The evaluation pipeline the Applied AI team runs covers 12 or more dimensions: forecasting accuracy, numerical reasoning, scenario analysis, robustness to noisy inputs, and related criteria. Adoption across investment teams has reached approximately 95 percent. One documented example is a Central Bank Speech Analyst agent that compressed a two-day workflow to thirty minutes. The speed gain is real. What makes it safe is that the Applied AI team controls the guardrail layer centrally while each desk deploys locally within that envelope, federated execution, centralized standards.
Two Sigma: Research funnel inversion
Two Sigma’s contribution to this pattern is the clearest strategic framing. The bottleneck in quantitative research has historically been idea generation, analysts producing enough hypotheses to keep evaluation pipelines busy. LLMs have inverted that. The bottleneck is now evaluation. “We need to evaluate ideas faster” is not a tools problem; it is an architecture problem. Two Sigma’s AI deployment targets the evaluation side of the funnel specifically, which changes what an AI agent system is for. It is not an idea machine. It is a throughput accelerator for a pipeline where human judgment concentrates at evaluation, not generation.
The Crypto-Native Proof of Concept

Crypto-native operators have encoded the same four-element pattern into protocol rather than policy. Olas Open Autonomy requires a 2/3 keeper threshold for any external transaction. The signature requirement is enforced by a multisig Safe. The veto is not a human reviewing a queue, it is a cryptographic constraint on execution. No transaction clears without quorum.
The significance for buy-side AI architecture is architectural, not operational. Olas demonstrates that the containment-and-veto pattern does not require human-in-the-loop at every step; it requires a verifiable threshold before state changes commit. In traditional buy-side deployments, that threshold is expressed as policy and enforced by review processes. In protocol-native deployments, it is expressed as a consensus requirement and enforced by contract. The mechanism differs. The architectural shape is the same.
What Nobody Is Publishing: The Thresholds

What these four firms have published is structural. What they have not published is calibration.
The architecture, containment, audit, sequential veto, federated guardrails, is now effectively public domain. Any competent engineering team can replicate the shape. The OpenAI case study gives you Balyasny’s evaluation framework dimensions. Bloomberg gives you Man Group’s agent sequence. Industry coverage gives you the gateway pattern. You can build a deployment that looks like theirs.
What you cannot build from public sources is the numbers. What is the token budget per desk per day before a flag triggers? At what confidence interval does the statistical evaluator block a hypothesis from advancing? What is the empirical ratio of hypotheses generated to hypotheses that survive evaluation in a functioning pipeline? What threshold separates “high signal, act now” from “noisy, wait for more data”?
None of that is published. In my observation of how quantitative firms describe their AI deployments publicly, the architecture is always the headline. The thresholds are always absent.
This is not an accident. The architecture is commoditizing. The calibration is the edge. A firm that has tuned its evaluation threshold to match its specific strategy, market regime, and signal half-life has an advantage that cannot be read off a case study. The architecture gives you a working system. The thresholds give you a better one.
The Governance Deadline

The firms above built containment architectures for operational reasons. Hallucination is a genuine problem, audit trails are needed for post-mortems, federated deployment requires central standards. The regulatory frameworks are now arriving at the same architectural requirements from the top down.
NIST published AI Risk Management Framework 1.0 on January 26, 2023. ISO/IEC 42001, the international standard for AI management systems, was published December 18, 2023. FINRA released Regulatory Notice 24-09 on June 27, 2024, addressing AI governance obligations for broker-dealers. The EU AI Act entered into force August 1, 2024; the compliance window for high-risk AI applications opens August 2026, per the current published timeline.
The containment architecture these firms built anticipates what these frameworks are beginning to mandate. An LLM Gateway with per-desk budget controls and PII filtering addresses AI risk management requirements. Cryptographic audit hashes on document retrieval create the traceability that AI governance frameworks require. Human review before signals enter live trading maps directly to the human-oversight provisions that appear across NIST, ISO, and the EU Act.
Firms that built the architecture for operational reasons are in a better position than firms that will need to build it for compliance reasons starting in 2026. The operational deployment teaches you the calibration. The compliance deployment teaches you the paperwork.
The Bloomberg Problem

In May 2026, Bloomberg reported on Alpha Arena, an AI trading contest run to stress-test autonomous trading agents against real market conditions. The result: most leading AI systems lost money. The failure modes were specific. The agents traded too much. They made wildly different decisions when given identical instructions in separate test runs.
The second failure mode is the more dangerous one for production deployment. Nondeterminism in decision-making is not a hallucination problem, it is an architecture problem. A system that produces different trading decisions from identical inputs cannot be reliably backtested, audited, or corrected. The variance in outcomes is not signal. It is noise in the decision layer.
The containment architectures described above are one answer to this problem. Sequential veto gates intercept high-variance outputs before they reach order entry. Audit trails make the variance visible after the fact. Human review at evaluation provides a consistency check the model itself cannot provide.
The Bloomberg finding is not an argument against AI agent deployment. It is the empirical case for the architecture that the four firms above built before the contest ran. Uncontained agents lose money and produce inconsistent outputs. Contained agents, by the evidence available, are what the industry’s most sophisticated practitioners built when they chose to deploy.
What a CTO Should Audit Right Now
The architecture pattern is public. The question for any buy-side technology leader is whether their existing deployment matches the pattern, or whether they built the capability without building the containment.
Here is the diagnostic frame I use when reviewing a deployment:
Containment layer: Is there a gateway or proxy between your LLMs and your firm’s data? Does it enforce per-desk or per-team budget controls? Does it strip or mask PII before model calls?
Audit trail: Is every LLM call logged with enough metadata to reconstruct what the model saw and what it returned? Can you trace a trading signal back to the specific model output that informed it?
Sequential veto: At what point in your agent pipeline does human review occur? Is review mandatory before signals enter live trading, or is it optional? What would trigger automatic blocking versus flagging for review?
Federated governance: If multiple desks are deploying AI tools, who owns the guardrail standards centrally? Is there a function, equivalent to Balyasny’s Applied AI team, responsible for the standards that scoped deployments must meet?
Test your own architecture against these four questions. Most deployments I have reviewed have one or two elements in place. The firms that have built all four are the ones that show up in the case studies.
The unsolved problem across all of them is the same: the thresholds. The architecture tells you where to put the veto. It does not tell you when to trigger it. That calibration, built from production data, refined over regime changes, specific to strategy and venue, is what the published case studies omit and what no regulatory framework yet specifies. I will keep tracking who publishes it first.

This article expands on a LinkedIn post published May 26, 2026.
Ariel Silahian is the founder of HFT Advisory, providing electronic trading systems architecture and advisory services across TradFi HFT, crypto-native quant funds, CEX architecture, and DEX protocol teams. Architecture assessment inquiries: electronictradinghub.com/discovery
Never Miss an Update
Get notified when we publish new analysis on HFT, market microstructure, and electronic trading infrastructure. No spam.
Subscribe by EmailHFT Systems Architect & Consultant | 20+ years architecting high-frequency trading systems. Author of "Trading Systems Performance Unleashed" (Packt, 2024). Creator of VisualHFT.
I help financial institutions architect high-frequency trading systems that are fast, stable, and profitable.
>> Learn more about what I do:
https://hftAdvisory.com
>> Your execution logs contain $200K+ in recoverable edge.
>> Microstructure Diagnostics — one-time audit, 3-5 day turnaround
https://hftadvisory.com/microstructure-diagnostics
... more info about me 👇