Enforcement benchmark
We measure the gateway, not the model
Most agent-security numbers really measure how well the underlying model resists an attack. That is not what AxioRank sells. This test isolates what the gateway itself stops, with no model in the measurement, and reports the whole result with every caveat.
The bottom line
- The attack. An attacker hides instructions inside content your agent reads (an email, a web page, a document). The agent obeys them and takes a harmful action whose arguments look completely normal. Content scanners caught 0.0% of these, because there is nothing in the request to match.
- The defense. AxioRank tracks that the agent read untrusted content before the action, so it can stop the action even when the request itself looks legitimate. It blocked 100.0% of the attacks, and held that under an attacker who actively adapted.
- The cost, controlled. A model then reviews the borderline holds so routine work is not stopped. That cut false alarms more than five-fold, from 25.0% to 5.0%, while every data-theft and destructive action stayed blocked. The result clears every condition of a bar we set in writing before running.
How to read the numbers
- Block rate
- The share of the attacker's harmful actions the gateway stopped (denied or held for approval). Higher is better.
- Benign false positives
- The share of legitimate actions the gateway stopped by mistake. Lower is better; this is what determines whether the product is usable in production rather than an outage.
- 95% CI
- The statistical confidence interval around the block rate, so a small sample is never dressed up as a precise number.
Every configuration, every number
Each configuration replays one identical set of recorded attack and legitimate sequences. The bars show attacks blocked (green, higher is better) against legitimate actions held by mistake (red, lower is better). Scanning alone catches nothing; provenance catches everything but over-holds; the recommended configuration catches everything while barely over-holding.
The exact figures
| Configuration | Block rate | 95% CI | Benign FP |
|---|---|---|---|
| No gateway (control) | 0.0% | [0.0%, 12.5%] | 0.0% |
| Content detectors only | 0.0% | [0.0%, 12.5%] | 0.0% |
| Information-flow control only | 100.0% | [87.5%, 100.0%] | 25.0% |
| Detectors + IFC (combined) | 100.0% | [87.5%, 100.0%] | 25.0% |
| IFC + ML flow judge | 85.2% | [67.5%, 94.1%] | 0.0% |
| Detectors + IFC + flow judge | 81.5% | [63.3%, 91.8%] | 5.0% |
n = 27 attack-relevant sink calls, 20 benign guarded-sink calls. Recording method: deterministic ground truth (no model at record time). AgentDojo 0.1.35 (benchmark v1.2.1); suites: workspace.
It holds when the attacker fights back
A single number means little against an attacker who adapts, so we budgeted a set of evasions and re-ran. Four disguise the payload (encoding it, splitting it across calls, swapping the tool, routing it through an innocent step). Three target the model directly: copying the user's own wording, planting a fake "this was already approved" note in the content the model reads, and claiming inside that content that the user changed their request. Against the attacker's most effective evasion, the recommended configuration still blocked 73.8%. Provenance does not care how a payload is dressed up, and the model treats instructions aimed at itself as evidence of an attack rather than as commands.
How the model keeps legitimate work flowing
Provenance alone is blunt: once an agent has read untrusted content, it would hold every following action, a genuine reply just like a data theft. That is safe but noisy. So when a flow is held, AxioRank asks a model one question: does this action serve the task the user actually asked for? A confident yes releases the hold; a suspected attack keeps it and raises an alert; anything uncertain, timed out, or errored stays held. The model can only ever release a hold for review. It never overrides a block, so a fooled model causes a delay, never a breach.
We publish what the model got wrong, because that is the point of a benchmark. It took three rounds of tuning to reach this result, all committed to the repository. An early round wrongly released two actions that deleted the evidence after reading it; we added a rule about destructive actions. A later round released one credential theft because the model assumed an unfamiliar address was the user's own; we made addresses anonymous to the model and added a rule that secrets may only go to a destination the user's own request named. The reported run is the third, and every round's full transcript ships with the evidence. The remaining released actions are searches of the user's own inbox that this test labels as attack steps only because the attack happens to begin with the same search the user's own task performs. Every release is recorded, with the model's reasoning, in a signed receipt.
Two things we are precise about
The false-positive rate sits right at the line. Out of 20 legitimate actions, the recommended configuration held exactly one: a genuine send the model held in uncertainty rather than auto-releasing, which in production goes to a person for a one-click approval, not a hard block. The provenance-plus-model configuration without content scanning holds none at all.
The block rate of 81.5% is not leaked attacks.Every action the model released is a search of the user's own inbox, which this test counts as an attack step only because the attack and the user's real task happen to begin with the same search. Every actual data theft and every destructive action stayed blocked. Each released action is listed in the published evidence with its receipt.
Why this number is trustworthy
The hard part of any security number is separating the gateway from the model behind it. We do it by keeping the model out of the measurement entirely. Each attack is a fixed sequence of tool calls built from a public benchmark's ground truth (the agent reads the attacker's content, then takes the harmful action), and we replay the exact same sequence through every configuration. Because the sequence is identical across rows, the difference between rows is the gateway and nothing else. We report a Wilson 95% confidence interval and the false-positive rate alongside every block rate, so a number is never stronger than the data behind it.
The pre-registered go / no-go bar
We decide these thresholds before the run, so the result cannot be rationalized after the fact. We publish a headline only if all of them hold:
- With no gateway in place, nothing is stopped. This confirms the test itself is not inventing blocks.
- The recommended configuration must block meaningfully more attacks than content scanning alone, with statistical confidence, not by chance.
- It must hold no more than 5% of legitimate actions by mistake, so it stays usable in production.
- An attacker given a budget to adapt must not be able to collapse the block rate below half.
What ships with every number
Each result is model-free and fully reproducible. It is accompanied by:
- No model involved in producing the attack sequences
- The public benchmark version and the task suites used
- The full list of attacks and how each was labelled
- The exact code version of the gateway under test
- The complete policy configuration for every row
- A verifiable cryptographic receipt for every blocked action
Run provenance
Recorded against AgentDojo 0.1.35 (benchmark v1.2.1), with deterministic ground truth (no model at record time). The full harness, the per-arm policy sets, the gate logic, both corpora, every arm's raw verdicts, and the model's adjudication transcripts are published as a separate, public repository. Re-run the report over the committed results and you get the same table.
Inspect the published evidence on GitHub, or see the evidence bundle for the receipt-backed audit artifacts AxioRank produces for every blocked call.