Detection benchmark methodology
How the competitive detection benchmark works. The normalized case schema, the adapter contract, the tools we run and the ones we cannot, the corpora and how benign controls are built, the catch-rate-at-a-false-positive-budget metric, and the pre-registered gate. Everything you need to reproduce every number.
The detection benchmark puts AxioRank's attack catch-rate
next to the open-source guardrails we can actually run, on public and red-team
corpora, at a fixed low false-positive budget. It is the competitive sibling of
the enforcement benchmark, which measures the gateway against
itself. The harness lives at benchmarks/competitive/ and follows the same
chain: corpora to results.json to a generated data file the page renders
directly, with a consistency test that forbids any published number from
drifting off the evidence.
We chart only tools we ran end to end on the same corpora. Every tool we could not run is named with the reason, never estimated.
What we run, and what we do not
| Tool | Runs offline, no key | Notes |
|---|---|---|
AxioRank inspect() | yes | The subject. A Node shim over the shipped @axiorank/detectors node build (recursive base64/hex/gzip decode), so the score matches the gateway default. |
| Regex / keyword baseline | yes | The content-scanning floor. |
| allow-all | yes | Instrument control: 0% catch, 0% false positives. |
| block-all | yes | 100% catch, 100% false positives; the budget punishes it. |
| Protect AI LLM Guard | yes, model download | PromptInjection scanner, a local DeBERTa model, no key. Scored on injection only. |
| Microsoft Presidio | yes, model download | PII entity recognizer, a local spaCy model, no key. Scored on PII only. |
| Rebuff | no | Needs a paid model API and an external vector store; cannot run offline. |
| NVIDIA NeMo Guardrails | no | Its rails invoke an LLM backend; cannot run offline. |
| Vigil | stretch | Heavy, uncertain local model and YARA setup; not run in this build. |
Fairness: each tool on its own turf
LLM Guard is injection-only and Presidio is PII-only, while AxioRank spans every class. A single blended number would be apples to oranges, so each tool is scored only on the categories it declares support for, and the case count and scope are stated on every panel. The honest headline is breadth at zero false positives, not a single-axis win: a dedicated injection classifier can match a gateway on injection text, so AxioRank's edge is covering every class at once while holding false positives down.
The normalized case
Every corpus is mapped into one shape so the runner is written once:
@dataclass(frozen=True)
class Case:
id: str
kind: str # "attack" | "benign"
category: str # injection | indirect_injection | pii | secret | destructive | ...
modality: str # "tool_call" | "text"
phase: str # "request" | "result"
source: str
tool: str | None
args: dict | None
text: str | NoneEach tool is an adapter that turns a Case into a Decision{blocked, score?, latency_ms}
and declares which categories it is fair to score on:
class Adapter:
name: str
scored: bool # emits a continuous score (threshold-sweepable)
supports: frozenset # categories it is fair to evaluate on
requires_download: bool
requires_key: bool
def prepare(self) -> None: ...
def decide(self, case: Case) -> Decision: ...Corpora and benign controls
- InjecAgent (indirect injection): the poisoned tool response is an attack case; the same template with the injection removed is a matched benign case. The upstream repository ships no license, so the derived fixture is generated locally by the loader and not committed; only aggregate metrics are published.
- Red-team corpus (overt attacks): AxioRank's own labeled single-call
scenarios, including obfuscated payloads such as a base64-encoded key. Its
allowscenarios are benign controls.
A benign case is the same workflow step with the injection removed, so any block on it is a true false positive. Multi-step kill chains need stateful taint and are measured in the enforcement benchmark, not here.
The metric: catch-rate at a false-positive budget
Per panel, per tool, restricted to the tool's supported categories, at a default budget of 2%:
- Scored tools sweep the threshold and report catch-rate at the lowest threshold whose benign false positives stay within budget.
- Binary tools have one operating point; if their false positives exceed the budget, they are marked over budget and cannot claim their catch-rate as usable.
- Intervals are the Wilson 95% score interval, imported from the enforcement harness. A hold counts as a catch, since it stops autonomous execution.
The pre-registered gate
The headline publishes only if all hold, evaluated by the same harness:
- allow-all blocks nothing (the harness invents no catches).
- On every overt class, AxioRank stays within the budget and catches something.
- On every overt class, no in-budget competitor beats AxioRank. A tool that leads only by exceeding the budget does not count.
Indirect injection is reported as honest context, not part of the claim: offline content scanners are weak on it because the attack is a legitimate-looking request with no payload to match, which is exactly what the gateway's information-flow control is for.
Reproduce it
cd benchmarks/competitive
pip install -e '.[all]' # torch + spaCy + models
python -m spacy download en_core_web_lg
pnpm --filter @axiorank/detectors build # the engine the shim scores through
node corpora/load_redteam.mjs # no network
python corpora/load_injecagent.py # network once
python run.py # writes results/results.json
python ci_smoke.py # no-network gate + metric self-test
cd ../../apps/web
node scripts/gen-detection-benchmark-data.mjsModel ids and corpus versions are recorded in the results provenance so a re-run is deterministic. The committed results are what the consistency test guards; the Python harness runs outside CI because it needs the model downloads.
Assurance center
One health view for the parts of AxioRank that run in the background, your integrations, the ML lane, data residency, and the async job queue, so a silent degradation never hides.
Verify our log yourself
Pin our public key, pull the signed checkpoints, and verify receipts offline. Trust nothing of ours.