Detection benchmark methodology

How the competitive detection benchmark works. The normalized case schema, the adapter contract, the tools we run and the ones we cannot, the corpora and how benign controls are built, the catch-rate-at-a-false-positive-budget metric, and the pre-registered gate. Everything you need to reproduce every number.

The detection benchmark puts AxioRank's attack catch-rate next to the open-source guardrails we can actually run, on public and red-team corpora, at a fixed low false-positive budget. It is the competitive sibling of the enforcement benchmark, which measures the gateway against itself. The harness lives at benchmarks/competitive/ and follows the same chain: corpora to results.json to a generated data file the page renders directly, with a consistency test that forbids any published number from drifting off the evidence.

We chart only tools we ran end to end on the same corpora. Every tool we could not run is named with the reason, never estimated.

What we run, and what we do not

Tool	Runs offline, no key	Notes
AxioRank `inspect()`	yes	The subject. A Node shim over the shipped `@axiorank/detectors` node build (recursive base64/hex/gzip decode), so the score matches the gateway default.
Regex / keyword baseline	yes	The content-scanning floor.
allow-all	yes	Instrument control: 0% catch, 0% false positives.
block-all	yes	100% catch, 100% false positives; the budget punishes it.
Protect AI LLM Guard	yes, model download	`PromptInjection` scanner, a local DeBERTa model, no key. Scored on injection only.
Microsoft Presidio	yes, model download	PII entity recognizer, a local spaCy model, no key. Scored on PII only.
Rebuff	no	Needs a paid model API and an external vector store; cannot run offline.
NVIDIA NeMo Guardrails	no	Its rails invoke an LLM backend; cannot run offline.
Vigil	stretch	Heavy, uncertain local model and YARA setup; not run in this build.

Fairness: each tool on its own turf

LLM Guard is injection-only and Presidio is PII-only, while AxioRank spans every class. A single blended number would be apples to oranges, so each tool is scored only on the categories it declares support for, and the case count and scope are stated on every panel. The honest headline is breadth at zero false positives, not a single-axis win: a dedicated injection classifier can match a gateway on injection text, so AxioRank's edge is covering every class at once while holding false positives down.

The normalized case

Every corpus is mapped into one shape so the runner is written once:

@dataclass(frozen=True)
class Case:
    id: str
    kind: str          # "attack" | "benign"
    category: str      # injection | indirect_injection | pii | secret | destructive | ...
    modality: str      # "tool_call" | "text"
    phase: str          # "request" | "result"
    source: str
    tool: str | None
    args: dict | None
    text: str | None

Each tool is an adapter that turns a Case into a Decision{blocked, score?, latency_ms} and declares which categories it is fair to score on:

class Adapter:
    name: str
    scored: bool                # emits a continuous score (threshold-sweepable)
    supports: frozenset          # categories it is fair to evaluate on
    requires_download: bool
    requires_key: bool
    def prepare(self) -> None: ...
    def decide(self, case: Case) -> Decision: ...

Corpora and benign controls

InjecAgent (indirect injection): the poisoned tool response is an attack case; the same template with the injection removed is a matched benign case. The upstream repository ships no license, so the derived fixture is generated locally by the loader and not committed; only aggregate metrics are published.
Red-team corpus (overt attacks): AxioRank's own labeled single-call scenarios, including obfuscated payloads such as a base64-encoded key. Its allow scenarios are benign controls.

A benign case is the same workflow step with the injection removed, so any block on it is a true false positive. Multi-step kill chains need stateful taint and are measured in the enforcement benchmark, not here.

The metric: catch-rate at a false-positive budget

Per panel, per tool, restricted to the tool's supported categories, at a default budget of 2%:

Scored tools sweep the threshold and report catch-rate at the lowest threshold whose benign false positives stay within budget.
Binary tools have one operating point; if their false positives exceed the budget, they are marked over budget and cannot claim their catch-rate as usable.
Intervals are the Wilson 95% score interval, imported from the enforcement harness. A hold counts as a catch, since it stops autonomous execution.

The pre-registered gate

The headline publishes only if all hold, evaluated by the same harness:

allow-all blocks nothing (the harness invents no catches).
On every overt class, AxioRank stays within the budget and catches something.
On every overt class, no in-budget competitor beats AxioRank. A tool that leads only by exceeding the budget does not count.

Indirect injection is reported as honest context, not part of the claim: offline content scanners are weak on it because the attack is a legitimate-looking request with no payload to match, which is exactly what the gateway's information-flow control is for.

Reproduce it

cd benchmarks/competitive
pip install -e '.[all]'                      # torch + spaCy + models
python -m spacy download en_core_web_lg
pnpm --filter @axiorank/detectors build      # the engine the shim scores through
node corpora/load_redteam.mjs                # no network
python corpora/load_injecagent.py            # network once
python run.py                                # writes results/results.json
python ci_smoke.py                           # no-network gate + metric self-test

cd ../../apps/web
node scripts/gen-detection-benchmark-data.mjs

Model ids and corpus versions are recorded in the results provenance so a re-run is deterministic. The committed results are what the consistency test guards; the Python harness runs outside CI because it needs the model downloads.