Prompt injection defense for AI agents

Prompt injection defense

Stop prompt injection before your agent acts

Prompt injection is the top security risk for AI agents. AxioRank catches overt injection payloads at the content layer and the subtle indirect kind at the gateway, where an agent that read untrusted content is held before it can act on it.

Try it on your own payload See the benchmark

Two kinds of prompt injection

Direct injection

An attacker plants ignore-your-instructions text or an SSRF payload straight into a tool call, trying to override the agent's task. These carry a recognizable payload a detector can match.

Indirect injection

A benign-sounding instruction is smuggled into content a tool returns: a fetched page, an email, an MCP reply. There is no payload to match, so content scanning alone cannot catch it. The agent version is what researchers call agentjacking.

How AxioRank defends both

Detected at the content layer

The shipped detection engine flags overt injection and SSRF markers on every tool call and on the output a tool returns, with recursive base64 and hex decoding so an obfuscated payload cannot hide. New rules install in monitor mode first, then promote to deny or hold.

Caught at the gateway

Indirect injection is an information-flow problem, not a string match. AxioRank tracks that an agent read untrusted content, and when that tainted flow reaches a sensitive action the call is held or denied. This is the part single-purpose scanners miss.

Governed, with proof

Every verdict lands in the audit log with the signal that fired, and each decision can be sealed into an offline-verifiable proof. You can show that an injection attempt was caught, not just claim it.

The proof, on overt injection

A reproducible head-to-head against the open-source guardrails we can actually run, at a fixed 2% false-positive budget. A bar drawn in red reaches its height only by exceeding that budget, so a tall red bar is an outage, not a defense.

On 2 injection flows, AxioRank caught 100% at 0.0% false positives, scored through the shipped engine so it measures exactly what the gateway default does. See every class and caveat.

Where content scanning ends and the gateway begins

On subtle indirect injection, a benign-sounding instruction hidden in tool output, every offline content scanner is weak, ours included. This is not a scanning problem to tune away: the attack is a legitimate-looking request, so there is no payload to match. AxioRank catches it at the gateway by tracking that the agent read untrusted content before it acted, which the enforcement benchmark measures. We show it here rather than quietly dropping the panel.

Questions

What is prompt injection?: Prompt injection is when untrusted text steers an AI agent into doing something its operator did not intend, either by overriding its instructions directly in a tool call or by hiding an instruction inside content the agent reads. For an autonomous agent that can call tools, a successful injection can turn into real action: exfiltrating data, deleting resources, or calling an attacker's server.
Can you catch indirect (second-order) injection?: Yes, but not by scanning for a payload, because there is not one to match. AxioRank tracks provenance: it records that the agent read untrusted content, and when that tainted data flows into a sensitive action the gateway holds or denies the call. Offline content scanners, ours included, are weak on this class, and we say so on the benchmark rather than quietly dropping the panel.
Do I have to write detection rules?: No. The detection engine ships with the injection and SSRF detectors on by default, and a one-click starter pack adds sensible policies in monitor mode so you see what would have been blocked before anything is enforced. You promote a rule to deny or hold when you are ready.
How is this different from a WAF or a single injection classifier?: A single classifier covers one class and often only at a false-positive rate that would block legitimate work. AxioRank covers every overt attack class at zero false positives in our benchmark, and it adds the gateway information-flow layer that catches indirect injection a classifier cannot see. It is defense at the agent's action boundary, not at a network edge.

See it catch an injection payload

Paste a tool call into the playground and watch the engine score it in your browser, no signup. Then wire the SDK to govern your own agents.

Open the playground Read the docs