Detection intelligence

How a call becomes a verdict beyond the deterministic signals, with ML assessment, a semantic judge, taint provenance, and kill-chain correlation.

The content-inspection engine and your policies decide every call in-band, deterministically. Detection intelligence is the layer around that decision: a model's semantic verdict, value-level provenance, and multi-step correlation. It catches what a single-call regex pass cannot.

Two lanes

The gateway never waits on a model. Detection runs in two lanes:

What	Lane	Affects
Content inspection, risk score, redaction	Hot path, in-band	This call's verdict
Policy evaluation, including the IFC sink check	Hot path, in-band	This call's verdict
Flow judge (opt-in, on IFC holds only)	In-band, on held calls	May release this hold to allow
ML assessment (the semantic judge)	Post-hoc job	Alerts, response rules, future calls via `mlThreatClass`
Kill-chain correlation	Post-hoc job	Alerts, response rules

The post-hoc lane never delays a decision. It enriches the record, raises alerts, and feeds the next decision.

The flow judge

Coarse information-flow control is deliberately blunt: after an agent reads untrusted content, every later guarded action holds for approval, a legitimate reply just like an exfiltration. The flow judge adds the missing precision. When an IFC rule holds a call (require_approval, never a deny), the gateway makes one synchronous model call that weighs the user's declared task (intent on the SDK trace), the untrusted content the agent read, and the sink call's destination, then answers one question: does this flow serve the task, or goals embedded in the untrusted content?

A confident benign verdict releases the hold. The release is recorded on the audit row, carried in the signed receipt (provenance token v3), and visible on the intelligence page.
An attack verdict keeps the hold and raises a high-severity alert. The judge never denies; a human still decides.
Everything else fails secure: uncertain, low confidence, timeout, model error, or quota exhaustion all leave the hold exactly as the deterministic ladder produced it.

The judge is prompt-hardened for its position on the trust boundary: untrusted excerpts are delimited as data, instructions found inside them (including instructions aimed at the judge itself) are treated as evidence of attack, and only the operator-supplied intent can define the task. Its measured behavior, including judge-aware adaptive attacks, is published on the enforcement benchmark.

Enable it under workspace settings (AI inference must be on; Team and Enterprise plans). Strict mode additionally refuses to release any hold whose sink arguments provably carry a value read from an untrusted source. Declare the task with the SDK:

with axio.trace(intent="Reply to Bob's email about the offsite") as t:
    ...

Every human approval or denial of an IFC hold also becomes a labeled flow example. After three consistent approvals of the same flow shape by at least two reviewers, AxioRank drafts a narrowly scoped allow policy for that exact flow (one agent, one tool, one destination class), born disabled for review unless the workspace chose auto mode. One later denial disables it. Learned trust stays an ordinary, receipt-citable policy row, never an opaque model override.

ML assessment

After a tool call is evaluated, a background job can send it to AxioRank's model service for a semantic verdict: prompt injection, jailbreak, or exfiltration intent that pattern matching misses. The call is gated, in order:

Global config: the model service must be configured for the deployment.
Workspace opt-in: external model egress is off by default; a workspace setting turns it on.
Plan entitlement: AI assessments are a Team and Enterprise feature.
Worth-it gate: always assessed for a critical signal, an ambiguous output-injection, or heuristic risk at or above 40; otherwise a deterministic 5% sample of benign calls keeps a baseline of "normal".
Monthly cap: assessments are metered against a per-plan monthly limit, checked before the spend.

Only the redacted payload (secrets already masked) ever leaves the platform.

Fail-open by design

If the model service is unreachable, the assessment is recorded as unavailable and nothing else happens. The deterministic decision already stood and was already returned; ML is enrichment, never a dependency.

A completed verdict carries a calibrated mlRisk (0 to 100), a recommendation (allow · review · block · escalate), and a threatClass: benign · prompt_injection · jailbreak · data_exfiltration · malware · social_engineering · policy_violation · unknown. When the recommendation is block or escalate, or mlRisk is 80 or above, an ml_threat alert is raised through your normal channels. Every persisted verdict fires the ml.assessed webhook and drives ml_* response-rule predicates.

A policy can also match on the verdict with the mlThreatClass predicate. Because the verdict is produced asynchronously, the predicate matches the agent's latest completed assessment, not the current call's. It is fail-open: with no verdict on record, the predicate is simply unmet, so a policy never denies on missing ML data.

The semantic judge

The judge's defining job is adjudicating ambiguity. When the deterministic detectors flag a possible injection in a tool output (a lone forged role marker or embedded tool directive can score below the risk floor), that is exactly the "the regex flagged it, but is it real?" case, so it is always sent for assessment regardless of risk.

Confirmed verdicts feed a self-improving loop. When the judge confirms an injection-family threat (prompt_injection, jailbreak, data_exfiltration) with confidence at or above 0.8, and the deterministic layer under-scored it, a second model call generalizes the finding into a reusable custom detector. The proposal is born disabled, marked ai_proposed, capped at 20 AI-proposed detectors per workspace, and metered like any assessment. A human reviews and arms it; a model never enables detection unattended.

Taint provenance

Information-flow control (IFC) tracks values, not just calls. When an untrusted tool returns a result, the payload's string leaves are fingerprinted at ingress: opaque salted hashes over several normalized variants of each leaf (raw, whitespace and case normalized, and base64/hex decoded) so trivial obfuscation does not break the match. The raw value is never stored, only fingerprints.

Untrusted sources are MCP servers not marked trusted (tag mcp_untrusted) plus tool-name classes: web_fetch, inbound_email, file_read, db_read.

When a later call in the trace is a sink (egress · destructive · state_change), its own argument leaves are fingerprinted and checked against the trace's accumulated untrusted set. An IFC policy rule chooses the propagation mode:

explicit: fires only when a tainted value provably reappeared in the sink arguments. Evasion-resistant, and it records which prior step minted the value, so the flow is a provable chain rather than an inference.
coarse: fires when any untrusted output was seen earlier in the trace. The high-recall backstop for transformations explicit matching cannot follow.

All IFC work is gated on the workspace having an enabled IFC policy; a workspace without one pays nothing. Results proxied through the MCP gateway are fingerprinted automatically. On the SDK path the platform only sees outputs your code reports, so call inspectResult or pass inspectResults: true to a framework adapter to bring tool outputs into the taint trace.

Taint across an agent handoff

Taint normally lives inside one trace. But a multi-agent system launders values across the boundary: agent A reads an untrusted page, hands its output to agent B, and B's run starts with no idea the value is tainted. Cross-agent lineage closes that gap.

When a result-phase call carries taint, the gateway mints a signed taint handle: a canonicalized, Ed25519-signed envelope of the run's accumulated fingerprints, returned as taintHandle (TypeScript) / taint_handle (Python) on the result. On the receiving side, pass it on the handoff (the taintHandle argument to report_result / inspect_result, or the matching SDK field) and the gateway re-seeds agent B's trace from it, so a value laundered through a second agent is still caught at B's next sink.

The handle is never trusted blindly, and inbound is never false-clean. A handle that is forged, expired, or issued by another workspace is rejected, and B's trace is still seeded with an a2a_inbound taint computed from the inbound payload itself. A trusted, same-workspace handle additionally records a lineage edge (in taint_lineage, with taint_facts.parent_trace_id) so the trace view shows the provable chain from A's read to B's sink. An a2a_inbound value is conservative by construction: it is seeded, not exonerated.

Kill-chain correlation

Single-call scoring misses the most dangerous behavior: a sequence whose steps each look fine alone. After every evaluated call, a post-hoc job loads the run's prior calls (by trace id, or by agent within a window when un-instrumented) and checks whether the just-landed call completes a dangerous ordered pattern over the most recent 20 steps:

Pattern	Sequence	Severity
`exfiltration`	A sensitive read (secret/PII signal or a read-shaped tool), then egress	`high`; `critical` when a live secret was seen
`recon_then_destroy`	Three or more reads/lists, then a destructive call	`high`
`injection_then_action`	An injection signal, then a state-changing or egress call	`high`

When the IFC pass proved that a value read at an earlier step reached this call's egress arguments, the finding is marked valueConfirmed and is always critical: the contributing steps are the exact provenance chain, not a heuristic window.

A denied call never raises a chain alert (the attempt was already blocked), and dedup keeps a long chain to one alert per pattern within a 30-minute cooldown. Findings land as kill_chain alerts in the normal triage lifecycle. A critical exfiltration chain also emits the kill_chain.detected event, which armed response rules can act on (for example, quarantining the agent) and webhooks deliver to your own systems.

Cross-tenant threat intel (k of 5)

Participation is opt-in on both sides (threat_intel_enabled): opting in both contributes your anonymized flags and lets network signal enrich your decisions. A k of 5 anonymity floor governs every shared indicator: an external identity is only ever surfaced once at least five distinct workspaces have independently flagged it. Below the floor it stays invisible, your flags are never attributable to you, and the lookup that enriches your decisions excludes your own workspace. So an indicator one participating tenant sees enriches verification for every participating tenant, with no single tenant's activity ever exposed. See the response engine for how flags are contributed.

Detection intelligence

On this page