Detection intelligence
How a call becomes a verdict beyond the deterministic signals, with ML assessment, a semantic judge, taint provenance, and kill-chain correlation.
The content-inspection engine and your policies decide every call in-band, deterministically. Detection intelligence is the layer around that decision: a model's semantic verdict, value-level provenance, and multi-step correlation. It catches what a single-call regex pass cannot.
Two lanes
The gateway never waits on a model. Detection runs in two lanes:
| What | Lane | Affects |
|---|---|---|
| Content inspection, risk score, redaction | Hot path, in-band | This call's verdict |
| Policy evaluation, including the IFC sink check | Hot path, in-band | This call's verdict |
| ML assessment (the semantic judge) | Post-hoc job | Alerts, response rules, future calls via mlThreatClass |
| Kill-chain correlation | Post-hoc job | Alerts, response rules |
The post-hoc lane never delays a decision. It enriches the record, raises alerts, and feeds the next decision.
ML assessment
After a tool call is evaluated, a background job can send it to AxioRank's model service for a semantic verdict: prompt injection, jailbreak, or exfiltration intent that pattern matching misses. The call is gated, in order:
- Global config: the model service must be configured for the deployment.
- Workspace opt-in: external model egress is off by default; a workspace setting turns it on.
- Plan entitlement: AI assessments are a Team and Enterprise feature.
- Worth-it gate: always assessed for a critical signal, an ambiguous output-injection, or heuristic risk at or above 40; otherwise a deterministic 5% sample of benign calls keeps a baseline of "normal".
- Monthly cap: assessments are metered against a per-plan monthly limit, checked before the spend.
Only the redacted payload (secrets already masked) ever leaves the platform.
Fail-open by design
If the model service is unreachable, the assessment is recorded as unavailable and nothing else happens. The deterministic decision already stood and was already returned; ML is enrichment, never a dependency.
A completed verdict carries a calibrated mlRisk (0 to 100), a recommendation
(allow · review · block · escalate), and a threatClass: benign ·
prompt_injection · jailbreak · data_exfiltration · malware ·
social_engineering · policy_violation · unknown. When the recommendation is
block or escalate, or mlRisk is 80 or above, an ml_threat alert is raised
through your normal channels. Every persisted verdict fires the
ml.assessed webhook and drives ml_* response-rule predicates.
A policy can also match on the verdict with the mlThreatClass
predicate. Because the verdict is produced asynchronously, the predicate matches
the agent's latest completed assessment, not the current call's. It is
fail-open: with no verdict on record, the predicate is simply unmet, so a policy
never denies on missing ML data.
The semantic judge
The judge's defining job is adjudicating ambiguity. When the deterministic detectors flag a possible injection in a tool output (a lone forged role marker or embedded tool directive can score below the risk floor), that is exactly the "the regex flagged it, but is it real?" case, so it is always sent for assessment regardless of risk.
Confirmed verdicts feed a self-improving loop. When the judge confirms an
injection-family threat (prompt_injection, jailbreak, data_exfiltration)
with confidence at or above 0.8, and the deterministic layer under-scored it, a
second model call generalizes the finding into a reusable custom detector. The
proposal is born disabled, marked ai_proposed, capped at 20 AI-proposed
detectors per workspace, and metered like any assessment. A human reviews and
arms it; a model never enables detection unattended.
Taint provenance
Information-flow control (IFC) tracks values, not just calls. When an untrusted tool returns a result, the payload's string leaves are fingerprinted at ingress: opaque salted hashes over several normalized variants of each leaf (raw, whitespace and case normalized, and base64/hex decoded) so trivial obfuscation does not break the match. The raw value is never stored, only fingerprints.
Untrusted sources are MCP servers not marked trusted (tag mcp_untrusted) plus
tool-name classes: web_fetch, inbound_email, file_read, db_read.
When a later call in the trace is a sink (egress · destructive ·
state_change), its own argument leaves are fingerprinted and checked against
the trace's accumulated untrusted set. An IFC policy rule chooses the propagation
mode:
explicit: fires only when a tainted value provably reappeared in the sink arguments. Evasion-resistant, and it records which prior step minted the value, so the flow is a provable chain rather than an inference.coarse: fires when any untrusted output was seen earlier in the trace. The high-recall backstop for transformations explicit matching cannot follow.
All IFC work is gated on the workspace having an enabled IFC policy; a workspace
without one pays nothing. Results proxied through the MCP gateway
are fingerprinted automatically. On the SDK path the platform only sees outputs
your code reports, so call inspectResult or pass inspectResults: true to a
framework adapter to bring tool outputs into the taint trace.
Kill-chain correlation
Single-call scoring misses the most dangerous behavior: a sequence whose steps each look fine alone. After every evaluated call, a post-hoc job loads the run's prior calls (by trace id, or by agent within a window when un-instrumented) and checks whether the just-landed call completes a dangerous ordered pattern over the most recent 20 steps:
| Pattern | Sequence | Severity |
|---|---|---|
exfiltration | A sensitive read (secret/PII signal or a read-shaped tool), then egress | high; critical when a live secret was seen |
recon_then_destroy | Three or more reads/lists, then a destructive call | high |
injection_then_action | An injection signal, then a state-changing or egress call | high |
When the IFC pass proved that a value read at an earlier step reached this call's
egress arguments, the finding is marked valueConfirmed and is always
critical: the contributing steps are the exact provenance chain, not a
heuristic window.
A denied call never raises a chain alert (the attempt was already blocked), and
dedup keeps a long chain to one alert per pattern within a 30-minute cooldown.
Findings land as kill_chain alerts in the normal triage lifecycle. A critical
exfiltration chain also emits the kill_chain.detected event, which armed
response rules can act on (for example, quarantining the
agent) and webhooks deliver to your own systems.