AI detection

A model in the loop, judging the hard calls.

When a rule holds a call, a model adjudicates whether to release it, never overriding a deny and failing secure if it cannot answer. The same lane scores what you send a model and what it sends back for toxicity, jailbreaks, and groundedness.

Open the console Read the docs

synchronous judge · fail-secure · model-I/O scored fail-open

flow judge · verdict

{
  "decision": "hold -> allow",
  "judge": {
    "model": "gpt-4.1",
    "verdict": "benign",
    "confidence": 0.94
  },
  "neverOverrides": "deny"
}

release-only · fail-secure · a deny stays a deny

25→5%

Benign holds after the judge

Denies the judge can override

gpt-4.1

Adjudicating model

Team

Available from Team up

The trade every gateway hits

Catch everything, or stop holding good work. Pick both.

Plain detectors miss the integrity attacks where untrusted data drives a dangerous action. Information-flow control catches them all, but it over-holds legitimate work. A model judge resolves the held calls so the catch rate stays high and the false holds fall.

Green is the share of attack sinks blocked. Red is the share of legitimate actions held by mistake. Numbers from the enforcement benchmark. See the methodology.

The flow judge

It releases held calls. It cannot release a denied one.

The judge runs only on calls the policy engine put on hold, and only to decide release. A deny is final. If the judge times out or is unsure, the call stays held. Intelligence sharpens the decision without ever weakening it.

Release-only

The judge can turn a hold into an allow. It is never consulted to turn a deny into anything else.

Fail-secure

No verdict in time, low confidence, or an error all leave the call held for a human.

Learns your flows

A human approving a held pattern can mint a scoped endorsement, so the same safe flow stops being held.

Model I/O guardrails

Score the prompt and the response, not just the tools.

The same lane assesses what goes into a model and what comes out: toxicity from an omni-moderation pass and groundedness from a model check, returned as a class your policies and response rules can target. It runs fail-open, so it raises risk without ever blocking on its own.

944assessed

Grounded and safe812
Jailbreak attempt64
Injected instruction41
Toxic or ungrounded27

Assess a prompt and a completion

POST /v1/assess/model-io
{
  "phase": "completion",
  "text": "<model output>",
  "checks": ["toxicity", "groundedness"]
}
// -> { "class": "policy_violation", "score": 0.81 }

Team and above

Closing the loop