AI Gateway (drop-in proxy)

Govern every model call across your org by changing one base URL. No application code changes.

The AI Gateway is a drop-in, OpenAI-compatible proxy. Point your existing OpenAI or Azure OpenAI client at it and add one header. Every model call your org makes is then scored, policy-checked, spend-metered, and audited, with no per-team code changes. It is the fastest way to put governance in front of model traffic that is already running in production.

Two changes, nothing else:

Set base_url to https://app.axiorank.com/api/proxy/v1.
Add the header X-AxioRank-Key: axr_live_… (a gateway key from Settings → API keys).

No key yet? The gateway quickstart mints one, fills it into these snippets, and watches for your first call to land.

Your provider key keeps riding in Authorization exactly as before. The proxy forwards it to the provider for that one request and never stores it.

Spend capture is automatic on every call, on any plan, so pointing a base_url here gives you cost visibility immediately. Inline enforcement (deny, hold, and redact) plus audit logging activate once you turn on model-I/O governance in Settings → Governance.

OpenAI

from openai import OpenAI

client = OpenAI(
    base_url="https://app.axiorank.com/api/proxy/v1",
    api_key="sk-...",  # your OpenAI key, forwarded upstream, never stored
    default_headers={"X-AxioRank-Key": "axr_live_..."},
)

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize Q3 results"}],
)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://app.axiorank.com/api/proxy/v1",
  apiKey: process.env.OPENAI_API_KEY, // forwarded upstream, never stored
  defaultHeaders: { "X-AxioRank-Key": process.env.AXIORANK_KEY! },
});

const resp = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Summarize Q3 results" }],
});

The same client also covers the Responses API. Once base_url points at the proxy, client.responses.create(...) is governed exactly like chat completions:

resp = client.responses.create(
    model="gpt-4o",
    instructions="You are a helpful assistant.",
    input="Summarize Q3 results",
)

Azure OpenAI

Use the standard OpenAI client and tell the proxy to route to Azure with a few headers. The proxy builds the deployment-scoped Azure URL and authenticates with the api-key Azure expects.

from openai import OpenAI

client = OpenAI(
    base_url="https://app.axiorank.com/api/proxy/v1",
    api_key="<your-azure-api-key>",
    default_headers={
        "X-AxioRank-Key": "axr_live_...",
        "X-AxioRank-Upstream": "azure",
        "X-AxioRank-Azure-Resource": "my-resource",      # my-resource.openai.azure.com
        "X-AxioRank-Azure-Deployment": "gpt-4o",
        "X-AxioRank-Azure-Api-Version": "2024-10-21",
    },
)

Named providers (presets)

X-AxioRank-Upstream accepts a provider name for every major OpenAI-compatible provider, the same presets the OSS gateway ships. The gateway expands the name to that provider's endpoint; your key for THAT provider rides in Authorization exactly as an OpenAI key would.

default_headers={
    "X-AxioRank-Key": "axr_live_...",
    "X-AxioRank-Upstream": "groq",   # Authorization carries your Groq key
}

Preset	Endpoint
`groq`	`https://api.groq.com/openai/v1`
`together`	`https://api.together.xyz/v1`
`fireworks`	`https://api.fireworks.ai/inference/v1`
`mistral`	`https://api.mistral.ai/v1`
`deepseek`	`https://api.deepseek.com/v1`
`xai`	`https://api.x.ai/v1`
`perplexity`	`https://api.perplexity.ai`
`anthropic`	`https://api.anthropic.com/v1` (OpenAI-compatible)
`gemini`	`https://generativelanguage.googleapis.com/v1beta/openai`
`cerebras`	`https://api.cerebras.ai/v1`
`sambanova`	`https://api.sambanova.ai/v1`
`deepinfra`	`https://api.deepinfra.com/v1/openai`
`novita`	`https://api.novita.ai/openai`
`hyperbolic`	`https://api.hyperbolic.xyz/v1`
`nebius`	`https://api.tokenfactory.nebius.com/v1`
`moonshot`	`https://api.moonshot.ai/v1` (Kimi)
`zhipu`	`https://api.z.ai/api/paas/v4` (GLM)
`qwen`	`https://dashscope-intl.aliyuncs.com/compatible-mode/v1`
`nvidia`	`https://integrate.api.nvidia.com/v1` (NIM)
`huggingface`	`https://router.huggingface.co/v1` (Inference Providers)
`github`	`https://models.github.ai/inference` (GitHub Models, PAT auth)
`cohere`	`https://api.cohere.ai/compatibility/v1`
`baseten`	`https://inference.baseten.co/v1` (Model APIs)
`featherless`	`https://api.featherless.ai/v1`
`siliconflow`	`https://api.siliconflow.com/v1`
`scaleway`	`https://api.scaleway.ai/v1`
`minimax`	`https://api.minimax.io/v1`

The anthropic and gemini presets use those providers' OpenAI-COMPATIBLE endpoints, for clients that only speak the OpenAI shape. If you use the native Anthropic or Gemini SDK, prefer the native surfaces below. Where a provider has international and China-domestic domains (moonshot, zhipu, qwen, siliconflow, minimax), the preset is the international endpoint; point a custom base URL at the domestic one if you need it. The OSS gateway's local presets (ollama, vllm, lmstudio, llamacpp, sglang, koboldcpp, jan) point at localhost and are rejected here; run npx @axiorank/gateway next to a local model instead. See the model catalog for every model these presets reach, with prices and copy-paste snippets.

Other OpenAI-compatible endpoints

Set X-AxioRank-Upstream to openrouter, or to any full base URL of an OpenAI-compatible endpoint (vLLM, LiteLLM, or a self-hosted server). The default is OpenAI.

default_headers={
    "X-AxioRank-Key": "axr_live_...",
    "X-AxioRank-Upstream": "https://llm.example.com/v1",
}

A custom upstream must be a public https endpoint. Loopback, private-network, link-local, and internal hostnames are rejected: the hosted gateway will not forward into its own network. For a model on localhost or a private network, run the OSS gateway (npx @axiorank/gateway) next to it; it applies the same guardrails and receipts with no such restriction.

Amazon Bedrock

Bedrock uses AWS SigV4, which binds a signature to the request host, so it cannot be proxied transparently. Instead, send the Converse request body to the gateway and pass your AWS credentials per request. The gateway re-signs for the Bedrock host and never stores them.

curl https://app.axiorank.com/api/proxy/bedrock/v1/converse \
  -H "content-type: application/json" \
  -H "X-AxioRank-Key: axr_live_..." \
  -H "X-AxioRank-AWS-Region: us-east-1" \
  -H "X-AxioRank-AWS-Access-Key-Id: $AWS_ACCESS_KEY_ID" \
  -H "X-AxioRank-AWS-Secret-Access-Key: $AWS_SECRET_ACCESS_KEY" \
  -H "X-AxioRank-Bedrock-Model: anthropic.claude-3-5-sonnet-20240620-v1:0" \
  -d '{"messages":[{"role":"user","content":[{"text":"Summarize Q3 results"}]}]}'

Add X-AxioRank-AWS-Session-Token for temporary credentials. Streaming is supported at …/api/proxy/bedrock/v1/converse-stream (the AWS binary event-stream format), with the same headers.

Google Vertex AI

Vertex exposes an OpenAI-compatible endpoint, so it rides the same chat completions proxy. Set the upstream to vertex and pass your project and location; the GCP access token rides in the API key field.

from openai import OpenAI

client = OpenAI(
    base_url="https://app.axiorank.com/api/proxy/v1",
    api_key="<output of: gcloud auth print-access-token>",
    default_headers={
        "X-AxioRank-Key": "axr_live_...",
        "X-AxioRank-Upstream": "vertex",
        "X-AxioRank-GCP-Project": "my-project",
        "X-AxioRank-GCP-Location": "us-central1",
    },
)

resp = client.chat.completions.create(
    model="google/gemini-2.0-flash",
    messages=[{"role": "user", "content": "Summarize Q3 results"}],
)

Anthropic

Point the Anthropic SDK's base_url at the proxy and add the governance header. Your Anthropic key keeps riding in x-api-key (the SDK sets it), forwarded upstream and never stored.

from anthropic import Anthropic

client = Anthropic(
    api_key="sk-ant-...",  # forwarded upstream, never stored
    base_url="https://app.axiorank.com/api/proxy/anthropic",
    default_headers={"X-AxioRank-Key": "axr_live_..."},
)

resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    system="You are a helpful assistant.",
    messages=[{"role": "user", "content": "Summarize Q3 results"}],
)

Google Gemini (native API)

The native Gemini API puts the model and method in the URL path, so point a Gemini client's base URL at …/api/proxy/gemini and add the governance header. Your Gemini key keeps riding in x-goog-api-key, forwarded upstream and never stored.

curl "https://app.axiorank.com/api/proxy/gemini/v1beta/models/gemini-2.0-flash:generateContent" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H "X-AxioRank-Key: axr_live_..." \
  -H "content-type: application/json" \
  -d '{"contents":[{"parts":[{"text":"Summarize Q3 results"}]}]}'

With the google-genai SDK, set the base URL and add the header via http_options:

from google import genai
from google.genai import types

client = genai.Client(
    api_key="...",  # forwarded as x-goog-api-key, never stored
    http_options=types.HttpOptions(
        base_url="https://app.axiorank.com/api/proxy/gemini",
        headers={"X-AxioRank-Key": "axr_live_..."},
    ),
)

resp = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="Summarize Q3 results",
)

Streaming (:streamGenerateContent?alt=sse) is supported. Google Vertex AI users can also reach Gemini through the OpenAI-compatible path documented above.

Drop it into your existing stack

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o",
    base_url="https://app.axiorank.com/api/proxy/v1",
    api_key="sk-...",
    default_headers={"X-AxioRank-Key": "axr_live_..."},
)

import litellm

resp = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "hi"}],
    api_base="https://app.axiorank.com/api/proxy/v1",
    api_key="sk-...",
    extra_headers={"X-AxioRank-Key": "axr_live_..."},
)

What the proxy does to a call

Decision	Result
allow	The provider response is returned unchanged.
deny (prompt)	The model is never called. The proxy returns HTTP 403 with an OpenAI-shaped error.
hold (prompt)	The model is never called. The proxy returns HTTP 409 with an `X-AxioRank-Approval-Id` header to poll.
redact (completion)	The flagged spans in the response content are masked. Every other field is preserved.
deny (completion)	The response content is replaced with a blocked notice and `finish_reason` becomes `content_filter`.

Token usage from the provider response is rolled up into the spend dashboard and counts against workspace budgets.

Run correlation headers

Send the Agent Runs correlation headers and the proxy's model turns stitch into the same run timeline as your governed tool calls:

Header	Value
`X-AxioRank-Trace-Id`	The run's trace id (uuid). The SDK trace handle emits it via `t.headers()` (TypeScript) / `t.headers` (Python).
`X-AxioRank-Session-Id`	Optional session id (uuid) grouping runs into one conversation.
`X-AxioRank-Step-Index`	Optional 0 to 100000; omit to order turns by time.
`X-AxioRank-Parent-Step-Index`	Optional; nests the turn under a spawning step (sub-agents).

Malformed values are dropped silently; correlation never fails a model call. With model-I/O governance off, a turn carrying a trace header is still recorded as an observation-only run step (allow, risk 0, telemetry only, no content).

Streaming

Streaming (stream: true) is supported. When model-I/O governance is off, the provider's server-sent events stream straight through and the proxy reads the final usage chunk to meter spend. When governance is on, the proxy buffers the response so it can apply completion-phase redaction or blocks, then re-emits it as server-sent events. That trades incremental delivery for inline enforcement.

Routing, load balancing, and failover

The fastest way to manage routes is the AI Gateway hub in the dashboard (Outbound, AI Gateway). It lists your routes, opens a create/edit dialog with a "Test resolution" button that dry-runs the ordered attempt list before you save, and charts routed traffic, cache hit rate, failover rate, latency, and provider health. The HTTP API below is the same surface for scripting and CI.

A route maps a model alias (the value you send as model) to an ordered list of provider and model targets. Instead of pinning an app to one model, point it at the alias and let the gateway choose and fail over:

Strategy decides the primary target: failover (the order you declared), cost (cheapest first by the price catalog), or round_robin (weighted load balancing across targets).
Failover is automatic: a 429, a timeout, or a 5xx from one target transparently retries the next. A 4xx such as 400 or 401 is a real error and is returned as-is, so a caller mistake is never masked by trying another provider.

Routes cover the OpenAI-compatible surface: openai, azure, openrouter, any named preset (groq, together, mistral, ...), and any custom OpenAI-compatible endpoint. Each target authenticates with the managed credential stored for its provider (a preset target looks up its own provider slot first, then the shared custom one), falling back to the key you forward when the provider matches. A preset target with no usable credential is skipped, so the route fails over to the targets that CAN authenticate. Full routing needs the Pro plan or higher; the free plan includes 1 failover alias with up to 2 targets.

# Define a route (admin). Cheapest target first, with cross-provider backups.
curl https://app.axiorank.com/api/proxy/routes \
  -H "content-type: application/json" \
  -b "<your dashboard session>" \
  -d '{
    "alias": "axio/auto",
    "strategy": "cost",
    "targets": [
      { "upstream": "groq", "model": "llama-3.3-70b-versatile" },
      { "upstream": "openai", "model": "gpt-4o-mini" },
      { "upstream": "openrouter", "model": "anthropic/claude-haiku-4-5" }
    ]
  }'

Then call the alias like any other model. The response carries X-AxioRank-Route, X-AxioRank-Route-Target, and X-AxioRank-Route-Attempts so you can see which target served the call and how many were tried.

client.chat.completions.create(model="axio/auto", messages=[...])

GET /api/proxy/routes lists routes; DELETE /api/proxy/routes/{id} removes one. PATCH /api/proxy/routes/{id} flips enabled or updates the caching, retry, and timeout knobs below without resending the target list. Routing applies to the Chat Completions surface; the other surfaces are governed but call the provider directly.

Canary releases

There is no separate canary switch because weighted round_robin already is one. A round_robin route splits traffic across its targets in proportion to each target's weight (default 1). To send 5 percent of calls to a new model and keep 95 percent on the current one, give them weights 95 and 5:

{
  "alias": "axio/chat",
  "strategy": "round_robin",
  "targets": [
    { "upstream": "openai", "model": "gpt-4o-mini", "weight": 95 },
    { "upstream": "openai", "model": "gpt-4o", "weight": 5 }
  ]
}

Every response reports X-AxioRank-Route-Target, so you can attribute results to the target that served each call. Widen the canary by raising its weight, or roll back by setting it to 0.

Retries and timeouts

Two per-route reliability knobs (Pro and above) sit on top of failover:

retryCount (0 to 2, default 0): on a retryable failure the gateway retries the same target, with exponential backoff, before it fails over to the next one. Use it to ride out a brief provider blip without changing models.
timeoutMs (1000 to 300000): a per-attempt upstream timeout. The timer guards the wait for response headers only, so a slow streamed body is never cut off mid-flight. A timeout counts as a retryable failure. Set it per target to override the route default.

Retries and timeouts never apply to a 4xx, so a caller mistake is returned immediately rather than retried.

Response caching

Turn on exact-match response caching per route by setting cacheTtlSeconds (30 to 86400). A repeated request answers from the workspace cache in single-digit milliseconds and pays the provider nothing.

The cache key is a hash of the normalized request body (transport-only fields stream and stream_options are ignored, everything else including temperature is significant) plus the route. Editing the route flushes its cache automatically.
Only clean responses are cached: a 200 whose final decision was allow with no redaction. Streaming requests can be served from the cache but are never written to it, and denied, held, or redacted outputs are never cached.
Under model-I/O enforcement, a cache hit is still run through completion governance before it is served, so a policy you tightened after the write still applies to cached responses. This is the difference between caching bytes and caching a governed decision.
Response headers report X-AxioRank-Cache: hit | miss | bypass | refresh. Send the request header X-AxioRank-Cache: bypass to skip the cache for one call, or refresh to skip the read and rewrite the entry from a fresh response. Flush a route's cache from the hub or with DELETE /api/proxy/routes/{id}/cache.

Caching is exact match today. Semantic caching (answering near-duplicate prompts from the cache) is on the roadmap. Response caching needs the Pro plan or higher.

Managed credentials (one key)

By default the proxy forwards your provider credential from the request (the Authorization bearer, x-api-key, x-goog-api-key, or the AWS headers), so it never stores it. On the Enterprise plan you can instead store a provider credential once and send only the X-AxioRank-Key. The secret is encrypted at rest (AES-256-GCM) and supplied by the proxy when the per-request credential is absent.

# Store an OpenAI key once (admin, Enterprise)
curl https://app.axiorank.com/api/proxy/credentials \
  -H "content-type: application/json" \
  -b "<your dashboard session>" \
  -d '{"provider":"openai","apiKey":"sk-...","label":"prod"}'

After that, callers send only base_url + X-AxioRank-Key (no Authorization). Provider values: openai, azure, openrouter, anthropic, gemini, bedrock (awsAccessKeyId + awsSecretAccessKey + awsRegion), custom, and every named preset (groq, cerebras, nvidia, and the rest of the preset table), so a route can hold a per-provider key for each of its targets. GET lists stored credentials (never the secret); DELETE /api/proxy/credentials/{id} removes one. Routing headers that are not secrets (Azure resource/deployment, custom base URL, the Bedrock model) are still sent per request.

Notes and limits

By default the provider key is used for one request and never persisted. Enterprise workspaces can instead store it once (see Managed credentials above) so the AxioRank key is the only credential sent.
OpenAI Chat Completions, OpenAI Responses, Azure OpenAI, the Anthropic Messages API, Amazon Bedrock (Converse and ConverseStream), the native Google Gemini API (generateContent), and Google Vertex (via its OpenAI-compatible endpoint) are supported today, streaming and non-streaming.
Under model-I/O governance a streamed response is buffered, governed, and re-emitted in the provider's native streaming format. Text is redactable; tool calls are preserved.
Under model-I/O governance, a streaming Responses request is buffered and re-emitted as typed events (response.created, response.output_text.delta, response.completed), so redaction and blocks still apply.
The proxy sits inline in your request path. Governance failures fail open by default so availability is never gated on the control plane.

AI Gateway (drop-in proxy)

On this page