API Reference

OpenAI-compatible AI gateway. Point any SDK at it — everything just works.

Authentication

Every request requires an API key in the Authorization header:

Authorization: Bearer cai_your_key_here

Create keys in the dashboard (per-team). Keys are Argon2-hashed at rest — the full key is shown only once at creation.

Base URL

Set this as base_url in any OpenAI SDK. All paths below are relative to this.

Quickstart

Pythonfrom openai import OpenAI

client = OpenAI(
    base_url=,
    api_key="cai_your_key_here",
)

response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)

cURLcurl /chat/completions \
  -H "Authorization: Bearer cai_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Any OpenAI-compatible SDK works. Python, Node, Go, Rust — just change base_url and api_key.

List Models

GET /v1/models

Returns all available model codenames.

Response{
  "object": "list",
  "data": [
    { "id": "auto", "object": "model" },
    { "id": "claude-sonnet", "object": "model" },
    { "id": "gpt-4o", "object": "model" }
  ]
}

Chat Completions

POST /v1/chat/completions

Send a conversation, get a completion. Fully compatible with the OpenAI Chat Completions API.

Request body

Parameter	Type		Description
model	string	required	Model codename (`gpt-4o`, `claude-sonnet`) or `auto` for intelligent routing. Append a strategy suffix to control provider ranking: `auto:speed`, `gpt-4o:cost`, `claude-sonnet:quality`. See Routing.
messages	array	required	The conversation. Each message has `role` (`"system"`, `"user"`, `"assistant"`, `"tool"`) and `content` (string or multipart array).
stream	boolean	optional	Stream response as Server-Sent Events. See Streaming.
tools	array	optional	Tool/function definitions. See Tool calls.
tool_choice	string \| object	optional	`"auto"`, `"none"`, `"required"`, or `{"type": "function", "function": {"name": "..."}}`.
response_format	object	optional	Force JSON output. See Structured output.
max_tokens	integer	optional	Max tokens to generate. Capped at the model's limit. `max_completion_tokens` is an alias (takes precedence if both set).
temperature	float	optional	`0`–`2`. Higher = more random.
top_p	float	optional	`0`–`1`. Nucleus sampling.
stop	string \| array	optional	Up to 4 stop sequences.
presence_penalty	float	optional	`-2`–`2`.
frequency_penalty	float	optional	`-2`–`2`.
seed	integer	optional	Deterministic sampling (best-effort).
n	integer	optional	Number of completions. Default: `1`.
logit_bias	object	optional	Token ID to bias value mapping.
user	string	optional	End-user identifier for abuse detection.
cailos	object	optional	Routing hints. See Routing.

Response

Standard OpenAI chat completion format, plus a cailos metadata object:

200 OK{
  "id": "chatcmpl-8a3b2c1d...",
  "object": "chat.completion",
  "model": "gpt-4o-2024-08-06",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you?",
      "reasoning_content": null,     // present only for reasoning models
      "tool_calls": null              // present only when tools are invoked
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21,
    "reasoning_tokens": 0          // tokens used for chain-of-thought
  },
  "cailos": {
    "optimise": "quality",
    "provider": "openai",
    "endpoint": "gpt-4o-2024-08-06",
    "trust_level": 2,
    "detected_languages": ["en"],
    "tier": "paid"
  }
}

Field	Description
message.content	The model's text response. `null` when the model responds with tool calls only.
message.reasoning_content	Chain-of-thought reasoning. Absent for non-reasoning models. See Reasoning.
message.tool_calls	Function calls made by the model. See Tool calls.
finish_reason	`"stop"`, `"length"`, or `"tool_calls"`.
usage.reasoning_tokens	Tokens used for thinking. Included in `completion_tokens`.
cailos.optimise	Strategy used: `"speed"`, `"cost"`, `"quality"`, or `"balanced"`.
cailos.provider	Provider that served the request (e.g. `"openai"`, `"anthropic"`).
cailos.endpoint	Provider's native model ID.
cailos.trust_level	Trust level of the endpoint (`0`–`3`). Details
cailos.detected_languages	Auto-detected input languages (e.g. `["en", "he"]`), or `null`.
cailos.tier	Your team's plan (`"free"` or `"paid"`).

The cailos object is a Cailos extension — not part of the OpenAI spec. SDKs ignore unknown keys, so it's safe to leave in your response handling.

Examples

Streaming

Set stream: true to receive Server-Sent Events. Each event is a ChatCompletionChunk with incremental deltas, ending with data: [DONE].

Pythonstream = client.chat.completions.create(
    model="auto:speed",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="")

SSE formatdata: {"choices": [{"delta": {"content": "Hello"}}]}

data: {"choices": [{"delta": {"content": "!"}}]}

data: {"choices": [{"delta": {}, "finish_reason": "stop"}]}

data: [DONE]

Tool calls

Define tools in the request. When the model invokes one, finish_reason is "tool_calls" and content is null:

Requestresponse = client.chat.completions.create(
    model="auto:quality",
    messages=[{"role": "user", "content": "What's the weather in Amsterdam?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"]
            }
        }
    }],
)

Response{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_abc123",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"city\": \"Amsterdam\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

To continue, append the assistant message (with tool_calls) and a tool message (with the result) back into messages.

Structured output

Force the model to respond with valid JSON using response_format:

JSON moderesponse = client.chat.completions.create(
    model="auto:quality",
    messages=[{"role": "user", "content": "List 3 planets as JSON"}],
    response_format={"type": "json_object"},
)

data = json.loads(response.choices[0].message.content)
# {"planets": ["Mercury", "Venus", "Earth"]}

Healing. If the model wraps JSON in code fences, has trailing commas, or truncates output, Cailos repairs it automatically before returning.

Vision

Send images as multipart content in the messages array:

Image inputresponse = client.chat.completions.create(
    model="auto:quality",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {
                "url": "https://example.com/photo.jpg"
            }}
        ]
    }],
)

Both URLs and base64 data URIs (data:image/png;base64,...) are supported. Cailos auto-detects vision requirements and routes to a capable model.

Reasoning

When a reasoning model is used (DeepSeek R1, Qwen3, Claude with extended thinking, OpenAI o-series), the chain-of-thought is returned in reasoning_content — never as raw tags in content:

Response{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The answer is 42.",
      "reasoning_content": "Let me think through this step by step..."
    }
  }],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 120,
    "reasoning_tokens": 98
  }
}

Pythonmsg = response.choices[0].message
print(msg.content)             # The actual answer

if hasattr(msg, "reasoning_content") and msg.reasoning_content:
    print(msg.reasoning_content)   # Chain-of-thought

reasoning_content is absent for non-reasoning models.
reasoning_tokens are included in completion_tokens.
Thinking tags (<think>, <thinking>, etc.) are extracted automatically — they never appear in content.

Routing

The model field takes a codename (gpt-4o, claude-sonnet). The same codename can be served by multiple providers — Cailos picks the best one based on the strategy. When model is "auto", the ML classifier also picks the best model.

Append a strategy suffix to any model value with : to control how providers are ranked:

Syntax	Behaviour
auto	ML classifier picks the best model and strategy (falls back to `balanced` when uncertain).
gpt-4o	Specific model, strategy auto-predicted from prompt content.
model:speed	Fastest response — ranks providers by lowest latency and highest TPS.
model:cost	Cheapest — ranks providers by lowest $/M tokens.
model:quality	Most capable — ranks providers by highest intelligence rating.
model:balanced	Composite blend of quality (40%), speed (35%), and cost (25%).

Where model is either auto or a specific codename. Examples: auto:speed, gpt-4o:cost, claude-sonnet:quality.

Choosing a strategy

Use case	Recommended
Real-time chat, autocomplete, typing indicators	`:speed` — minimises time-to-first-token and prioritises throughput.
High-volume batch processing, background jobs, bulk classification	`:cost` — routes to the cheapest available provider for the model.
Complex reasoning, code generation, high-stakes analysis	`:quality` — picks the provider with the highest eval scores.
General-purpose, unsure, or mixed workloads	`:balanced` or omit the suffix — blends all three factors.

Tip. When you specify a model like gpt-4o without a suffix, Cailos auto-predicts the best strategy from your prompt content. The suffix is only needed when you want explicit control.

Routing hints

Fine-tune routing with the cailos object. Most fields are auto-detected from the request — you only need to set them for explicit overrides.

Field	Type	Description
trust_level	integer	`0`–`3`. Only routes to endpoints at or above this trust level. Cannot go below your team's floor. Details
require_tools	boolean	Only route to models with function calling support. Auto-detected from `tools`.
require_vision	boolean	Only route to models with image input support. Auto-detected from `image_url` in messages.
require_structured_output	boolean	Only route to models with JSON output support. Auto-detected from `response_format`.
require_web_access	boolean	Only route to models with web access.
language	string	ISO 639-1 code. Only route to models supporting this language. Auto-detected from message content.

Example{
  "model": "auto",
  "messages": [...],
  "cailos": {
    "trust_level": 2,
    "language": "en"
  }
}

Auto-detection. Fields marked auto-detected are inferred from your request payload. You don't need to set them unless you want to override.

Feedback

Submit human feedback on any routed request. Feedback is used to improve routing quality — satisfaction ratings per endpoint influence model selection over time.

POST /accounts/requests/{id}/feedback/

The {id} in the URL is the id field returned in every chat completion response — pass it through unchanged. No authentication required; the id is unguessable and feedback is only accepted within 1 week of the request being created. After that window, the endpoint returns 410 Gone.

Request body

Parameter	Type		Description
positive	string	required	`"1"` for thumbs up, `"0"` for thumbs down.
comment	string	optional	Short note on what went wrong or right. Max 100 characters.

cURLcurl -X POST /accounts/requests/550e8400-e29b-41d4-a716-446655440000/feedback/ \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d 'positive=0&comment=wrong language in response'

The endpoint returns an HTML partial (designed for htmx). The feedback is also visible on the request detail page in the dashboard, and aggregated satisfaction ratings appear per endpoint on the model browse page.

Why give feedback? Even sparse feedback (a few percent of requests) gives Cailos a quality signal per endpoint and task type that static eval scores can't capture. Over time this directly improves which provider gets picked for your workload.

Errors

All errors follow the OpenAI error envelope:

{
  "error": {
    "message": "Model 'nonexistent' not found or not active.",
    "type": "not_found_error"
  }
}

Status	Meaning
200	Success.
400	Malformed JSON or invalid parameters.
401	Missing or invalid API key.
403	Key valid but team is inactive.
404	Model codename not found or no active providers.
429	Rate limit exceeded. Retry after 60 seconds.
502	Upstream provider returned an error.
503	Model unavailable — circuit breaker open or provider down.
504	Upstream provider timed out.

Rate Limits

Per-key, per-minute. When exceeded, requests return 429. The window resets every 60 seconds. A rate limit of 0 means unlimited.

Providers

You never interact with providers directly — just send a codename. For transparency:

Provider	Format	Notes
OpenAI	native	Forwarded as-is.
Anthropic	converted	System prompt, tools, vision, thinking — fully translated.
Google	converted	Gemini API. Thinking parts extracted.
Cohere	converted	v2 Chat API.
Groq, Together, DeepInfra, Cerebras, Scaleway, SambaNova, xAI, OpenRouter, Novita, NCompass, NScale, Inception, Tinfoil	native	OpenAI-compatible.