API Reference
OpenAI-compatible AI gateway. Point any SDK at it — everything just works.
Authentication
Every request requires an API key in the Authorization header:
Authorization: Bearer cai_your_key_here
Create keys in the dashboard (per-team). Keys are Argon2-hashed at rest — the full key is shown only once at creation.
Base URL
Set this as base_url in any OpenAI SDK. All paths below are relative to this.
Quickstart
Pythonfrom openai import OpenAI client = OpenAI( base_url=, api_key="cai_your_key_here", ) response = client.chat.completions.create( model="auto", messages=[{"role": "user", "content": "Hello"}], ) print(response.choices[0].message.content)
cURLcurl /chat/completions \ -H "Authorization: Bearer cai_your_key_here" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [{"role": "user", "content": "Hello"}] }'
base_url and api_key.List Models
Returns all available model codenames.
Response{ "object": "list", "data": [ { "id": "auto", "object": "model" }, { "id": "claude-sonnet", "object": "model" }, { "id": "gpt-4o", "object": "model" } ] }
Chat Completions
Send a conversation, get a completion. Fully compatible with the OpenAI Chat Completions API.
Request body
| Parameter | Type | Description | |
|---|---|---|---|
| model | string | required | Model codename (gpt-4o, claude-sonnet) or auto for intelligent routing. Append a strategy suffix to control provider ranking: auto:speed, gpt-4o:cost, claude-sonnet:quality. See Routing. |
| messages | array | required | The conversation. Each message has role ("system", "user", "assistant", "tool") and content (string or multipart array). |
| stream | boolean | optional | Stream response as Server-Sent Events. See Streaming. |
| tools | array | optional | Tool/function definitions. See Tool calls. |
| tool_choice | string | object | optional | "auto", "none", "required", or {"type": "function", "function": {"name": "..."}}. |
| response_format | object | optional | Force JSON output. See Structured output. |
| max_tokens | integer | optional | Max tokens to generate. Capped at the model's limit. max_completion_tokens is an alias (takes precedence if both set). |
| temperature | float | optional | 0–2. Higher = more random. |
| top_p | float | optional | 0–1. Nucleus sampling. |
| stop | string | array | optional | Up to 4 stop sequences. |
| presence_penalty | float | optional | -2–2. |
| frequency_penalty | float | optional | -2–2. |
| seed | integer | optional | Deterministic sampling (best-effort). |
| n | integer | optional | Number of completions. Default: 1. |
| logit_bias | object | optional | Token ID to bias value mapping. |
| user | string | optional | End-user identifier for abuse detection. |
| cailos | object | optional | Routing hints. See Routing. |
Response
Standard OpenAI chat completion format, plus a cailos metadata object:
200 OK{ "id": "chatcmpl-8a3b2c1d...", "object": "chat.completion", "model": "gpt-4o-2024-08-06", "choices": [{ "message": { "role": "assistant", "content": "Hello! How can I help you?", "reasoning_content": null, // present only for reasoning models "tool_calls": null // present only when tools are invoked }, "finish_reason": "stop" }], "usage": { "prompt_tokens": 9, "completion_tokens": 12, "total_tokens": 21, "reasoning_tokens": 0 // tokens used for chain-of-thought }, "cailos": { "optimise": "quality", "provider": "openai", "endpoint": "gpt-4o-2024-08-06", "trust_level": 2, "detected_languages": ["en"], "tier": "paid" } }
| Field | Description |
|---|---|
| message.content | The model's text response. null when the model responds with tool calls only. |
| message.reasoning_content | Chain-of-thought reasoning. Absent for non-reasoning models. See Reasoning. |
| message.tool_calls | Function calls made by the model. See Tool calls. |
| finish_reason | "stop", "length", or "tool_calls". |
| usage.reasoning_tokens | Tokens used for thinking. Included in completion_tokens. |
| cailos.optimise | Strategy used: "speed", "cost", "quality", or "balanced". |
| cailos.provider | Provider that served the request (e.g. "openai", "anthropic"). |
| cailos.endpoint | Provider's native model ID. |
| cailos.trust_level | Trust level of the endpoint (0–3). Details |
| cailos.detected_languages | Auto-detected input languages (e.g. ["en", "he"]), or null. |
| cailos.tier | Your team's plan ("free" or "paid"). |
cailos object is a Cailos extension — not part of the OpenAI spec. SDKs ignore unknown keys, so it's safe to leave in your response handling.Examples
Streaming
Set stream: true to receive Server-Sent Events. Each event is a ChatCompletionChunk with incremental deltas, ending with data: [DONE].
Pythonstream = client.chat.completions.create( model="auto:speed", messages=[{"role": "user", "content": "Explain quantum computing"}], stream=True, ) for chunk in stream: delta = chunk.choices[0].delta.content if delta: print(delta, end="")
SSE formatdata: {"choices": [{"delta": {"content": "Hello"}}]} data: {"choices": [{"delta": {"content": "!"}}]} data: {"choices": [{"delta": {}, "finish_reason": "stop"}]} data: [DONE]
Tool calls
Define tools in the request. When the model invokes one, finish_reason is "tool_calls" and content is null:
Requestresponse = client.chat.completions.create( model="auto:quality", messages=[{"role": "user", "content": "What's the weather in Amsterdam?"}], tools=[{ "type": "function", "function": { "name": "get_weather", "parameters": { "type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"] } } }], )
Response{ "choices": [{ "message": { "role": "assistant", "content": null, "tool_calls": [{ "id": "call_abc123", "type": "function", "function": { "name": "get_weather", "arguments": "{\"city\": \"Amsterdam\"}" } }] }, "finish_reason": "tool_calls" }] }
To continue, append the assistant message (with tool_calls) and a tool message (with the result) back into messages.
Structured output
Force the model to respond with valid JSON using response_format:
JSON moderesponse = client.chat.completions.create( model="auto:quality", messages=[{"role": "user", "content": "List 3 planets as JSON"}], response_format={"type": "json_object"}, ) data = json.loads(response.choices[0].message.content) # {"planets": ["Mercury", "Venus", "Earth"]}
Vision
Send images as multipart content in the messages array:
Image inputresponse = client.chat.completions.create( model="auto:quality", messages=[{ "role": "user", "content": [ {"type": "text", "text": "What's in this image?"}, {"type": "image_url", "image_url": { "url": "https://example.com/photo.jpg" }} ] }], )
Both URLs and base64 data URIs (data:image/png;base64,...) are supported. Cailos auto-detects vision requirements and routes to a capable model.
Reasoning
When a reasoning model is used (DeepSeek R1, Qwen3, Claude with extended thinking, OpenAI o-series), the chain-of-thought is returned in reasoning_content — never as raw tags in content:
Response{ "choices": [{ "message": { "role": "assistant", "content": "The answer is 42.", "reasoning_content": "Let me think through this step by step..." } }], "usage": { "prompt_tokens": 15, "completion_tokens": 120, "reasoning_tokens": 98 } }
Pythonmsg = response.choices[0].message print(msg.content) # The actual answer if hasattr(msg, "reasoning_content") and msg.reasoning_content: print(msg.reasoning_content) # Chain-of-thought
reasoning_contentis absent for non-reasoning models.reasoning_tokensare included incompletion_tokens.- Thinking tags (
<think>,<thinking>, etc.) are extracted automatically — they never appear incontent.
Routing
The model field takes a codename (gpt-4o, claude-sonnet). The same codename can be served by multiple providers — Cailos picks the best one based on the strategy. When model is "auto", the ML classifier also picks the best model.
Append a strategy suffix to any model value with : to control how providers are ranked:
| Syntax | Behaviour |
|---|---|
| auto | ML classifier picks the best model and strategy (falls back to balanced when uncertain). |
| gpt-4o | Specific model, strategy auto-predicted from prompt content. |
| model:speed | Fastest response — ranks providers by lowest latency and highest TPS. |
| model:cost | Cheapest — ranks providers by lowest $/M tokens. |
| model:quality | Most capable — ranks providers by highest intelligence rating. |
| model:balanced | Composite blend of quality (40%), speed (35%), and cost (25%). |
Where model is either auto or a specific codename. Examples: auto:speed, gpt-4o:cost, claude-sonnet:quality.
Choosing a strategy
| Use case | Recommended |
|---|---|
| Real-time chat, autocomplete, typing indicators | :speed — minimises time-to-first-token and prioritises throughput. |
| High-volume batch processing, background jobs, bulk classification | :cost — routes to the cheapest available provider for the model. |
| Complex reasoning, code generation, high-stakes analysis | :quality — picks the provider with the highest eval scores. |
| General-purpose, unsure, or mixed workloads | :balanced or omit the suffix — blends all three factors. |
gpt-4o without a suffix, Cailos auto-predicts the best strategy from your prompt content. The suffix is only needed when you want explicit control.Routing hints
Fine-tune routing with the cailos object. Most fields are auto-detected from the request — you only need to set them for explicit overrides.
| Field | Type | Description |
|---|---|---|
| trust_level | integer | 0–3. Only routes to endpoints at or above this trust level. Cannot go below your team's floor. Details |
| require_tools | boolean | Only route to models with function calling support. Auto-detected from tools. |
| require_vision | boolean | Only route to models with image input support. Auto-detected from image_url in messages. |
| require_structured_output | boolean | Only route to models with JSON output support. Auto-detected from response_format. |
| require_web_access | boolean | Only route to models with web access. |
| language | string | ISO 639-1 code. Only route to models supporting this language. Auto-detected from message content. |
Example{ "model": "auto", "messages": [...], "cailos": { "trust_level": 2, "language": "en" } }
Feedback
Submit human feedback on any routed request. Feedback is used to improve routing quality — satisfaction ratings per endpoint influence model selection over time.
The {id} in the URL is the id field returned in every chat completion response — pass it through unchanged. No authentication required; the id is unguessable and feedback is only accepted within 1 week of the request being created. After that window, the endpoint returns 410 Gone.
Request body
| Parameter | Type | Description | |
|---|---|---|---|
| positive | string | required | "1" for thumbs up, "0" for thumbs down. |
| comment | string | optional | Short note on what went wrong or right. Max 100 characters. |
cURLcurl -X POST /accounts/requests/550e8400-e29b-41d4-a716-446655440000/feedback/ \ -H "Content-Type: application/x-www-form-urlencoded" \ -d 'positive=0&comment=wrong language in response'
The endpoint returns an HTML partial (designed for htmx). The feedback is also visible on the request detail page in the dashboard, and aggregated satisfaction ratings appear per endpoint on the model browse page.
Errors
All errors follow the OpenAI error envelope:
{
"error": {
"message": "Model 'nonexistent' not found or not active.",
"type": "not_found_error"
}
}
| Status | Meaning |
|---|---|
| 200 | Success. |
| 400 | Malformed JSON or invalid parameters. |
| 401 | Missing or invalid API key. |
| 403 | Key valid but team is inactive. |
| 404 | Model codename not found or no active providers. |
| 429 | Rate limit exceeded. Retry after 60 seconds. |
| 502 | Upstream provider returned an error. |
| 503 | Model unavailable — circuit breaker open or provider down. |
| 504 | Upstream provider timed out. |
Rate Limits
Per-key, per-minute. When exceeded, requests return 429. The window resets every 60 seconds. A rate limit of 0 means unlimited.
Providers
You never interact with providers directly — just send a codename. For transparency:
| Provider | Format | Notes |
|---|---|---|
| OpenAI | native | Forwarded as-is. |
| Anthropic | converted | System prompt, tools, vision, thinking — fully translated. |
| converted | Gemini API. Thinking parts extracted. | |
| Cohere | converted | v2 Chat API. |
| Groq, Together, DeepInfra, Cerebras, Scaleway, SambaNova, xAI, OpenRouter, Novita, NCompass, NScale, Inception, Tinfoil | native | OpenAI-compatible. |