API Reference

OpenAI-compatible AI gateway. Point any SDK at it — everything just works.

Authentication

Every request requires an API key in the Authorization header:

Authorization: Bearer cai_your_key_here

Create keys in the dashboard (per-team). Keys are Argon2-hashed at rest — the full key is shown only once at creation.

Base URL


    

Set this as base_url in any OpenAI SDK. All paths below are relative to this.

Quickstart

Pythonfrom openai import OpenAI

client = OpenAI(
    base_url=,
    api_key="cai_your_key_here",
)

response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)
cURLcurl /chat/completions \
  -H "Authorization: Bearer cai_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Hello"}]
  }'
Any OpenAI-compatible SDK works. Python, Node, Go, Rust — just change base_url and api_key.

List Models

GET /v1/models

Returns all available model codenames.

Response{
  "object": "list",
  "data": [
    { "id": "auto", "object": "model" },
    { "id": "claude-sonnet", "object": "model" },
    { "id": "gpt-4o", "object": "model" }
  ]
}

Chat Completions

POST /v1/chat/completions

Send a conversation, get a completion. Fully compatible with the OpenAI Chat Completions API.

Request body

ParameterTypeDescription
model string required Model codename (gpt-4o, claude-sonnet) or auto for intelligent routing. Append a strategy suffix to control provider ranking: auto:speed, gpt-4o:cost, claude-sonnet:quality. See Routing.
messages array required The conversation. Each message has role ("system", "user", "assistant", "tool") and content (string or multipart array).
stream boolean optional Stream response as Server-Sent Events. See Streaming.
tools array optional Tool/function definitions. See Tool calls.
tool_choice string | object optional "auto", "none", "required", or {"type": "function", "function": {"name": "..."}}.
response_format object optional Force JSON output. See Structured output.
max_tokens integer optional Max tokens to generate. Capped at the model's limit. max_completion_tokens is an alias (takes precedence if both set).
temperature float optional 02. Higher = more random.
top_p float optional 01. Nucleus sampling.
stop string | array optional Up to 4 stop sequences.
presence_penalty float optional -22.
frequency_penalty float optional -22.
seed integer optional Deterministic sampling (best-effort).
n integer optional Number of completions. Default: 1.
logit_bias object optional Token ID to bias value mapping.
user string optional End-user identifier for abuse detection.
cailos object optional Routing hints. See Routing.

Response

Standard OpenAI chat completion format, plus a cailos metadata object:

200 OK{
  "id": "chatcmpl-8a3b2c1d...",
  "object": "chat.completion",
  "model": "gpt-4o-2024-08-06",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you?",
      "reasoning_content": null,     // present only for reasoning models
      "tool_calls": null              // present only when tools are invoked
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21,
    "reasoning_tokens": 0          // tokens used for chain-of-thought
  },
  "cailos": {
    "optimise": "quality",
    "provider": "openai",
    "endpoint": "gpt-4o-2024-08-06",
    "trust_level": 2,
    "detected_languages": ["en"],
    "tier": "paid"
  }
}
FieldDescription
message.contentThe model's text response. null when the model responds with tool calls only.
message.reasoning_contentChain-of-thought reasoning. Absent for non-reasoning models. See Reasoning.
message.tool_callsFunction calls made by the model. See Tool calls.
finish_reason"stop", "length", or "tool_calls".
usage.reasoning_tokensTokens used for thinking. Included in completion_tokens.
cailos.optimiseStrategy used: "speed", "cost", "quality", or "balanced".
cailos.providerProvider that served the request (e.g. "openai", "anthropic").
cailos.endpointProvider's native model ID.
cailos.trust_levelTrust level of the endpoint (03). Details
cailos.detected_languagesAuto-detected input languages (e.g. ["en", "he"]), or null.
cailos.tierYour team's plan ("free" or "paid").
The cailos object is a Cailos extension — not part of the OpenAI spec. SDKs ignore unknown keys, so it's safe to leave in your response handling.

Examples

Streaming

Set stream: true to receive Server-Sent Events. Each event is a ChatCompletionChunk with incremental deltas, ending with data: [DONE].

Pythonstream = client.chat.completions.create(
    model="auto:speed",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="")
SSE formatdata: {"choices": [{"delta": {"content": "Hello"}}]}

data: {"choices": [{"delta": {"content": "!"}}]}

data: {"choices": [{"delta": {}, "finish_reason": "stop"}]}

data: [DONE]

Tool calls

Define tools in the request. When the model invokes one, finish_reason is "tool_calls" and content is null:

Requestresponse = client.chat.completions.create(
    model="auto:quality",
    messages=[{"role": "user", "content": "What's the weather in Amsterdam?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"]
            }
        }
    }],
)
Response{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_abc123",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"city\": \"Amsterdam\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

To continue, append the assistant message (with tool_calls) and a tool message (with the result) back into messages.

Structured output

Force the model to respond with valid JSON using response_format:

JSON moderesponse = client.chat.completions.create(
    model="auto:quality",
    messages=[{"role": "user", "content": "List 3 planets as JSON"}],
    response_format={"type": "json_object"},
)

data = json.loads(response.choices[0].message.content)
# {"planets": ["Mercury", "Venus", "Earth"]}
Healing. If the model wraps JSON in code fences, has trailing commas, or truncates output, Cailos repairs it automatically before returning.

Vision

Send images as multipart content in the messages array:

Image inputresponse = client.chat.completions.create(
    model="auto:quality",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {
                "url": "https://example.com/photo.jpg"
            }}
        ]
    }],
)

Both URLs and base64 data URIs (data:image/png;base64,...) are supported. Cailos auto-detects vision requirements and routes to a capable model.

Reasoning

When a reasoning model is used (DeepSeek R1, Qwen3, Claude with extended thinking, OpenAI o-series), the chain-of-thought is returned in reasoning_content — never as raw tags in content:

Response{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "The answer is 42.",
      "reasoning_content": "Let me think through this step by step..."
    }
  }],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 120,
    "reasoning_tokens": 98
  }
}
Pythonmsg = response.choices[0].message
print(msg.content)             # The actual answer

if hasattr(msg, "reasoning_content") and msg.reasoning_content:
    print(msg.reasoning_content)   # Chain-of-thought
  • reasoning_content is absent for non-reasoning models.
  • reasoning_tokens are included in completion_tokens.
  • Thinking tags (<think>, <thinking>, etc.) are extracted automatically — they never appear in content.

Routing

The model field takes a codename (gpt-4o, claude-sonnet). The same codename can be served by multiple providers — Cailos picks the best one based on the strategy. When model is "auto", the ML classifier also picks the best model.

Append a strategy suffix to any model value with : to control how providers are ranked:

SyntaxBehaviour
autoML classifier picks the best model and strategy (falls back to balanced when uncertain).
gpt-4oSpecific model, strategy auto-predicted from prompt content.
model:speedFastest response — ranks providers by lowest latency and highest TPS.
model:costCheapest — ranks providers by lowest $/M tokens.
model:qualityMost capable — ranks providers by highest intelligence rating.
model:balancedComposite blend of quality (40%), speed (35%), and cost (25%).

Where model is either auto or a specific codename. Examples: auto:speed, gpt-4o:cost, claude-sonnet:quality.

Choosing a strategy

Use caseRecommended
Real-time chat, autocomplete, typing indicators:speed — minimises time-to-first-token and prioritises throughput.
High-volume batch processing, background jobs, bulk classification:cost — routes to the cheapest available provider for the model.
Complex reasoning, code generation, high-stakes analysis:quality — picks the provider with the highest eval scores.
General-purpose, unsure, or mixed workloads:balanced or omit the suffix — blends all three factors.
Tip. When you specify a model like gpt-4o without a suffix, Cailos auto-predicts the best strategy from your prompt content. The suffix is only needed when you want explicit control.

Routing hints

Fine-tune routing with the cailos object. Most fields are auto-detected from the request — you only need to set them for explicit overrides.

FieldTypeDescription
trust_level integer 03. Only routes to endpoints at or above this trust level. Cannot go below your team's floor. Details
require_tools boolean Only route to models with function calling support. Auto-detected from tools.
require_vision boolean Only route to models with image input support. Auto-detected from image_url in messages.
require_structured_output boolean Only route to models with JSON output support. Auto-detected from response_format.
require_web_access boolean Only route to models with web access.
language string ISO 639-1 code. Only route to models supporting this language. Auto-detected from message content.
Example{
  "model": "auto",
  "messages": [...],
  "cailos": {
    "trust_level": 2,
    "language": "en"
  }
}
Auto-detection. Fields marked auto-detected are inferred from your request payload. You don't need to set them unless you want to override.

Feedback

Submit human feedback on any routed request. Feedback is used to improve routing quality — satisfaction ratings per endpoint influence model selection over time.

POST /accounts/requests/{id}/feedback/

The {id} in the URL is the id field returned in every chat completion response — pass it through unchanged. No authentication required; the id is unguessable and feedback is only accepted within 1 week of the request being created. After that window, the endpoint returns 410 Gone.

Request body

ParameterTypeDescription
positive string required "1" for thumbs up, "0" for thumbs down.
comment string optional Short note on what went wrong or right. Max 100 characters.
cURLcurl -X POST /accounts/requests/550e8400-e29b-41d4-a716-446655440000/feedback/ \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d 'positive=0&comment=wrong language in response'

The endpoint returns an HTML partial (designed for htmx). The feedback is also visible on the request detail page in the dashboard, and aggregated satisfaction ratings appear per endpoint on the model browse page.

Why give feedback? Even sparse feedback (a few percent of requests) gives Cailos a quality signal per endpoint and task type that static eval scores can't capture. Over time this directly improves which provider gets picked for your workload.

Errors

All errors follow the OpenAI error envelope:

{
  "error": {
    "message": "Model 'nonexistent' not found or not active.",
    "type": "not_found_error"
  }
}
StatusMeaning
200Success.
400Malformed JSON or invalid parameters.
401Missing or invalid API key.
403Key valid but team is inactive.
404Model codename not found or no active providers.
429Rate limit exceeded. Retry after 60 seconds.
502Upstream provider returned an error.
503Model unavailable — circuit breaker open or provider down.
504Upstream provider timed out.

Rate Limits

Per-key, per-minute. When exceeded, requests return 429. The window resets every 60 seconds. A rate limit of 0 means unlimited.

Providers

You never interact with providers directly — just send a codename. For transparency:

ProviderFormatNotes
OpenAInativeForwarded as-is.
AnthropicconvertedSystem prompt, tools, vision, thinking — fully translated.
GoogleconvertedGemini API. Thinking parts extracted.
Cohereconvertedv2 Chat API.
Groq, Together, DeepInfra, Cerebras, Scaleway, SambaNova, xAI, OpenRouter, Novita, NCompass, NScale, Inception, TinfoilnativeOpenAI-compatible.