Cailos · The Decision Engine for Agentic AI

Cheap inference at comparable quality is a multi-objective optimisation problem.

We solve it with four signals: evals, an ML classifier, RLHF feedback, and a 15-stage ranking pipeline that blends them on every request.

Continuous evals

Every endpoint is graded against an internal benchmark suite covering intelligence, tool calling, structured output, and vision before it's eligible for routing. Models that score poorly on a task class get filtered out when prompts of that class arrive.

Intent classifier

An ML model reads each prompt and predicts the right strategy axis: speed, quality, cost, or balanced. Inference runs in milliseconds and adapts to your traffic over time.

RLHF feedback

Every thumbs up or down on a response sharpens our picture of which endpoints excel at which task classes, like SQL, debugging, or creative writing. The router learns the matching automatically.

15-stage pipeline

A multi-stage ranking pipeline blends every signal in real time (evals, reliability, user satisfaction, cost, latency) and picks the cheapest endpoint that meets the constraints. End to end in under 10ms.

Build smarter agents.

Drop Cailos into any OpenAI-compatible agent framework. The SDK doesn't change. Just the endpoint. Model selection becomes automatic.

Before Hardcoded

from agents import Agent

triage = Agent(
    name="triage",
    model="gpt-4o-mini",
    instructions="Classify: billing, technical, or escalate.",
)

resolver = Agent(
    name="resolver",
    model="gpt-4o",
    instructions="Draft resolution from KB.",
    tools=[search_kb, lookup_customer],
)

Triage is cheap but may misclassify. Resolver is expensive for simple tickets. If GPT-4o goes down, the pipeline breaks.

After Cailos

from agents import Agent
from openai import AsyncOpenAI

cailos = AsyncOpenAI(base_url="https://cailos.com/v1", api_key="cai_...")

triage = Agent(
    name="triage",
    model="auto",              # fastest cheap model
    instructions="Classify: billing, technical, or escalate.",
)

resolver = Agent(
    name="resolver",
    model="auto",              # best tool-calling model
    instructions="Draft resolution from KB.",
    tools=[search_kb, lookup_customer],
)

model="auto": each agent gets the optimal model. Automatic fallback across 19 providers.

Infrastructure is fragile

Your system stays up when theirs goes down.

Every provider has outages, rate limits, and degraded performance windows. Building on a single provider means inheriting their worst day as yours. Cailos treats failure as a routing event, not an incident.

Provider fails

OpenAI returns 503. Anthropic hits a rate limit. Google times out. Every provider has bad minutes.

503 → next endpoint in <50ms

Circuit breakers trip

Persistent failures trigger circuit breakers that isolate the problem. Traffic shifts to healthy endpoints automatically.

OPEN → HALF_OPEN → CLOSED

Your request lands

Every request tries ranked endpoints in sequence. If the first fails, the second picks up. Your users never see the failure.

19 providers → zero downtime

Strong systems require strong infrastructure. Single-provider dependencies are a design flaw, not a tradeoff.

How it works

Three steps. Two lines of code. Every model, one endpoint.

01

You send a request

Standard OpenAI format. Set model="auto" or name a specific model. Add optional routing hints.

02

Cailos evaluates 112 endpoints

Filters by capability, trust level, and budget. Ranks by your strategy: quality, speed, or cost. Selects the best match.

03

Best model responds

PII is cloaked before transit. If the provider fails, the next-best endpoint picks up automatically. Your request always lands.

Integration main.py

from openai import OpenAI

client = OpenAI(
    base_url="https://cailos.com/v1",
    api_key="cai_...",
)

response = client.chat.completions.create(
    model="auto:quality",
    messages=[{"role": "user", "content": "..."}],
)

Standard OpenAI SDK. Change base_url and api_key. Append a strategy to any model: auto:quality, gpt-4o:cost, claude-sonnet:speed, or just auto.

What you don't have to build.

No SDK sprawl

One format for every provider. Format translation, tool schemas, and vision payloads handled automatically.

No maintenance

Provider APIs change. Models deprecate. Cailos absorbs every breaking change so your integration doesn't.

No lock-in

Switch providers in seconds. Circuit breakers auto-failover when a provider goes down. Your on-call never wakes up.

Live evals

Every endpoint is evaluated on intelligence, tool calling, and vision. Routing always reflects the current model landscape.

Frequently asked questions.

Does Cailos see or store my prompts? +

Every request is processed by llmshield before it reaches any provider. Sensitive data (emails, names, phone numbers, addresses, IDs) is detected and replaced with placeholder tokens before transit, then reconstructed in the response on its way back. Upstream providers see cloaked tokens, never raw user data.

Cailos itself does not log the text of your prompts. We persist output metadata only: which endpoint was picked, latency, token counts, cost, and the routing decision trace. The raw prompt passes through llmshield to the provider and back, then is discarded. The only derived signal we retain is a small set of de-coupled keyphrases used to improve routing over time. All processing runs on our servers; no data reaches any third party other than the provider itself.

How do you measure "comparable quality"? +

Every endpoint in our registry runs through an internal eval suite covering intelligence, tool calling, structured output, and vision before it's eligible for routing. Models that score below threshold for a task class get filtered automatically when prompts of that class arrive. Eval scores are continuously updated as providers ship new model versions, and the full per-endpoint breakdown is visible in your dashboard.

What's the latency overhead? +

Routing decisions take under 10ms end to end. The first-token latency you experience is bounded by the underlying provider, not Cailos. In most cases the optimal endpoint Cailos picks is faster than what you'd hit by defaulting to a single provider, which more than offsets any routing overhead.

What happens when a provider fails? +

Every request is given a ranked list of fallback endpoints. If the first choice errors, times out, or hits a rate limit, Cailos retries against the next-ranked endpoint in under 50ms. After 3 consecutive failures, a circuit breaker trips and traffic shifts away from the failing provider until probes confirm it has recovered. Your users never see provider outages.

Can I force a specific model? +

Yes. Pass model="gpt-5" instead of model="auto" and Cailos becomes a transparent gateway. Specific-model requests still benefit from the unified API, automatic failover across providers, llmshield protection, and the same observability dashboard. You only get routing when you ask for it.

cailos.com/v1

Drop-in replacement for any OpenAI SDK.
Change two lines. Access 112 endpoints from 19 providers.