Price-performance arbitrage for LLM inference.
| TIME | STRATEGY | MODEL |
|---|
Simulated routing decisions. Request content is illustrative only.
Every major model. Every major provider.
We solve it with four signals: evals, an ML classifier, RLHF feedback, and a 15-stage ranking pipeline that blends them on every request.
Continuous evals
Every endpoint is graded against an internal benchmark suite covering intelligence, tool calling, structured output, and vision before it's eligible for routing. Models that score poorly on a task class get filtered out when prompts of that class arrive.
Intent classifier
An ML model reads each prompt and predicts the right strategy axis: speed, quality, cost, or balanced. Inference runs in milliseconds and adapts to your traffic over time.
RLHF feedback
Every thumbs up or down on a response sharpens our picture of which endpoints excel at which task classes, like SQL, debugging, or creative writing. The router learns the matching automatically.
15-stage pipeline
A multi-stage ranking pipeline blends every signal in real time (evals, reliability, user satisfaction, cost, latency) and picks the cheapest endpoint that meets the constraints. End to end in under 10ms.
Drop Cailos into any OpenAI-compatible agent framework. The SDK doesn't change. Just the endpoint. Model selection becomes automatic.
from agents import Agent
triage = Agent(
name="triage",
model="gpt-4o-mini",
instructions="Classify: billing, technical, or escalate.",
)
resolver = Agent(
name="resolver",
model="gpt-4o",
instructions="Draft resolution from KB.",
tools=[search_kb, lookup_customer],
)
from agents import Agent
from openai import AsyncOpenAI
cailos = AsyncOpenAI(base_url="https://cailos.com/v1", api_key="cai_...")
triage = Agent(
name="triage",
model="auto", # fastest cheap model
instructions="Classify: billing, technical, or escalate.",
)
resolver = Agent(
name="resolver",
model="auto", # best tool-calling model
instructions="Draft resolution from KB.",
tools=[search_kb, lookup_customer],
)
Infrastructure is fragile
Every provider has outages, rate limits, and degraded performance windows. Building on a single provider means inheriting their worst day as yours. Cailos treats failure as a routing event, not an incident.
Provider fails
OpenAI returns 503. Anthropic hits a rate limit. Google times out. Every provider has bad minutes.
Circuit breakers trip
Persistent failures trigger circuit breakers that isolate the problem. Traffic shifts to healthy endpoints automatically.
Your request lands
Every request tries ranked endpoints in sequence. If the first fails, the second picks up. Your users never see the failure.
Strong systems require strong infrastructure. Single-provider dependencies are a design flaw, not a tradeoff.
Three steps. Two lines of code. Every model, one endpoint.
You send a request
Standard OpenAI format. Set model="auto" or name a specific model. Add optional routing hints.
Cailos evaluates 102 endpoints
Filters by capability, trust level, and budget. Ranks by your strategy: quality, speed, or cost. Selects the best match.
Best model responds
PII is cloaked before transit. If the provider fails, the next-best endpoint picks up automatically. Your request always lands.
from openai import OpenAI
client = OpenAI(
base_url="https://cailos.com/v1",
api_key="cai_...",
)
response = client.chat.completions.create(
model="auto:quality",
messages=[{"role": "user", "content": "..."}],
)
Standard OpenAI SDK. Change base_url and api_key. Append a strategy to any model: auto:quality, gpt-4o:cost, claude-sonnet:speed, or just auto.
No SDK sprawl
One format for every provider. Format translation, tool schemas, and vision payloads handled automatically.
No maintenance
Provider APIs change. Models deprecate. Cailos absorbs every breaking change so your integration doesn't.
No lock-in
Switch providers in seconds. Circuit breakers auto-failover when a provider goes down. Your on-call never wakes up.
Live evals
Every endpoint is evaluated on intelligence, tool calling, and vision. Routing always reflects the current model landscape.
Every request is processed by llmshield before it reaches any provider. Sensitive data (emails, names, phone numbers, addresses, IDs) is detected and replaced with placeholder tokens before transit, then reconstructed in the response on its way back. Upstream providers see cloaked tokens, never raw user data.
Cailos itself does not log the text of your prompts. We persist output metadata only: which endpoint was picked, latency, token counts, cost, and the routing decision trace. The raw prompt passes through llmshield to the provider and back, then is discarded. The only derived signal we retain is a small set of de-coupled keyphrases used to improve routing over time. All processing runs on our servers; no data reaches any third party other than the provider itself.
Every endpoint in our registry runs through an internal eval suite covering intelligence, tool calling, structured output, and vision before it's eligible for routing. Models that score below threshold for a task class get filtered automatically when prompts of that class arrive. Eval scores are continuously updated as providers ship new model versions, and the full per-endpoint breakdown is visible in your dashboard.
Routing decisions take under 10ms end to end. The first-token latency you experience is bounded by the underlying provider, not Cailos. In most cases the optimal endpoint Cailos picks is faster than what you'd hit by defaulting to a single provider, which more than offsets any routing overhead.
Every request is given a ranked list of fallback endpoints. If the first choice errors, times out, or hits a rate limit, Cailos retries against the next-ranked endpoint in under 50ms. After 3 consecutive failures, a circuit breaker trips and traffic shifts away from the failing provider until probes confirm it has recovered. Your users never see provider outages.
Yes. Pass model="gpt-5" instead of model="auto" and Cailos becomes a transparent gateway. Specific-model requests still benefit from the unified API, automatic failover across providers, llmshield protection, and the same observability dashboard. You only get routing when you ask for it.
cailos.com/v1
Drop-in replacement for any OpenAI SDK.
Change two lines. Access 102 endpoints from 18 providers.