Opinion
Five things flat-rate, single-provider LLM inference gets wrong — and the routing decisions that fix each.
01
A backend extracts a date from a string. Routed to a flagship reasoning model at flat-rate pricing, each call costs about one cent. At ten million calls a day that's $100,000/day for a task a small model handles at a tenth the cost with equal accuracy. The opposite mistake is worse: a strategy-synthesis prompt routed to a fast cheap model returns plausible fluff that misses every constraint.
What cailos does
Ranks endpoints per request by cost, latency, and quality signals. Extractions route to a cheap fast model. Reasoning-heavy calls route to a flagship. Same OpenAI-compatible API, different endpoint per call, arbitraged at request time.
02
Monday, 09:00 UTC. OpenAI returns servers at capacity. Anthropic's rate-limit bucket 429s your agent. Gemini returns two different answers to the same deterministic prompt. Your workflow depends on all three, so it is down on all three. Oncall pages you for an incident that isn't yours to fix.
What cailos does
Every request carries a ranked fallback chain. On error, rate-limit, or timeout, cailos retries the next-ranked endpoint in under 50ms. After three consecutive failures a circuit breaker trips and traffic shifts away until probes confirm recovery. Your provider's outage stops being your outage.
03
Confidential client data routed to provider A. Strategic plans to provider B. Financial projections to provider C. Three retention policies, three sets of subprocessors, three sets of terms of service — all changeable at the provider's discretion. Every model you add widens the attack surface.
What cailos does
Two controls. llmshield cloaks PII — names, emails, addresses, IDs — at the request boundary before any upstream provider sees the payload, and uncloaks on return. Per-endpoint trust levels (0–3) enforce a hard routing filter: a request marked trust_level=3 is never routed to a lower-trust endpoint, regardless of cost or latency wins.
04
GPT-5 for writing. Claude for reasoning. Gemini for search. DeepSeek's new release benchmarked eight points higher last Thursday. Llama 5 ships next week. Your team's morning ritual has become reading leaderboards instead of shipping.
What cailos does
Pass model="auto". An ML classifier reads your prompt, picks the strategy (cost / speed / quality / balanced), and the registry picks the endpoint. Cailos's eval system tracks per-endpoint quality across task types continuously — you don't read the leaderboard because it's already routed into the decision.
05
OpenAI deprecates gpt-4-turbo-preview Friday. Anthropic renames an endpoint. Azure shifts a region. Each provider's breaking change is a ticket on your team's board, an integration test to rewrite, a customer outage if you miss the migration window.
What cailos does
Absorbs the churn. Providers change; the cailos API doesn't. When an upstream endpoint deprecates, routing shifts to the successor automatically. You pin to a codename — the provider underneath is a routing concern, not an integration concern.