How we measure performance and why our approach gives you numbers you can actually trust.
Total time from your API call to the complete response. We report the median (p50) for typical performance and the 95th percentile (p95) for worst-case scenarios most users will experience.
How fast you see the first word appear in a streaming response. This is the number that determines whether your app feels instant or sluggish to end users.
What you actually pay per request. Theo's intelligent routing sends simple questions to fast, affordable models and reserves premium models for complex tasks, reducing your average cost without sacrificing quality.
How often Theo's auto-mode selects the correct model for the task. Measured against a labeled dataset of real-world prompts spanning code, creative, research, and mixed-intent queries.
When a provider goes down, Theo automatically switches to a backup. We measure how quickly this happens and whether you experience any interruption. The answer is usually: you don't.
Most AI benchmarks measure model quality on academic datasets. That's useful, but it doesn't tell you how fast your app will feel or what you'll pay per request. Our benchmarks measure the things that affect your product: latency, cost, and reliability.
A coding task and a quick question have completely different performance profiles. We benchmark each of Theo's six routing modes separately so you know exactly what to expect for your specific use case.
Some benchmarks only report averages, which hide the worst cases. We report p50 (the typical experience) and p95 (what your unluckiest 5% of users see). No data points are trimmed or excluded.
Our benchmarks include actual network latency because that's what your users experience. We don't run tests on localhost to make numbers look better.