OUR METHODOLOGY

How we measure performance and why our approach gives you numbers you can actually trust.

WHAT WE MEASURE

End-to-End Latency

Total time from your API call to the complete response. We report the median (p50) for typical performance and the 95th percentile (p95) for worst-case scenarios most users will experience.

Time-to-First-Token

How fast you see the first word appear in a streaming response. This is the number that determines whether your app feels instant or sluggish to end users.

Cost Per Task

What you actually pay per request. Theo's intelligent routing sends simple questions to fast, affordable models and reserves premium models for complex tasks, reducing your average cost without sacrificing quality.

Routing Accuracy

How often Theo's auto-mode selects the correct model for the task. Measured against a labeled dataset of real-world prompts spanning code, creative, research, and mixed-intent queries.

Failover Recovery

When a provider goes down, Theo automatically switches to a backup. We measure how quickly this happens and whether you experience any interruption. The answer is usually: you don't.

WHY OUR APPROACH IS DIFFERENT

We test what developers care about

Most AI benchmarks measure model quality on academic datasets. That's useful, but it doesn't tell you how fast your app will feel or what you'll pay per request. Our benchmarks measure the things that affect your product: latency, cost, and reliability.

Every mode is benchmarked independently

A coding task and a quick question have completely different performance profiles. We benchmark each of Theo's six routing modes separately so you know exactly what to expect for your specific use case.

We include the slow requests

Some benchmarks only report averages, which hide the worst cases. We report p50 (the typical experience) and p95 (what your unluckiest 5% of users see). No data points are trimmed or excluded.

Real network conditions

Our benchmarks include actual network latency because that's what your users experience. We don't run tests on localhost to make numbers look better.