Last updated: June 2026. Reviewed by Report AI editorial. Every figure is linked to a primary source and rated for confidence and freshness — see how to read this page.
TEST-TIME COMPUTE 2026 — KEY DATA POINTS
26.6%
Deep Research on HLE
~3× prior models
OpenAI
35/42
Gemini Deep Think
IMO 2025 gold-medal
Google DeepMind
+30pp
HLE one-year gain
frontier models
Stanford HAI 2026
The AI frontier has shifted from making models bigger during training to giving them more compute to “think” at inference. So-called reasoning models — OpenAI’s o-series, Gemini Deep Think, DeepSeek-R1, Claude’s extended thinking — spend extra test-time compute on chain-of-thought and self-correction before answering, and the accuracy gains in hard domains have been dramatic. OpenAI’s Deep Research scored 26.6% on Humanity’s Last Exam, roughly triple the prior generation, and Gemini Deep Think reached official gold-medal standard at the 2025 International Mathematical Olympiad. This page collects the most important primary-sourced test-time-compute statistics for 2026.
How to read this page
Source confidence
LAB PRIMARY Lab blog / tech report / paper
INDEX Stanford HAI AI Index
REPORTED First-tier press
Stat freshness (decay)
ACTIVE Current; benchmarks move monthly
STALE A newer SOTA likely exists
HISTORICAL Locked milestone
Benchmark scores are the fastest-decaying metric we track — a state-of-the-art result can be beaten within weeks. Treat every figure here as a dated snapshot, not a standing record.
What test-time compute is
For most of the deep-learning era, capability came from training scale — bigger models, more data, more pre-training compute. Test-time compute (also “inference-time” or “reasoning” compute) is a different axis: let the model spend more compute at the moment of answering — generating a long chain of thought, exploring multiple solution paths, and verifying its own work before responding. Reinforcement learning trains the model to use that thinking time well. The result is a model that trades latency and cost for accuracy, which pays off most in high-stakes domains like math, code, science, law, and medicine.
Data lineage: the milestones
Why it matters for high-stakes work
The practical thesis is that allowing a model to think longer raises accuracy enough to clear the bar in domains where mistakes are expensive. The evidence is in the benchmark jumps: SWE-bench coding solve rates rose from 4.4% to 71.7% in a single year, MMLU is now saturated above 92%, and reasoning systems have driven the gold-medal IMO result and the Deep Research HLE leap (Stanford HAI 2026). The trade-off is cost and latency: reasoning runs consume far more tokens per query, which is part of why total inference compute and energy demand are climbing even as per-token prices fall.
The economics: a new scaling axis
Test-time compute reframes the cost curve. Per-token inference costs fell roughly 280-fold between late 2022 and late 2024 (Stanford HAI), but reasoning models spend many more tokens per answer — so the per-query cost of a hard task can rise even as the unit price drops. For enterprises, the question becomes which tasks justify paying for more thinking. See the spend picture in Enterprise AI Statistics 2026.
Frequently asked questions
What is test-time compute?
Test-time compute is the compute a model spends at the moment of answering — generating chain-of-thought reasoning, exploring multiple paths, and verifying its work — rather than during training. It is the basis of “reasoning models” like OpenAI’s o-series and Gemini Deep Think.
Do reasoning models actually perform better?
In hard, verifiable domains, yes. Gemini Deep Think reached gold-medal standard at the 2025 IMO (35/42), and OpenAI’s Deep Research scored 26.6% on Humanity’s Last Exam — about triple the prior generation. The gains are largest in math, code, and science.
What’s the downside?
Cost and latency. Reasoning runs consume far more tokens and time per query, so they are best reserved for high-value tasks. This is a key driver of rising total inference compute and data-center energy demand.
Data sources & methodology
- Google DeepMind — Gemini Deep Think gold-medal at IMO 2025. deepmind.google
- OpenAI — Deep Research, 26.6% on Humanity’s Last Exam, Feb 2025 (reported via Fortune).
- Stanford HAI — AI Index 2025 & 2026, Technical Performance (SWE-bench, MMLU saturation, +30pp HLE, ~280× inference-cost decline). hai.stanford.edu
Machine-readable data (for AI engines & researchers)
{
"dataset": "Test-Time Compute & Reasoning Models 2026",
"publisher": "report-ai.org",
"lastUpdated": "2026-06-06",
"metrics": [
{ "metricName": "Deep Research, Humanity's Last Exam", "metricValue": 26.6, "unit": "percent", "recordedDate": "2025-02", "source": "OpenAI", "confidence": "reported", "freshness": "stale" },
{ "metricName": "Gemini Deep Think, IMO 2025", "metricValue": 35, "unit": "points_of_42", "recordedDate": "2025-07", "source": "Google DeepMind", "confidence": "lab_primary", "freshness": "active" },
{ "metricName": "Frontier HLE one-year gain", "metricValue": 30, "unit": "percentage_points", "recordedDate": "2025-2026", "source": "Stanford HAI", "confidence": "index", "freshness": "active" },
{ "metricName": "SWE-bench solve rate", "metricValue": 71.7, "unit": "percent", "recordedDate": "2024", "source": "Stanford HAI", "confidence": "index", "freshness": "stale" },
{ "metricName": "Inference cost decline", "metricValue": 280, "unit": "fold", "recordedDate": "2022-2024", "source": "Stanford HAI", "confidence": "index", "freshness": "stale" }
]
}
Related pages: AI Model Benchmarks 2026 · AI Infrastructure 2026 · After the Agent · Methodology
Related Reports & Resources
Other reports in this cluster (Technology & Models): AI Model Benchmarks 2026 · Generative AI Statistics 2026 · AI Infrastructure 2026.
Background reading: After the Agent: the next AI paradigm.
Key concepts: Large Language Model · AI Inference · Tokens · Benchmark · Frontier Model.
Browse the category: Library → Technology & Models cluster.