Last updated: June 2026. Reviewed by Report AI editorial. Every figure on this page is linked to its primary source.
MODELS & BENCHMARKS 2026 — KEY DATA POINTS
71.7%
SWE-bench solve rate (2024)
from 4.4% in 2023
Stanford HAI
92%+
Frontier MMLU score
benchmark now saturated
Stanford HAI
2.7%
US lead over China
top model, Mar 2026
Stanford HAI
One-year benchmark gains, 2023→2024 (Stanford HAI) · See all comparison charts →
AI model capability is improving faster than benchmarks can keep up. On SWE-bench, models went from solving 4.4% of coding tasks in 2023 to 71.7% in 2024 (Stanford HAI). MMLU is now effectively saturated above 92%, and the gap between the top U.S. and Chinese models has stayed in the single digits all year (Stanford HAI 2026). This page collects the most important primary-sourced AI model and benchmark statistics for 2026.
Executive summary
- Fast Fact: In a single year (2023–2024), scores rose 18.8, 48.9, and 67.3 percentage points on MMMU, GPQA, and SWE-bench respectively.
- Saturation: MMLU launched in 2020 near 32% frontier accuracy; by Q1 2026 every frontier system scores above 92%, near the benchmark’s ~95% ceiling.
- The Race: U.S. and Chinese models have repeatedly traded the top spot since early 2025; as of March 2026 the top U.S. model leads by just 2.7%.
- Open vs. closed: The closed-weight lead on Chatbot Arena narrowed from 8.0% (Jan 2024) to 1.7% (Feb 2025).
Benchmark progress, 2023–2024
| Benchmark | 2023 | 2024 | Gain |
|---|---|---|---|
| SWE-bench (coding) | 4.4% | 71.7% | +67.3 pp |
| GPQA (graduate-level QA) | — | — | +48.9 pp |
| MMMU (multimodal) | — | — | +18.8 pp |
Source: Stanford HAI, AI Index 2025 (one-year gains, 2023–2024).
Deep dive: why benchmarks keep changing
As classic benchmarks saturate, frontier labs have moved to harder tests. The benchmarks model cards actually report in 2026 include Humanity’s Last Exam (HLE), FrontierMath, ARC-AGI-2, GPQA Diamond, SWE-bench Verified, AIME 2025, and τ-bench. Even these are falling quickly — frontier models gained roughly 30 percentage points on HLE in a single year, a test explicitly designed to be hard for AI (Stanford HAI 2026).
The competitive landscape
The frontier is close and contested. In February 2025, China’s DeepSeek-R1 briefly matched the top U.S. model; since then the lead has changed hands repeatedly while staying in the single digits. As of March 2026 the top U.S. model leads by 2.7%. The open-vs-closed gap has also compressed — from 8.0% in January 2024 to 1.7% by February 2025 on the Chatbot Arena Leaderboard (Stanford HAI 2026).
| Metric | Figure | Source |
|---|---|---|
| Top US lead over China (Mar 2026) | 2.7% | Stanford HAI 2026 |
| Closed–open gap (Jan 2024) | 8.0% | Stanford HAI 2026 |
| Closed–open gap (Feb 2025) | 1.7% | Stanford HAI 2026 |
| HLE one-year gain | +30 pp | Stanford HAI 2026 |
| Inference cost decline | ~280× (late 2022–late 2024) | Stanford HAI 2025 |
Frequently asked questions
What is the best AI benchmark in 2026?
There is no single benchmark. As MMLU and similar tests saturate above 92%, frontier labs now report on harder ones — Humanity’s Last Exam (HLE), FrontierMath, ARC-AGI-2, GPQA Diamond, and SWE-bench Verified, among others.
How fast is AI improving on benchmarks?
Very fast. In 2023–2024 alone, SWE-bench jumped from 4.4% to 71.7% (+67.3 pp), and frontier models have since gained about 30 points in a year on Humanity’s Last Exam.
Are US or Chinese AI models better?
They are very close. The lead has changed hands repeatedly since early 2025; as of March 2026 the top U.S. model leads China’s best by about 2.7%, per Stanford HAI.
Data sources & methodology
- Stanford HAI — AI Index Report 2025 (Technical Performance).
Verified data points: SWE-bench 4.4% (2023) → 71.7% (2024); one-year gains of +18.8 (MMMU), +48.9 (GPQA), +67.3 (SWE-bench) pp; ~280× inference-cost decline.
Source: hai.stanford.edu/ai-index/2025-ai-index-report - Stanford HAI — AI Index Report 2026 (Technical Performance).
Verified data points: MMLU saturated >92% (Q1 2026); top US model leads China by 2.7% (Mar 2026); DeepSeek-R1 matched top US (Feb 2025); closed–open gap 8.0% (Jan 2024) → 1.7% (Feb 2025); +30 pp on HLE in one year.
Source: hai.stanford.edu/ai-index/2026-ai-index-report
Related pages: Generative AI Statistics 2026 · AI Adoption Statistics 2026 · What is an LLM?
Related Reports & Resources
Other reports in this cluster (Technology & Models): Generative AI Statistics 2026 — users, adoption, and the revenue these models generate · AI Infrastructure 2026 — the compute and energy required to train and serve them.
Compare year-over-year: AI benchmark gains chart · State of AI by year.
Background reading: LLM Market Report 2026 · Enterprise AI Statistics 2026 · AI Safety & Governance 2026.
Key concepts: Large Language Model · Foundation Model · Benchmark.
Browse the category: All Statistics & Reports → Technology & Models cluster.