AI Model Benchmarks 2026: SWE-bench, MMLU & Frontier Models

Name: AI Models & Benchmarks Statistics 2026
Creator: Report AI
Published: 2026-06-05

Last updated: June 2026. Reviewed by Report AI editorial. Every figure on this page is linked to its primary source.

MODELS & BENCHMARKS 2026 — KEY DATA POINTS

71.7%

SWE-bench solve rate (2024)
from 4.4% in 2023
Stanford HAI

92%+

Frontier MMLU score
benchmark now saturated
Stanford HAI

2.7%

US lead over China
top model, Mar 2026
Stanford HAI

One-year benchmark gains, 2023→2024 (Stanford HAI) · See all comparison charts →

AI model capability is improving faster than benchmarks can keep up. On SWE-bench, models went from solving 4.4% of coding tasks in 2023 to 71.7% in 2024 (Stanford HAI). MMLU is now effectively saturated above 92%, and the gap between the top U.S. and Chinese models has stayed in the single digits all year (Stanford HAI 2026). This page collects the most important primary-sourced AI model and benchmark statistics for 2026.

Executive summary

Fast Fact: In a single year (2023–2024), scores rose 18.8, 48.9, and 67.3 percentage points on MMMU, GPQA, and SWE-bench respectively.
Saturation: MMLU launched in 2020 near 32% frontier accuracy; by Q1 2026 every frontier system scores above 92%, near the benchmark’s ~95% ceiling.
The Race: U.S. and Chinese models have repeatedly traded the top spot since early 2025; as of March 2026 the top U.S. model leads by just 2.7%.
Open vs. closed: The closed-weight lead on Chatbot Arena narrowed from 8.0% (Jan 2024) to 1.7% (Feb 2025).

Benchmark progress, 2023–2024

Benchmark	2023	2024	Gain
SWE-bench (coding)	4.4%	71.7%	+67.3 pp
GPQA (graduate-level QA)	—	—	+48.9 pp
MMMU (multimodal)	—	—	+18.8 pp

Source: Stanford HAI, AI Index 2025 (one-year gains, 2023–2024).

Deep dive: why benchmarks keep changing

As classic benchmarks saturate, frontier labs have moved to harder tests. The benchmarks model cards actually report in 2026 include Humanity’s Last Exam (HLE), FrontierMath, ARC-AGI-2, GPQA Diamond, SWE-bench Verified, AIME 2025, and τ-bench. Even these are falling quickly — frontier models gained roughly 30 percentage points on HLE in a single year, a test explicitly designed to be hard for AI (Stanford HAI 2026).

The competitive landscape

The frontier is close and contested. In February 2025, China’s DeepSeek-R1 briefly matched the top U.S. model; since then the lead has changed hands repeatedly while staying in the single digits. As of March 2026 the top U.S. model leads by 2.7%. The open-vs-closed gap has also compressed — from 8.0% in January 2024 to 1.7% by February 2025 on the Chatbot Arena Leaderboard (Stanford HAI 2026).

Metric	Figure	Source
Top US lead over China (Mar 2026)	2.7%	Stanford HAI 2026
Closed–open gap (Jan 2024)	8.0%	Stanford HAI 2026
Closed–open gap (Feb 2025)	1.7%	Stanford HAI 2026
HLE one-year gain	+30 pp	Stanford HAI 2026
Inference cost decline	~280× (late 2022–late 2024)	Stanford HAI 2025

Frequently asked questions

What is the best AI benchmark in 2026?

There is no single benchmark. As MMLU and similar tests saturate above 92%, frontier labs now report on harder ones — Humanity’s Last Exam (HLE), FrontierMath, ARC-AGI-2, GPQA Diamond, and SWE-bench Verified, among others.

How fast is AI improving on benchmarks?

Very fast. In 2023–2024 alone, SWE-bench jumped from 4.4% to 71.7% (+67.3 pp), and frontier models have since gained about 30 points in a year on Humanity’s Last Exam.

Are US or Chinese AI models better?

They are very close. The lead has changed hands repeatedly since early 2025; as of March 2026 the top U.S. model leads China’s best by about 2.7%, per Stanford HAI.

Data sources & methodology

Stanford HAI — AI Index Report 2025 (Technical Performance).
Verified data points: SWE-bench 4.4% (2023) → 71.7% (2024); one-year gains of +18.8 (MMMU), +48.9 (GPQA), +67.3 (SWE-bench) pp; ~280× inference-cost decline.
Source: hai.stanford.edu/ai-index/2025-ai-index-report
Stanford HAI — AI Index Report 2026 (Technical Performance).
Verified data points: MMLU saturated >92% (Q1 2026); top US model leads China by 2.7% (Mar 2026); DeepSeek-R1 matched top US (Feb 2025); closed–open gap 8.0% (Jan 2024) → 1.7% (Feb 2025); +30 pp on HLE in one year.
Source: hai.stanford.edu/ai-index/2026-ai-index-report

Related Reports & Resources

Other reports in this cluster (Technology & Models): Generative AI Statistics 2026 — users, adoption, and the revenue these models generate · AI Infrastructure 2026 — the compute and energy required to train and serve them.

Compare year-over-year: AI benchmark gains chart · State of AI by year.

Background reading: LLM Market Report 2026 · Enterprise AI Statistics 2026 · AI Safety & Governance 2026.

Key concepts: Large Language Model · Foundation Model · Benchmark.

Browse the category: All Statistics & Reports → Technology & Models cluster.