Benchmark

An AI benchmark is a standardized test used to measure and compare model capabilities on a defined task — coding, math, reasoning, multimodal understanding, agent execution, and so on. Examples include SWE-bench (coding), MMLU (broad knowledge), GPQA (graduate-level QA), MMMU (multimodal), and Humanity’s Last Exam (HLE).

How it works

A benchmark consists of a fixed set of inputs (problems, questions, tasks) and a scoring method (typically accuracy or pass rate). Researchers run a model against the dataset and publish the score. Comparable benchmarks let competing frontier models be ranked head-to-head; saturation occurs when leading models start scoring near the benchmark’s ceiling.

Why it matters

Benchmarks are the primary way the industry tracks AI progress — and progress in 2024–2026 has been steep. SWE-bench scores jumped from 4.4% to 71.7% in a single year, MMLU is now saturated above 92%, and frontier models gained ~30 points on HLE in one year. Full timeline in AI Model Benchmarks 2026.

Related terms: Frontier Model · Large Language Model · Foundation Model · All glossary entries