AI Benchmarks Explained: MMLU, HellaSwag, GSM8K, and More
AI

AI Benchmarks Explained: MMLU, HellaSwag, GSM8K, and More

Key takeaways

  • AI benchmarks are standardized tests that let researchers compare model capabilities across labs.
  • MMLU covers 57 academic subjects; GSM8K measures grade-school math; HellaSwag tests commonsense reasoning.
  • Coding benchmarks like HumanEval and SWE-bench have become central as AI coding tools matured.
  • All benchmarks saturate eventually — frontier models now score near ceiling on MMLU and HumanEval, forcing new, harder benchmarks like MMLU-Pro and SWE-bench Verified.
  • Benchmark numbers are useful but easy to game; contamination (test data leaking into training) is a persistent problem.

Why benchmarks exist

Without benchmarks, claims about AI progress are anecdotes. A new model might sound impressive in a demo but be worse on tasks you care about. Benchmarks give a common scorecard — the same questions, scored the same way — so that Anthropic’s Claude, OpenAI’s GPT, Google’s Gemini, and open-source Llama or Qwen can be compared apples-to-apples.

Performance chart comparing AI model scores on benchmarks
Photo by Lukas Blazek on Pexels

The downside: benchmarks shape what labs optimize for. Teach to the test long enough and you eventually reach a ceiling that says more about the test than about real capability. See our model evaluation coverage for a deeper look at evaluation design.

The major language benchmarks

MMLU — Massive Multitask Language Understanding

MMLU is 57 multiple-choice subject tests, from elementary math to professional law and medicine. Released in 2020, it became the default benchmark for raw knowledge breadth. Early GPT-3 scored around 44%. Frontier models in 2024-2025 scored in the 86-92% range. The ceiling prompted a harder successor, MMLU-Pro, with fewer answer choices and more challenging questions.

GSM8K — grade-school math

GSM8K contains 8,500 grade-school math word problems. Models must parse the English, identify the operations needed, and work through multi-step arithmetic. It was a stretch goal for years; recent frontier models score above 95%, so it has been largely retired as a frontier test in favour of MATH, AIME, and similar harder sets.

HellaSwag — commonsense sentence completion

HellaSwag asks models to pick the most plausible continuation of a short scenario. It was designed to be easy for humans (~95% accuracy) but hard for 2019-era models. Modern LLMs saturate it above 95%.

TruthfulQA — resistance to plausible falsehoods

TruthfulQA tests whether a model repeats common misconceptions. It is harder than it sounds — strong models often repeat myths they learned from internet text. Scores correlate with post-training alignment quality more than with raw capability.

ARC — abstract reasoning

The AI2 Reasoning Challenge (ARC) tests science questions requiring reasoning, not lookup. An advanced version, ARC-Challenge, remains one of the harder benchmarks for LLMs.

Coding benchmarks

HumanEval

OpenAI’s HumanEval contains 164 hand-written Python coding problems with unit tests. A model’s score is the fraction of problems it solves with functioning code. Top models now pass 90%+, so it is less discriminative than it was.

MBPP and APPS

Google’s MBPP (Mostly Basic Python Problems) and Stanford’s APPS are broader coding benchmarks covering a wider range of difficulty and problem types.

SWE-bench

SWE-bench changed the frame from “solve a leetcode problem” to “close a real GitHub issue”. It gives the model a real repository and an issue ticket, asking it to produce a pull-request-style patch. SWE-bench scores have become a key metric for agentic coding capabilities. Frontier models in 2025 scored in the 40-70% range depending on the subset, with SWE-bench Verified (a curated subset with stronger test coverage) used for headline comparisons.

Reasoning and math benchmarks

MATH

MATH contains 12,500 competition math problems from AMC and AIME-level difficulty. Much harder than GSM8K. Frontier models now score around 80%+ with chain-of-thought prompting.

GPQA

GPQA (Graduate-level Google-Proof Q&A) is a set of PhD-level questions in physics, chemistry, and biology, designed so that experts score ~65% but Google-assisted laypeople score ~35%. Stanford AI Index reported GPQA scores jumping by nearly 49 percentage points in a single year — a signal that frontier models rapidly closed in on expert performance.

AIME and FrontierMath

As LLMs saturated existing math benchmarks, harder ones appeared. AIME (American Invitational Mathematics Examination) and FrontierMath (expert-curated novel problems) probe the ceiling of current models.

Multimodal and vision benchmarks

  • MMMU — multi-discipline multimodal understanding, college-level questions with images.
  • ChartQA, DocVQA — reading charts and documents visually.
  • Video-MME, Video-MMMU — video understanding.
  • Visual Question Answering (VQA) — broad but aging benchmark.

Agent and tool-use benchmarks

As AI shifted from chat to agents — systems that plan, call tools, and take actions — new benchmarks emerged. WebArena and VisualWebArena test browser agents. AgentBench covers varied environments. Cybench evaluates cybersecurity tasks. These are still rapidly evolving as the agent landscape changes.

Common pitfalls in reading benchmarks

Contamination

When benchmark questions leak into training data, scores go up without real capability improvements. All major labs now report contamination analyses, but subtle leakage (paraphrased questions, online discussions of answers) is hard to fully eliminate.

Prompt sensitivity

Scores can swing 5-15 percentage points based on prompt engineering — which shots, which formatting, which decoding parameters. Reported scores often reflect the best prompt the evaluator could find, not a fair comparison with prior work.

Cherry-picking

With dozens of benchmarks, a lab can always find a subset where their model wins. Independent leaderboards like LMArena (formerly Chatbot Arena), OpenRouter rankings, and HELM help by aggregating many benchmarks or using head-to-head human preference.

Benchmark vs. real usage

High MMLU does not guarantee the model handles your customer-support tickets well. Always complement standardized benchmarks with your own task-specific evaluations. See our large language models coverage for how benchmarks map to real deployments.

Frequently asked questions

Which benchmark matters most?
None in isolation. Benchmarks are most useful in combination — a model strong on MMLU, GPQA, SWE-bench, and human-preference arenas is likely capable across the board. A model that wins one benchmark but loses others probably optimized narrowly. For product decisions, your own evaluations on your actual use case matter more than any public score.

Why are scores improving so fast?
A mix of factors: larger models, better training data, better post-training techniques (RLHF, DPO), chain-of-thought prompting, and — honestly — some benchmark contamination. The Stanford HAI AI Index 2025 documented gains of 18.8 percentage points on MMMU, 48.9 on GPQA, and 67.3 on SWE-bench in a single year. The real-world impact is smaller than the benchmark deltas suggest, but the trend is real.

Can I trust a leaderboard?
Trust with nuance. Static benchmark leaderboards are vulnerable to contamination and prompt engineering. Live head-to-head arenas like LMArena are harder to game but measure chat-style preference rather than capability depth. HELM from Stanford is a thoughtful aggregate. Use several sources and treat large gaps as signal, small gaps as noise. For industry trends, see our ai industry coverage.

Digital Mind News

Digital Mind News is an AI-operated newsroom. Every article here is synthesized from multiple trusted external sources by our automated pipeline, then checked before publication. We disclose our AI authorship openly because transparency is part of the product.