What the Benchmark...?

In partnership with

Dear Readers,

Benchmarks are the stethoscope of the AI world: inconspicuous, but indispensable. They reveal whether a model only produces shiny headlines or whether there are real capabilities behind it. But the higher the scores, the louder the question becomes: do these numbers really measure intelligence - or just test routine? This is where the real tension begins: benchmarks are necessary, but they are never neutral. They shape what they measure, shifting the focus from ability to passing.

In this issue, we rearrange the field: we look at the major benchmark families—from MMLU-Pro to GPQA to SWE-Bench - and show what their strengths and blind spots are. We also look to the future: dynamic, multimodal tests that reflect real-world applications such as medicine or law. In short: less multiple choice, more everyday life. Anyone who wants to understand where AI really stands - and where it needs to go - will find the decisive compass in the following lines.

All the best,

What are AI and LLM benchmarks – why do we need them, and which ones matter today?

❝

The TLDR

AI benchmarks are essential, standardized tests that act as speedometers for the industry, creating comparability and transparency for new models. However, they come with a major risk known as "Goodhart's Law": when a benchmark becomes a target, models can learn to "game the test" without actually acquiring the underlying skills being measured. Therefore, to get a true sense of an AI's capabilities, it's crucial to rely on a combination of new, difficult, and robust benchmarks rather than a single score.

Benchmarks are the speedometers of AI development: sober measuring points in a field that is moving at breakneck speed. Without them, there is no comparison, no direction, and no early warning system against setbacks. A benchmark is nothing mysterious, but rather a clearly defined, repeatable task with a key figure at the end – accuracy, pass@1, exact match, or a preference rating. But measurements change what they measure. As soon as a number becomes a target, optimization begins along that very number – that is the old insight behind Goodhart's law: “When a measure becomes a target, it is no longer a measure.” This is even more true in the context of AI: models can “learn” tests without actually acquiring the skills they are looking for. Therefore, behind every impressive score lies the real question: Does this test measure what we really want to know?

In the following, I will organize the field: why benchmarks are necessary; what types there are; which ones are specifically important – with strengths, limitations, and practical relevance. The guiding question that underpins the text: How can benchmarks be read and combined in such a way that they reveal real competence and not just exam routine?

Why benchmarks are indispensable

Benchmarks create comparability across models, versions, and training methods. They provide regression protection (does a new model fall back on old tasks?) and enable external transparency, for example in leaderboards or papers. Equally important: good benchmarks differentiate between models at the top instead of lumping them all together at “>95%.” This is exactly where newer, more robust suites come into play, which are deliberately more difficult and less susceptible to shortcuts (see MMLU-Pro). And yet the downside remains: overfitting (contamination by training data), artifacts in the dataset, and gaming of metrics. Those who take benchmarks seriously therefore pay attention to timeliness, test robustness, and a combination of tests instead of a single “magic number.”

The best marketing ideas come from marketers who live it.

That’s what this newsletter delivers.

The Marketing Millennials is a look inside what’s working right now for other marketers. No theory. No fluff. Just real insights and ideas you can actually use—from marketers who’ve been there, done that, and are sharing the playbook.

Every newsletter is written by Daniel Murray, a marketer obsessed with what goes into great marketing. Expect fresh takes, hot topics, and the kind of stuff you’ll want to steal for your next campaign.

Because marketing shouldn’t feel like guesswork. And you shouldn’t have to dig for the good stuff.

Subscribe Free

The big families of benchmarks – and the heavyweights among them

(a) General knowledge & understanding

MMLU (“Massive Multitask Language Understanding”) tests 57 subjects from history to medicine in multiple-choice form – long the quasi-standard, but somewhat saturated at the top. MMLU-Pro responds to this with more answer options, more need for justification and transfer, and less prompt dependency.

ARC Challenge tests scientific knowledge in the form of tasks that combine reasoning and knowledge retrieval. HellaSwag, on the other hand, focuses on everyday logic – the plausible continuation of scenarios. Both show how everyday understanding and specialist knowledge interact.

(b) Mathematical thinking

GSM8K measures multi-step arithmetic word problems at the middle school level. MATH goes one level higher: competitive tasks with complete solution sketches. Finally, GPQA is “Google-proof” at the graduate level – deliberately designed so that pure retrieval does not help, but requires genuine derivation.

(c) Instruction following & dialogue quality

IFEval checks whether a model follows instructions exactly (“write 400 words,” “use exactly three bullet points”). MT-Bench and the Chatbot Arena, on the other hand, rely on preference comparisons: humans or other models evaluate which answers they prefer. Advantage: proximity to real-world use. Disadvantage: bias and style effects.

(d) Programming & software engineering

Early code benchmarks such as HumanEval are often too easy for today's models. SWE-bench tests real GitHub issues: bug fixes in large repositories, tested with unit tests. SWE-bench Verified is a human-verified subset to avoid measurement errors. LiveCodeBench continuously delivers new tasks from coding platforms, thus protecting against data contamination.

(e) Multimodality (text+image)

MMMU includes over 11,000 tasks from subjects such as physics, medicine, and music—with tables, diagrams, and sheet music. MMMU-Pro raises the bar even higher. MathVista focuses on visual mathematics: diagrams, function plots, and combined logic.

(f) Truthfulness, safety, and robustness

TruthfulQA measures whether models reproduce popular misconceptions. RealToxicityPrompts tests how models respond to toxic or sensitive prompts. Both are central to safety, but they are not sufficient: they show tendencies, not guarantees.

How to interpret benchmark figures correctly

First: Understand the metrics. Accuracy on multiple choice questions says something different than Pass@1 in coding or preference scores.

Second: Exclude contamination. Have test tasks already been seen in training? Dynamic benchmarks such as LiveCodeBench minimize the risk.

Third: Check robustness. Are results significantly skewed by wording or style?

Fourth: Task proximity. For practical decisions, the benchmarks that resemble real-world tasks are the ones that count.

Preliminary conclusion: No single benchmark is a good judge of “intelligence.” Those who seriously compare models combine comprehension (MMLU-Pro, ARC/HellaSwag), thinking/math (GSM8K, MATH, GPQA), instruction following (IFEval, MT-Bench), application relevance (SWE-bench, LiveCodeBench), and safety (TruthfulQA, RealToxicityPrompts).

Conclusion

Benchmarks are necessary because they make progress measurable and setbacks visible. They are limited because every measure can be strategically optimized and real-world tasks are multidimensional. The answer to the initial question is: benchmarks are indispensable—provided that they are read in combination and, depending on the application, those tests are selected that actually reflect the target competence.

The outlook: The future belongs to dynamic, low-contamination, multimodal, and agentic benchmarks—tests that incorporate tool use, long-term planning, and team interaction. Until then, benchmarks are a compass, not a final destination. The map is only created in real-world use.

Sources:

🔗 MMLU-Pro (2024, Wang et al.) - https://arxiv.org/abs/2406.01574

🔗 ARC-Challenge (2018, Clark et al.) - https://arxiv.org/abs/1803.05457

🔗 HellaSwag (2019, Zellers et al.) - https://arxiv.org/abs/1905.07830

🔗 GSM8K (2021, Cobbe et al.) - https://arxiv.org/pdf/2110.14168

🔗 MATH (2021, Hendrycks et al.) - https://arxiv.org/abs/2103.03874

🔗 GPQA (2023, Rein et al.) - https://arxiv.org/abs/2311.12022

🔗 IFEval (2023, Zhou et al.) - https://arxiv.org/abs/2311.07911

🔗 MT-Bench / Chatbot-Arena (LMSYS, 2023–24) - https://huggingface.co/papers/2306.05685; sowie https://lmarena.ai/?arena=

🔗 SWE-bench (2023, Jimenez et al.) - https://arxiv.org/abs/2310.06770

🔗 SWE-bench Verified (OpenAI, 2024) - https://openai.com/index/introducing-swe-bench-verified/

🔗 LiveCodeBench (2024, Jain et al.) - https://arxiv.org/abs/2403.07974

🔗 MMMU / MMMU-Pro (2023/24, Yue et al.) - https://arxiv.org/abs/2311.16502

🔗 MathVista (2023, Lu et al.) - https://arxiv.org/abs/2310.02255

🔗 TruthfulQA (2021, Lin et al.) - https://arxiv.org/abs/2109.07958

🔗 RealToxicityPrompts (2020, Gehman et al.) - https://aclanthology.org/2020.findings-emnlp.301/

🔗 Überblick zu Goodhart’s Law (Wiki / Thomas 2022) - https://en.wikipedia.org/wiki/Goodhart's_law

🔗 und als etwas vertiefende Abhandlung: - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9122957/

Chubby’s Opinion Corner

Traditional benchmarks have served their purpose: they have made progress visible, enabled models to be compared, and brought initial order to the field. However, the more the models have developed, the more apparent the limitations of these test formats have become. Multiple-choice questions, mathematical tasks, and synthetic instruction tests are, in a sense, “laboratory conditions.” They measure skills in isolation, but often fail to reflect the complexity of reality. With models now routinely achieving 90% or more on standard benchmarks, the raw numbers are losing their significance. The future therefore belongs to benchmarks that reflect the real world – not only with greater difficulty, but with genuine relevance. This applies above all to fields such as medicine, law, and engineering. In medicine, it is not enough to reproduce facts correctly. A model must interpret clinical cases, weigh probabilities, assess risks, and communicate responsibly. In law, it is not just a matter of knowing legal texts, but of reviewing contracts, weighing considerations, and making justifications transparent. In all these fields, it is not only the correct answer that counts, but also the quality of the argumentation and the robustness of the process. Real-world benchmarks must therefore be multimodal – with text, images, tables, and possibly also sound – they must allow for interaction and measure the ability to collaborate with humans.

Another crucial point is that real benchmarks evaluate not only the final answer, but also the path to it. In practice, people work with AI not as consumers of ready-made solutions, but in dialogue. A doctor would use AI to test hypotheses; a lawyer to prepare lines of argumentation. The new generation of benchmarks must reflect this cooperative dimension: agent-based tasks that navigate the model through scenarios, role-playing with virtual patients or clients, and tasks that require several steps before a decision is made. Only then can it be determined whether a model is truly usable or only excels in the testing room.

This shifts the overall significance of benchmarks away from rigid scores and toward process evaluation. Away from the test lab and toward everyday life. The next generation will be more specialized, more interactive, and more tailored to real-world applications. This is costly—it requires real specialist data, ethical reviews, experts for evaluation, and high standards of data protection. But this effort is necessary if benchmarks are to be more than just performance parades and become a reliable bridge between research and social practice.

How'd We Do?

Please let us know what you think! Also feel free to just reply to this email with suggestions (we read everything you send us)!

What the Benchmark...?