arxiv: 2406.19314 · v2 · submitted 2024-06-27 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White , Samuel Dooley , Manley Roberts , Arka Pal , Ben Feuer , Siddhartha Jain , Ravid Shwartz-Ziv , Neel Jain

show 10 more authors

Khalid Saifullah Sreemanti Dey Shubh-Agrawal Sandeep Singh Sandha Siddartha Naidu Chinmay Hegde Yann LeCun Tom Goldstein Willie Neiswanger Micah Goldblum

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords LLM benchmarktest contaminationautomatic evaluationlive benchmarkmath reasoningcoding tasksobjective scoring

0 comments

The pith

LiveBench is an LLM benchmark that pulls questions monthly from recent sources and scores them automatically against objective answers to avoid test contamination.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LiveBench to solve the problem of test set contamination, where benchmark questions leak into model training data and render evaluations unreliable. It sources questions from fresh math competitions, arXiv papers, news articles, and datasets, then updates them regularly while using automatic objective scoring instead of human or LLM judges. The benchmark covers challenging tasks across math, coding, reasoning, language, instruction following, and data analysis, including harder versions of prior tasks like Big-Bench Hard. Even leading models score below 70 percent, and the full set of questions, code, and answers is released for ongoing use. Questions and tasks continue to be added over time so the benchmark can track genuine capability gains.

Core claim

LiveBench is the first benchmark that contains frequently-updated questions from recent information sources, scores answers automatically according to objective ground-truth values, and contains a wide variety of challenging tasks spanning math, coding, reasoning, language, instruction following, and data analysis.

What carries the argument

Sourcing questions from recently released sources such as math competitions, arXiv papers, news articles, and datasets, paired with automatic objective scoring against ground-truth values.

If this is right

Closed-source and open-source models from 0.5B to 405B parameters can be compared on the same contamination-resistant tasks.
Monthly additions of new questions and harder task versions allow the benchmark to remain useful as model performance rises.
Objective auto-scoring removes reliance on subjective human or LLM judges for difficult problems.
Releasing all questions and model answers supports reproducible evaluation and community extensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other domains such as science or multimodal evaluation could adopt similar live sourcing plus auto-scoring to stay ahead of data leakage.
Models may need training strategies that emphasize recent data or better generalization to maintain high scores over repeated benchmark updates.
Public leaderboards built on LiveBench could become more stable references for comparing capability trends across years.

Load-bearing premise

Questions drawn from recently released sources stay out of the training data of the models being tested, and automatic objective scoring measures true capability without adding new biases.

What would settle it

A newly released model quickly reaches high accuracy on LiveBench questions drawn from sources that post-date its training cutoff, or inspection shows those exact questions already present in its training corpus.

read the original abstract

Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be resistant to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-limited versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 405B in size. LiveBench is difficult, with top models achieving below 70% accuracy. We release all questions, code, and model answers. Questions are added and updated on a monthly basis, and we release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LiveBench gives a practical, auto-scored benchmark with monthly fresh questions, but its contamination resistance for closed models rests on assumptions that cannot be checked.

read the letter

LiveBench is worth a look if you work on LLM evaluation. The paper puts out a benchmark that draws questions from recent arXiv papers, news, competitions, and datasets, updates them monthly, and scores answers automatically against objective ground truth. It also adds harder versions of tasks from Big-Bench Hard, AMPS, and IFEval. That mix of frequent updates, automatic scoring, and broad coverage is the actual new piece relative to the benchmarks they cite. They release every question, the code, and the model answers, which makes the whole thing usable right away. They run it on dozens of open models up to 405B and several closed ones, with top scores staying below 70 percent, so the difficulty claim holds up in the numbers they show. The automatic scoring avoids the biases that come with human or LLM judges on hard items. The soft spot is the contamination story for closed models. The design assumes that material from recently released sources has not made it into the training data of GPT-4, Claude, or similar systems, but those training mixtures are not public and web-scraped data has variable lag. Monthly updates and harder variants do not close that gap, so the “contamination-limited” label is conditional for the strongest models. For open models the claim is easier to inspect. This paper is for anyone who needs an evolving benchmark that stays relevant as models improve and who values reproducible, judge-free scoring. A reader focused on practical evaluation methods will get concrete value from the released artifacts. I would send it to peer review. The release itself is a useful contribution, and the remaining issues are addressable with clearer discussion of the unverifiable parts rather than a fundamental flaw in the work.

Referee Report

3 major / 2 minor

Summary. The paper introduces LiveBench, a new LLM benchmark designed to resist test set contamination through frequently updated questions drawn from recent arXiv papers, news articles, math competitions, and datasets. It features automatic scoring against objective ground-truth values across diverse tasks including math, coding, reasoning, language, instruction following, and data analysis, plus harder contamination-limited versions of tasks from benchmarks such as Big-Bench Hard, AMPS, and IFEval. The work evaluates numerous closed- and open-source models (0.5B to 405B parameters), reports top-model accuracy below 70%, releases all questions/code/answers, and plans monthly updates with new tasks over time.

Significance. If the contamination resistance and objective scoring claims hold, LiveBench would provide a valuable, evolving benchmark that addresses a documented weakness in static LLM evaluations without relying on subjective human or LLM judges. The public release of questions, code, and model answers, combined with the commitment to ongoing updates, supports reproducibility and long-term utility for tracking model progress.

major comments (3)

[Abstract and §2] Abstract and §2 (LiveBench construction): the central claim that LiveBench is 'contamination-limited' for closed models rests on the untestable premise that questions from recently released sources are absent from proprietary training data; no verification procedure, cutoff analysis, or empirical check is provided for models such as GPT-4 or Claude, which directly undermines the primary design goal.
[§3] §3 (Evaluation and scoring): exact scoring rules, ground-truth extraction methods, and handling of edge cases (e.g., partial credit, formatting variations) are not fully specified for each task category, making it impossible to confirm that automatic objective scoring is free of new measurement biases as asserted.
[§4] §4 (Results): the reported aggregate scores and model rankings depend on the contamination-limited property; without addressing the unverifiable assumption for closed models, the claim that LiveBench can 'distinguish between the capabilities of LLMs as they improve' lacks sufficient support.

minor comments (2)

[§4] A table enumerating all evaluated models with parameter counts, sources, and exact LiveBench scores would improve readability and allow direct comparison.
[Abstract] The abstract and introduction use 'dozens of open-source models' without a precise count or breakdown; adding this detail would strengthen the evaluation description.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments. We address each point below, making revisions to improve clarity on scoring and to acknowledge limitations on contamination verification for closed models.

read point-by-point responses

Referee: [Abstract and §2] Abstract and §2 (LiveBench construction): the central claim that LiveBench is 'contamination-limited' for closed models rests on the untestable premise that questions from recently released sources are absent from proprietary training data; no verification procedure, cutoff analysis, or empirical check is provided for models such as GPT-4 or Claude, which directly undermines the primary design goal.

Authors: We acknowledge that complete verification is impossible for closed models since their training data is proprietary. Our design uses questions from very recent sources (e.g., arXiv papers from the past month) to minimize the likelihood of contamination, assuming standard training cutoffs. We have added explicit discussion of this assumption and its limitations in the revised §2. For open-source models, we provide contamination checks. This does not fully resolve the issue for closed models but is the practical approach given the constraints. revision: partial
Referee: [§3] §3 (Evaluation and scoring): exact scoring rules, ground-truth extraction methods, and handling of edge cases (e.g., partial credit, formatting variations) are not fully specified for each task category, making it impossible to confirm that automatic objective scoring is free of new measurement biases as asserted.

Authors: We agree that more detail is needed. In the revised manuscript, we have expanded §3 with precise scoring rules for each category: for mathematical tasks, we normalize answers and check for exact equivalence; for coding, we execute code against hidden test cases; for reasoning tasks, we use string matching with tolerance for minor variations. Edge cases like partial answers or formatting issues are now explicitly addressed with examples. This ensures transparency and reduces potential biases. revision: yes
Referee: [§4] §4 (Results): the reported aggregate scores and model rankings depend on the contamination-limited property; without addressing the unverifiable assumption for closed models, the claim that LiveBench can 'distinguish between the capabilities of LLMs as they improve' lacks sufficient support.

Authors: We have revised §4 to include a dedicated subsection discussing the reliance on the contamination-limited assumption, particularly noting the unverifiable aspect for closed models. We argue that the benchmark still distinguishes capabilities through its challenging nature and planned updates, as evidenced by the current top scores below 70%. We also provide breakdowns showing performance gaps that persist even under conservative assumptions about contamination. revision: partial

standing simulated objections not resolved

The inability to empirically verify the absence of LiveBench questions from the training data of closed-source models such as GPT-4 and Claude.

Circularity Check

0 steps flagged

No circularity: benchmark introduction is self-contained contribution

full rationale

The paper introduces and releases LiveBench as a new benchmark with frequently-updated questions from recent sources and automatic objective scoring. No derivation chain, equations, fitted parameters, or predictions are present that reduce to inputs by construction. The central claims rest on the act of curation and release rather than any self-referential mathematical reduction or load-bearing self-citation of unverified uniqueness theorems. Self-citations to prior benchmarks (Big-Bench Hard, AMPS, IFEval) are used only to describe task extensions, not to justify the core contamination-limitation property via circular logic. The contamination resistance is presented as a design choice based on sourcing from recent releases, which is externally verifiable in principle and does not collapse to a fitted input or self-definition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on domain assumptions about data freshness and scoring objectivity rather than new mathematical derivations, fitted parameters, or invented entities.

axioms (2)

domain assumption Questions drawn from recently released math competitions, arXiv papers, news articles, and datasets are absent from the training sets of the evaluated models.
This assumption underpins the claim of contamination-limited evaluation.
domain assumption Automatic scoring against objective ground-truth values accurately reflects LLM capability without the biases that affect human or LLM judges.
This assumption is required for the benchmark to avoid the pitfalls of crowdsourced judging described in the abstract.

pith-pipeline@v0.9.0 · 5680 in / 1370 out tokens · 47638 ms · 2026-05-15T04:43:43.381988+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis.
Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-limited versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs
cs.LG 2026-05 unverdicted novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics
cs.AI 2026-05 unverdicted novelty 7.0

Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling...
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
cs.AI 2026-04 unverdicted novelty 7.0

FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.
DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?
cs.AI 2026-04 unverdicted novelty 7.0

DRBENCHER generates multi-hop questions across biochemistry, finance, geophysics, security, and history that test interleaved browsing and computation, where the strongest models reach only 20% accuracy and human vali...
Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software
cs.SE 2026-04 conditional novelty 7.0

LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with co...
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
cs.AI 2025-05 unverdicted novelty 7.0

MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.
Unified Reward Model for Multimodal Understanding and Generation
cs.CV 2025-03 unverdicted novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes
cs.CV 2026-05 accept novelty 6.0

ShapeCodeBench introduces a renewable benchmark for perception-to-program reconstruction of synthetic shapes, with evaluations showing low exact-match performance from current models and heuristics.
Decision-aware User Simulation Agent for Evaluating Conversational Recommender Systems
cs.IR 2026-05 unverdicted novelty 6.0

Hesitator is a theory-grounded simulator that separates utility-based item selection from overload-aware commitment decisions to reduce unrealistic high acceptance rates in conversational recommender evaluations.
Counting as a minimal probe of language model reliability
cs.CL 2026-05 unverdicted novelty 6.0

Language models have limited stable counting capacity well below context limits and rely on a finite set of count-like internal states, collapsing to guessing once exhausted.
You Don't Need Public Tests to Generate Correct Code
cs.SE 2026-04 unverdicted novelty 6.0

DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or ...
LLMs for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments
cs.SE 2026-04 unverdicted novelty 6.0

LLMs improve with detailed code descriptions but remain insufficient to replace human annotators for security-specific qualitative coding.
Babbling Suppression: Making LLMs Greener One Token at a Time
cs.SE 2026-04 unverdicted novelty 6.0

Babbling Suppression stops LLM code generation upon test passage to reduce token output and energy consumption by up to 65% across Python and Java benchmarks.
Fighting AI with AI: AI-Agent Augmented DNS Blocking of LLM Services during Student Evaluations
cs.NI 2026-03 unverdicted novelty 6.0

AI-Sinkhole uses AI classification with quantized LLMs and Pi-Hole DNS blocking to dynamically prevent access to LLM services during student evaluations, reporting F1 scores above 0.83.
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
cs.SE 2025-09 conditional novelty 6.0

SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.
Process Reinforcement through Implicit Rewards
cs.LG 2025-02 conditional novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
Qwen2.5-1M Technical Report
cs.CL 2025-01 accept novelty 6.0

Qwen2.5-1M models reach 1M token context with improved long-context performance, no short-context loss, and 3-7x prefill speedup via open inference optimizations.
Qwen3 Technical Report
cs.CL 2025-05 unverdicted novelty 5.0

Pith review generated a malformed one-line summary.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
cs.AI 2025-07 accept novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
cs.CL 2026-05 unverdicted novelty 3.0

EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 21 Pith papers · 21 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Nemotron-4 340b technical report

Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704,

work page arXiv
[3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

URL https://openreview.net/forum?id=uyTL5Bvosj

ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901
[5]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6,

work page 2023
[7]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

c4ai-command-r-plus-08-2024,

11 Published as a conference paper at ICLR 2025 Cohere For AI. c4ai-command-r-plus-08-2024,

work page 2025
[10]

URL https://huggingface.co/ CohereForAI/c4ai-command-r-plus-08-2024 . DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Ho...

work page 2024
[11]

Investigating data contamination in modern benchmarks for large language models

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Investigating data contamination in modern benchmarks for large language models. arXiv preprint arXiv:2311.09783,

work page arXiv
[12]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Generalization or memorization: Data con- tamination and trustworthy evaluation for large language models.arXiv preprint arXiv:2402.15938,

Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, and Ge Li. Generalization or memorization: Data con- tamination and trustworthy evaluation for large language models.arXiv preprint arXiv:2402.15938,

work page arXiv
[14]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

arXiv:2310.18208 [cs]

URL http://arxiv.org/abs/2310.18208. arXiv:2310.18208 [cs]. Bofei Gao and Tianyu Liu. Omni-math: A universal olympiad level mathematic benchmark for large language models. https://omni-math.github.io/,

work page arXiv
[17]

Datasheets for datasets

12 Published as a conference paper at ICLR 2025 Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12): 86–92,

work page 2025
[18]

Data contamination quiz: A tool to detect and estimate contamination in large language models

Shahriar Golchin and Mihai Surdeanu. Data contamination quiz: A tool to detect and estimate contamination in large language models. arXiv preprint arXiv:2311.06233, 2023a. Shahriar Golchin and Mihai Surdeanu. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493, 2023b. Anthony Goldbloom. The overlooked ...

work page arXiv
[19]

Guardian Media Group

Accessed: 2024-11-23. Guardian Media Group. The guardian. https://www.theguardian.com/,

work page 2024
[20]

Measuring Mathematical Problem Solving With the MATH Dataset

Accessed: 2024-01-20. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Training on the test model: Contamina- tion in ranking distillation

Vishakha Suresh Kalal, Andrew Parry, and Sean MacAvaney. Training on the test model: Contamina- tion in ranking distillation. arXiv preprint arXiv:2411.02284,

work page arXiv
[25]

StarCoder: may the source be with you!

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023a. Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From live data to high-qu...

work page internal anchor Pith review Pith/arXiv arXiv
[26]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

URL https://lmsys.org/blog/2024-04-19-arena-hard/ . Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b. Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chengg...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

doi: 10.7717/peerj-cs.103

ISSN 2376-5992. doi: 10.7717/peerj-cs.103. URL https://doi.org/10.7717/peerj-cs.103. V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wi...

work page doi:10.7717/peerj-cs.103
[28]

URL http://dx.doi.org/10.1038/nature14236

ISSN 00280836. URL http://dx.doi.org/10.1038/nature14236. OpenAI. Gpt-4 technical report. Technical Report,

work page doi:10.1038/nature14236
[29]

Proving test set contamination in black box language models

Yonatan Oren, Nicole Meister, Niladri Chatterji, Faisal Ladhak, and Tatsunori B Hashimoto. Proving test set contamination in black box language models. arXiv preprint arXiv:2310.17623,

work page arXiv
[30]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Rematch: Retrieval enhanced schema matching with llms

Eitam Sheetrit, Menachem Brief, Moshik Mishaeli, and Oren Elisha. Rematch: Retrieval enhanced schema matching with llms. arXiv preprint arXiv:2403.01567,

work page arXiv
[32]

Evaluation data contamination in llms: how do we measure it and (when) does it matter? arXiv preprint arXiv:2411.03923,

Aaditya K Singh, Muhammed Yusuf Kocyigit, Andrew Poulton, David Esiobu, Maria Lomeli, Gergely Szilvasy, and Dieuwke Hupkes. Evaluation data contamination in llms: how do we measure it and (when) does it matter? arXiv preprint arXiv:2411.03923,

work page arXiv
[33]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap

14 Published as a conference paper at ICLR 2025 Saurabh Srivastava, Anto PV , Shashank Menon, Ajay Sukumar, Alan Philipose, Stevin Prince, Sooraj Thomas, et al. Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap. arXiv preprint arXiv:2402.19450,

work page arXiv 2025
[35]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051,

work page 2023
[37]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma, 2024a. URL https://www.kaggle.com/m/3301. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks

Cong Yan and Yeye He. Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1539–1554, Portland OR USA, June

work page 2020
[41]

ISBN 978-1-4503- 6735-6

ACM. ISBN 978-1-4503- 6735-6. doi: 10.1145/3318464.3389738. URL https://dl.acm.org/doi/10.1145/ 3318464.3389738. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, ...

work page doi:10.1145/3318464.3389738
[42]

Tree of thoughts: Deliberate problem solving with large language models

15 Published as a conference paper at ICLR 2025 Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 36,

work page 2025
[43]

A careful examination of large language model performance on grade school arithmetic

Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, et al. A careful examination of large language model performance on grade school arithmetic. arXiv preprint arXiv:2405.00332,

work page arXiv
[44]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023a. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large ...

work page internal anchor Pith review Pith/arXiv arXiv
[45]

This is the full version of Figure 1 (left)

16 Published as a conference paper at ICLR 2025 0 10 20 30 40 50 60 70 phi-3-mini-4k-instruct phi-3-mini-128k-instruct meta-llama-3.1-8b-instruct-turbo command-r-plus-04-2024 command-r-08-2024 mistral-small-2402 gemma-2-9b-it command-r-plus-08-2024 mixtral-8x22b-instruct-v0.1 claude-3-haiku-20240307 qwen2.5-7b-instruct-turbo gemini-1.5-flash-8b-exp-0827 g...

work page 2025
[46]

We output the results for each model on each main category, as well as each model’s overallLiveBench score

17 Published as a conference paper at ICLR 2025 Table 2: LiveBench results across 40 models. We output the results for each model on each main category, as well as each model’s overallLiveBench score. Model LiveBenchCoding Data Instruction Language Math Reasoning Score Analysis Following o1-preview-2024-09-12 64.7 50.8 64.0 74.6 68.7 62.9 67.4 claude-3-5-...

work page 2025
[47]

error for each category compared to the overall average LiveBench score, computed using data from all 40 models

20 Published as a conference paper at ICLR 2025 Table 5: Pearson correlation coefficient and std. error for each category compared to the overall average LiveBench score, computed using data from all 40 models. Category Correlation Std Error math 0.9477 0.0643 reasoning 0.9439 0.0709 data_analysis 0.9315 0.0489 coding 0.8970 0.0840 language 0.8970 0.0831 ...

work page 2025
[48]

[[rating]]

In Table 7, we compute the relative best and worst task for each model, specifically, the tasks with the highest and lowest residuals of the best fit line vs. overall LiveBench performance. In other words, we compute the task that each model most outperforms and underperforms on, relative to a theoretical model with the same overall performance but has ba...

work page 2025
[49]

However, we 22 Published as a conference paper at ICLR 2025 Table 8: LLM judges cannot accurately evaluate challenging math and reasoning questions

We find that the error rate for all tasks is far above a reasonable value, indicating that LLM judges are not appropriate for challenging math and logic tasks. However, we 22 Published as a conference paper at ICLR 2025 Table 8: LLM judges cannot accurately evaluate challenging math and reasoning questions. Error rate of LLM-as-a-judge scoring on challeng...

work page 2025
[50]

All three are challenging multiple-choice competitions for high school students in the USA (AMC) and UK (SMC) that build in difficulty, meant as the first step for high school students to qualify for their country’s team for the International Mathematical Olympiad (IMO). The questions test mathematical problem solving with arithmetic, algebra, counting, g...

work page 2022
[51]

Real numbers x and y with x, y > 1 satisfy logx(yx) = log y(x4y) = 10

An example of a problem of this type from the AIME I 2024 problem set is below: An example question from the Math Competitions task. Real numbers x and y with x, y > 1 satisfy logx(yx) = log y(x4y) = 10 . What is the value of xy? Please think step by step, and then display the answer at the very end of your response. The answer is an integer consisting of...

work page 2024
[52]

Remember to have the three digits as the last part of the response

If you cannot determine the correct answer, take your best guess. Remember to have the three digits as the last part of the response. Ground Truth: 025 Proof-based questions. We consider the USA Math Olympiad (USAMO) 2024 and International Math Olympiad (IMO) 2024 competitions, released on March 20, 2024 and July 22, 2024, respectively. These contests are...

work page 2024
[53]

An example of a integral problem is as follows: 25 Published as a conference paper at ICLR 2025 An example question from the AMPS Hard task

where we check for semantic as well as numerical equivalence of mathematical expressions. An example of a integral problem is as follows: 25 Published as a conference paper at ICLR 2025 An example question from the AMPS Hard task. Find an indefinite integral (which can vary by a constant) of the following function: 5 sec2(5x +

work page 2025
[54]

A.3.2 C ODING CATEGORY The coding ability of LLMs is one of the most widely studied and sought-after skills for LLMs (Mnih et al., 2015; Jain et al., 2024; Li et al., 2023a). We include two coding tasks in LiveBench: a modified version of the code generation task from LiveCodeBench (Jain et al., 2024), and a novel code completion task combining LiveCodeBe...

work page 2015
[55]

We have taken 78 randomly selected problems from the April 2024 release of LiveCodeBench, selecting only problems released in or after November

included several tasks to assess the coding capabilities of large language models. We have taken 78 randomly selected problems from the April 2024 release of LiveCodeBench, selecting only problems released in or after November

work page 2024
[56]

and AtCoder(Team, 2012), defined with a textual description and solved by writing full programs in Python 3 code. These problems are presented as in LiveCodeBench’s Code Generation task, with minor prompting differences and with only one chance at generating a correct solution per question, per model. We report pass@1, a metric which describes the proport...

work page 2012
[57]

As with Code Generation, we report pass@1

April 2024 release, combined with matching solutions from https://github.com/kamyu104/LeetCode-Solutions, omitting the last 15% of each medium/hard solution and 30-70% of each easy solution and asking the LLM to complete the solution. As with Code Generation, we report pass@1. A.3.3 R EASONING CATEGORY The reasoning abilities of large language models is a...

work page 2024
[58]

The task is to evaluate the truth value of a random Boolean function expressed as a natural-language word problem

and Big- Bench Hard (Suzgun et al., 2023). The task is to evaluate the truth value of a random Boolean function expressed as a natural-language word problem. In particular, the LLM must evaluate fn(fn−1(...f1(x)...)), where each fi is either negation or identity, and x is True or False. We represent x by the sentence: X0 {tells the truth, lies}, and we re...

work page 2023
[59]

JSON", "JSONL

reasoning task that tests the ability of the model to follow a set of statements that set up constraints, and then logically deduce the requested information. The following is an example with three people and three attributes: An example question from the Zebra Puzzle task. There are 3 people standing in a line numbered 1 through 3 in a left to right orde...

work page 2023
[60]

The following are the beginning sentences of a news article from the Guardian.\n——-\n{guardian article}\n——-\n{subtask prompt} {instructions}

To construct, the full prompt, containing the news article sentences, the prompt, and the instructions, we use the following meta prompt: “The following are the beginning sentences of a news article from the Guardian.\n——-\n{guardian article}\n——-\n{subtask prompt} {instructions}”. 2https://open-platform.theguardian.com/ 29 Published as a conference paper...

work page 2025
[61]

and RoBERTa (Liu et al., 2019). The authors found that GPT-4 has an overall completion rate below 40% on the puzzles (when allowed multiple tries to get it correct), concluding that ‘large language models in the GPT family are able to solve these puzzles with moderate reliability, indicating that the task is possible but remains a formidable challenge.’ I...

work page 2019
[62]

The lengths of the synopses vary from as few as 7 sentences to as many as 66 sentences; at the upper end, this is a very challenging task

These synopses are then split into their constituent sentences and are randomly shuffled. The lengths of the synopses vary from as few as 7 sentences to as many as 66 sentences; at the upper end, this is a very challenging task. The LLM is provided the shuffled sentences with the prompt: ‘The following plot 31 Published as a conference paper at ICLR 2025 ...

work page 2025
[63]

to determine the closest match using a version of the Ratcliff/Obershelp algorithm (Ratcliff & Metzener, 1988). For 2), we calculate the score as 1 − d n_sentences_gt, where n_sentences_gt is the number of sentences in the ground truth synopsis, and d is the Levenshtein distance (Levenshtein,

work page 1988
[64]

Thus if the model’s sentence ordering perfectly matches the ground truth, the distanced would be 0, and the score would be 1 for that sample

of the model’s sentence ordering to the ground truth synopsis ordering. Thus if the model’s sentence ordering perfectly matches the ground truth, the distanced would be 0, and the score would be 1 for that sample. One might think that it is plausible that synopsis unscrambling cannot always be solved with the information provided. However, note that even ...

work page 2025
[65]

The first is that an instruction-following element is added to the task

On the other hand, automated grading methods have two main pitfalls, which we take care to circumvent. The first is that an instruction-following element is added to the task. For example, a reasoning task with a complex answer instruction format tests instruction-following as well as reasoning, rather than pure reasoning. The second is that care must be ...

work page 2025
[66]

The latter example, despite LLMs already exhibiting some degree of this type of high generalization, is still ‘fair game’ with respect to pretraining and fine-tuning: many LLMs are trained on competitive math problem tasks and then evaluated on new, unseen test problems from the same distribution. Despite it being accepted practice, we attempt to guard ag...

work page 2024
[67]

for LiveBench in https://github.com/ LiveBench/LiveBench/blob/main/docs/DATASHEET.md. B.5 B ENCHMARK STATISTICS Here, we give statistics on the number of questions and average number of output tokens per task, and the total cost of running LiveBench with common API models. For the number of questions for each task, as well as the mean and std. dev number ...

work page 2024
[68]

36 Published as a conference paper at ICLR 2025 Table 16: Prices for running GPT and Claude models on LiveBench

o1-preview-2024-09-12 is the most expensive at $47.87, while claude-3-haiku is the cheapest, at $0.90. 36 Published as a conference paper at ICLR 2025 Table 16: Prices for running GPT and Claude models on LiveBench. This table gives the ap- proximate cost for running models on LiveBench as of Oct 1,

work page 2024
[69]

Note that we used the gpt-4-turbo tokenizer for all computations, so all other prices are approximate. Model Price in USD o1-preview-2024-09-12 ≈ 47.87 o1-mini-2024-09-12 ≈ 9.57 gpt-4o-2024-05-13 ≈ 13.98 gpt-4-turbo-2024-04-09 27.97 gpt-4-1106-preview ≈ 27.97 gpt-3.5-turbo-0125 ≈ 1.40 claude-3-opus ≈ 53.80 claude-3-5-sonnet ≈ 10.76 claude-3-sonnet ≈ 10.76...

work page 2024