pith. machine review for the scientific record. sign in

arxiv: 2406.19314 · v2 · submitted 2024-06-27 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords LLM benchmarktest contaminationautomatic evaluationlive benchmarkmath reasoningcoding tasksobjective scoring
0
0 comments X

The pith

LiveBench is an LLM benchmark that pulls questions monthly from recent sources and scores them automatically against objective answers to avoid test contamination.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LiveBench to solve the problem of test set contamination, where benchmark questions leak into model training data and render evaluations unreliable. It sources questions from fresh math competitions, arXiv papers, news articles, and datasets, then updates them regularly while using automatic objective scoring instead of human or LLM judges. The benchmark covers challenging tasks across math, coding, reasoning, language, instruction following, and data analysis, including harder versions of prior tasks like Big-Bench Hard. Even leading models score below 70 percent, and the full set of questions, code, and answers is released for ongoing use. Questions and tasks continue to be added over time so the benchmark can track genuine capability gains.

Core claim

LiveBench is the first benchmark that contains frequently-updated questions from recent information sources, scores answers automatically according to objective ground-truth values, and contains a wide variety of challenging tasks spanning math, coding, reasoning, language, instruction following, and data analysis.

What carries the argument

Sourcing questions from recently released sources such as math competitions, arXiv papers, news articles, and datasets, paired with automatic objective scoring against ground-truth values.

If this is right

  • Closed-source and open-source models from 0.5B to 405B parameters can be compared on the same contamination-resistant tasks.
  • Monthly additions of new questions and harder task versions allow the benchmark to remain useful as model performance rises.
  • Objective auto-scoring removes reliance on subjective human or LLM judges for difficult problems.
  • Releasing all questions and model answers supports reproducible evaluation and community extensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other domains such as science or multimodal evaluation could adopt similar live sourcing plus auto-scoring to stay ahead of data leakage.
  • Models may need training strategies that emphasize recent data or better generalization to maintain high scores over repeated benchmark updates.
  • Public leaderboards built on LiveBench could become more stable references for comparing capability trends across years.

Load-bearing premise

Questions drawn from recently released sources stay out of the training data of the models being tested, and automatic objective scoring measures true capability without adding new biases.

What would settle it

A newly released model quickly reaches high accuracy on LiveBench questions drawn from sources that post-date its training cutoff, or inspection shows those exact questions already present in its training corpus.

read the original abstract

Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be resistant to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-limited versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 405B in size. LiveBench is difficult, with top models achieving below 70% accuracy. We release all questions, code, and model answers. Questions are added and updated on a monthly basis, and we release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces LiveBench, a new LLM benchmark designed to resist test set contamination through frequently updated questions drawn from recent arXiv papers, news articles, math competitions, and datasets. It features automatic scoring against objective ground-truth values across diverse tasks including math, coding, reasoning, language, instruction following, and data analysis, plus harder contamination-limited versions of tasks from benchmarks such as Big-Bench Hard, AMPS, and IFEval. The work evaluates numerous closed- and open-source models (0.5B to 405B parameters), reports top-model accuracy below 70%, releases all questions/code/answers, and plans monthly updates with new tasks over time.

Significance. If the contamination resistance and objective scoring claims hold, LiveBench would provide a valuable, evolving benchmark that addresses a documented weakness in static LLM evaluations without relying on subjective human or LLM judges. The public release of questions, code, and model answers, combined with the commitment to ongoing updates, supports reproducibility and long-term utility for tracking model progress.

major comments (3)
  1. [Abstract and §2] Abstract and §2 (LiveBench construction): the central claim that LiveBench is 'contamination-limited' for closed models rests on the untestable premise that questions from recently released sources are absent from proprietary training data; no verification procedure, cutoff analysis, or empirical check is provided for models such as GPT-4 or Claude, which directly undermines the primary design goal.
  2. [§3] §3 (Evaluation and scoring): exact scoring rules, ground-truth extraction methods, and handling of edge cases (e.g., partial credit, formatting variations) are not fully specified for each task category, making it impossible to confirm that automatic objective scoring is free of new measurement biases as asserted.
  3. [§4] §4 (Results): the reported aggregate scores and model rankings depend on the contamination-limited property; without addressing the unverifiable assumption for closed models, the claim that LiveBench can 'distinguish between the capabilities of LLMs as they improve' lacks sufficient support.
minor comments (2)
  1. [§4] A table enumerating all evaluated models with parameter counts, sources, and exact LiveBench scores would improve readability and allow direct comparison.
  2. [Abstract] The abstract and introduction use 'dozens of open-source models' without a precise count or breakdown; adding this detail would strengthen the evaluation description.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments. We address each point below, making revisions to improve clarity on scoring and to acknowledge limitations on contamination verification for closed models.

read point-by-point responses
  1. Referee: [Abstract and §2] Abstract and §2 (LiveBench construction): the central claim that LiveBench is 'contamination-limited' for closed models rests on the untestable premise that questions from recently released sources are absent from proprietary training data; no verification procedure, cutoff analysis, or empirical check is provided for models such as GPT-4 or Claude, which directly undermines the primary design goal.

    Authors: We acknowledge that complete verification is impossible for closed models since their training data is proprietary. Our design uses questions from very recent sources (e.g., arXiv papers from the past month) to minimize the likelihood of contamination, assuming standard training cutoffs. We have added explicit discussion of this assumption and its limitations in the revised §2. For open-source models, we provide contamination checks. This does not fully resolve the issue for closed models but is the practical approach given the constraints. revision: partial

  2. Referee: [§3] §3 (Evaluation and scoring): exact scoring rules, ground-truth extraction methods, and handling of edge cases (e.g., partial credit, formatting variations) are not fully specified for each task category, making it impossible to confirm that automatic objective scoring is free of new measurement biases as asserted.

    Authors: We agree that more detail is needed. In the revised manuscript, we have expanded §3 with precise scoring rules for each category: for mathematical tasks, we normalize answers and check for exact equivalence; for coding, we execute code against hidden test cases; for reasoning tasks, we use string matching with tolerance for minor variations. Edge cases like partial answers or formatting issues are now explicitly addressed with examples. This ensures transparency and reduces potential biases. revision: yes

  3. Referee: [§4] §4 (Results): the reported aggregate scores and model rankings depend on the contamination-limited property; without addressing the unverifiable assumption for closed models, the claim that LiveBench can 'distinguish between the capabilities of LLMs as they improve' lacks sufficient support.

    Authors: We have revised §4 to include a dedicated subsection discussing the reliance on the contamination-limited assumption, particularly noting the unverifiable aspect for closed models. We argue that the benchmark still distinguishes capabilities through its challenging nature and planned updates, as evidenced by the current top scores below 70%. We also provide breakdowns showing performance gaps that persist even under conservative assumptions about contamination. revision: partial

standing simulated objections not resolved
  • The inability to empirically verify the absence of LiveBench questions from the training data of closed-source models such as GPT-4 and Claude.

Circularity Check

0 steps flagged

No circularity: benchmark introduction is self-contained contribution

full rationale

The paper introduces and releases LiveBench as a new benchmark with frequently-updated questions from recent sources and automatic objective scoring. No derivation chain, equations, fitted parameters, or predictions are present that reduce to inputs by construction. The central claims rest on the act of curation and release rather than any self-referential mathematical reduction or load-bearing self-citation of unverified uniqueness theorems. Self-citations to prior benchmarks (Big-Bench Hard, AMPS, IFEval) are used only to describe task extensions, not to justify the core contamination-limitation property via circular logic. The contamination resistance is presented as a design choice based on sourcing from recent releases, which is externally verifiable in principle and does not collapse to a fitted input or self-definition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on domain assumptions about data freshness and scoring objectivity rather than new mathematical derivations, fitted parameters, or invented entities.

axioms (2)
  • domain assumption Questions drawn from recently released math competitions, arXiv papers, news articles, and datasets are absent from the training sets of the evaluated models.
    This assumption underpins the claim of contamination-limited evaluation.
  • domain assumption Automatic scoring against objective ground-truth values accurately reflects LLM capability without the biases that affect human or LLM judges.
    This assumption is required for the benchmark to avoid the pitfalls of crowdsourced judging described in the abstract.

pith-pipeline@v0.9.0 · 5680 in / 1370 out tokens · 47638 ms · 2026-05-15T04:43:43.381988+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We introduce LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis.

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-limited versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

    cs.LG 2026-05 unverdicted novelty 8.0

    MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

  2. Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics

    cs.AI 2026-05 unverdicted novelty 7.0

    Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling...

  3. FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

    cs.AI 2026-04 unverdicted novelty 7.0

    FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.

  4. DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?

    cs.AI 2026-04 unverdicted novelty 7.0

    DRBENCHER generates multi-hop questions across biochemistry, finance, geophysics, security, and history that test interleaved browsing and computation, where the strongest models reach only 20% accuracy and human vali...

  5. Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software

    cs.SE 2026-04 conditional novelty 7.0

    LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with co...

  6. MathArena: Evaluating LLMs on Uncontaminated Math Competitions

    cs.AI 2025-05 unverdicted novelty 7.0

    MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.

  7. Unified Reward Model for Multimodal Understanding and Generation

    cs.CV 2025-03 unverdicted novelty 7.0

    UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.

  8. ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes

    cs.CV 2026-05 accept novelty 6.0

    ShapeCodeBench introduces a renewable benchmark for perception-to-program reconstruction of synthetic shapes, with evaluations showing low exact-match performance from current models and heuristics.

  9. Decision-aware User Simulation Agent for Evaluating Conversational Recommender Systems

    cs.IR 2026-05 unverdicted novelty 6.0

    Hesitator is a theory-grounded simulator that separates utility-based item selection from overload-aware commitment decisions to reduce unrealistic high acceptance rates in conversational recommender evaluations.

  10. Counting as a minimal probe of language model reliability

    cs.CL 2026-05 unverdicted novelty 6.0

    Language models have limited stable counting capacity well below context limits and rely on a finite set of count-like internal states, collapsing to guessing once exhausted.

  11. You Don't Need Public Tests to Generate Correct Code

    cs.SE 2026-04 unverdicted novelty 6.0

    DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or ...

  12. LLMs for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments

    cs.SE 2026-04 unverdicted novelty 6.0

    LLMs improve with detailed code descriptions but remain insufficient to replace human annotators for security-specific qualitative coding.

  13. Babbling Suppression: Making LLMs Greener One Token at a Time

    cs.SE 2026-04 unverdicted novelty 6.0

    Babbling Suppression stops LLM code generation upon test passage to reduce token output and energy consumption by up to 65% across Python and Java benchmarks.

  14. Fighting AI with AI: AI-Agent Augmented DNS Blocking of LLM Services during Student Evaluations

    cs.NI 2026-03 unverdicted novelty 6.0

    AI-Sinkhole uses AI classification with quantized LLMs and Pi-Hole DNS blocking to dynamically prevent access to LLM services during student evaluations, reporting F1 scores above 0.83.

  15. Kimi Linear: An Expressive, Efficient Attention Architecture

    cs.CL 2025-10 unverdicted novelty 6.0

    Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

  16. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    cs.SE 2025-09 conditional novelty 6.0

    SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.

  17. Process Reinforcement through Implicit Rewards

    cs.LG 2025-02 conditional novelty 6.0

    PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...

  18. Qwen2.5-1M Technical Report

    cs.CL 2025-01 accept novelty 6.0

    Qwen2.5-1M models reach 1M token context with improved long-context performance, no short-context loss, and 3-7x prefill speedup via open inference optimizations.

  19. Qwen3 Technical Report

    cs.CL 2025-05 unverdicted novelty 5.0

    Pith review generated a malformed one-line summary.

  20. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    cs.AI 2025-07 accept novelty 4.0

    The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

  21. Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

    cs.CL 2026-05 unverdicted novelty 3.0

    EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 21 Pith papers · 21 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219,

  2. [2]

    Nemotron-4 340b technical report

    Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704,

  3. [3]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609,

  4. [4]

    URL https://openreview.net/forum?id=uyTL5Bvosj

    ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

  5. [5]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712,

  6. [6]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6,

  7. [7]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132,

  8. [8]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  9. [9]

    c4ai-command-r-plus-08-2024,

    11 Published as a conference paper at ICLR 2025 Cohere For AI. c4ai-command-r-plus-08-2024,

  10. [10]

    URL https://huggingface.co/ CohereForAI/c4ai-command-r-plus-08-2024 . DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Ho...

  11. [11]

    Investigating data contamination in modern benchmarks for large language models

    Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Investigating data contamination in modern benchmarks for large language models. arXiv preprint arXiv:2311.09783,

  12. [12]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

  13. [13]

    Generalization or memorization: Data con- tamination and trustworthy evaluation for large language models.arXiv preprint arXiv:2402.15938,

    Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, and Ge Li. Generalization or memorization: Data con- tamination and trustworthy evaluation for large language models.arXiv preprint arXiv:2402.15938,

  14. [14]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  15. [15]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475,

  16. [16]

    arXiv:2310.18208 [cs]

    URL http://arxiv.org/abs/2310.18208. arXiv:2310.18208 [cs]. Bofei Gao and Tianyu Liu. Omni-math: A universal olympiad level mathematic benchmark for large language models. https://omni-math.github.io/,

  17. [17]

    Datasheets for datasets

    12 Published as a conference paper at ICLR 2025 Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12): 86–92,

  18. [18]

    Data contamination quiz: A tool to detect and estimate contamination in large language models

    Shahriar Golchin and Mihai Surdeanu. Data contamination quiz: A tool to detect and estimate contamination in large language models. arXiv preprint arXiv:2311.06233, 2023a. Shahriar Golchin and Mihai Surdeanu. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493, 2023b. Anthony Goldbloom. The overlooked ...

  19. [19]

    Guardian Media Group

    Accessed: 2024-11-23. Guardian Media Group. The guardian. https://www.theguardian.com/,

  20. [20]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Accessed: 2024-01-20. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

  21. [21]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276,

  22. [22]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974,

  23. [23]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

  24. [24]

    Training on the test model: Contamina- tion in ranking distillation

    Vishakha Suresh Kalal, Andrew Parry, and Sean MacAvaney. Training on the test model: Contamina- tion in ranking distillation. arXiv preprint arXiv:2411.02284,

  25. [25]

    StarCoder: may the source be with you!

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023a. Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From live data to high-qu...

  26. [26]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    URL https://lmsys.org/blog/2024-04-19-arena-hard/ . Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b. Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chengg...

  27. [27]

    doi: 10.7717/peerj-cs.103

    ISSN 2376-5992. doi: 10.7717/peerj-cs.103. URL https://doi.org/10.7717/peerj-cs.103. V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wi...

  28. [28]

    URL http://dx.doi.org/10.1038/nature14236

    ISSN 00280836. URL http://dx.doi.org/10.1038/nature14236. OpenAI. Gpt-4 technical report. Technical Report,

  29. [29]

    Proving test set contamination in black box language models

    Yonatan Oren, Nicole Meister, Niladri Chatterji, Faisal Ladhak, and Tatsunori B Hashimoto. Proving test set contamination in black box language models. arXiv preprint arXiv:2310.17623,

  30. [30]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530,

  31. [31]

    Rematch: Retrieval enhanced schema matching with llms

    Eitam Sheetrit, Menachem Brief, Moshik Mishaeli, and Oren Elisha. Rematch: Retrieval enhanced schema matching with llms. arXiv preprint arXiv:2403.01567,

  32. [32]

    Evaluation data contamination in llms: how do we measure it and (when) does it matter? arXiv preprint arXiv:2411.03923,

    Aaditya K Singh, Muhammed Yusuf Kocyigit, Andrew Poulton, David Esiobu, Maria Lomeli, Gergely Szilvasy, and Dieuwke Hupkes. Evaluation data contamination in llms: how do we measure it and (when) does it matter? arXiv preprint arXiv:2411.03923,

  33. [33]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615,

  34. [34]

    Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap

    14 Published as a conference paper at ICLR 2025 Saurabh Srivastava, Anto PV , Shashank Menon, Ajay Sukumar, Alan Philipose, Stevin Prince, Sooraj Thomas, et al. Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap. arXiv preprint arXiv:2402.19450,

  35. [35]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261,

  36. [36]

    Challenging big-bench tasks and whether chain-of-thought can solve them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051,

  37. [37]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team. Gemma, 2024a. URL https://www.kaggle.com/m/3301. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118,

  38. [38]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

  39. [39]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574,

  40. [40]

    Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks

    Cong Yan and Yeye He. Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1539–1554, Portland OR USA, June

  41. [41]

    ISBN 978-1-4503- 6735-6

    ACM. ISBN 978-1-4503- 6735-6. doi: 10.1145/3318464.3389738. URL https://dl.acm.org/doi/10.1145/ 3318464.3389738. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, ...

  42. [42]

    Tree of thoughts: Deliberate problem solving with large language models

    15 Published as a conference paper at ICLR 2025 Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 36,

  43. [43]

    A careful examination of large language model performance on grade school arithmetic

    Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, et al. A careful examination of large language model performance on grade school arithmetic. arXiv preprint arXiv:2405.00332,

  44. [44]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023a. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large ...

  45. [45]

    This is the full version of Figure 1 (left)

    16 Published as a conference paper at ICLR 2025 0 10 20 30 40 50 60 70 phi-3-mini-4k-instruct phi-3-mini-128k-instruct meta-llama-3.1-8b-instruct-turbo command-r-plus-04-2024 command-r-08-2024 mistral-small-2402 gemma-2-9b-it command-r-plus-08-2024 mixtral-8x22b-instruct-v0.1 claude-3-haiku-20240307 qwen2.5-7b-instruct-turbo gemini-1.5-flash-8b-exp-0827 g...

  46. [46]

    We output the results for each model on each main category, as well as each model’s overallLiveBench score

    17 Published as a conference paper at ICLR 2025 Table 2: LiveBench results across 40 models. We output the results for each model on each main category, as well as each model’s overallLiveBench score. Model LiveBenchCoding Data Instruction Language Math Reasoning Score Analysis Following o1-preview-2024-09-12 64.7 50.8 64.0 74.6 68.7 62.9 67.4 claude-3-5-...

  47. [47]

    error for each category compared to the overall average LiveBench score, computed using data from all 40 models

    20 Published as a conference paper at ICLR 2025 Table 5: Pearson correlation coefficient and std. error for each category compared to the overall average LiveBench score, computed using data from all 40 models. Category Correlation Std Error math 0.9477 0.0643 reasoning 0.9439 0.0709 data_analysis 0.9315 0.0489 coding 0.8970 0.0840 language 0.8970 0.0831 ...

  48. [48]

    [[rating]]

    In Table 7, we compute the relative best and worst task for each model, specifically, the tasks with the highest and lowest residuals of the best fit line vs. overall LiveBench performance. In other words, we compute the task that each model most outperforms and underperforms on, relative to a theoretical model with the same overall performance but has ba...

  49. [49]

    However, we 22 Published as a conference paper at ICLR 2025 Table 8: LLM judges cannot accurately evaluate challenging math and reasoning questions

    We find that the error rate for all tasks is far above a reasonable value, indicating that LLM judges are not appropriate for challenging math and logic tasks. However, we 22 Published as a conference paper at ICLR 2025 Table 8: LLM judges cannot accurately evaluate challenging math and reasoning questions. Error rate of LLM-as-a-judge scoring on challeng...

  50. [50]

    All three are challenging multiple-choice competitions for high school students in the USA (AMC) and UK (SMC) that build in difficulty, meant as the first step for high school students to qualify for their country’s team for the International Mathematical Olympiad (IMO). The questions test mathematical problem solving with arithmetic, algebra, counting, g...

  51. [51]

    Real numbers x and y with x, y > 1 satisfy logx(yx) = log y(x4y) = 10

    An example of a problem of this type from the AIME I 2024 problem set is below: An example question from the Math Competitions task. Real numbers x and y with x, y > 1 satisfy logx(yx) = log y(x4y) = 10 . What is the value of xy? Please think step by step, and then display the answer at the very end of your response. The answer is an integer consisting of...

  52. [52]

    Remember to have the three digits as the last part of the response

    If you cannot determine the correct answer, take your best guess. Remember to have the three digits as the last part of the response. Ground Truth: 025 Proof-based questions. We consider the USA Math Olympiad (USAMO) 2024 and International Math Olympiad (IMO) 2024 competitions, released on March 20, 2024 and July 22, 2024, respectively. These contests are...

  53. [53]

    An example of a integral problem is as follows: 25 Published as a conference paper at ICLR 2025 An example question from the AMPS Hard task

    where we check for semantic as well as numerical equivalence of mathematical expressions. An example of a integral problem is as follows: 25 Published as a conference paper at ICLR 2025 An example question from the AMPS Hard task. Find an indefinite integral (which can vary by a constant) of the following function: 5 sec2(5x +

  54. [54]

    A.3.2 C ODING CATEGORY The coding ability of LLMs is one of the most widely studied and sought-after skills for LLMs (Mnih et al., 2015; Jain et al., 2024; Li et al., 2023a). We include two coding tasks in LiveBench: a modified version of the code generation task from LiveCodeBench (Jain et al., 2024), and a novel code completion task combining LiveCodeBe...

  55. [55]

    We have taken 78 randomly selected problems from the April 2024 release of LiveCodeBench, selecting only problems released in or after November

    included several tasks to assess the coding capabilities of large language models. We have taken 78 randomly selected problems from the April 2024 release of LiveCodeBench, selecting only problems released in or after November

  56. [56]

    and AtCoder(Team, 2012), defined with a textual description and solved by writing full programs in Python 3 code. These problems are presented as in LiveCodeBench’s Code Generation task, with minor prompting differences and with only one chance at generating a correct solution per question, per model. We report pass@1, a metric which describes the proport...

  57. [57]

    As with Code Generation, we report pass@1

    April 2024 release, combined with matching solutions from https://github.com/kamyu104/LeetCode-Solutions, omitting the last 15% of each medium/hard solution and 30-70% of each easy solution and asking the LLM to complete the solution. As with Code Generation, we report pass@1. A.3.3 R EASONING CATEGORY The reasoning abilities of large language models is a...

  58. [58]

    The task is to evaluate the truth value of a random Boolean function expressed as a natural-language word problem

    and Big- Bench Hard (Suzgun et al., 2023). The task is to evaluate the truth value of a random Boolean function expressed as a natural-language word problem. In particular, the LLM must evaluate fn(fn−1(...f1(x)...)), where each fi is either negation or identity, and x is True or False. We represent x by the sentence: X0 {tells the truth, lies}, and we re...

  59. [59]

    JSON", "JSONL

    reasoning task that tests the ability of the model to follow a set of statements that set up constraints, and then logically deduce the requested information. The following is an example with three people and three attributes: An example question from the Zebra Puzzle task. There are 3 people standing in a line numbered 1 through 3 in a left to right orde...

  60. [60]

    The following are the beginning sentences of a news article from the Guardian.\n——-\n{guardian article}\n——-\n{subtask prompt} {instructions}

    To construct, the full prompt, containing the news article sentences, the prompt, and the instructions, we use the following meta prompt: “The following are the beginning sentences of a news article from the Guardian.\n——-\n{guardian article}\n——-\n{subtask prompt} {instructions}”. 2https://open-platform.theguardian.com/ 29 Published as a conference paper...

  61. [61]

    and RoBERTa (Liu et al., 2019). The authors found that GPT-4 has an overall completion rate below 40% on the puzzles (when allowed multiple tries to get it correct), concluding that ‘large language models in the GPT family are able to solve these puzzles with moderate reliability, indicating that the task is possible but remains a formidable challenge.’ I...

  62. [62]

    The lengths of the synopses vary from as few as 7 sentences to as many as 66 sentences; at the upper end, this is a very challenging task

    These synopses are then split into their constituent sentences and are randomly shuffled. The lengths of the synopses vary from as few as 7 sentences to as many as 66 sentences; at the upper end, this is a very challenging task. The LLM is provided the shuffled sentences with the prompt: ‘The following plot 31 Published as a conference paper at ICLR 2025 ...

  63. [63]

    to determine the closest match using a version of the Ratcliff/Obershelp algorithm (Ratcliff & Metzener, 1988). For 2), we calculate the score as 1 − d n_sentences_gt, where n_sentences_gt is the number of sentences in the ground truth synopsis, and d is the Levenshtein distance (Levenshtein,

  64. [64]

    Thus if the model’s sentence ordering perfectly matches the ground truth, the distanced would be 0, and the score would be 1 for that sample

    of the model’s sentence ordering to the ground truth synopsis ordering. Thus if the model’s sentence ordering perfectly matches the ground truth, the distanced would be 0, and the score would be 1 for that sample. One might think that it is plausible that synopsis unscrambling cannot always be solved with the information provided. However, note that even ...

  65. [65]

    The first is that an instruction-following element is added to the task

    On the other hand, automated grading methods have two main pitfalls, which we take care to circumvent. The first is that an instruction-following element is added to the task. For example, a reasoning task with a complex answer instruction format tests instruction-following as well as reasoning, rather than pure reasoning. The second is that care must be ...

  66. [66]

    The latter example, despite LLMs already exhibiting some degree of this type of high generalization, is still ‘fair game’ with respect to pretraining and fine-tuning: many LLMs are trained on competitive math problem tasks and then evaluated on new, unseen test problems from the same distribution. Despite it being accepted practice, we attempt to guard ag...

  67. [67]

    for LiveBench in https://github.com/ LiveBench/LiveBench/blob/main/docs/DATASHEET.md. B.5 B ENCHMARK STATISTICS Here, we give statistics on the number of questions and average number of output tokens per task, and the total cost of running LiveBench with common API models. For the number of questions for each task, as well as the mean and std. dev number ...

  68. [68]

    36 Published as a conference paper at ICLR 2025 Table 16: Prices for running GPT and Claude models on LiveBench

    o1-preview-2024-09-12 is the most expensive at $47.87, while claude-3-haiku is the cheapest, at $0.90. 36 Published as a conference paper at ICLR 2025 Table 16: Prices for running GPT and Claude models on LiveBench. This table gives the ap- proximate cost for running models on LiveBench as of Oct 1,

  69. [69]

    Note that we used the gpt-4-turbo tokenizer for all computations, so all other prices are approximate. Model Price in USD o1-preview-2024-09-12 ≈ 47.87 o1-mini-2024-09-12 ≈ 9.57 gpt-4o-2024-05-13 ≈ 13.98 gpt-4-turbo-2024-04-09 27.97 gpt-4-1106-preview ≈ 27.97 gpt-3.5-turbo-0125 ≈ 1.40 claude-3-opus ≈ 53.80 claude-3-5-sonnet ≈ 10.76 claude-3-sonnet ≈ 10.76...