hub Canonical reference

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain · 2024 · cs.CL · arXiv 2406.19314

Canonical reference. 78% of citing Pith papers cite this work as background.

38 Pith papers citing it

Background 78% of classified citations

open full Pith review browse 38 citing papers arXiv PDF

abstract

Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be resistant to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-limited versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 405B in size. LiveBench is difficult, with top models achieving below 70% accuracy. We release all questions, code, and model answers. Questions are added and updated on a monthly basis, and we release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 dataset 3

citation-polarity summary

background 7 use dataset 2

representative citing papers

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

TabArena: A Living Benchmark for Machine Learning on Tabular Data

cs.LG · 2025-06-20 · conditional · novelty 8.0

TabArena launches a dynamic, updatable benchmarking system for tabular ML that shows boosted trees remain competitive, deep learning matches them under larger budgets with ensembling, foundation models excel on small data, and cross-model ensembles advance SOTA while flagging validation overfitting.

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.

Forecasting Scientific Progress with Artificial Intelligence

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.

Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.

Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.

FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

cs.AI · 2026-04-11 · unverdicted · novelty 7.0

FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.

DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

DRBENCHER generates multi-hop questions across biochemistry, finance, geophysics, security, and history that test interleaved browsing and computation, where the strongest models reach only 20% accuracy and human validation finds 76% validity.

Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software

cs.SE · 2026-04-06 · conditional · novelty 7.0

LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with control flow and high-precision literals.

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

cs.AI · 2025-05-29 · unverdicted · novelty 7.0

MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.

PRIMETIME : Limits of LLMs in Temporal Primitives

cs.NE · 2025-04-22 · unverdicted · novelty 7.0

PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.

Unified Reward Model for Multimodal Understanding and Generation

cs.CV · 2025-03-07 · unverdicted · novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.

Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

cs.CL · 2026-05-31 · unverdicted · novelty 6.0

A graph-based MIS prompt selection method on embedding similarity graphs yields reduced benchmark subsets with highly consistent LLM rankings (Kendall's W ≥ 0.90 in 99.2% of cases) and 25-48% size reduction at higher thresholds.

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.

ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes

cs.CV · 2026-05-12 · accept · novelty 6.0

ShapeCodeBench introduces a renewable benchmark for perception-to-program reconstruction of synthetic shapes, with evaluations showing low exact-match performance from current models and heuristics.

Decision-aware User Simulation Agent for Evaluating Conversational Recommender Systems

cs.IR · 2026-05-05 · unverdicted · novelty 6.0

Hesitator is a theory-grounded simulator that separates utility-based item selection from overload-aware commitment decisions to reduce unrealistic high acceptance rates in conversational recommender evaluations.

You Don't Need Public Tests to Generate Correct Code

cs.SE · 2026-04-23 · unverdicted · novelty 6.0

DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or external signals.

LLMs for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments

cs.SE · 2026-04-12 · unverdicted · novelty 6.0

LLMs improve with detailed code descriptions but remain insufficient to replace human annotators for security-specific qualitative coding.

Babbling Suppression: Making LLMs Greener One Token at a Time

cs.SE · 2026-04-08 · unverdicted · novelty 6.0

Babbling Suppression stops LLM code generation upon test passage to reduce token output and energy consumption by up to 65% across Python and Java benchmarks.

Fighting AI with AI: AI-Agent Augmented DNS Blocking of LLM Services during Student Evaluations

cs.NI · 2026-03-20 · unverdicted · novelty 6.0

AI-Sinkhole uses AI classification with quantized LLMs and Pi-Hole DNS blocking to dynamically prevent access to LLM services during student evaluations, reporting F1 scores above 0.83.

Kimi Linear: An Expressive, Efficient Attention Architecture

cs.CL · 2025-10-30 · unverdicted · novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

cs.SE · 2025-09-21 · conditional · novelty 6.0

SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.

EMERGE: A Benchmark for Updating Knowledge Graphs with Emerging Textual Knowledge

cs.CL · 2025-07-04 · accept · novelty 6.0

EMERGE is a benchmark dataset of 233K Wikipedia passages paired with 1.45 million Wikidata edit operations across seven yearly snapshots from 2019 to 2025 for evaluating knowledge graph updates from emerging text.

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

cs.CL · 2025-06-13 · conditional · novelty 6.0

MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.

citing papers explorer

Showing 38 of 38 citing papers.

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs cs.LG · 2026-05-08 · unverdicted · none · ref 60 · internal anchor
MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
TabArena: A Living Benchmark for Machine Learning on Tabular Data cs.LG · 2025-06-20 · conditional · none · ref 54 · internal anchor
TabArena launches a dynamic, updatable benchmarking system for tabular ML that shows boosted trees remain competitive, deep learning matches them under larger budgets with ensembling, foundation models excel on small data, and cross-model ensembles advance SOTA while flagging validation overfitting.
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know? cs.AI · 2026-05-27 · unverdicted · none · ref 56 · internal anchor
LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.
Forecasting Scientific Progress with Artificial Intelligence cs.AI · 2026-05-21 · unverdicted · none · ref 53 · internal anchor
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities cs.LG · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.
Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics cs.AI · 2026-05-09 · unverdicted · none · ref 23 · internal anchor
Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks cs.AI · 2026-04-11 · unverdicted · none · ref 14 · internal anchor
FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.
DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math? cs.AI · 2026-04-10 · unverdicted · none · ref 11 · internal anchor
DRBENCHER generates multi-hop questions across biochemistry, finance, geophysics, security, and history that test interleaved browsing and computation, where the strongest models reach only 20% accuracy and human validation finds 76% validity.
Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software cs.SE · 2026-04-06 · conditional · none · ref 38 · internal anchor
LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with control flow and high-precision literals.
MathArena: Evaluating LLMs on Uncontaminated Math Competitions cs.AI · 2025-05-29 · unverdicted · none · ref 33 · internal anchor
MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.
PRIMETIME : Limits of LLMs in Temporal Primitives cs.NE · 2025-04-22 · unverdicted · none · ref 26 · internal anchor
PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
Unified Reward Model for Multimodal Understanding and Generation cs.CV · 2025-03-07 · unverdicted · none · ref 45 · internal anchor
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs cs.CL · 2026-05-31 · unverdicted · none · ref 30 · internal anchor
A graph-based MIS prompt selection method on embedding similarity graphs yields reduced benchmark subsets with highly consistent LLM rankings (Kendall's W ≥ 0.90 in 99.2% of cases) and 25-48% size reduction at higher thresholds.
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations cs.CL · 2026-05-21 · unverdicted · none · ref 35 · internal anchor
SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.
ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes cs.CV · 2026-05-12 · accept · none · ref 17 · internal anchor
ShapeCodeBench introduces a renewable benchmark for perception-to-program reconstruction of synthetic shapes, with evaluations showing low exact-match performance from current models and heuristics.
Decision-aware User Simulation Agent for Evaluating Conversational Recommender Systems cs.IR · 2026-05-05 · unverdicted · none · ref 15 · internal anchor
Hesitator is a theory-grounded simulator that separates utility-based item selection from overload-aware commitment decisions to reduce unrealistic high acceptance rates in conversational recommender evaluations.
You Don't Need Public Tests to Generate Correct Code cs.SE · 2026-04-23 · unverdicted · none · ref 21 · internal anchor
DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or external signals.
LLMs for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments cs.SE · 2026-04-12 · unverdicted · none · ref 55 · internal anchor
LLMs improve with detailed code descriptions but remain insufficient to replace human annotators for security-specific qualitative coding.
Babbling Suppression: Making LLMs Greener One Token at a Time cs.SE · 2026-04-08 · unverdicted · none · ref 45 · internal anchor
Babbling Suppression stops LLM code generation upon test passage to reduce token output and energy consumption by up to 65% across Python and Java benchmarks.
Fighting AI with AI: AI-Agent Augmented DNS Blocking of LLM Services during Student Evaluations cs.NI · 2026-03-20 · unverdicted · none · ref 10 · internal anchor
AI-Sinkhole uses AI classification with quantized LLMs and Pi-Hole DNS blocking to dynamically prevent access to LLM services during student evaluations, reporting F1 scores above 0.83.
Kimi Linear: An Expressive, Efficient Attention Architecture cs.CL · 2025-10-30 · unverdicted · none · ref 108 · internal anchor
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? cs.SE · 2025-09-21 · conditional · none · ref 13 · internal anchor
SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.
EMERGE: A Benchmark for Updating Knowledge Graphs with Emerging Textual Knowledge cs.CL · 2025-07-04 · accept · none · ref 69 · internal anchor
EMERGE is a benchmark dataset of 233K Wikipedia passages paired with 1.45 million Wikidata edit operations across seven yearly snapshots from 2019 to 2025 for evaluating knowledge graph updates from emerging text.
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource cs.CL · 2025-06-13 · conditional · none · ref 40 · internal anchor
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
Process Reinforcement through Implicit Rewards cs.LG · 2025-02-03 · conditional · none · ref 128 · internal anchor
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
Qwen2.5-1M Technical Report cs.CL · 2025-01-26 · accept · none · ref 20 · internal anchor
Qwen2.5-1M models reach 1M token context with improved long-context performance, no short-context loss, and 3-7x prefill speedup via open inference optimizations.
Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces cs.CL · 2026-06-05 · unverdicted · none · ref 118 · internal anchor
Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.
CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models cs.SE · 2026-05-28 · unverdicted · none · ref 13 · internal anchor
CodeGolf Bench is a dynamic benchmark for LLM concise code generation in 60 languages, showing reasoning models reach 70.97% average human percentile on Python and C++ tasks while non-reasoning models lag.
PTCG-Bench: Can LLM Agents Master Pok\'emon Trading Card Game? cs.AI · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
PTCG-Bench shows LLM agents reach non-trivial PTCG performance but struggle with sustained self-evolution and remain sensitive to harness design.
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs cs.CL · 2026-05-08 · conditional · none · ref 60 · 2 links · internal anchor
EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.
Language models fail at extended rule following cs.CL · 2026-05-03 · unverdicted · none · ref 6 · 2 links · internal anchor
LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.
A Large Language Model Based Pipeline for Review of Systems Entity Recognition from Clinical Notes cs.CL · 2025-05-31 · unverdicted · none · ref 41 · internal anchor
LLM pipeline with novel attribution algorithm extracts ROS entities, negation status, and body systems from 24 clinical notes at up to 0.952 F1 using open-source models.
Qwen3 Technical Report cs.CL · 2025-05-14 · unverdicted · none · ref 36 · internal anchor
Pith review generated a malformed one-line summary.
SCAN: Structured Capability Assessment and Navigation for LLMs cs.CL · 2025-05-10 · unverdicted · none · ref 5 · internal anchor
SCAN is a framework for fine-grained LLM capability assessment via automatic taxonomy construction from queries, query synthesis for coverage, visualization tools, and a PC2-enhanced LLM-as-a-judge method, applied to 21 models showing intra-family variations.
Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation cs.AI · 2026-05-19 · unverdicted · none · ref 14 · internal anchor
Adapts conformal prediction methods to provide distribution-free uncertainty quantification and coverage guarantees for continuous evaluation of AI agent quality scores.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence cs.AI · 2025-07-28 · accept · none · ref 145 · internal anchor
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
Gemma 3 Technical Report cs.CL · 2025-03-25 · accept · none · ref 44 · internal anchor
Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and the 27B version comparable to Gemini-1.5-Pro.
Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation cs.CL · 2025-02-13 · unverdicted · none · ref 3 · internal anchor
A hierarchical statistical model demonstrates that multiple LLM generations per prompt improve benchmark score accuracy, reduce variance, and enable prompt-level difficulty scoring via correct ratios.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer