hub Canonical reference

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain · 2024 · cs.CL · arXiv 2406.19314

Canonical reference. 78% of citing Pith papers cite this work as background.

53 Pith papers citing it

Background 78% of classified citations

open full Pith review browse 53 citing papers arXiv PDF

abstract

Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be resistant to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-limited versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 405B in size. LiveBench is difficult, with top models achieving below 70% accuracy. We release all questions, code, and model answers. Questions are added and updated on a monthly basis, and we release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 dataset 3

citation-polarity summary

background 7 use dataset 2

representative citing papers

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

TabArena: A Living Benchmark for Machine Learning on Tabular Data

cs.LG · 2025-06-20 · conditional · novelty 8.0

TabArena launches a dynamic, updatable benchmarking system for tabular ML that shows boosted trees remain competitive, deep learning matches them under larger budgets with ensembling, foundation models excel on small data, and cross-model ensembles advance SOTA while flagging validation overfitting.

Meta-Benchmarks for Financial-Services LLM Evaluation

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.

AlgoBench: Benchmarking Algorithmic Adaptation in Code Generation

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

AlgoBench creates traceable variants of competitive programming problems via constraint shifts that invalidate original algorithms, paired with complexity metrics that reveal LLMs often produce functionally correct but asymptotically unsuitable solutions.

Power Systems Agent Benchmark: Executable Evaluation of AI Agents in Electric Power Engineering

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

Introduces the Power Systems Agent Benchmark with 41 task families across eight power engineering areas for executable evaluation of AI agents using deterministic feasibility checks.

Flaws in the LLM Automation Narrative

stat.OT · 2026-06-09 · unverdicted · novelty 7.0

A new code-writing data analysis benchmark shows human experts outperforming a frontier LLM on average with lower performance variance.

Unified Energy for Invariant and Independent Decoding in Diffusion Language Models

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

The paper introduces Uni-E, a unified energy for DLMs that accounts for model capacity, dependency and invariance, can be computed exactly, and corrects distribution shifts from dependency and invariance.

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

cs.CL · 2026-06-02 · conditional · novelty 7.0

CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.

Forecasting Scientific Progress with Artificial Intelligence

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.

Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.

Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.

FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

cs.AI · 2026-04-11 · unverdicted · novelty 7.0

FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.

DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

DRBENCHER generates multi-hop questions across biochemistry, finance, geophysics, security, and history that test interleaved browsing and computation, where the strongest models reach only 20% accuracy and human validation finds 76% validity.

Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software

cs.SE · 2026-04-06 · conditional · novelty 7.0

LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with control flow and high-precision literals.

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

cs.AI · 2025-05-29 · unverdicted · novelty 7.0

MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.

PRIMETIME : Limits of LLMs in Temporal Primitives

cs.NE · 2025-04-22 · unverdicted · novelty 7.0

PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.

Unified Reward Model for Multimodal Understanding and Generation

cs.CV · 2025-03-07 · unverdicted · novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.

EduArt: An educational-level benchmark for evaluating art history knowledge in large language models

cs.CL · 2026-07-02 · unverdicted · novelty 6.0

EduArt is a new benchmark of 871 educational questions that reveals multimodal LLMs perform near ceiling on multiple-choice art history items but drop sharply on open completion and error identification tasks.

REKEY: Metadata-Grounded Visual-Key Regeneration for Contamination-Resilient VQA Evaluation

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

ReKey introduces a live benchmark protocol that regenerates visual keys in images to produce contamination-resilient VQA evaluations, showing 9.5-18.8 point higher scores on original items across eight VLMs.

Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

AGDO improves dLLM reasoning performance by determining denoising order and emphasizing tokens based on attention-derived dependencies rather than random masking.

Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

cs.CL · 2026-05-31 · unverdicted · novelty 6.0

A graph-based MIS prompt selection method on embedding similarity graphs yields reduced benchmark subsets with highly consistent LLM rankings (Kendall's W ≥ 0.90 in 99.2% of cases) and 25-48% size reduction at higher thresholds.

TRACER: A Semantic-Aware Framework for Fine-Grained Contamination Detection in Code LLMs

cs.SE · 2026-05-22 · unverdicted · novelty 6.0

TRACER presents a semantic-aware framework and the first benchmark for fine-grained code contamination detection across three levels of overlap, reporting F1 scores of 0.91-0.92 and large gains over prior methods.

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Flaws in the LLM Automation Narrative stat.OT · 2026-06-09 · unverdicted · none · ref 15 · internal anchor
A new code-writing data analysis benchmark shows human experts outperforming a frontier LLM on average with lower performance variance.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer