hub

Dynabench: Rethinking benchmarking in NLP

Kiela, Douwe, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, et al · 2021 · arXiv 2104.14337

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Meta-Benchmarks for Financial-Services LLM Evaluation

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

cs.CL · 2026-06-02 · conditional · novelty 7.0

CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.

Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games

cs.AI · 2026-05-05 · unverdicted · novelty 7.0

Agent Island is a new multiagent game environment that functions as a dynamic benchmark resistant to saturation and contamination, with Bayesian ranking showing OpenAI GPT-5.5 as the strongest performer among 49 models across 999 games.

TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation

cs.CV · 2026-04-29 · accept · novelty 7.0

TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.

Analysis and Explainability of LLMs Via Evolutionary Methods

cs.NE · 2026-04-27 · unverdicted · novelty 7.0

Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.

CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction

cs.AI · 2026-04-17 · accept · novelty 7.0

CT Open is a new live platform with an automated LLM-powered decontamination pipeline that supplies uncontaminated benchmarks for predicting clinical trial outcomes.

Open-World Evaluations for Measuring Frontier AI Capabilities

cs.AI · 2026-05-19 · conditional · novelty 6.0

Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.

Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.

QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.

RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains

cs.RO · 2026-04-06 · unverdicted · novelty 6.0

RoboPlayground reframes robotic manipulation evaluation as a language-driven process over structured physical domains, letting users author varied yet reproducible tasks that reveal policy generalization failures.

Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

cs.CL · 2026-03-16 · unverdicted · novelty 6.0

Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 100-scenario suite.

Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation

cs.CL · 2025-10-10 · conditional · novelty 6.0

DyReMe generates fresh, confounded medical consultation cases to expose weaknesses in state-of-the-art LLMs across accuracy, veracity, helpfulness, and consistency beyond static benchmarks.

League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

cs.AI · 2025-07-30 · unverdicted · novelty 6.0

League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.

Interactive Evaluation Requires a Design Science

cs.AI · 2026-05-18 · unverdicted · novelty 5.0

Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.

Humanity's Last Exam

cs.LG · 2025-01-24 · unverdicted · novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

Too long; didn't solve

cs.AI · 2026-04-08

citing papers explorer

Showing 16 of 16 citing papers.

Meta-Benchmarks for Financial-Services LLM Evaluation cs.AI · 2026-07-02 · unverdicted · none · ref 12
A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.
CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks cs.CL · 2026-06-02 · conditional · none · ref 14
CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.
Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games cs.AI · 2026-05-05 · unverdicted · none · ref 17
Agent Island is a new multiagent game environment that functions as a dynamic benchmark resistant to saturation and contamination, with Bayesian ranking showing OpenAI GPT-5.5 as the strongest performer among 49 models across 999 games.
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation cs.CV · 2026-04-29 · accept · none · ref 21
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
Analysis and Explainability of LLMs Via Evolutionary Methods cs.NE · 2026-04-27 · unverdicted · none · ref 17
Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.
CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction cs.AI · 2026-04-17 · accept · none · ref 3
CT Open is a new live platform with an automated LLM-powered decontamination pipeline that supplies uncontaminated benchmarks for predicting clinical trial outcomes.
Open-World Evaluations for Measuring Frontier AI Capabilities cs.AI · 2026-05-19 · conditional · none · ref 6
Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks cs.AI · 2026-05-11 · unverdicted · none · ref 13
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks cs.CL · 2026-04-20 · unverdicted · none · ref 20
QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.
RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains cs.RO · 2026-04-06 · unverdicted · none · ref 15
RoboPlayground reframes robotic manipulation evaluation as a language-driven process over structured physical domains, letting users author varied yet reproducible tasks that reveal policy generalization failures.
Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI cs.CL · 2026-03-16 · unverdicted · none · ref 7
Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 100-scenario suite.
Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation cs.CL · 2025-10-10 · conditional · none · ref 3
DyReMe generates fresh, confounded medical consultation cases to expose weaknesses in state-of-the-art LLMs across accuracy, veracity, helpfulness, and consistency beyond static benchmarks.
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models cs.AI · 2025-07-30 · unverdicted · none · ref 25
League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
Interactive Evaluation Requires a Design Science cs.AI · 2026-05-18 · unverdicted · none · ref 28
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.
Humanity's Last Exam cs.LG · 2025-01-24 · unverdicted · none · ref 29
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
Too long; didn't solve cs.AI · 2026-04-08 · unreviewed · ref 4

Dynabench: Rethinking benchmarking in NLP

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer