hub

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan

The benchmark lottery , author= · 2021 · arXiv 2107.07002

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

read on arXiv browse 22 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Pooled Leaderboards Hide System-Specific Winners: A Reporting-Protocol Audit of Offline Root-Cause Analysis Benchmarks

cs.AI · 2026-06-28 · unverdicted · novelty 7.0

Pooled top-1 accuracy rankings in RCA benchmarks do not reliably identify per-subsystem winners, as pairwise comparisons across 11 subsystems show effects of both signs and leave-one-system-out selection incurs regret up to 24.8 pp.

Matter to Mechanism: A Benchmark for AI Co-Scientists in Materials and Battery Research

cs.CE · 2026-06-01 · unverdicted · novelty 7.0

Introduces the Matter to Mechanism benchmark of 2,645 structured instances and a composite metric suite for evaluating AI co-scientists on problem-to-hypothesis reasoning in battery materials research.

How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Benchmark-specific training maps to shift bribery and is NP-hard under Borda and mean win rate; mean win rate has the highest instance-level robustness (median 22 tasks on BBH) among tested aggregation rules.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

Validity Threats for Foundation Model Research

cs.LG · 2026-06-03 · accept · novelty 6.0

Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.

Welfare, Improvability, and Variance: A Principal-Agent Approach to Optimal Benchmark Item Aggregation

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Models benchmarking as principal-agent game, derives welfare loss from welfare alignment, improvability and variance, and applies an audit framework to OLMES items.

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

Tool-calling evaluations for LLM agents are highly sensitive to implementation details such as random seeds and history handling, and two new techniques accelerate RL training with wall-clock speedup and no performance degradation.

Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild

cs.SE · 2026-05-22 · unverdicted · novelty 6.0

An empirical study of 57 ML evaluation harnesses shows 41.4% of operational issues occur in the specification stage, driven mainly by unimplemented features, documentation gaps, and missing input validation.

Are Sparse Autoencoder Benchmarks Reliable?

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

An audit of SAEBench reveals that Targeted Probe Perturbation and Spurious Correlation Removal metrics fail reliability tests and should not be used to evaluate sparse autoencoders.

Neural Fields for NV-Center Inverse Sensing

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

NeTMY neural fields with annealed encoding, multiscale optimization, and spectrum-fidelity losses achieve superior localization and distributional accuracy in NV-center inverse sensing by using a tensor power-summed dipolar operator that exposes and mitigates center-collapse failures.

No One Knows the State of the Art in Geospatial Foundation Models

cs.CV · 2026-05-12 · accept · novelty 6.0

An audit of 152 papers reveals that geospatial foundation models lack standardized evaluations, training controls, and weight releases, so no one knows the state of the art.

From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.

Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

cs.LG · 2026-04-23 · conditional · novelty 6.0

Different valid temporal partitions of the same streaming dataset can produce materially different rankings and performance numbers for continual learning methods.

On the Opportunities and Risks of Foundation Models

cs.LG · 2021-08-16 · accept · novelty 6.0

Foundation models are large adaptable AI systems with emergent capabilities that offer broad opportunities but carry risks from homogenization, opacity, and inherited defects across downstream applications.

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

cs.AI · 2026-06-18 · unverdicted · novelty 5.0

Aggregate leaderboards for LLM agents lack predictive validity for out-of-distribution settings, and the paper proposes ranking by in-sample to out-of-sample rank correlation instead of mean score.

Cross-Layer Subspace Coupling for LLM Compression: A Unifying Framework and Its Empirical Limits

cs.LG · 2026-05-29 · unverdicted · novelty 5.0

Unifying cross-layer SVD compression for LLMs improves weight reconstruction error by up to 46% on Pythia models but causes severe degradation in perplexity and accuracy due to residual stream decoupling.

Rethinking FID Through the Geometry of the Reference Dataset

cs.CV · 2026-05-28 · unverdicted · novelty 5.0

FID improves with better samples only on concentrated reference datasets but can worsen on dispersed ones, as shown by density and effective rank in a controlled study across six datasets.

Higher Resolution, Better Generalization: Unlocking Visual Scaling in Deep Reinforcement Learning

cs.LG · 2026-05-11 · unverdicted · novelty 5.0

Higher-resolution observations with global-average-pooling encoders improve RL performance and generalization by enabling more localized visual attention, yielding up to 28% gains over standard Impala encoders.

Unstable Rankings in Bayesian Deep Learning Evaluation

cs.LG · 2026-04-25 · unverdicted · novelty 5.0

Bayesian deep learning method rankings are unstable at small sample sizes, dataset-dependent, and require uncertainty-aware evaluation using hierarchical models and minimum detectable difference curves.

Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings

cs.LG · 2026-06-26 · unverdicted · novelty 4.0

Framework for dataset subset selection via clustering, A/D-optimality, and FAFI with bootstrap intervals to preserve model rankings, showing high Spearman correlation (0.95 with 5 datasets) in TSC but limited gains in recommender systems.

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

cs.SE · 2026-06-16 · unverdicted · novelty 4.0

Coding benchmarks misalign with agentic software engineering because they conflate model and harness, grade against single references, and provide no component-level iteration signals.

Measuring AI Reasoning: A Guide for Researchers

cs.AI · 2026-05-04 · unverdicted · novelty 4.0

Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

citing papers explorer

Showing 22 of 22 citing papers.

Pooled Leaderboards Hide System-Specific Winners: A Reporting-Protocol Audit of Offline Root-Cause Analysis Benchmarks cs.AI · 2026-06-28 · unverdicted · none · ref 5
Pooled top-1 accuracy rankings in RCA benchmarks do not reliably identify per-subsystem winners, as pairwise comparisons across 11 subsystems show effects of both signs and leave-one-system-out selection incurs regret up to 24.8 pp.
Matter to Mechanism: A Benchmark for AI Co-Scientists in Materials and Battery Research cs.CE · 2026-06-01 · unverdicted · none · ref 44
Introduces the Matter to Mechanism benchmark of 2,645 structured instances and a composite metric suite for evaluating AI co-scientists on problem-to-hypothesis reasoning in battery materials research.
How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness cs.LG · 2026-05-22 · unverdicted · none · ref 4
Benchmark-specific training maps to shift bribery and is NP-hard under Borda and mean win rate; mean win rate has the highest instance-level robustness (median 22 tasks on BBH) among tested aggregation rules.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack cs.AI · 2026-05-12 · conditional · none · ref 14
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
Validity Threats for Foundation Model Research cs.LG · 2026-06-03 · accept · none · ref 20
Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.
Welfare, Improvability, and Variance: A Principal-Agent Approach to Optimal Benchmark Item Aggregation cs.LG · 2026-05-29 · unverdicted · none · ref 13
Models benchmarking as principal-agent game, derives welfare loss from welfare alignment, improvability and variance, and applies an audit framework to OLMES items.
On Effectiveness and Efficiency of Agentic Tool-calling and RL Training cs.LG · 2026-05-28 · unverdicted · none · ref 23
Tool-calling evaluations for LLM agents are highly sensitive to implementation details such as random seeds and history handling, and two new techniques accelerate RL training with wall-clock speedup and no performance degradation.
Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild cs.SE · 2026-05-22 · unverdicted · none · ref 14
An empirical study of 57 ML evaluation harnesses shows 41.4% of operational issues occur in the specification stage, driven mainly by unimplemented features, documentation gaps, and missing input validation.
Are Sparse Autoencoder Benchmarks Reliable? cs.LG · 2026-05-18 · unverdicted · none · ref 10
An audit of SAEBench reveals that Targeted Probe Perturbation and Spurious Correlation Removal metrics fail reliability tests and should not be used to evaluate sparse autoencoders.
Neural Fields for NV-Center Inverse Sensing cs.LG · 2026-05-13 · unverdicted · none · ref 16
NeTMY neural fields with annealed encoding, multiscale optimization, and spectrum-fidelity losses achieve superior localization and distributional accuracy in NV-center inverse sensing by using a tensor power-summed dipolar operator that exposes and mitigates center-collapse failures.
No One Knows the State of the Art in Geospatial Foundation Models cs.CV · 2026-05-12 · accept · none · ref 17
An audit of 152 papers reveals that geospatial foundation models lack standardized evaluations, training controls, and weight releases, so no one knows the state of the art.
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World cs.AI · 2026-05-11 · unverdicted · none · ref 7
A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.
Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability cs.LG · 2026-04-23 · conditional · none · ref 47
Different valid temporal partitions of the same streaming dataset can produce materially different rankings and performance numbers for continual learning methods.
On the Opportunities and Risks of Foundation Models cs.LG · 2021-08-16 · accept · none · ref 5
Foundation models are large adaptable AI systems with emergent capabilities that offer broad opportunities but carry risks from homogenization, opacity, and inherited defects across downstream applications.
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents cs.AI · 2026-06-18 · unverdicted · none · ref 32
Aggregate leaderboards for LLM agents lack predictive validity for out-of-distribution settings, and the paper proposes ranking by in-sample to out-of-sample rank correlation instead of mean score.
Cross-Layer Subspace Coupling for LLM Compression: A Unifying Framework and Its Empirical Limits cs.LG · 2026-05-29 · unverdicted · none · ref 1
Unifying cross-layer SVD compression for LLMs improves weight reconstruction error by up to 46% on Pythia models but causes severe degradation in perplexity and accuracy due to residual stream decoupling.
Rethinking FID Through the Geometry of the Reference Dataset cs.CV · 2026-05-28 · unverdicted · none · ref 3
FID improves with better samples only on concentrated reference datasets but can worsen on dispersed ones, as shown by density and effective rank in a controlled study across six datasets.
Higher Resolution, Better Generalization: Unlocking Visual Scaling in Deep Reinforcement Learning cs.LG · 2026-05-11 · unverdicted · none · ref 5
Higher-resolution observations with global-average-pooling encoders improve RL performance and generalization by enabling more localized visual attention, yielding up to 28% gains over standard Impala encoders.
Unstable Rankings in Bayesian Deep Learning Evaluation cs.LG · 2026-04-25 · unverdicted · none · ref 4
Bayesian deep learning method rankings are unstable at small sample sizes, dataset-dependent, and require uncertainty-aware evaluation using hierarchical models and minimum detectable difference curves.
Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings cs.LG · 2026-06-26 · unverdicted · none · ref 11
Framework for dataset subset selection via clustering, A/D-optimality, and FAFI with bootstrap intervals to preserve model rankings, showing high Spearman correlation (0.95 with 5 datasets) in TSC but limited gains in recommender systems.
Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering cs.SE · 2026-06-16 · unverdicted · none · ref 11
Coding benchmarks misalign with agentic software engineering because they conflate model and harness, grade against single references, and provide no component-level iteration signals.
Measuring AI Reasoning: A Guide for Researchers cs.AI · 2026-05-04 · unverdicted · none · ref 149
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer