Pooled top-1 accuracy rankings in RCA benchmarks do not reliably identify per-subsystem winners, as pairwise comparisons across 11 subsystems show effects of both signs and leave-one-system-out selection incurs regret up to 24.8 pp.
hub
Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan
22 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
Introduces the Matter to Mechanism benchmark of 2,645 structured instances and a composite metric suite for evaluating AI co-scientists on problem-to-hypothesis reasoning in battery materials research.
Benchmark-specific training maps to shift bribery and is NP-hard under Borda and mean win rate; mean win rate has the highest instance-level robustness (median 22 tasks on BBH) among tested aggregation rules.
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.
Models benchmarking as principal-agent game, derives welfare loss from welfare alignment, improvability and variance, and applies an audit framework to OLMES items.
Tool-calling evaluations for LLM agents are highly sensitive to implementation details such as random seeds and history handling, and two new techniques accelerate RL training with wall-clock speedup and no performance degradation.
An empirical study of 57 ML evaluation harnesses shows 41.4% of operational issues occur in the specification stage, driven mainly by unimplemented features, documentation gaps, and missing input validation.
An audit of SAEBench reveals that Targeted Probe Perturbation and Spurious Correlation Removal metrics fail reliability tests and should not be used to evaluate sparse autoencoders.
NeTMY neural fields with annealed encoding, multiscale optimization, and spectrum-fidelity losses achieve superior localization and distributional accuracy in NV-center inverse sensing by using a tensor power-summed dipolar operator that exposes and mitigates center-collapse failures.
An audit of 152 papers reveals that geospatial foundation models lack standardized evaluations, training controls, and weight releases, so no one knows the state of the art.
A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.
Different valid temporal partitions of the same streaming dataset can produce materially different rankings and performance numbers for continual learning methods.
Foundation models are large adaptable AI systems with emergent capabilities that offer broad opportunities but carry risks from homogenization, opacity, and inherited defects across downstream applications.
Aggregate leaderboards for LLM agents lack predictive validity for out-of-distribution settings, and the paper proposes ranking by in-sample to out-of-sample rank correlation instead of mean score.
Unifying cross-layer SVD compression for LLMs improves weight reconstruction error by up to 46% on Pythia models but causes severe degradation in perplexity and accuracy due to residual stream decoupling.
FID improves with better samples only on concentrated reference datasets but can worsen on dispersed ones, as shown by density and effective rank in a controlled study across six datasets.
Higher-resolution observations with global-average-pooling encoders improve RL performance and generalization by enabling more localized visual attention, yielding up to 28% gains over standard Impala encoders.
Bayesian deep learning method rankings are unstable at small sample sizes, dataset-dependent, and require uncertainty-aware evaluation using hierarchical models and minimum detectable difference curves.
Framework for dataset subset selection via clustering, A/D-optimality, and FAFI with bootstrap intervals to preserve model rankings, showing high Spearman correlation (0.95 with 5 datasets) in TSC but limited gains in recommender systems.
Coding benchmarks misalign with agentic software engineering because they conflate model and harness, grade against single references, and provide no component-level iteration signals.
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
citing papers explorer
-
Pooled Leaderboards Hide System-Specific Winners: A Reporting-Protocol Audit of Offline Root-Cause Analysis Benchmarks
Pooled top-1 accuracy rankings in RCA benchmarks do not reliably identify per-subsystem winners, as pairwise comparisons across 11 subsystems show effects of both signs and leave-one-system-out selection incurs regret up to 24.8 pp.
-
Matter to Mechanism: A Benchmark for AI Co-Scientists in Materials and Battery Research
Introduces the Matter to Mechanism benchmark of 2,645 structured instances and a composite metric suite for evaluating AI co-scientists on problem-to-hypothesis reasoning in battery materials research.
-
How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness
Benchmark-specific training maps to shift bribery and is NP-hard under Borda and mean win rate; mean win rate has the highest instance-level robustness (median 22 tasks on BBH) among tested aggregation rules.
-
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
-
Validity Threats for Foundation Model Research
Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.
-
Welfare, Improvability, and Variance: A Principal-Agent Approach to Optimal Benchmark Item Aggregation
Models benchmarking as principal-agent game, derives welfare loss from welfare alignment, improvability and variance, and applies an audit framework to OLMES items.
-
On Effectiveness and Efficiency of Agentic Tool-calling and RL Training
Tool-calling evaluations for LLM agents are highly sensitive to implementation details such as random seeds and history handling, and two new techniques accelerate RL training with wall-clock speedup and no performance degradation.
-
Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild
An empirical study of 57 ML evaluation harnesses shows 41.4% of operational issues occur in the specification stage, driven mainly by unimplemented features, documentation gaps, and missing input validation.
-
Are Sparse Autoencoder Benchmarks Reliable?
An audit of SAEBench reveals that Targeted Probe Perturbation and Spurious Correlation Removal metrics fail reliability tests and should not be used to evaluate sparse autoencoders.
-
Neural Fields for NV-Center Inverse Sensing
NeTMY neural fields with annealed encoding, multiscale optimization, and spectrum-fidelity losses achieve superior localization and distributional accuracy in NV-center inverse sensing by using a tensor power-summed dipolar operator that exposes and mitigates center-collapse failures.
-
No One Knows the State of the Art in Geospatial Foundation Models
An audit of 152 papers reveals that geospatial foundation models lack standardized evaluations, training controls, and weight releases, so no one knows the state of the art.
-
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.
-
Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability
Different valid temporal partitions of the same streaming dataset can produce materially different rankings and performance numbers for continual learning methods.
-
On the Opportunities and Risks of Foundation Models
Foundation models are large adaptable AI systems with emergent capabilities that offer broad opportunities but carry risks from homogenization, opacity, and inherited defects across downstream applications.
-
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents
Aggregate leaderboards for LLM agents lack predictive validity for out-of-distribution settings, and the paper proposes ranking by in-sample to out-of-sample rank correlation instead of mean score.
-
Cross-Layer Subspace Coupling for LLM Compression: A Unifying Framework and Its Empirical Limits
Unifying cross-layer SVD compression for LLMs improves weight reconstruction error by up to 46% on Pythia models but causes severe degradation in perplexity and accuracy due to residual stream decoupling.
-
Rethinking FID Through the Geometry of the Reference Dataset
FID improves with better samples only on concentrated reference datasets but can worsen on dispersed ones, as shown by density and effective rank in a controlled study across six datasets.
-
Higher Resolution, Better Generalization: Unlocking Visual Scaling in Deep Reinforcement Learning
Higher-resolution observations with global-average-pooling encoders improve RL performance and generalization by enabling more localized visual attention, yielding up to 28% gains over standard Impala encoders.
-
Unstable Rankings in Bayesian Deep Learning Evaluation
Bayesian deep learning method rankings are unstable at small sample sizes, dataset-dependent, and require uncertainty-aware evaluation using hierarchical models and minimum detectable difference curves.
-
Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings
Framework for dataset subset selection via clustering, A/D-optimality, and FAFI with bootstrap intervals to preserve model rankings, showing high Spearman correlation (0.95 with 5 datasets) in TSC but limited gains in recommender systems.
-
Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering
Coding benchmarks misalign with agentic software engineering because they conflate model and harness, grade against single references, and provide no component-level iteration signals.
-
Measuring AI Reasoning: A Guide for Researchers
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.