BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices , url=

Hardy, Amelia, Hardy, Malcolm, Kochenderfer, Mykel, Lamparth, Max, Reuel, Anka, Smith, Chandler , year= · DOI 10.52202/079017-0685

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

representative citing papers

A Tale of Two Variances: When Single-Seed Benchmarks Fail in Bayesian Deep Learning

cs.LG · 2026-04-25 · unverdicted · novelty 5.0

Single-seed CRPS estimates in limited-data BDL show high variance and peaks for heteroscedastic methods, with local variance correlating above 0.96 to single-seed error.

Intelligent Automation for Embodied Benchmark Construction: Pipelines, Embodiments, Simulators, and Trends

cs.RO · 2026-06-10 · unverdicted · novelty 3.0

Automation in embodied benchmark construction shifts costs from acquisition toward validation, auditability, version control, and long-term governance instead of simply lowering total cost.

Position: State-of-the-Art Claims Require State-of-the-Art Evidence

cs.LG · 2026-05-17

citing papers explorer

Showing 3 of 3 citing papers.

A Tale of Two Variances: When Single-Seed Benchmarks Fail in Bayesian Deep Learning cs.LG · 2026-04-25 · unverdicted · none · ref 17
Single-seed CRPS estimates in limited-data BDL show high variance and peaks for heteroscedastic methods, with local variance correlating above 0.96 to single-seed error.
Intelligent Automation for Embodied Benchmark Construction: Pipelines, Embodiments, Simulators, and Trends cs.RO · 2026-06-10 · unverdicted · none · ref 70
Automation in embodied benchmark construction shifts costs from acquisition toward validation, auditability, version control, and long-term governance instead of simply lowering total cost.
Position: State-of-the-Art Claims Require State-of-the-Art Evidence cs.LG · 2026-05-17 · unreviewed · ref 64

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices , url=

fields

years

verdicts

representative citing papers

citing papers explorer