pith. sign in

arxiv: 2605.30329 · v1 · pith:JZKWUJSSnew · submitted 2026-05-28 · 💻 cs.LG

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

classification 💻 cs.LG
keywords researchsoundnessbenchbenchmarkfalsellmsmodelspromptingproposals
0
0 comments X
read the original abstract

Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with reviewer soundness sub-scores, and audited against source papers. SoundnessBench should be interpreted as a benchmark for recoverable proposal-stage soundness rather than exact prediction of full-paper review outcomes. Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases, surface features, and human audit quality suggest that this behavior is not explained by a single confounder. Our results indicate that current LLMs are not yet reliable as standalone first-gate evaluators for scientific rigor.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

    cs.CL 2026-06 unverdicted novelty 7.0

    ReproRepo uses GitHub issues as natural supervision to benchmark LLM agents on detecting reproducibility blockers across 1,149 ML papers, with the top agent finding related issues for roughly 90% of cases.

  2. Socratic agents for autonomous scientific discovery in high-dimensional physical systems

    cs.AI 2026-06 unverdicted novelty 6.0

    AHOIS is a Socratic multi-agent AI that autonomously discovers and validates a random-interference encoding strategy for multimode fiber optics, achieving 76.97% MNIST and 83.17% Fashion-MNIST accuracy with 16x16 meas...

  3. Agon: An Autonomous Large-Scale Omnidisciplinary Research System Built on Prompt Economy

    cs.SE 2026-06 unverdicted novelty 5.0

    Agon is a new autonomous research system using prompt economy loops across 444 iterations to demonstrate scalable omnidisciplinary research and a taxonomy separating machine-fixable failures from those needing human judgment.