ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

Ben Gao; Dongzhan Zhou; Erik Cambria; Jinjie Ni; Shixiang Tang; Tong Xie; Wanli Ouyang; Yujie Liu; Yuqiang Li; Zonglin Yang

arxiv: 2503.21248 · v3 · submitted 2025-03-27 · 💻 cs.CL · cs.AI· cs.CE

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

Yujie Liu , Zonglin Yang , Tong Xie , Jinjie Ni , Ben Gao , Yuqiang Li , Shixiang Tang , Wanli Ouyang

show 2 more authors

Erik Cambria Dongzhan Zhou

This is my paper

Pith reviewed 2026-05-22 22:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CE

keywords LLM benchmarkingscientific discoveryinspiration retrievalhypothesis generationtask decompositionresearch benchmarkcontamination-free evaluation

0 comments

The pith

LLMs retrieve inspirations for research hypotheses effectively across disciplines even on recent papers outside their training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ResearchBench as the first large-scale benchmark that decomposes scientific discovery into three sub-tasks—inspiration retrieval, hypothesis composition, and hypothesis ranking—defined as sufficient because perfect performance on them would solve the full discovery problem. An automated LLM-based extraction process pulls research questions, background surveys, inspirations, and hypotheses from papers published in 2024 or later across 12 disciplines, with expert checks confirming accuracy and the recency cutoff minimizing pretraining overlap. Evaluation results show LLMs perform strongly on the inspiration retrieval task, which the authors treat as out-of-distribution, pointing to their capacity for forming novel knowledge associations.

Core claim

Scientific discovery can be reduced to the three sub-tasks of retrieving relevant inspirations from background material, composing hypotheses from those inspirations, and ranking the resulting hypotheses; when LLMs are tested on these sub-tasks extracted from 2024+ papers, they succeed particularly at inspiration retrieval, indicating an ability to surface novel associations even when the source material post-dates their training.

What carries the argument

Inspiration-based task decomposition into inspiration retrieval, hypothesis composition, and hypothesis ranking, presented as a sufficient breakdown of the overall scientific discovery task.

If this is right

Strong results on inspiration retrieval imply LLMs can already surface useful cross-paper connections for researchers.
The automated extraction pipeline allows the benchmark to be renewed automatically with newer papers as LLM training cutoffs advance.
Performance differences across the three sub-tasks can guide which parts of the discovery process to automate first.
The multi-discipline coverage provides evidence that the observed retrieval strength is not limited to any single field.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks of this form could be used to rank LLMs for specific stages of research assistance rather than general capability.
Testing whether high sub-task scores translate into hypotheses that survive peer review or experimental validation would check real utility.
The same decomposition might be applied to non-scientific creative tasks such as engineering design or policy idea generation.

Load-bearing premise

The automated LLM extraction framework correctly pulls out the true research questions, surveys, inspirations, and hypotheses from papers, and that mastering the three sub-tasks is enough to achieve scientific discovery.

What would settle it

Expert manual inspection of a sample of extracted papers revealing frequent mismatches between the framework's identified inspirations or hypotheses and the actual content of those papers.

read the original abstract

Large language models (LLMs) have shown potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs on a sufficient set of scientific discovery sub-tasks-inspiration retrieval, hypothesis composition, and hypothesis ranking-where sufficient means that perfectly solving these sub-tasks perfectly solves the overall discovery task. We develop an automated LLM-based framework that extracts critical components-research questions, background surveys, inspirations, and hypotheses-from papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on publications from 2024 onward, ensuring minimal overlap with LLM pretraining data; our automated framework further enables automatic extraction of even more recent papers as LLM pretraining cutoffs advance, supporting scalable and contamination-free automatic renewal of this discovery benchmark. Our evaluation shows that, across disciplines, LLMs excel at inspiration retrieval-an out-of-distribution task-suggesting their ability to surface novel knowledge associations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ResearchBench gives a clean new decomposition and contamination shield for LLM discovery benchmarks, but the claim that the three sub-tasks are sufficient is asserted rather than shown.

read the letter

The paper's real contribution is ResearchBench: a benchmark that splits scientific discovery into inspiration retrieval, hypothesis composition, and ranking, built only from 2024+ papers with an automated extraction pipeline that can renew itself. That design directly tackles data leakage, which most prior work ignores, and the multi-discipline scope is a practical step forward. The evaluation result that LLMs do well on the retrieval piece is at least worth noting, since it is framed as out-of-distribution.

Referee Report

3 major / 1 minor

Summary. The paper introduces ResearchBench, the first large-scale benchmark for evaluating LLMs on scientific discovery. It decomposes the task into three sub-tasks—inspiration retrieval, hypothesis composition, and hypothesis ranking—extracted automatically via an LLM-based framework from 2024+ papers across 12 disciplines, with expert validation. The authors define these sub-tasks as 'sufficient' for the overall discovery task by construction and report that LLMs excel at inspiration retrieval (an out-of-distribution task), suggesting potential for surfacing novel knowledge associations. The benchmark is designed for scalability and contamination avoidance through automatic renewal.

Significance. If the sufficiency claim and extraction accuracy hold, the benchmark would provide a scalable, renewable framework for assessing LLMs on hypothesis-related tasks, with the reported strength on inspiration retrieval indicating a potential advantage in retrieving cross-domain associations not seen in training data.

major comments (3)

[Abstract] Abstract: The claim that 'perfectly solving these sub-tasks perfectly solves the overall discovery task' is asserted by definition without argument or evidence that the three retrospective sub-tasks (all recombinations from published papers) encompass necessary forward elements such as experimental design, handling negative results, or generating ideas outside the source distribution.
[Extraction framework] Extraction framework: Expert validation is stated to confirm accuracy of the automated LLM-based extraction of research questions, background surveys, inspirations, and hypotheses, but no quantitative metrics (e.g., accuracy percentages, inter-rater agreement, or sample sizes) are provided, leaving the reliability of the benchmark data unquantified.
[Evaluation] Evaluation: The claim that LLMs 'excel' at inspiration retrieval across disciplines is presented without baseline comparisons, error bars, statistical tests, or per-discipline breakdowns, weakening support for the implication that this demonstrates ability to surface novel knowledge associations.

minor comments (1)

The manuscript would benefit from explicit discussion of how the three sub-tasks relate to or omit standard stages of the scientific method beyond the sufficiency definition.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope and presentation of ResearchBench. We address each major comment below and will make revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'perfectly solving these sub-tasks perfectly solves the overall discovery task' is asserted by definition without argument or evidence that the three retrospective sub-tasks (all recombinations from published papers) encompass necessary forward elements such as experimental design, handling negative results, or generating ideas outside the source distribution.

Authors: We appreciate this observation. Our use of 'sufficient' is explicitly by construction for the retrospective sub-tasks (inspiration retrieval, hypothesis composition, and ranking) extracted from published papers, meaning that solving them would solve the defined discovery task within that scope. We do not claim these sub-tasks cover all aspects of scientific discovery, such as experimental design, negative results, or ideas outside the source distribution. In the revision, we will update the abstract and introduction to explicitly state the retrospective nature of the benchmark and acknowledge these limitations, providing clearer context without overstating the claim. revision: yes
Referee: [Extraction framework] Extraction framework: Expert validation is stated to confirm accuracy of the automated LLM-based extraction of research questions, background surveys, inspirations, and hypotheses, but no quantitative metrics (e.g., accuracy percentages, inter-rater agreement, or sample sizes) are provided, leaving the reliability of the benchmark data unquantified.

Authors: The referee correctly notes the absence of quantitative metrics. While expert validation was performed, the current manuscript does not report details such as accuracy percentages, inter-rater agreement, or sample sizes. We will add a dedicated subsection in the revision with these quantitative results from the validation process to substantiate the reliability of the extraction framework. revision: yes
Referee: [Evaluation] Evaluation: The claim that LLMs 'excel' at inspiration retrieval across disciplines is presented without baseline comparisons, error bars, statistical tests, or per-discipline breakdowns, weakening support for the implication that this demonstrates ability to surface novel knowledge associations.

Authors: We agree that the abstract's summary of results would benefit from additional supporting details. The full evaluation section reports performance across disciplines, but to strengthen the evidence, we will incorporate baseline comparisons (e.g., against non-LLM retrieval methods), error bars, statistical tests, and per-discipline breakdowns in the revised manuscript. This will better support the observed strengths in inspiration retrieval. revision: yes

Circularity Check

1 steps flagged

Sufficiency of sub-tasks for discovery defined by construction in abstract

specific steps

self definitional [Abstract]
"we introduce the first large-scale benchmark for evaluating LLMs on a sufficient set of scientific discovery sub-tasks-inspiration retrieval, hypothesis composition, and hypothesis ranking-where sufficient means that perfectly solving these sub-tasks perfectly solves the overall discovery task"

The paper presents the sub-tasks as a sufficient proxy for the overall scientific discovery task, but 'sufficient' is defined tautologically as the sub-tasks perfectly solving the discovery task, without independent demonstration that retrospective extraction of inspirations/hypotheses from published papers encompasses all necessary forward-looking elements of discovery.

full rationale

The paper's core evaluation results (LLM performance on inspiration retrieval from held-out 2024 papers) are empirical measurements on extracted data and do not reduce to fitted parameters or self-referential equations. The sole load-bearing definitional step is the assertion that the three sub-tasks form a 'sufficient' benchmark for scientific discovery, which is introduced via an explicit definitional clause rather than derived or validated against external criteria for discovery. This affects interpretation of the benchmark's scope but leaves the reported numerical results independent. No self-citation chains, ansatz smuggling, or renaming of known results appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the three sub-tasks are jointly sufficient for discovery and that the automated extraction faithfully recovers the original paper components; no free parameters or invented entities are introduced.

axioms (2)

domain assumption The three sub-tasks (inspiration retrieval, hypothesis composition, hypothesis ranking) are jointly sufficient to solve the overall scientific discovery task.
Stated explicitly in the abstract as the definition of 'sufficient'.
domain assumption An LLM-based automated framework can accurately extract research questions, background surveys, inspirations, and hypotheses from papers.
Required for the benchmark construction; expert validation is mentioned but not quantified.

pith-pipeline@v0.9.0 · 5748 in / 1429 out tokens · 56849 ms · 2026-05-22T22:16:06.077744+00:00 · methodology

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Forecasting Scientific Progress with Artificial Intelligence
cs.AI 2026-05 unverdicted novelty 7.0

Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and in...
AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification
astro-ph.IM 2026-05 unverdicted novelty 7.0

AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.
AI scientists produce results without reasoning scientifically
cs.AI 2026-04 conditional novelty 7.0

LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research
cs.CL 2025-07 unverdicted novelty 7.0

IDRBench is presented as the first benchmark framework consisting of datasets and three evaluation tasks to measure LLMs' ability to perform interdisciplinary research.
LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design
cs.LG 2026-05 unverdicted novelty 6.0

LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published ...
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
cs.LG 2026-05 unverdicted novelty 6.0

MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery
cs.AI 2026-05 unverdicted novelty 4.0

A survey organizing AI-powered research automation into five workflow stages, defining AutoResearch and Vibe Research, and proposing five evaluation dimensions while noting domain-conditioned limits on autonomy.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
cs.AI 2025-04 accept novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.