Recognition: unknown
Sound Agentic Science Requires Adversarial Experiments
Pith reviewed 2026-05-09 21:08 UTC · model grok-4.3
The pith
LLM agents in science must hunt for ways their claims could fail instead of only building supportive narratives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the
What carries the argument
The falsification-first standard, under which agents are directed to actively search for the ways in which a generated claim can fail rather than to craft the most compelling narrative.
Load-bearing premise
It is practically feasible to design and run automated adversarial experiments that meaningfully falsify claims without introducing new biases or requiring human oversight that defeats the purpose of agentic assistance.
What would settle it
A controlled comparison in which the same set of agent-generated claims is evaluated once under standard positive-optimization prompting and once under explicit adversarial-falsification prompting, followed by independent human replication attempts to measure whether the latter set survives at markedly higher rates.
Figures
read the original abstract
LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the missing evidence is a negative space, experiments and analyses that would have falsified the claim were never run or never published. We therefore propose that non-experimental claims produced with agentic assistance be evaluated under a falsification-first standard: agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that LLM-based agents accelerate the production of plausible but selectively supported scientific claims by optimizing for positive analyses and narratives, and proposes that non-experimental agent-assisted claims should instead follow a falsification-first standard in which agents are directed to actively search for disconfirming evidence and ways the claim can fail.
Significance. If the proposed standard is adopted, it could meaningfully improve the reliability of agent-assisted science by aligning workflows with established falsification principles from the philosophy of science. The manuscript earns credit for its internally consistent normative argument that avoids self-referential definitions or fitted parameters, its explicit acknowledgment of open practical questions around bias and automation, and its clear contrast between narrative crafting and adversarial search.
major comments (2)
- [Abstract] Abstract: The central motivation—that agents 'accelerate a familiar failure mode' of plausible but unverified claims—is presented without any concrete examples, case studies, or prevalence data on current agent behavior, which weakens the load-bearing claim that this risk is widespread enough to require a new standard.
- [Proposal] Proposal section (as summarized in the abstract): The recommendation that 'agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail' rests on the unexamined assumption that automated adversarial experiments can be designed and executed without introducing new selection biases or requiring prohibitive human oversight; this feasibility issue is flagged as open but not analyzed in sufficient detail to support the standard as practically actionable.
minor comments (1)
- The abstract and full text would benefit from explicit section headings or a short roadmap paragraph to help readers locate the risk description versus the proposed standard.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and recommendation for minor revision. We respond to each major comment below, indicating where we will revise the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central motivation—that agents 'accelerate a familiar failure mode' of plausible but unverified claims—is presented without any concrete examples, case studies, or prevalence data on current agent behavior, which weakens the load-bearing claim that this risk is widespread enough to require a new standard.
Authors: We agree that illustrative examples would make the motivation more tangible. The argument draws from established patterns in scientific practice (selective reporting, file-drawer effects) that agentic systems can amplify through rapid iteration. As the paper is a normative proposal rather than an empirical study, we did not include prevalence statistics. In revision we will add brief, hypothetical scenarios drawn from common agent workflows to illustrate the selective-support risk without asserting new empirical claims. revision: yes
-
Referee: [Proposal] Proposal section (as summarized in the abstract): The recommendation that 'agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail' rests on the unexamined assumption that automated adversarial experiments can be designed and executed without introducing new selection biases or requiring prohibitive human oversight; this feasibility issue is flagged as open but not analyzed in sufficient detail to support the standard as practically actionable.
Authors: We appreciate the referee noting this tension. The manuscript already identifies bias and oversight as open questions to avoid overclaiming practicality. To strengthen the proposal we will expand the discussion section with additional analysis of how selection biases could arise in automated falsification searches and the likely continued need for human judgment, while preserving the stance that these remain unresolved implementation challenges rather than solved engineering problems. revision: partial
Circularity Check
No significant circularity
full rationale
The paper is a normative position piece whose central claim—that agentic scientific claims should be evaluated under a falsification-first standard—rests on standard philosophy-of-science considerations (negative evidence space, limits of selective positive analyses) rather than any derivation, equation, fitted parameter, or self-referential definition. No load-bearing step reduces to the paper's own inputs by construction; the argument is self-contained against external benchmarks and acknowledges practical difficulties as open questions instead of asserting them as solved.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM agents accelerate the production of plausible but selectively supported analyses that are difficult to falsify post hoc
- standard math Scientific claims require active attempts at falsification rather than accumulation of positive evidence alone
Reference graph
Works this paper leans on
-
[1]
doi: 10.1038/s42256-026-01199-8
ISSN 2522-5839. doi: 10.1038/s42256-026-01199-8. URLhttp://dx.doi.org/10.1038/ s42256-026-01199-8. Christopher A. Bail. Can generative ai improve social science?Proceedings of the National Academy of Sciences, 121(21), May
-
[2]
ISSN 1091-6490. doi: 10.1073/pnas.2314021121. URLhttp://dx.doi.org/10.1073/pnas.2314021121. C. Glenn Begley and Lee M. Ellis. Raise standards for preclinical cancer research.Nature, 483 (7391):531–533, March
-
[3]
ISSN 1476-4687. doi: 10.1038/483531a. URLhttp://dx. doi.org/10.1038/483531a. CDC and NCHS. Nhanes 2017–2018: Questionnaires, datasets, and related documenta- tion,
-
[4]
CDC and NCHS
URLhttps://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/ default.aspx?BeginYear=2017. CDC and NCHS. National health and nutrition examination survey, 2017–2018: Demo- graphic variables and sample weights (demo j) — data documentation, codebook, and fre- quencies,
2017
-
[5]
Data file: DEMO J.xpt
URLhttps://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/ DataFiles/DEMO_J.htm. Data file: DEMO J.xpt. CDC and NCHS. National health and nutrition examination survey, 2017–2018: Men- tal health — depression screener (dpq j) — data documentation, codebook, and frequen- cies,
2017
-
[6]
Data file: DPQ J.xpt
URLhttps://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/ DataFiles/DPQ_J.htm. Data file: DPQ J.xpt. CDC and NCHS. National health and nutrition examination survey, 2017–2018: Vitamin d (vid j) — data documentation, codebook, and frequencies,
2017
-
[7]
Data file: VID J.xpt
URLhttps://wwwn.cdc.gov/ Nchs/Data/Nhanes/Public/2017/DataFiles/VID_J.htm. Data file: VID J.xpt. Dionizije Fa, Marko ˇCuljak, Bruno Pandˇza, and Mateo ˇCupi´c. Bioagent bench: An ai agent evalua- tion suite for bioinformatics,
2017
-
[8]
URLhttps://arxiv.org/abs/2601.21800. R. A. Fisher.The Design of Experiments. Oliver and Boyd, Edinburgh,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Blondin, et al
5 Published at ICLR 2026 Workshop on Agents in the Wild Marta Guasch-Ferr´e, Anpan Satija, Stacey A. Blondin, et al. Meta-analysis of randomized controlled trials of red meat consumption in comparison with various comparison diets on cardiovascular risk factors.Circulation, 139(15):1828–1845,
2026
-
[10]
Kexin Huang, Ying Jin, Ryan Li, Michael Y
doi: 10.1161/CIRCULATIONAHA.118.035225. Kexin Huang, Ying Jin, Ryan Li, Michael Y . Li, Emmanuel Candes, and Jure Leskovec. Auto- mated hypothesis validation with agentic sequential falsifications. InForty-second International Conference on Machine Learning,
-
[11]
doi: 10.1371/journal.pmed.0020124. Epub 2005-08-30. Erratum in: PLoS Med. 2022 Aug 25;19(8):e1004085. doi:10.1371/journal.pmed.1004085. Nikita Mehandru, Amanda K. Hall, Olesya Melnichenko, Yulia Dubinina, Daniel Tsirulnikov, David Bamman, Ahmed Alaa, Scott Saponas, and Venkat S. Malladi. Bioagents: Bridging the gap in bioinformatics analysis with multi-ag...
-
[12]
doi: 10.1038/s41598-025-25919-z
ISSN 2045-2322. doi: 10.1038/s41598-025-25919-z. URLhttp://dx.doi.org/10. 1038/s41598-025-25919-z. Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition,
-
[13]
English translation/revision of Logik der Forschung(1934). Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Hel- yar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Al...
1934
-
[14]
URLhttps://arxiv.org/abs/2601.03267. Alexus A. Smith, Edmund L. Wong, Ronan C. Donovan, Brad A. Chapman, Ryan Harry, Pooyan Tirandazi, Paulina Kanigowska, Elizabeth A. Gendreau, Robert H. Dahl, Michal Jastrzebski, Jose E. Cortez, Christopher J. Bremner, Jos ´e C. Morales Hemuda, James Dooner, Ian Graves, Rahul Karandikar, Christopher Lionetti, Kevin Chris...
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Sangho Suh, Bryan Min, Srishti Palani, and Haijun Xia
doi: 10.1177/1745691616658637. 7 Published at ICLR 2026 Workshop on Agents in the Wild Erick H Turner, Annette M Matthews, Eftihia Linardatos, Robert A Tell, and Robert Rosenthal. Selective publication of antidepressant trials and its influence on apparent efficacy.The New England journal of medicine, 358(3):252—260, January
-
[16]
Validity & caveats
1056/nejmsa065779. URLhttp://content.nejm.org/cgi/content/full/358/ 3/252. 8 Published at ICLR 2026 Workshop on Agents in the Wild A APPENDIX A.1 AGENT SETUP Two independent agents analyze the same dataset (DEMO J.xpt, DPQ J.xpt, VID J.xpt) from the 2017– 2018 Nutrition Examination Survey (NHANES) data CDC & NCHS (2018), with opposite goals. Agent A is pr...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.