pith. machine review for the scientific record. sign in

arxiv: 2604.22080 · v1 · submitted 2026-04-23 · 💻 cs.AI

Recognition: unknown

Sound Agentic Science Requires Adversarial Experiments

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:08 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsscientific discoveryfalsificationadversarial testingreproducibilityagentic sciencehypothesis validation
0
0 comments X

The pith

LLM agents in science must hunt for ways their claims could fail instead of only building supportive narratives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that LLM-based agents accelerate not only data analysis but also the generation of plausible claims backed by selectively chosen evidence, because they optimize for fluent positives without exploring counter-evidence. A sympathetic reader cares because scientific knowledge depends on the absence of disconfirming results, not on the presence of supporting ones, and unchecked agent use risks flooding research with analyses that look verified but never faced serious tests. The authors therefore call for a falsification-first standard in which agents are tasked first with finding ways the claim can break. This matters because the missing negative space of unrun experiments is what currently allows non-experimental agent outputs to appear sound.

Core claim

LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the

What carries the argument

The falsification-first standard, under which agents are directed to actively search for the ways in which a generated claim can fail rather than to craft the most compelling narrative.

Load-bearing premise

It is practically feasible to design and run automated adversarial experiments that meaningfully falsify claims without introducing new biases or requiring human oversight that defeats the purpose of agentic assistance.

What would settle it

A controlled comparison in which the same set of agent-generated claims is evaluated once under standard positive-optimization prompting and once under explicit adversarial-falsification prompting, followed by independent human replication attempts to measure whether the latter set survives at markedly higher rates.

Figures

Figures reproduced from arXiv: 2604.22080 by Dionizije Fa, Marko Culjak.

Figure 1
Figure 1. Figure 1: The verification gap between software agents and general data analysis agents. (A) Soft [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the missing evidence is a negative space, experiments and analyses that would have falsified the claim were never run or never published. We therefore propose that non-experimental claims produced with agentic assistance be evaluated under a falsification-first standard: agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper argues that LLM-based agents accelerate the production of plausible but selectively supported scientific claims by optimizing for positive analyses and narratives, and proposes that non-experimental agent-assisted claims should instead follow a falsification-first standard in which agents are directed to actively search for disconfirming evidence and ways the claim can fail.

Significance. If the proposed standard is adopted, it could meaningfully improve the reliability of agent-assisted science by aligning workflows with established falsification principles from the philosophy of science. The manuscript earns credit for its internally consistent normative argument that avoids self-referential definitions or fitted parameters, its explicit acknowledgment of open practical questions around bias and automation, and its clear contrast between narrative crafting and adversarial search.

major comments (2)
  1. [Abstract] Abstract: The central motivation—that agents 'accelerate a familiar failure mode' of plausible but unverified claims—is presented without any concrete examples, case studies, or prevalence data on current agent behavior, which weakens the load-bearing claim that this risk is widespread enough to require a new standard.
  2. [Proposal] Proposal section (as summarized in the abstract): The recommendation that 'agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail' rests on the unexamined assumption that automated adversarial experiments can be designed and executed without introducing new selection biases or requiring prohibitive human oversight; this feasibility issue is flagged as open but not analyzed in sufficient detail to support the standard as practically actionable.
minor comments (1)
  1. The abstract and full text would benefit from explicit section headings or a short roadmap paragraph to help readers locate the risk description versus the proposed standard.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation for minor revision. We respond to each major comment below, indicating where we will revise the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central motivation—that agents 'accelerate a familiar failure mode' of plausible but unverified claims—is presented without any concrete examples, case studies, or prevalence data on current agent behavior, which weakens the load-bearing claim that this risk is widespread enough to require a new standard.

    Authors: We agree that illustrative examples would make the motivation more tangible. The argument draws from established patterns in scientific practice (selective reporting, file-drawer effects) that agentic systems can amplify through rapid iteration. As the paper is a normative proposal rather than an empirical study, we did not include prevalence statistics. In revision we will add brief, hypothetical scenarios drawn from common agent workflows to illustrate the selective-support risk without asserting new empirical claims. revision: yes

  2. Referee: [Proposal] Proposal section (as summarized in the abstract): The recommendation that 'agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail' rests on the unexamined assumption that automated adversarial experiments can be designed and executed without introducing new selection biases or requiring prohibitive human oversight; this feasibility issue is flagged as open but not analyzed in sufficient detail to support the standard as practically actionable.

    Authors: We appreciate the referee noting this tension. The manuscript already identifies bias and oversight as open questions to avoid overclaiming practicality. To strengthen the proposal we will expand the discussion section with additional analysis of how selection biases could arise in automated falsification searches and the likely continued need for human judgment, while preserving the stance that these remain unresolved implementation challenges rather than solved engineering problems. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a normative position piece whose central claim—that agentic scientific claims should be evaluated under a falsification-first standard—rests on standard philosophy-of-science considerations (negative evidence space, limits of selective positive analyses) rather than any derivation, equation, fitted parameter, or self-referential definition. No load-bearing step reduces to the paper's own inputs by construction; the argument is self-contained against external benchmarks and acknowledges practical difficulties as open questions instead of asserting them as solved.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central proposal rests on two background assumptions about current agent behavior and the requirements of scientific validation; no new entities or fitted parameters are introduced.

axioms (2)
  • domain assumption LLM agents accelerate the production of plausible but selectively supported analyses that are difficult to falsify post hoc
    Stated directly in the abstract as the motivating failure mode.
  • standard math Scientific claims require active attempts at falsification rather than accumulation of positive evidence alone
    Invoked as the contrast to current agent usage; draws on established philosophy of science without new justification in the text.

pith-pipeline@v0.9.0 · 5452 in / 1302 out tokens · 45943 ms · 2026-05-09T21:08:29.637684+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    doi: 10.1038/s42256-026-01199-8

    ISSN 2522-5839. doi: 10.1038/s42256-026-01199-8. URLhttp://dx.doi.org/10.1038/ s42256-026-01199-8. Christopher A. Bail. Can generative ai improve social science?Proceedings of the National Academy of Sciences, 121(21), May

  2. [2]

    doi: 10.1073/pnas.2314021121

    ISSN 1091-6490. doi: 10.1073/pnas.2314021121. URLhttp://dx.doi.org/10.1073/pnas.2314021121. C. Glenn Begley and Lee M. Ellis. Raise standards for preclinical cancer research.Nature, 483 (7391):531–533, March

  3. [3]

    doi: 10.1038/483531a

    ISSN 1476-4687. doi: 10.1038/483531a. URLhttp://dx. doi.org/10.1038/483531a. CDC and NCHS. Nhanes 2017–2018: Questionnaires, datasets, and related documenta- tion,

  4. [4]

    CDC and NCHS

    URLhttps://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/ default.aspx?BeginYear=2017. CDC and NCHS. National health and nutrition examination survey, 2017–2018: Demo- graphic variables and sample weights (demo j) — data documentation, codebook, and fre- quencies,

  5. [5]

    Data file: DEMO J.xpt

    URLhttps://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/ DataFiles/DEMO_J.htm. Data file: DEMO J.xpt. CDC and NCHS. National health and nutrition examination survey, 2017–2018: Men- tal health — depression screener (dpq j) — data documentation, codebook, and frequen- cies,

  6. [6]

    Data file: DPQ J.xpt

    URLhttps://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/ DataFiles/DPQ_J.htm. Data file: DPQ J.xpt. CDC and NCHS. National health and nutrition examination survey, 2017–2018: Vitamin d (vid j) — data documentation, codebook, and frequencies,

  7. [7]

    Data file: VID J.xpt

    URLhttps://wwwn.cdc.gov/ Nchs/Data/Nhanes/Public/2017/DataFiles/VID_J.htm. Data file: VID J.xpt. Dionizije Fa, Marko ˇCuljak, Bruno Pandˇza, and Mateo ˇCupi´c. Bioagent bench: An ai agent evalua- tion suite for bioinformatics,

  8. [8]

    URLhttps://arxiv.org/abs/2601.21800. R. A. Fisher.The Design of Experiments. Oliver and Boyd, Edinburgh,

  9. [9]

    Blondin, et al

    5 Published at ICLR 2026 Workshop on Agents in the Wild Marta Guasch-Ferr´e, Anpan Satija, Stacey A. Blondin, et al. Meta-analysis of randomized controlled trials of red meat consumption in comparison with various comparison diets on cardiovascular risk factors.Circulation, 139(15):1828–1845,

  10. [10]

    Kexin Huang, Ying Jin, Ryan Li, Michael Y

    doi: 10.1161/CIRCULATIONAHA.118.035225. Kexin Huang, Ying Jin, Ryan Li, Michael Y . Li, Emmanuel Candes, and Jure Leskovec. Auto- mated hypothesis validation with agentic sequential falsifications. InForty-second International Conference on Machine Learning,

  11. [11]

    Epub 2005-08-30

    doi: 10.1371/journal.pmed.0020124. Epub 2005-08-30. Erratum in: PLoS Med. 2022 Aug 25;19(8):e1004085. doi:10.1371/journal.pmed.1004085. Nikita Mehandru, Amanda K. Hall, Olesya Melnichenko, Yulia Dubinina, Daniel Tsirulnikov, David Bamman, Ahmed Alaa, Scott Saponas, and Venkat S. Malladi. Bioagents: Bridging the gap in bioinformatics analysis with multi-ag...

  12. [12]

    doi: 10.1038/s41598-025-25919-z

    ISSN 2045-2322. doi: 10.1038/s41598-025-25919-z. URLhttp://dx.doi.org/10. 1038/s41598-025-25919-z. Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition,

  13. [13]

    English translation/revision of Logik der Forschung(1934). Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Hel- yar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Al...

  14. [14]

    OpenAI GPT-5 System Card

    URLhttps://arxiv.org/abs/2601.03267. Alexus A. Smith, Edmund L. Wong, Ronan C. Donovan, Brad A. Chapman, Ryan Harry, Pooyan Tirandazi, Paulina Kanigowska, Elizabeth A. Gendreau, Robert H. Dahl, Michal Jastrzebski, Jose E. Cortez, Christopher J. Bremner, Jos ´e C. Morales Hemuda, James Dooner, Ian Graves, Rahul Karandikar, Christopher Lionetti, Kevin Chris...

  15. [15]

    Sangho Suh, Bryan Min, Srishti Palani, and Haijun Xia

    doi: 10.1177/1745691616658637. 7 Published at ICLR 2026 Workshop on Agents in the Wild Erick H Turner, Annette M Matthews, Eftihia Linardatos, Robert A Tell, and Robert Rosenthal. Selective publication of antidepressant trials and its influence on apparent efficacy.The New England journal of medicine, 358(3):252—260, January

  16. [16]

    Validity & caveats

    1056/nejmsa065779. URLhttp://content.nejm.org/cgi/content/full/358/ 3/252. 8 Published at ICLR 2026 Workshop on Agents in the Wild A APPENDIX A.1 AGENT SETUP Two independent agents analyze the same dataset (DEMO J.xpt, DPQ J.xpt, VID J.xpt) from the 2017– 2018 Nutrition Examination Survey (NHANES) data CDC & NCHS (2018), with opposite goals. Agent A is pr...