pith. sign in

arxiv: 2606.23175 · v1 · pith:HAANFKKTnew · submitted 2026-06-22 · 💻 cs.LG

Position: Correct Answer, Wrong Mechanism -- When AI Scientists Defend General Claims Their Own Data Contradicts

Pith reviewed 2026-06-26 08:48 UTC · model grok-4.3

classification 💻 cs.LG
keywords AI scientistsmechanism fidelityepistemic honestyCorrect Answer Wrong Mechanismcoding agentssimulation-based discoveryGeant4outcome evaluation
0
0 comments X

The pith

Coding agents reach correct simulation results through reasoning that contradicts their own data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that AI scientist systems cannot be judged only on whether their final answers look right. Experiments with a coding agent rediscovering a known particle identification observable in Geant4 simulations produced several cases where the output appeared correct but rested on incorrect reasoning that failed when conditions changed. The authors label this pattern Correct Answer, Wrong Mechanism and show that outcome success can dissociate from mechanism fidelity and epistemic honesty inside a single agent run. They conclude that such agents function reliably as tools yet cannot be trusted as scientific co-authors for open-ended claims without external verification of the reasoning steps.

Core claim

Across 28 episodes, agents produced right-looking results via incorrect mechanisms in 4 of 20 primary-model runs and 3 of 8 cross-model runs. When supplied a partially misleading prior, all five agents rejected the false component on evidence, yet one still defended its chosen observable with physics inconsistent with the data it had generated. The paper demonstrates that honesty and mechanism fidelity can separate within one trajectory and proposes two lightweight checks that together flag every such case in the study.

What carries the argument

Correct Answer, Wrong Mechanism (CAWM), the pattern in which an agent produces a result that matches a known observable but reaches it through reasoning that breaks under a regime shift.

If this is right

  • Task outcome must be measured separately from mechanism fidelity and epistemic honesty.
  • Coding agents remain reliable for tool-like tasks but unreliable as co-authors for open-ended claim-making.
  • A one-step regime-shift check using only the agent's claim flags over-generalized cases.
  • When the correct observable is known, a recomputation check flags the remaining cases.
  • Together the two checks identify every CAWM instance in the reported episodes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dissociation between outcome and reasoning could appear in other simulation or modeling domains where agents generate explanatory claims.
  • Requiring agents to run an internal regime-shift test on every claim would make the failure mode detectable without external knowledge of the right answer.
  • Benchmarks for AI scientists will need explicit regime-shift and consistency tests rather than outcome-only scoring.
  • The pattern raises the question of whether human scientists exhibit analogous over-generalization that is harder to detect because it is not produced by explicit code.

Load-bearing premise

The observed episodes of rediscovering one known observable in a particle simulation represent how AI systems make open-ended scientific claims more generally.

What would settle it

A new set of episodes in which every agent that reaches a correct result also passes both the regime-shift check and a recomputation check against its own generated data.

Figures

Figures reproduced from arXiv: 2606.23175 by Steven Young Eulig.

Figure 1
Figure 1. Figure 1: The outcome–mechanism matrix, with all 28 episodes placed by the codings of §3 and §4, with outcome correctness as de￾fined in §2. Single-regime outcome evaluation cannot distinguish the top-right ideal quadrant from the shaded CAWM quadrant, where a correct result is defended with fabricated physics. All seven CAWM-coded episodes sit in the shaded cell. The episode at lower left pairs an unsupported mecha… view at source ↗
Figure 2
Figure 2. Figure 2: Per-event photon arrival times for muons (blue) and electrons (red) at d = 10, 20 m, rebased to the first hit per event. One illustrative 50-event run at E = 40 GeV. Insets give per-event means of σ and fearly(W = 1 ns). At d = 10 m, σµ ≈ σe and σ fails to discriminate. At d = 20 m, σµ > σe emerges as the δ-ray tail dominates at lower photon counts. fearly(µ) > fearly(e) holds at both distances. coincidenc… view at source ↗
Figure 3
Figure 3. Figure 3: Robustness of σ and fearly, pooled per d. (a) Mean per-event σµ falls by about a factor of 2.5 when the latest 5% of photons in each event are removed (from 7.2 to 2.6 ns at d = 10 m; from 11.4 to 5.0 ns at d = 25 m), so its width is carried by a sparse late tail of δ-ray and scattered photons rather than the track-length geometry the agents invoke. (b) Per-episode σµ − σe. The sign reverses in 10 of 25 si… view at source ↗
read the original abstract

AI scientist systems are described as tools, coauthors, or founders, but we evaluate them as if only the final answer matters. This position paper argues that outcome-only evaluation is insufficient, and that task outcome, mechanism fidelity, and epistemic honesty must be measured separately. Our evidence comes from 28 episodes of a coding agent attempting to rediscover a known particle identification observable in a Geant4 simulation, including an 8-episode probe across two additional frontier models. In 4/20 primary-model and 3/8 cross-model episodes, agents reach right-looking results through incorrect reasoning that breaks when conditions change, which we call Correct Answer, Wrong Mechanism (CAWM). Honesty and mechanism fidelity dissociate within a single agent trajectory. When given a partially misleading prior, all five agents reject the false component on evidence, yet one defends its chosen observable with physics inconsistent with its own data. In the simulation-based discovery setting studied here, coding agents prove reliable tools but unreliable scientific co-authors for open-ended claim-making, where co-author trust requires mechanism-fidelity verification they do not reliably self-apply. The failure is detectable, and we propose a lightweight test. A one-step regime-shift check needs only the agent's claim and flags the over-generalized cases. A companion recomputation flags the remaining cases when the correct observable is known. Together, these checks flag every CAWM case in this study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper argues that evaluating AI scientist systems solely on final task outcomes is insufficient and that task outcome, mechanism fidelity, and epistemic honesty must be assessed separately. Evidence is drawn from 28 episodes in which a coding agent attempts to rediscover a known particle identification observable in a Geant4 simulation, with an additional 8-episode cross-model probe using two other frontier models. In 4/20 primary-model episodes and 3/8 cross-model episodes the agents produce right-looking results via incorrect reasoning that fails under changed conditions (termed CAWM); honesty and mechanism fidelity are shown to dissociate within trajectories. The work concludes that such agents are reliable tools but unreliable scientific co-authors for open-ended claim-making and proposes two lightweight detection tests: a one-step regime-shift check and a recomputation check when the correct observable is known.

Significance. If the central observations hold, the paper identifies a practically important gap in current evaluation practices for AI systems used in scientific discovery. The concrete demonstration that correct answers can arise from non-generalizable mechanisms, together with the dissociation of honesty from fidelity and the proposal of simple, falsifiable checks, supplies a clear, actionable direction for improving co-author trust. The cross-model replication and the explicit scoping to the simulation-based rediscovery setting studied here are strengths that make the position testable and extensible.

major comments (1)
  1. [Methods (implied by abstract reporting of episode counts and cross-model results)] The manuscript does not supply the precise operational definition of 'incorrect reasoning,' the episode inclusion/exclusion criteria, or the annotation protocol used to classify the 4/20 and 3/8 CAWM cases. These details are load-bearing for interpreting the reported rates and for assessing whether the observed dissociation between honesty and mechanism fidelity is robust.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for identifying the need for explicit methodological transparency in a position paper that relies on empirical episode counts. We address the single major comment below and will incorporate the requested details in revision.

read point-by-point responses
  1. Referee: The manuscript does not supply the precise operational definition of 'incorrect reasoning,' the episode inclusion/exclusion criteria, or the annotation protocol used to classify the 4/20 and 3/8 CAWM cases. These details are load-bearing for interpreting the reported rates and for assessing whether the observed dissociation between honesty and mechanism fidelity is robust.

    Authors: We agree that these elements must be stated explicitly. The revised manuscript will add a Methods subsection that defines: (i) 'incorrect reasoning' as any chain that produces a numerically correct observable yet fails to generalize when the simulation regime is shifted (e.g., different particle species, energy range, or detector geometry) or that asserts physics inconsistent with the agent's own generated data; (ii) inclusion criteria as every completed agent trajectory that returned a candidate observable without runtime failure or explicit refusal; exclusion only for trajectories that crashed before producing output; (iii) annotation protocol as dual independent coding by two authors using a shared rubric, with disagreements resolved by joint review of the full trajectory log. These additions will make the 4/20 and 3/8 classifications fully auditable and will clarify how honesty–fidelity dissociation was identified within individual runs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical episodes are independent evidence

full rationale

The paper presents a position based on 28 new experimental episodes of a coding agent rediscovering a known observable in Geant4, plus cross-model probes. No derivations, equations, fitted parameters, or self-citations are invoked to support the central CAWM claim or the proposed checks; the evidence consists of direct experimental outcomes rather than reductions to inputs by construction. The setup is self-contained against external benchmarks (the known particle identification observable) and does not rely on load-bearing self-citations or ansatzes smuggled from prior work. This matches the default case of no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The position rests on the assumption that the chosen simulation task and observable constitute a valid proxy for open-ended scientific claim-making; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The Geant4 particle identification task is representative of broader AI scientist claim-making scenarios.
    Invoked to generalize from the 28 episodes to the reliability of AI co-authors.

pith-pipeline@v0.9.1-grok · 5785 in / 1203 out tokens · 28218 ms · 2026-06-26T08:48:40.782610+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 3 canonical work pages · 3 internal anchors

  1. [1]

    Nature , volume=

    Autonomous chemical research with large language models , author=. Nature , volume=

  2. [2]

    Agostinelli, Sea and Allison, John and Amako, Ka and Apostolakis, J and Araujo, H and Arce, P and Asai, M and Axen, D and Banerjee, S and Barrand, G and others , journal=

  3. [3]

    and Burns, Benjamin and Adu-Ampratwum, Daniel and Huang, Xuhui and Ning, Xia and Gao, Song and Su, Yu and Sun, Huan , booktitle=

    Chen, Ziru and Chen, Shijie and Ning, Yuting and Zhang, Qianheng and Wang, Boshi and Yu, Botao and Li, Yifei and Liao, Zeyi and Wei, Chen and Lu, Zitong and Dey, Vishal and Xue, Mingyi and Baker, Frazier N. and Burns, Benjamin and Adu-Ampratwum, Daniel and Huang, Xuhui and Ning, Xia and Gao, Song and Su, Yu and Sun, Huan , booktitle=

  4. [4]

    Ren, Richard and Agarwal, Arunim and Mazeika, Mantas and Menghini, Cristina and Vacareanu, Robert and Kenstler, Brad and Yang, Mick and Barrass, Isabelle and Gatti, Alice and Yin, Xuwang and Trevino, Eduardo and Geralnik, Matias and Khoja, Adam and Lee, Dean and Yue, Summer and Hendrycks, Dan , journal=. The

  5. [5]

    Language Models Don't Always Say What They Think: Unfaithful Explanations in

    Turpin, Miles and Michael, Julian and Perez, Ethan and Bowman, Samuel R , journal=. Language Models Don't Always Say What They Think: Unfaithful Explanations in

  6. [6]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Lanham, Tamera and Chen, Anna and Radhakrishnan, Ansh and Steiner, Benoit and Denison, Carson and Hernandez, Danny and Li, Dustin and Durmus, Esin and Hubinger, Evan and Kernion, Jackson and Luko. Measuring Faithfulness in. arXiv preprint arXiv:2307.13702 , year=

  7. [7]

    International Conference on Learning Representations , year=

    Let's Verify Step by Step , author=. International Conference on Learning Representations , year=

  8. [8]

    Solving math word problems with process- and outcome-based feedback

    Solving Math Word Problems with Process- and Outcome-Based Feedback , author=. arXiv preprint arXiv:2211.14275 , year=

  9. [9]

    Nature Machine Intelligence , volume=

    Shortcut Learning in Deep Neural Networks , author=. Nature Machine Intelligence , volume=

  10. [10]

    Evaluating

    Beel, Joeran and Kan, Min-Yen and Baumgart, Moritz , journal=. Evaluating. 2025 , note=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    Jansen, Peter and C\^. Advances in Neural Information Processing Systems , volume=

  12. [12]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (

    Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (

  13. [13]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (

    Probing Neural Network Comprehension of Natural Language Arguments , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (

  14. [14]

    , journal=

    Luo, Ziming and Kasirzadeh, Atoosa and Shah, Nihar B. , journal=. The More You Automate, the Less You See: Hidden Pitfalls of. 2025 , note=

  15. [15]

    Towards an AI co-scientist

    Gottweis, Juraj and Weng, Wei-Hung and Daryin, Alexander and Tu, Tao and Palepu, Anil and Sirkovic, Petar and Myaskovsky, Artiom and Weissenberger, Felix and Rong, Keran and Tanno, Ryutaro and Saab, Khaled and Popovici, Dan and Blum, Jacob and Zhang, Fan and Chou, Katherine and Hassidim, Avinatan and Gokturk, Burak and Vahdat, Amin and Kohli, Pushmeet and...

  16. [16]

    Yamada, Yutaro and Lange, Robert Tjarko and Lu, Cong and Hu, Shengran and Lu, Chris and Foerster, Jakob and Clune, Jeff and Ha, David , journal=. The

  17. [17]

    Towards end-to-end automation of

    Lu, Chris and Lu, Cong and Lange, Robert Tjarko and Yamada, Yutaro and Hu, Shengran and Foerster, Jakob and Ha, David and Clune, Jeff , journal=. Towards end-to-end automation of

  18. [18]

    and Barabasi, Daniel L

    Mitchener, Ludovico and Yiu, Angela and Chang, Benjamin and Bourdenx, Mathieu and Nadolski, Tyler and Sulovari, Arvis and Landsness, Eric C. and Barabasi, Daniel L. and Narayanan, Siddharth and Evans, Nicky and Reddy, Shriya and Foiani, Martha and Kamal, Aizad and Shriver, Leah P. and Cao, Fang and Wassie, Asmamaw T. and Laurent, Jon M. and Melville-Green...