The Last Human-Written Paper: Agent-Native Research Artifacts

Alex Pentland; Ang Chen; Ao Qu; Baoyu Zhou; Beidi Chen; Carl Chen; Chenglei Si; Chenyu You; Fan Lai; Haizhong Zheng

arxiv: 2604.24658 · v3 · pith:VS23ZKI7new · submitted 2026-04-27 · 💻 cs.LG

The Last Human-Written Paper: Agent-Native Research Artifacts

Jiachen Liu , Jiaxin Pei , Jintao Huang , Chenglei Si , Ao Qu , Xiangru Tang , Runyu Lu , Lichang Chen

show 29 more authors

Xiaoyan Bai Haizhong Zheng Carl Chen Zhiyang Chen Haojie Ye Yujuan Fu Zexue He Zijian Jin Zhenyu Zhang Shangquan Sun Maestro Harmon John Dianzhuo Wang Jianqiao Zeng Jiachen Sun Mingyuan Wu Baoyu Zhou Chenyu You Shijian Lu Yiming Qiu Fan Lai Yuan Yuan Yao Li Junyuan Hong Ruihao Zhu Beidi Chen Alex Pentland Ang Chen Mosharaf Chowdhury Zechen Zhang

This is my paper

Pith reviewed 2026-05-20 23:42 UTC · model grok-4.3

classification 💻 cs.LG

keywords Agent-Native Research Artifactresearch reproducibilityAI agentsscientific publicationexploration graphmachine-executable researchstorytelling taxengineering tax

0 comments

The pith

Agent-native research artifacts replace narrative papers with four-layer machine-executable packages to cut information loss for AI agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional scientific papers compress branching research into a linear story, discarding failed experiments and omitting details that machines need to run the work. This creates a storytelling tax and an engineering tax that block AI agents from understanding, reproducing, or extending the findings. The paper introduces the Agent-Native Research Artifact, a protocol built on four layers of scientific logic, full executable code, an exploration graph that keeps the discarded paths, and raw evidence for every claim. Three supporting tools capture decisions during development, convert old papers, and automate objective review checks so humans focus on bigger judgments. If the structure succeeds, agents show large gains in answering questions about the work and modest gains in reproducing results.

Core claim

The paper claims that replacing the conventional linear narrative paper with an Agent-Native Research Artifact structured around four layers—scientific logic, executable code with full specifications, an exploration graph that preserves failures and dead ends, and evidence that grounds every claim in raw outputs—removes the storytelling tax and engineering tax. This change allows AI agents to understand, reproduce, and extend published research more effectively, as shown by accuracy rising from 72.4 percent to 93.7 percent on question-answering benchmarks and reproduction success rising from 57.4 percent to 64.4 percent.

What carries the argument

The four-layer Agent-Native Research Artifact (ARA) protocol, supported by a Live Research Manager that records decisions and dead ends, an ARA Compiler that converts legacy PDFs and repositories, and an ARA-native review system that automates objective checks.

If this is right

AI agents reach higher accuracy when answering questions drawn from published research.
Reproduction success rates increase because full code specifications and raw evidence are preserved.
Human reviewers spend less time on implementation details and more time on significance, novelty, and taste.
Preserved failure traces in the exploration graph can accelerate agent progress on extension tasks or, for some agents, limit exploration outside prior paths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams might adopt new development workflows that treat the exploration graph as a first-class output rather than an afterthought.
If widely adopted, credit and review processes could shift toward evaluating the completeness of the artifact instead of only the final narrative.
The format could encourage research to be conducted in a machine-first manner from the initial experiments onward.

Load-bearing premise

That typical research teams can create and maintain the four-layer ARA structure without prohibitive extra effort and that the reported benchmark gains will hold for other research tasks.

What would settle it

An experiment in which independent teams produce ARAs for their own papers and independent agents using those ARAs show no improvement in reproduction success or question-answering accuracy compared with agents using traditional papers on the same tasks.

Figures

Figures reproduced from arXiv: 2604.24658 by Alex Pentland, Ang Chen, Ao Qu, Baoyu Zhou, Beidi Chen, Carl Chen, Chenglei Si, Chenyu You, Fan Lai, Haizhong Zheng, Haojie Ye, Jiachen Liu, Jiachen Sun, Jianqiao Zeng, Jiaxin Pei, Jintao Huang, John Dianzhuo Wang, Junyuan Hong, Lichang Chen, Maestro Harmon, Mingyuan Wu, Mosharaf Chowdhury, Ruihao Zhu, Runyu Lu, Shangquan Sun, Shijian Lu, Xiangru Tang, Xiaoyan Bai, Yao Li, Yiming Qiu, Yuan Yuan, Yujuan Fu, Zechen Zhang, Zexue He, Zhenyu Zhang, Zhiyang Chen, Zijian Jin.

**Figure 1.** Figure 1: Publishing compiles a rich research object into a lossy narrative (left); ARA preserves the original as a high-fidelity, agentexecutable knowledge package (right). this process into a polished linear story, discarding every failed experiment, rejected hypothesis, and abandoned approach. This emphasis on success leaves failures undocumented; although modern platforms archive final artifacts, the branchin… view at source ↗

**Figure 2.** Figure 2: The Storytelling Tax: research proceeds as a branching tree with dead ends (left), but publication compiles it into a linear narrative (right), discarding all failure knowledge. repositories but does not address the epistemic structure of research itself. None of these efforts jointly structure scientific logic, executable code, and exploration history into a single operable object (§8). We propose the Ag… view at source ↗

**Figure 3.** Figure 3: The reproduction information gap across 8,921 PaperBench requirements. (a) PDFs systematically under-specify code development tasks. (b) The three largest gap types are precisely the categories ARA’s structured layers address. • We identify two structural costs of compiling research into narrative—the Storytelling Tax and the Engineering Tax—and introduce the Agent-Native Research Artifact (ARA): a protoc… view at source ↗

**Figure 6.** Figure 6: The Live Research Manager operates at session boundaries: a three-stage pipeline (Context Harvester → Event Router → Maturity Tracker) distills each researcher–agent conversation into typed events that accumulate across ARA layers over time. We consider an ARA sufficient when a sufficiently capable coding agent can reproduce the core claim zero-shot from it, without human intervention or external context … view at source ↗

**Figure 9.** Figure 9: Three-stage ARA-native review pipeline. Stages 1–2 invoke the ARA Seal levels of view at source ↗

**Figure 10.** Figure 10: The (Human+AI)2 research network. Each researcher works through a research agent that interfaces with a shared ARA network via /submit, /retrieve, and /fork; agents may also collaborate directly. contribution significant—does it address a real problem that matters? Is the core insight genuinely novel, or an incremental recombination of known ideas? Is this the right formulation of the problem, and are t… view at source ↗

**Figure 11.** Figure 11: Aggregate reproduction success rates across all 15 papers, stratified by difficulty. The ARA advantage widens monotonically with difficulty (+4.9% easy, +5.6% medium, +8.5% hard), tracking the tiers where reproduction depends most heavily on configuration content the PDF underspecifies. and completed all medium and hard subtasks; the baseline agent fought the JAX environment and completed only 3 training… view at source ↗

**Figure 12.** Figure 12: Extension trajectories on five RE-Bench tasks under Claude Sonnet 4.6. One task per column: top row is score-vs-wall-clocktime, bottom row is score-vs-cumulative-cost; the y-axis is shared down each column. Faint markers are individual scoring attempts, solid lines are the best-so-far envelope, and stars mark the best attempt; dotted grey lines mark each task’s RE-Bench reference. Arrows in the column ti… view at source ↗

**Figure 13.** Figure 13: Per-paper ARA − baseline delta (percentage points) on each difficulty stratum, sorted by mean advantage. Green indicates ARA wins, red indicates baseline wins. Gains concentrate in the medium and hard columns across most papers; the few baseline wins are confined to a small set, most prominently self-expansion and ftrl. environment and completed only 3 training attempts before exhausting its budget. The… view at source ↗

**Figure 14.** Figure 14: triton_cumsum on Sonnet 4.5: paper vs. ARA score-vs-time (left) and score-vs-cost (right). Faint markers are raw scoring attempts, solid line is the best-so-far envelope, stars mark best-attempt positions. Dotted line is the original RE-Bench reference (0.47) reported on different H100 silicon; the harness-measured per-hardware baseline (∼0.64) is where both arms start. The 4.6 trajectories are in the bod… view at source ↗

**Figure 15.** Figure 15: restricted_mlm on Sonnet 4.5: paper vs. ARA score-vs-time (left) and score-vs-cost (right). Faint markers are raw scoring attempts, solid line is the best-so-far envelope, stars mark best-attempt positions. Dotted lines mark the untrained-MLP baseline (1.84) and the RE-Bench reference (1.13). Both arms are anchored at the 1.84 baseline at t = 0; ARA-4.5’s pre-agent harness baseline crashed (corrupted star… view at source ↗

read the original abstract

Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. This compilation imposes two structural costs: a Storytelling Tax, where failed experiments, rejected hypotheses, and the branching exploration process are discarded to fit a linear narrative; and an Engineering Tax, where the gap between reviewer-sufficient prose and agent-sufficient specification leaves critical implementation details unwritten. Tolerable for human readers, these costs become critical when AI agents must understand, reproduce, and extend published work. We introduce the Agent-Native Research Artifact (ARA), a protocol that replaces the narrative paper with a machine-executable research package structured around four layers: scientific logic, executable code with full specifications, an exploration graph that preserves the failures compilation discards, and evidence grounding every claim in raw outputs. Three mechanisms support the ecosystem: a Live Research Manager that captures decisions and dead ends during ordinary development; an ARA Compiler that translates legacy PDFs and repos into ARAs; and an ARA-native review system that automates objective checks so human reviewers can focus on significance, novelty, and taste. On PaperBench and RE-Bench, ARA raises question-answering accuracy from 72.4% to 93.7% and reproduction success from 57.4% to 64.4%. On RE-Bench's five open-ended extension tasks, preserved failure traces in ARA accelerate progress, but can also constrain a capable agent from stepping outside the prior-run box depending on the agent's capabilities. Our code is open-sourced at https://github.com/Orchestra-Research/Agent-Native-Research-Artifact.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that narrative scientific papers impose a Storytelling Tax and Engineering Tax that hinder AI agents, and introduces the Agent-Native Research Artifact (ARA) as a four-layer machine-executable package (scientific logic, executable code with specifications, exploration graph preserving failures, and evidence). It describes supporting mechanisms including a Live Research Manager, ARA Compiler for legacy conversion, and an automated review system. On the newly introduced PaperBench and RE-Bench, ARA improves agent question-answering accuracy from 72.4% to 93.7% and reproduction success from 57.4% to 64.4%, with mixed observations on five open-ended extension tasks; the implementation is open-sourced.

Significance. If the empirical gains hold under controlled conditions and the four-layer structure proves feasible for typical teams, the work could meaningfully advance agent-assisted research by reducing information loss in publications. The open-sourcing of the code at the provided repository is a clear strength that supports reproducibility and community validation of the protocol.

major comments (2)

[Evaluation section] Evaluation section (and abstract): the headline improvements on PaperBench and RE-Bench are reported without details on agent architectures, prompting methods, baseline configurations, or how ARA packages were constructed for the specific test cases. These controls are load-bearing for attributing the deltas (72.4% → 93.7% QA; 57.4% → 64.4% reproduction) to the ARA structure rather than implementation specifics.
[ARA adoption and extension tasks] Section on ARA adoption and extension tasks: no measurements or ablations are provided for the additional effort required to create and maintain the four-layer structure (scientific logic, code, exploration graph, evidence) via the Live Research Manager, which directly bears on the claim that ARA can replace narrative papers at scale for ordinary research teams.

minor comments (2)

[Abstract and discussion] The abstract and discussion note that preserved failure traces can constrain capable agents on extension tasks, but no quantitative breakdown by agent capability or task type is supplied to support this observation.
[ARA definition] The four-layer ARA structure is described at a high level; a concrete example or diagram showing how a single claim maps across all four layers would improve clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to improve clarity and rigor.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (and abstract): the headline improvements on PaperBench and RE-Bench are reported without details on agent architectures, prompting methods, baseline configurations, or how ARA packages were constructed for the specific test cases. These controls are load-bearing for attributing the deltas (72.4% → 93.7% QA; 57.4% → 64.4% reproduction) to the ARA structure rather than implementation specifics.

Authors: We agree that additional experimental details are required to support attribution of the reported gains to the ARA structure. In the revised manuscript we will expand the Evaluation section (and update the abstract if space permits) with a dedicated subsection describing the agent architectures, base models, tool-use configurations, prompting strategies for both ARA and baseline conditions, baseline implementation details, and the exact procedure used to construct ARA packages from the benchmark test cases. This will include relevant hyperparameters and construction steps to enable independent verification. revision: yes
Referee: [ARA adoption and extension tasks] Section on ARA adoption and extension tasks: no measurements or ablations are provided for the additional effort required to create and maintain the four-layer structure (scientific logic, code, exploration graph, evidence) via the Live Research Manager, which directly bears on the claim that ARA can replace narrative papers at scale for ordinary research teams.

Authors: We acknowledge that the current manuscript does not contain quantitative measurements or ablations of authoring and maintenance effort. The Live Research Manager is designed to integrate into ordinary development workflows with the intent of minimizing overhead, and the manuscript includes qualitative observations from our own development process. In revision we will add an expanded discussion of this design rationale together with explicit plans for future user studies that will quantify effort. We do not have the requested ablations available from the present experiments. revision: partial

Circularity Check

0 steps flagged

No load-bearing circularity; empirical gains measured on newly introduced benchmarks without reduction to self-fitted inputs

full rationale

The paper introduces the ARA protocol with its four-layer structure and reports empirical improvements on the newly defined PaperBench and RE-Bench. These gains (72.4% to 93.7% QA accuracy, 57.4% to 64.4% reproduction success) are presented as direct benchmark measurements rather than outputs of equations or parameters fitted from prior self-citations that would reduce to the inputs by construction. No self-citation load-bearing steps, uniqueness theorems imported from the authors' prior work, or ansatzes smuggled via citation appear in the abstract or described claims. The derivation chain consists of defining a new artifact format and evaluating it on purpose-built benchmarks; while this introduces a minor risk that the benchmarks align closely with ARA features, it does not meet the criteria for specific circular reductions. The paper remains self-contained as an empirical proposal without the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the newly introduced ARA concept and the assumption that structured artifacts reduce the two named taxes; no numerical free parameters are fitted, and the only invented entity is the ARA itself.

axioms (1)

domain assumption Traditional narrative papers systematically discard failed experiments and implementation details that are necessary for agent reproduction.
Stated directly in the opening sentences of the abstract as the Storytelling Tax and Engineering Tax.

invented entities (1)

Agent-Native Research Artifact (ARA) no independent evidence
purpose: Machine-executable replacement for narrative papers that preserves scientific logic, code, exploration graph, and evidence.
Newly defined in this work; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5970 in / 1512 out tokens · 61359 ms · 2026-05-20T23:42:33.974572+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Toward an Engineering of Science: Rebalancing Generation and Verification in the Age of AI
cs.CY 2026-05 unverdicted novelty 5.0

AI lowers the cost of generating plausible scientific artifacts without lowering verification costs, so the paper proposes blueprints as typed graph components that decompose claims, evidence, and assumptions to enabl...

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Whole Tale

URL https://www.wandb.com/. Software available from wandb.com. Boiko, D. A., MacKnight, R., Kline, B., and Gomes, G. Autonomous chemical research with large language mod- els.Nature, 624(7992):570–578, 2023. doi: 10.1038/ s41586-023-06792-0. Booeshaghi, A. S., Luebbert, L., and Pachter, L. Science should be machine-readable.bioRxiv, 2026. doi: 10. 64898/2...

work page doi:10.1016/j.future.2017.12.029 2023
[2]

Information Services and Use30(1–2), 51–56 (2010)

doi: 10.3233/ISU-2010-0613. Hua, T., Hua, H., Xiang, V ., Klieger, B., Truong, S. T., Liang, W., Sun, F.-Y ., and Haber, N. ResearchCodeBench: Benchmarking LLMs on implementing novel machine learning research code.arXiv preprint arXiv:2506.02314, 2025. Huang, M. DecMetrics: Structured claim decomposition scoring for factually consistent LLM outputs.arXiv ...

work page doi:10.3233/isu-2010-0613 2010
[3]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

doi: 10.18653/v1/2020.acl-main.447. Lu, C., Lu, C., Lange, R. T., Foerster, J., Clune, J., and Ha, D. The AI scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024. Luo, Y ., Yu, Z., Wang, X., Zhu, Y ., Zhang, N., Wei, L., Du, L., Zheng, D., and Chen, H. What makes AI research replicable? Executable knowle...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.acl-main.447 2020
[4]

Medawar, P

doi: 10.1242/dmm.015123. Medawar, P. B. Is the scientific paper a fraud?The Listener, 70:377–378, 1963. Reprinted inThe Strange Case of the Spotted Mice, Oxford University Press, 1996. OpenAI. AGENTS.md: A standard for agent-oriented repository documentation. https://github.com/ openai/agents.md, 2025. Accessed 2026-03-01. Pineau, J., Vincent-Lamarre, P.,...

work page doi:10.1242/dmm.015123 1963
[5]

We evaluate all methods on three task sequences with 10 seeds each

Combinatorial experiment matrix (24.1%).The sin- gle largest category consists of requirements that enumerate which model variant must be run on which dataset, with which configuration, for how many seeds. In PDFs, this combinatorial structure is compressed into a single sentence (“We evaluate all methods on three task sequences with 10 seeds each”) or a ...

work page
[6]

Compute cosine similarity between δi and δmlp,i for layers 0, 2, 4, 6, 8, 10, 12, 14, 16, 18 using 1,199 prompts from RealToxicityPrompts

Evaluation protocol (18.5%).Requirements specify- ingwhichmetric to compute, onwhichtest split, usingwhich evaluation-time configuration (e.g., beam size, number of evaluation episodes, specific layers to probe). These details are often scattered: the metric definition appears in §3, the test split in §4, the evaluation episodes in the appendix, and the l...

work page
[7]

AdamW optimizer with learning rate 5e-6, weight decay 0.01; batch size 64; 6,000 training steps

Hyperparameters (17.2%).Classic training config- uration: learning rates, batch sizes, optimizer parameters, temperature, weight decay, LoRA rank, number of epochs. While these are themost commonly discussedreproduction barrier, they account for only 17% of leaf requirements. In PDFs, hyperparameters are typically consolidated in an ap- pendix table, but ...

work page
[8]

instrumentation

Metric computation and logging (10.4%).Require- ments that the agent mustrecordspecific intermediate quan- tities during runs: loss curves, attention distributions, cost tracking (dollars per 1k questions), episodic returns logged every N steps. This “instrumentation” knowledge is rarely specified in papers—authors implicitly know what to log but do not d...

work page
[9]

After adapting with DPO, the principal component of the residual streams shift in the same direction, and the activation of the toxic vectors decrease

Result interpretation (8.6%).Qualitative claims about what the results shouldshow—directional trends, compar- ative rankings, mechanistic explanations. These carry the highest weight in PaperBench rubrics (weight = 2) because they test whether the agentunderstandsthe results, not just whether the code ran. Examples: • mechanistic-understanding: “After ada...

work page
[10]

CNN has three convolutional layers with 32, 64, and 64 channels and filter sizes of 8, 4, and 3

Architecture specification (5.8%).Layer counts, chan- nel sizes, activation functions, output head structure, em- bedding dimensions. In PDFs, architecture details are split across a figure (showing the high-level diagram), the meth- ods section (describing components in prose), and the ap- pendix (listing dimensions in a table). An agent must men- tally ...

work page
[11]

Output attention: softmax(qK T /√dmodel)·V

Mathematical formulation (4.5%).Specific equations that must be implemented exactly: loss functions, atten- 24 The Last Human-Written Paper: Agent-Native Research Artifacts tion operations, PDE boundary conditions, update rules. In PDFs, equations are referenced by number, but the reader must trace through variable definitions scattered across mul- tiple ...

work page
[12]

Single CNN encoder per policy; new encoder initialized with weights of the previous one

Implementation tricks (4.2%).Non-obvious design choices that distinguish faithful reproduction from naive re-implementation: weight freezing schedules, initialization from prior checkpoints, gradient clipping thresholds, nor- malization details, optimizer switching strategies. These are the hardest items to recover from a PDF because they ap- pear as pare...

work page
[13]

we use the standard train/test split

Data pipeline (3.8%).Dataset acquisition, split ratios, filtering criteria, preprocessing steps, data augmentation, collocation point sampling strategies. These details are often under-specified in papers (“we use the standard train/test split”) or tucked into a single appendix paragraph. Examples: • bbox: “Split GSM8K into 7,473 training and 1,319 test s...

work page
[14]

obvious” and omitted entirely from the paper, yet they are essential for reproduction. Examples: • bbox: “API access configured for davinci-002

Environment and infrastructure (2.9%).Specific API endpoints, model version strings, library versions, sim- ulator names, hardware requirements. These are often as- sumed to be “obvious” and omitted entirely from the paper, yet they are essential for reproduction. Examples: • bbox: “API access configured for davinci-002”; “Code to execute fine-tuning jobs...

work page 2026
[15]

score": X,

The sign pattern (8–2) is itself statistically improbable under the null hypothesis of no difference (p= 0.039 , exact binomial), confirming that the aggregate advantage is not driven by a single outlier paper. F.2. Per-Paper Reproduction Analysis This section provides the detailed per-paper analysis for the reproduction experiment (§7.3). Per-difficulty ...

work page 2025

[1] [1]

Whole Tale

URL https://www.wandb.com/. Software available from wandb.com. Boiko, D. A., MacKnight, R., Kline, B., and Gomes, G. Autonomous chemical research with large language mod- els.Nature, 624(7992):570–578, 2023. doi: 10.1038/ s41586-023-06792-0. Booeshaghi, A. S., Luebbert, L., and Pachter, L. Science should be machine-readable.bioRxiv, 2026. doi: 10. 64898/2...

work page doi:10.1016/j.future.2017.12.029 2023

[2] [2]

Information Services and Use30(1–2), 51–56 (2010)

doi: 10.3233/ISU-2010-0613. Hua, T., Hua, H., Xiang, V ., Klieger, B., Truong, S. T., Liang, W., Sun, F.-Y ., and Haber, N. ResearchCodeBench: Benchmarking LLMs on implementing novel machine learning research code.arXiv preprint arXiv:2506.02314, 2025. Huang, M. DecMetrics: Structured claim decomposition scoring for factually consistent LLM outputs.arXiv ...

work page doi:10.3233/isu-2010-0613 2010

[3] [3]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

doi: 10.18653/v1/2020.acl-main.447. Lu, C., Lu, C., Lange, R. T., Foerster, J., Clune, J., and Ha, D. The AI scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024. Luo, Y ., Yu, Z., Wang, X., Zhu, Y ., Zhang, N., Wei, L., Du, L., Zheng, D., and Chen, H. What makes AI research replicable? Executable knowle...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.acl-main.447 2020

[4] [4]

Medawar, P

doi: 10.1242/dmm.015123. Medawar, P. B. Is the scientific paper a fraud?The Listener, 70:377–378, 1963. Reprinted inThe Strange Case of the Spotted Mice, Oxford University Press, 1996. OpenAI. AGENTS.md: A standard for agent-oriented repository documentation. https://github.com/ openai/agents.md, 2025. Accessed 2026-03-01. Pineau, J., Vincent-Lamarre, P.,...

work page doi:10.1242/dmm.015123 1963

[5] [5]

We evaluate all methods on three task sequences with 10 seeds each

Combinatorial experiment matrix (24.1%).The sin- gle largest category consists of requirements that enumerate which model variant must be run on which dataset, with which configuration, for how many seeds. In PDFs, this combinatorial structure is compressed into a single sentence (“We evaluate all methods on three task sequences with 10 seeds each”) or a ...

work page

[6] [6]

Compute cosine similarity between δi and δmlp,i for layers 0, 2, 4, 6, 8, 10, 12, 14, 16, 18 using 1,199 prompts from RealToxicityPrompts

Evaluation protocol (18.5%).Requirements specify- ingwhichmetric to compute, onwhichtest split, usingwhich evaluation-time configuration (e.g., beam size, number of evaluation episodes, specific layers to probe). These details are often scattered: the metric definition appears in §3, the test split in §4, the evaluation episodes in the appendix, and the l...

work page

[7] [7]

AdamW optimizer with learning rate 5e-6, weight decay 0.01; batch size 64; 6,000 training steps

Hyperparameters (17.2%).Classic training config- uration: learning rates, batch sizes, optimizer parameters, temperature, weight decay, LoRA rank, number of epochs. While these are themost commonly discussedreproduction barrier, they account for only 17% of leaf requirements. In PDFs, hyperparameters are typically consolidated in an ap- pendix table, but ...

work page

[8] [8]

instrumentation

Metric computation and logging (10.4%).Require- ments that the agent mustrecordspecific intermediate quan- tities during runs: loss curves, attention distributions, cost tracking (dollars per 1k questions), episodic returns logged every N steps. This “instrumentation” knowledge is rarely specified in papers—authors implicitly know what to log but do not d...

work page

[9] [9]

After adapting with DPO, the principal component of the residual streams shift in the same direction, and the activation of the toxic vectors decrease

Result interpretation (8.6%).Qualitative claims about what the results shouldshow—directional trends, compar- ative rankings, mechanistic explanations. These carry the highest weight in PaperBench rubrics (weight = 2) because they test whether the agentunderstandsthe results, not just whether the code ran. Examples: • mechanistic-understanding: “After ada...

work page

[10] [10]

CNN has three convolutional layers with 32, 64, and 64 channels and filter sizes of 8, 4, and 3

Architecture specification (5.8%).Layer counts, chan- nel sizes, activation functions, output head structure, em- bedding dimensions. In PDFs, architecture details are split across a figure (showing the high-level diagram), the meth- ods section (describing components in prose), and the ap- pendix (listing dimensions in a table). An agent must men- tally ...

work page

[11] [11]

Output attention: softmax(qK T /√dmodel)·V

Mathematical formulation (4.5%).Specific equations that must be implemented exactly: loss functions, atten- 24 The Last Human-Written Paper: Agent-Native Research Artifacts tion operations, PDE boundary conditions, update rules. In PDFs, equations are referenced by number, but the reader must trace through variable definitions scattered across mul- tiple ...

work page

[12] [12]

Single CNN encoder per policy; new encoder initialized with weights of the previous one

Implementation tricks (4.2%).Non-obvious design choices that distinguish faithful reproduction from naive re-implementation: weight freezing schedules, initialization from prior checkpoints, gradient clipping thresholds, nor- malization details, optimizer switching strategies. These are the hardest items to recover from a PDF because they ap- pear as pare...

work page

[13] [13]

we use the standard train/test split

Data pipeline (3.8%).Dataset acquisition, split ratios, filtering criteria, preprocessing steps, data augmentation, collocation point sampling strategies. These details are often under-specified in papers (“we use the standard train/test split”) or tucked into a single appendix paragraph. Examples: • bbox: “Split GSM8K into 7,473 training and 1,319 test s...

work page

[14] [14]

obvious” and omitted entirely from the paper, yet they are essential for reproduction. Examples: • bbox: “API access configured for davinci-002

Environment and infrastructure (2.9%).Specific API endpoints, model version strings, library versions, sim- ulator names, hardware requirements. These are often as- sumed to be “obvious” and omitted entirely from the paper, yet they are essential for reproduction. Examples: • bbox: “API access configured for davinci-002”; “Code to execute fine-tuning jobs...

work page 2026

[15] [15]

score": X,

The sign pattern (8–2) is itself statistically improbable under the null hypothesis of no difference (p= 0.039 , exact binomial), confirming that the aggregate advantage is not driven by a single outlier paper. F.2. Per-Paper Reproduction Analysis This section provides the detailed per-paper analysis for the reproduction experiment (§7.3). Per-difficulty ...

work page 2025