The Last Human-Written Paper: Agent-Native Research Artifacts
Pith reviewed 2026-05-20 23:42 UTC · model grok-4.3
The pith
Agent-native research artifacts replace narrative papers with four-layer machine-executable packages to cut information loss for AI agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that replacing the conventional linear narrative paper with an Agent-Native Research Artifact structured around four layers—scientific logic, executable code with full specifications, an exploration graph that preserves failures and dead ends, and evidence that grounds every claim in raw outputs—removes the storytelling tax and engineering tax. This change allows AI agents to understand, reproduce, and extend published research more effectively, as shown by accuracy rising from 72.4 percent to 93.7 percent on question-answering benchmarks and reproduction success rising from 57.4 percent to 64.4 percent.
What carries the argument
The four-layer Agent-Native Research Artifact (ARA) protocol, supported by a Live Research Manager that records decisions and dead ends, an ARA Compiler that converts legacy PDFs and repositories, and an ARA-native review system that automates objective checks.
If this is right
- AI agents reach higher accuracy when answering questions drawn from published research.
- Reproduction success rates increase because full code specifications and raw evidence are preserved.
- Human reviewers spend less time on implementation details and more time on significance, novelty, and taste.
- Preserved failure traces in the exploration graph can accelerate agent progress on extension tasks or, for some agents, limit exploration outside prior paths.
Where Pith is reading between the lines
- Teams might adopt new development workflows that treat the exploration graph as a first-class output rather than an afterthought.
- If widely adopted, credit and review processes could shift toward evaluating the completeness of the artifact instead of only the final narrative.
- The format could encourage research to be conducted in a machine-first manner from the initial experiments onward.
Load-bearing premise
That typical research teams can create and maintain the four-layer ARA structure without prohibitive extra effort and that the reported benchmark gains will hold for other research tasks.
What would settle it
An experiment in which independent teams produce ARAs for their own papers and independent agents using those ARAs show no improvement in reproduction success or question-answering accuracy compared with agents using traditional papers on the same tasks.
Figures
read the original abstract
Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. This compilation imposes two structural costs: a Storytelling Tax, where failed experiments, rejected hypotheses, and the branching exploration process are discarded to fit a linear narrative; and an Engineering Tax, where the gap between reviewer-sufficient prose and agent-sufficient specification leaves critical implementation details unwritten. Tolerable for human readers, these costs become critical when AI agents must understand, reproduce, and extend published work. We introduce the Agent-Native Research Artifact (ARA), a protocol that replaces the narrative paper with a machine-executable research package structured around four layers: scientific logic, executable code with full specifications, an exploration graph that preserves the failures compilation discards, and evidence grounding every claim in raw outputs. Three mechanisms support the ecosystem: a Live Research Manager that captures decisions and dead ends during ordinary development; an ARA Compiler that translates legacy PDFs and repos into ARAs; and an ARA-native review system that automates objective checks so human reviewers can focus on significance, novelty, and taste. On PaperBench and RE-Bench, ARA raises question-answering accuracy from 72.4% to 93.7% and reproduction success from 57.4% to 64.4%. On RE-Bench's five open-ended extension tasks, preserved failure traces in ARA accelerate progress, but can also constrain a capable agent from stepping outside the prior-run box depending on the agent's capabilities. Our code is open-sourced at https://github.com/Orchestra-Research/Agent-Native-Research-Artifact.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that narrative scientific papers impose a Storytelling Tax and Engineering Tax that hinder AI agents, and introduces the Agent-Native Research Artifact (ARA) as a four-layer machine-executable package (scientific logic, executable code with specifications, exploration graph preserving failures, and evidence). It describes supporting mechanisms including a Live Research Manager, ARA Compiler for legacy conversion, and an automated review system. On the newly introduced PaperBench and RE-Bench, ARA improves agent question-answering accuracy from 72.4% to 93.7% and reproduction success from 57.4% to 64.4%, with mixed observations on five open-ended extension tasks; the implementation is open-sourced.
Significance. If the empirical gains hold under controlled conditions and the four-layer structure proves feasible for typical teams, the work could meaningfully advance agent-assisted research by reducing information loss in publications. The open-sourcing of the code at the provided repository is a clear strength that supports reproducibility and community validation of the protocol.
major comments (2)
- [Evaluation section] Evaluation section (and abstract): the headline improvements on PaperBench and RE-Bench are reported without details on agent architectures, prompting methods, baseline configurations, or how ARA packages were constructed for the specific test cases. These controls are load-bearing for attributing the deltas (72.4% → 93.7% QA; 57.4% → 64.4% reproduction) to the ARA structure rather than implementation specifics.
- [ARA adoption and extension tasks] Section on ARA adoption and extension tasks: no measurements or ablations are provided for the additional effort required to create and maintain the four-layer structure (scientific logic, code, exploration graph, evidence) via the Live Research Manager, which directly bears on the claim that ARA can replace narrative papers at scale for ordinary research teams.
minor comments (2)
- [Abstract and discussion] The abstract and discussion note that preserved failure traces can constrain capable agents on extension tasks, but no quantitative breakdown by agent capability or task type is supplied to support this observation.
- [ARA definition] The four-layer ARA structure is described at a high level; a concrete example or diagram showing how a single claim maps across all four layers would improve clarity for readers.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to improve clarity and rigor.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section (and abstract): the headline improvements on PaperBench and RE-Bench are reported without details on agent architectures, prompting methods, baseline configurations, or how ARA packages were constructed for the specific test cases. These controls are load-bearing for attributing the deltas (72.4% → 93.7% QA; 57.4% → 64.4% reproduction) to the ARA structure rather than implementation specifics.
Authors: We agree that additional experimental details are required to support attribution of the reported gains to the ARA structure. In the revised manuscript we will expand the Evaluation section (and update the abstract if space permits) with a dedicated subsection describing the agent architectures, base models, tool-use configurations, prompting strategies for both ARA and baseline conditions, baseline implementation details, and the exact procedure used to construct ARA packages from the benchmark test cases. This will include relevant hyperparameters and construction steps to enable independent verification. revision: yes
-
Referee: [ARA adoption and extension tasks] Section on ARA adoption and extension tasks: no measurements or ablations are provided for the additional effort required to create and maintain the four-layer structure (scientific logic, code, exploration graph, evidence) via the Live Research Manager, which directly bears on the claim that ARA can replace narrative papers at scale for ordinary research teams.
Authors: We acknowledge that the current manuscript does not contain quantitative measurements or ablations of authoring and maintenance effort. The Live Research Manager is designed to integrate into ordinary development workflows with the intent of minimizing overhead, and the manuscript includes qualitative observations from our own development process. In revision we will add an expanded discussion of this design rationale together with explicit plans for future user studies that will quantify effort. We do not have the requested ablations available from the present experiments. revision: partial
Circularity Check
No load-bearing circularity; empirical gains measured on newly introduced benchmarks without reduction to self-fitted inputs
full rationale
The paper introduces the ARA protocol with its four-layer structure and reports empirical improvements on the newly defined PaperBench and RE-Bench. These gains (72.4% to 93.7% QA accuracy, 57.4% to 64.4% reproduction success) are presented as direct benchmark measurements rather than outputs of equations or parameters fitted from prior self-citations that would reduce to the inputs by construction. No self-citation load-bearing steps, uniqueness theorems imported from the authors' prior work, or ansatzes smuggled via citation appear in the abstract or described claims. The derivation chain consists of defining a new artifact format and evaluating it on purpose-built benchmarks; while this introduces a minor risk that the benchmarks align closely with ARA features, it does not meet the criteria for specific circular reductions. The paper remains self-contained as an empirical proposal without the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Traditional narrative papers systematically discard failed experiments and implementation details that are necessary for agent reproduction.
invented entities (1)
-
Agent-Native Research Artifact (ARA)
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Toward an Engineering of Science: Rebalancing Generation and Verification in the Age of AI
AI lowers the cost of generating plausible scientific artifacts without lowering verification costs, so the paper proposes blueprints as typed graph components that decompose claims, evidence, and assumptions to enabl...
Reference graph
Works this paper leans on
-
[1]
URL https://www.wandb.com/. Software available from wandb.com. Boiko, D. A., MacKnight, R., Kline, B., and Gomes, G. Autonomous chemical research with large language mod- els.Nature, 624(7992):570–578, 2023. doi: 10.1038/ s41586-023-06792-0. Booeshaghi, A. S., Luebbert, L., and Pachter, L. Science should be machine-readable.bioRxiv, 2026. doi: 10. 64898/2...
-
[2]
Information Services and Use30(1–2), 51–56 (2010)
doi: 10.3233/ISU-2010-0613. Hua, T., Hua, H., Xiang, V ., Klieger, B., Truong, S. T., Liang, W., Sun, F.-Y ., and Haber, N. ResearchCodeBench: Benchmarking LLMs on implementing novel machine learning research code.arXiv preprint arXiv:2506.02314, 2025. Huang, M. DecMetrics: Structured claim decomposition scoring for factually consistent LLM outputs.arXiv ...
-
[3]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
doi: 10.18653/v1/2020.acl-main.447. Lu, C., Lu, C., Lange, R. T., Foerster, J., Clune, J., and Ha, D. The AI scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024. Luo, Y ., Yu, Z., Wang, X., Zhu, Y ., Zhang, N., Wei, L., Du, L., Zheng, D., and Chen, H. What makes AI research replicable? Executable knowle...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.acl-main.447 2020
-
[4]
doi: 10.1242/dmm.015123. Medawar, P. B. Is the scientific paper a fraud?The Listener, 70:377–378, 1963. Reprinted inThe Strange Case of the Spotted Mice, Oxford University Press, 1996. OpenAI. AGENTS.md: A standard for agent-oriented repository documentation. https://github.com/ openai/agents.md, 2025. Accessed 2026-03-01. Pineau, J., Vincent-Lamarre, P.,...
-
[5]
We evaluate all methods on three task sequences with 10 seeds each
Combinatorial experiment matrix (24.1%).The sin- gle largest category consists of requirements that enumerate which model variant must be run on which dataset, with which configuration, for how many seeds. In PDFs, this combinatorial structure is compressed into a single sentence (“We evaluate all methods on three task sequences with 10 seeds each”) or a ...
-
[6]
Evaluation protocol (18.5%).Requirements specify- ingwhichmetric to compute, onwhichtest split, usingwhich evaluation-time configuration (e.g., beam size, number of evaluation episodes, specific layers to probe). These details are often scattered: the metric definition appears in §3, the test split in §4, the evaluation episodes in the appendix, and the l...
-
[7]
AdamW optimizer with learning rate 5e-6, weight decay 0.01; batch size 64; 6,000 training steps
Hyperparameters (17.2%).Classic training config- uration: learning rates, batch sizes, optimizer parameters, temperature, weight decay, LoRA rank, number of epochs. While these are themost commonly discussedreproduction barrier, they account for only 17% of leaf requirements. In PDFs, hyperparameters are typically consolidated in an ap- pendix table, but ...
-
[8]
Metric computation and logging (10.4%).Require- ments that the agent mustrecordspecific intermediate quan- tities during runs: loss curves, attention distributions, cost tracking (dollars per 1k questions), episodic returns logged every N steps. This “instrumentation” knowledge is rarely specified in papers—authors implicitly know what to log but do not d...
-
[9]
Result interpretation (8.6%).Qualitative claims about what the results shouldshow—directional trends, compar- ative rankings, mechanistic explanations. These carry the highest weight in PaperBench rubrics (weight = 2) because they test whether the agentunderstandsthe results, not just whether the code ran. Examples: • mechanistic-understanding: “After ada...
-
[10]
CNN has three convolutional layers with 32, 64, and 64 channels and filter sizes of 8, 4, and 3
Architecture specification (5.8%).Layer counts, chan- nel sizes, activation functions, output head structure, em- bedding dimensions. In PDFs, architecture details are split across a figure (showing the high-level diagram), the meth- ods section (describing components in prose), and the ap- pendix (listing dimensions in a table). An agent must men- tally ...
-
[11]
Output attention: softmax(qK T /√dmodel)·V
Mathematical formulation (4.5%).Specific equations that must be implemented exactly: loss functions, atten- 24 The Last Human-Written Paper: Agent-Native Research Artifacts tion operations, PDE boundary conditions, update rules. In PDFs, equations are referenced by number, but the reader must trace through variable definitions scattered across mul- tiple ...
-
[12]
Single CNN encoder per policy; new encoder initialized with weights of the previous one
Implementation tricks (4.2%).Non-obvious design choices that distinguish faithful reproduction from naive re-implementation: weight freezing schedules, initialization from prior checkpoints, gradient clipping thresholds, nor- malization details, optimizer switching strategies. These are the hardest items to recover from a PDF because they ap- pear as pare...
-
[13]
we use the standard train/test split
Data pipeline (3.8%).Dataset acquisition, split ratios, filtering criteria, preprocessing steps, data augmentation, collocation point sampling strategies. These details are often under-specified in papers (“we use the standard train/test split”) or tucked into a single appendix paragraph. Examples: • bbox: “Split GSM8K into 7,473 training and 1,319 test s...
-
[14]
Environment and infrastructure (2.9%).Specific API endpoints, model version strings, library versions, sim- ulator names, hardware requirements. These are often as- sumed to be “obvious” and omitted entirely from the paper, yet they are essential for reproduction. Examples: • bbox: “API access configured for davinci-002”; “Code to execute fine-tuning jobs...
work page 2026
-
[15]
The sign pattern (8–2) is itself statistically improbable under the null hypothesis of no difference (p= 0.039 , exact binomial), confirming that the aggregate advantage is not driven by a single outlier paper. F.2. Per-Paper Reproduction Analysis This section provides the detailed per-paper analysis for the reproduction experiment (§7.3). Per-difficulty ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.