arxiv: 2605.12087 · v1 · submitted 2026-05-12 · 💻 cs.AI · cs.MA

Recognition: 2 theorem links

· Lean Theorem

Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems

Josh Rosen , Seth Rosen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:34 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords intermediate artifactsagentic systemsdata modeldurable staterevisable AI workartifact lineageinspectable outputsmulti-step reasoning

0 comments

The pith

AI agentic systems should store intermediate work products as durable, typed, versioned data objects that later steps and humans can inspect and revise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AI systems often run loops of reasoning and tool use to produce final outputs such as plans or analyses, yet discard the steps that shaped those outputs. The paper argues that these intermediate results should instead be kept as persistent, structured artifacts that carry explicit type, structure, address, version, dependencies, and authority. Such artifacts differ from private chain-of-thought or chat logs; they are consumable work products like evidence maps, claim structures, and partial syntheses. If the claim holds, downstream agents and people can examine, edit, supersede, or extend prior work without reconstructing lost context. The authors supply a systems data model that defines how these artifacts receive additive or superseding updates and how their lineage maintains state across revisions.

Core claim

The paper presents a data model in which intermediate artifacts become first-class, durable objects: typed, structured, addressable, versioned, dependency-aware, authoritative, and consumable by other computation. These objects receive explicit additive and superseding update semantics with current-state resolution rules. Lineage tracking keeps the artifacts consistent across revisions. The model treats evaluation as a question of maintained-state quality rather than final-output quality alone.

What carries the argument

Durable intermediate artifact: a structured, addressable, versioned data object that represents maintained work products such as evidence maps or plans and carries explicit dependencies and update semantics so downstream computation can consume or revise it directly.

If this is right

Artifact lineage supplies durable intermediate state that survives revisions without requiring regeneration.
Evaluation of agentic systems must measure quality of the preserved artifacts, not only the final output.
Downstream agents can consume and extend specific artifacts directly rather than re-deriving lost context.
Humans gain direct access to edit or supersede individual artifacts instead of rewriting entire final products.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The model could be added to existing agent runtimes to give persistent, queryable memory of reasoning steps.
Collaborative workflows become possible when multiple agents and users edit the same shared artifact set over time.
A practical test would measure whether agents that read prior artifacts produce fewer redundant steps than agents that start from scratch each time.

Load-bearing premise

Final outputs lose important details from earlier steps, and preserving the intermediates as durable objects will improve inspectability and revisability without adding meaningful overhead.

What would settle it

Compare revision time and error rate for a human editor fixing a multi-step agent task when only the final artifact is available versus when the full set of typed, versioned intermediate artifacts and their lineage are also present.

Figures

Figures reproduced from arXiv: 2605.12087 by Josh Rosen, Seth Rosen.

**Figure 1.** Figure 1: A final-answer loop collapses intermediate state into a transcript and answer. An artifact-first system pre [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

Many AI systems are organized around loops in which models reason, call tools, observe results, and continue until a task is complete. These systems often produce final artifacts such as memos, plans, recommendations, and analyses, while the intermediate work that shaped those outputs remains ephemeral. For multi-step, revisable AI work, final artifacts are often lossy projections over upstream state. We argue that such systems should preserve durable, inspectable intermediate artifacts: typed, structured, addressable, versioned, dependency-aware, authoritative, and consumable by downstream computation. These artifacts are not the model's private chain-of-thought. They are maintained work products such as evidence maps, claim structures, criteria, assumptions, plans, transformation rules, synthesis procedures, unresolved tensions, and partial products that later humans and agents can inspect, revise, supersede, and improve. The contribution is a systems-level data model. We distinguish intermediate artifacts from chat transcripts, memory, hidden chain-of-thought, narration, thinking, and final answers; formalize additive and superseding update semantics with explicit current-state resolution; describe how artifact lineage supports durable intermediate state across revisions; and argue that evaluation must target maintained-state quality, not only final-output quality. The claim is not that artifacts make models smarter. It is that durable intermediate artifacts make AI-generated work more inspectable, revisable, and maintainable over time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A conceptual data model for durable intermediate artifacts in agent loops that draws useful distinctions but skips any check on storage or query costs.

read the letter

This paper sketches a data model that keeps intermediate work products—evidence maps, plans, partial syntheses, unresolved tensions—as typed, versioned, dependency-aware objects that later agents and humans can inspect and update. It treats these as separate from chain-of-thought, chat logs, or final outputs and gives rules for additive versus superseding changes plus lineage tracking to preserve state across revisions. The core pitch is that final artifacts lose too much upstream information, so making the intermediates durable should improve revisability and maintainability.

Referee Report

1 major / 2 minor

Summary. The paper claims that agentic AI systems organized around reasoning-tool-observation loops often produce final artifacts that are lossy projections of upstream state because intermediate work remains ephemeral. It argues that such systems should instead preserve durable, inspectable intermediate artifacts that are typed, structured, addressable, versioned, dependency-aware, authoritative, and consumable by downstream computation. These artifacts include evidence maps, claim structures, criteria, assumptions, plans, transformation rules, synthesis procedures, unresolved tensions, and partial products. The contribution is a systems-level data model that distinguishes intermediate artifacts from chat transcripts, memory, hidden chain-of-thought, narration, thinking, and final answers; formalizes additive and superseding update semantics with explicit current-state resolution; describes how artifact lineage supports durable intermediate state across revisions; and advocates evaluating maintained-state quality rather than only final-output quality.

Significance. If the distinctions and update semantics hold in practice, the framework could meaningfully shift agentic system design toward greater long-term inspectability, revisability, and maintainability by treating intermediate work products as persistent, queryable entities. The paper's strengths include its self-contained conceptual framework, clear separation of artifact types, and explicit formalization of additive/superseding semantics and lineage mechanics, which provide a reusable vocabulary for future implementations even without empirical validation or machine-checked proofs.

major comments (1)

[Data model and lineage mechanics] The manuscript describes the data model and lineage mechanics at a systems level but supplies neither asymptotic bounds, storage growth estimates under realistic revision patterns, nor any prototype measurements for maintaining the artifact graph in long-running agent loops. This is load-bearing for the central claim that durable intermediates will improve inspectability and maintainability without introducing prohibitive overhead or complexity.

minor comments (2)

The abstract and introduction would benefit from one or two concrete, worked examples contrasting an ephemeral intermediate state with its durable counterpart to illustrate the claimed lossiness of final artifacts.
[Data model description] The terms 'authoritative' and 'consumable' are introduced without explicit definitions or operational criteria, which could lead to inconsistent interpretations when the model is applied.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for identifying a point that merits clarification. The manuscript is a conceptual contribution focused on a systems-level data model, distinctions among artifact types, and update/lineage semantics. We address the major comment below and have made a targeted revision to acknowledge practical considerations without altering the paper's scope.

read point-by-point responses

Referee: The manuscript describes the data model and lineage mechanics at a systems level but supplies neither asymptotic bounds, storage growth estimates under realistic revision patterns, nor any prototype measurements for maintaining the artifact graph in long-running agent loops. This is load-bearing for the central claim that durable intermediates will improve inspectability and maintainability without introducing prohibitive overhead or complexity.

Authors: We agree that concrete performance data would strengthen any claim about non-prohibitive overhead. However, the manuscript's central claim concerns the semantic and structural properties that make durable intermediates inspectable, revisable, and authoritative, rather than a performance guarantee for a particular implementation. The data model is deliberately implementation-agnostic: it specifies typed, versioned, dependency-aware entities with additive and superseding update rules, but leaves storage representation, indexing, and retention policies to concrete systems. Lineage mechanics explicitly support superseding and current-state resolution, which in principle permit pruning of obsolete versions and delta-based storage—standard techniques that avoid unbounded growth. Because the paper does not present or evaluate an implementation, we do not supply asymptotic bounds or prototype measurements; doing so would require choosing specific data structures and workloads outside the stated contribution. We have added a brief discussion paragraph noting that storage overhead is governed by implementation choices (e.g., retention policies, garbage collection of superseded artifacts) and that the model's explicit versioning and dependency tracking are intended to make such management feasible rather than to eliminate it. revision: partial

Circularity Check

0 steps flagged

No circularity in conceptual data model proposal

full rationale

The paper is a systems-level conceptual proposal that defines durable intermediate artifacts, distinguishes them from chat transcripts and chain-of-thought, formalizes additive/superseding semantics, and argues for improved inspectability without any equations, derivations, fitted parameters, or predictions. No load-bearing steps reduce to self-definition, self-citation chains, or renamed known results; the framework is self-contained against external benchmarks and contains no internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

This is a conceptual proposal. The central claim rests on domain assumptions about the value of preserved intermediate state rather than on fitted parameters or new entities with independent evidence.

axioms (2)

domain assumption Final artifacts in AI systems are lossy projections that discard useful upstream state
Invoked in the opening argument that intermediate work should be preserved.
domain assumption Durable, typed, versioned artifacts will improve inspectability and maintainability
Core premise of the proposed data model and evaluation shift.

invented entities (1)

Durable intermediate artifacts data model no independent evidence
purpose: To formalize preservation, versioning, and consumption of intermediate work products
The model itself is the proposed contribution; no independent evidence or falsifiable predictions are supplied in the abstract.

pith-pipeline@v0.9.0 · 5552 in / 1370 out tokens · 75693 ms · 2026-05-13T04:34:12.816504+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We argue that such systems should preserve durable, inspectable intermediate artifacts: typed, structured, addressable, versioned, dependency-aware, authoritative, and consumable by downstream computation.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Additive versus superseding update semantics... current-state resolution

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 8 internal anchors

[1]

ReAct: Syn- ergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Thomas Narasimhan, and Yuan Cao. ReAct: Syn- ergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations, 2023. 16 Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems

work page 2023
[2]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language Models Can Teach Themselves to Use Tools. In Advances in Neural Information Processing Systems, 2023

work page 2023
[3]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. arXiv:2303.17760, 2023

work page internal anchor Pith review arXiv 2023
[4]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J¨urgen Schmid- huber. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. arXiv:2308.00352, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Max Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, Samy Bengio, and others. Show Your Work: Scratchpads for Inter- mediate Computation with Language Models. arXiv:2112.00114, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[9]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. InInternational Conference on Learning Representations, 2023

work page 2023
[10]

Wei, Tianyi Zhou, Xuezhi Wang, Dale Schuurmans, Claire Cui, Hao Bian, and others

Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Jason Y . Wei, Tianyi Zhou, Xuezhi Wang, Dale Schuurmans, Claire Cui, Hao Bian, and others. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. InInternational Conference on Learning Representations, 2023

work page 2023
[11]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[12]

Re- flexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[13]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and others. Self-Refine: Iterative Refinement with Self-Feedback. In Advances in Neural Information Processing Systems, 2023

work page 2023
[14]

Language agent tree search unifies reasoning, acting, and planning in language models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models. arXiv:2310.04406, 2023

work page arXiv 2023
[15]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. arXiv:2305.04388, 2023

work page arXiv 2023
[16]

Natural-Language Agent Harnesses

Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Hai-Tao Zheng. Natural-Language Agent Harnesses. arXiv:2603.25723, 2026

work page arXiv 2026
[17]

General modular harness for llm agents in multi-turn gaming environments.arXiv preprint arXiv:2507.11633, 2025

Yuxuan Zhang, Haoyang Yu, Lanxiang Hu, Haojian Jin, and Hao Zhang. General Modular Harness for LLM Agents in Multi-Turn Gaming Environments. arXiv:2507.11633, 2025

work page arXiv 2025
[18]

Agent Workflow Memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent Workflow Memory. arXiv:2409.07429, 2024

work page internal anchor Pith review arXiv 2024
[19]

Workflowllm: Enhancing workflow orchestration capability of large language models.arXiv preprint arXiv:2411.05451, 2024

Shengda Fan, Xin Cong, Yuepeng Fu, Zhong Zhang, Shuyan Zhang, Yuanwei Liu, Yesai Wu, Yankai Lin, Zhiyuan Liu, and Maosong Sun. WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Lan- guage Models. arXiv:2411.05451, 2024

work page arXiv 2024
[20]

LEGOMem : Modular procedural memory for multi-agent LLM systems for workflow automation, 2025

Dongge Han, Camille Couturier, Daniel Madrigal Diaz, Xuchao Zhang, Victor R ¨uhle, and Saravan Rajmo- han. LEGOMem: Modular Procedural Memory for Multi-agent LLM Systems for Workflow Automation. arXiv:2510.04851, 2025. 17 Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems

work page arXiv 2025
[21]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Hinrich Sch¨utze, V olker Tresp, and Yunpu Ma. Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning. arXiv:2508.19828, 2025

work page internal anchor Pith review arXiv 2025
[22]

Borro, Luiz A

Luiz C. Borro, Luiz A. B. Macarini, Gordon Tindall, Michael Montero, and Adam B. Struck. Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents. arXiv:2603.19935, 2026

work page arXiv 2026
[23]

From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

Josh Rosen and Seth Rosen. From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work. arXiv:2605.06365, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. InEuroSys, 2007

work page 2007
[25]

Franklin, Scott Shenker, and Ion Stoica

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster Com- puting with Working Sets. InUSENIX HotCloud, 2010

work page 2010
[26]

Airflow Documentation: DAGs.https://airflow.apache.org/docs/ apache-airflow/3.0.4/core-concepts/dags.html, accessed May 6, 2026

Apache Software Foundation. Airflow Documentation: DAGs.https://airflow.apache.org/docs/ apache-airflow/3.0.4/core-concepts/dags.html, accessed May 6, 2026

work page 2026
[27]

dbt Developer Hub.https://docs.getdbt.com/, accessed May 6, 2026

dbt Labs. dbt Developer Hub.https://docs.getdbt.com/, accessed May 6, 2026

work page 2026
[28]

Why and Where: A Characterization of Data Prove- nance

Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan. Why and Where: A Characterization of Data Prove- nance. InInternational Conference on Database Theory, 2001

work page 2001
[29]

Juliana Freire, David Koop, Emanuele Santos, and Claudio T. Silva. Provenance for Computational Tasks: A Survey.Computing in Science & Engineering, 10(3):11–21, 2008

work page 2008
[30]

LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology

Renan Souza, Timothy Poteet, Brian Etz, Daniel Rosendo, Amal Gueroudji, Woong Shin, Prasanna Balaprakash, and Rafael Ferreira da Silva. LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology. arXiv:2509.13978, 2025

work page arXiv 2025
[31]

Donald E. Knuth. Literate Programming.The Computer Journal, 27(2):97–111, 1984

work page 1984
[32]

Jupyter Notebooks – a Publishing Format for Reproducible Computational Workflows

Thomas Kluyver, Benjamin Ragan-Kelley, Fernando P´erez, Brian Granger, Matthias Bussonnier, Jonathan Fred- eric, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, and others. Jupyter Notebooks – a Publishing Format for Reproducible Computational Workflows. InPositioning and Power in Academic Publishing: Players, Agents and Agendas, 2016

work page 2016
[33]

MLIR: Scaling Compiler Infrastructure for Do- main Specific Computation

Chris Lattner, Jacques Pienaar, Mehdi Amini, Uday Bondhugula, River Riddle, Albert Cohen, Tatiana Shpeis- man, Andy Davis, Nicolas Vasilache, and Oleksandr Zinenko. MLIR: Scaling Compiler Infrastructure for Do- main Specific Computation. InInternational Symposium on Code Generation and Optimization, 2021

work page 2021
[34]

Knowledge Graphs.ACM Computing Surveys, 54(4):1–37, 2021

Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, Claudio Gutierrez, Sabrina Kirrane, Jos Lehmann, Roberto Navigli, Axel Polleres, and others. Knowledge Graphs.ACM Computing Surveys, 54(4):1–37, 2021. 18

work page 2021