pith. machine review for the scientific record. sign in

arxiv: 2605.12087 · v1 · submitted 2026-05-12 · 💻 cs.AI · cs.MA

Recognition: 2 theorem links

· Lean Theorem

Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:34 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords intermediate artifactsagentic systemsdata modeldurable staterevisable AI workartifact lineageinspectable outputsmulti-step reasoning
0
0 comments X

The pith

AI agentic systems should store intermediate work products as durable, typed, versioned data objects that later steps and humans can inspect and revise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AI systems often run loops of reasoning and tool use to produce final outputs such as plans or analyses, yet discard the steps that shaped those outputs. The paper argues that these intermediate results should instead be kept as persistent, structured artifacts that carry explicit type, structure, address, version, dependencies, and authority. Such artifacts differ from private chain-of-thought or chat logs; they are consumable work products like evidence maps, claim structures, and partial syntheses. If the claim holds, downstream agents and people can examine, edit, supersede, or extend prior work without reconstructing lost context. The authors supply a systems data model that defines how these artifacts receive additive or superseding updates and how their lineage maintains state across revisions.

Core claim

The paper presents a data model in which intermediate artifacts become first-class, durable objects: typed, structured, addressable, versioned, dependency-aware, authoritative, and consumable by other computation. These objects receive explicit additive and superseding update semantics with current-state resolution rules. Lineage tracking keeps the artifacts consistent across revisions. The model treats evaluation as a question of maintained-state quality rather than final-output quality alone.

What carries the argument

Durable intermediate artifact: a structured, addressable, versioned data object that represents maintained work products such as evidence maps or plans and carries explicit dependencies and update semantics so downstream computation can consume or revise it directly.

If this is right

  • Artifact lineage supplies durable intermediate state that survives revisions without requiring regeneration.
  • Evaluation of agentic systems must measure quality of the preserved artifacts, not only the final output.
  • Downstream agents can consume and extend specific artifacts directly rather than re-deriving lost context.
  • Humans gain direct access to edit or supersede individual artifacts instead of rewriting entire final products.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The model could be added to existing agent runtimes to give persistent, queryable memory of reasoning steps.
  • Collaborative workflows become possible when multiple agents and users edit the same shared artifact set over time.
  • A practical test would measure whether agents that read prior artifacts produce fewer redundant steps than agents that start from scratch each time.

Load-bearing premise

Final outputs lose important details from earlier steps, and preserving the intermediates as durable objects will improve inspectability and revisability without adding meaningful overhead.

What would settle it

Compare revision time and error rate for a human editor fixing a multi-step agent task when only the final artifact is available versus when the full set of typed, versioned intermediate artifacts and their lineage are also present.

Figures

Figures reproduced from arXiv: 2605.12087 by Josh Rosen, Seth Rosen.

Figure 1
Figure 1. Figure 1: A final-answer loop collapses intermediate state into a transcript and answer. An artifact-first system pre [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Many AI systems are organized around loops in which models reason, call tools, observe results, and continue until a task is complete. These systems often produce final artifacts such as memos, plans, recommendations, and analyses, while the intermediate work that shaped those outputs remains ephemeral. For multi-step, revisable AI work, final artifacts are often lossy projections over upstream state. We argue that such systems should preserve durable, inspectable intermediate artifacts: typed, structured, addressable, versioned, dependency-aware, authoritative, and consumable by downstream computation. These artifacts are not the model's private chain-of-thought. They are maintained work products such as evidence maps, claim structures, criteria, assumptions, plans, transformation rules, synthesis procedures, unresolved tensions, and partial products that later humans and agents can inspect, revise, supersede, and improve. The contribution is a systems-level data model. We distinguish intermediate artifacts from chat transcripts, memory, hidden chain-of-thought, narration, thinking, and final answers; formalize additive and superseding update semantics with explicit current-state resolution; describe how artifact lineage supports durable intermediate state across revisions; and argue that evaluation must target maintained-state quality, not only final-output quality. The claim is not that artifacts make models smarter. It is that durable intermediate artifacts make AI-generated work more inspectable, revisable, and maintainable over time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that agentic AI systems organized around reasoning-tool-observation loops often produce final artifacts that are lossy projections of upstream state because intermediate work remains ephemeral. It argues that such systems should instead preserve durable, inspectable intermediate artifacts that are typed, structured, addressable, versioned, dependency-aware, authoritative, and consumable by downstream computation. These artifacts include evidence maps, claim structures, criteria, assumptions, plans, transformation rules, synthesis procedures, unresolved tensions, and partial products. The contribution is a systems-level data model that distinguishes intermediate artifacts from chat transcripts, memory, hidden chain-of-thought, narration, thinking, and final answers; formalizes additive and superseding update semantics with explicit current-state resolution; describes how artifact lineage supports durable intermediate state across revisions; and advocates evaluating maintained-state quality rather than only final-output quality.

Significance. If the distinctions and update semantics hold in practice, the framework could meaningfully shift agentic system design toward greater long-term inspectability, revisability, and maintainability by treating intermediate work products as persistent, queryable entities. The paper's strengths include its self-contained conceptual framework, clear separation of artifact types, and explicit formalization of additive/superseding semantics and lineage mechanics, which provide a reusable vocabulary for future implementations even without empirical validation or machine-checked proofs.

major comments (1)
  1. [Data model and lineage mechanics] The manuscript describes the data model and lineage mechanics at a systems level but supplies neither asymptotic bounds, storage growth estimates under realistic revision patterns, nor any prototype measurements for maintaining the artifact graph in long-running agent loops. This is load-bearing for the central claim that durable intermediates will improve inspectability and maintainability without introducing prohibitive overhead or complexity.
minor comments (2)
  1. The abstract and introduction would benefit from one or two concrete, worked examples contrasting an ephemeral intermediate state with its durable counterpart to illustrate the claimed lossiness of final artifacts.
  2. [Data model description] The terms 'authoritative' and 'consumable' are introduced without explicit definitions or operational criteria, which could lead to inconsistent interpretations when the model is applied.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for identifying a point that merits clarification. The manuscript is a conceptual contribution focused on a systems-level data model, distinctions among artifact types, and update/lineage semantics. We address the major comment below and have made a targeted revision to acknowledge practical considerations without altering the paper's scope.

read point-by-point responses
  1. Referee: The manuscript describes the data model and lineage mechanics at a systems level but supplies neither asymptotic bounds, storage growth estimates under realistic revision patterns, nor any prototype measurements for maintaining the artifact graph in long-running agent loops. This is load-bearing for the central claim that durable intermediates will improve inspectability and maintainability without introducing prohibitive overhead or complexity.

    Authors: We agree that concrete performance data would strengthen any claim about non-prohibitive overhead. However, the manuscript's central claim concerns the semantic and structural properties that make durable intermediates inspectable, revisable, and authoritative, rather than a performance guarantee for a particular implementation. The data model is deliberately implementation-agnostic: it specifies typed, versioned, dependency-aware entities with additive and superseding update rules, but leaves storage representation, indexing, and retention policies to concrete systems. Lineage mechanics explicitly support superseding and current-state resolution, which in principle permit pruning of obsolete versions and delta-based storage—standard techniques that avoid unbounded growth. Because the paper does not present or evaluate an implementation, we do not supply asymptotic bounds or prototype measurements; doing so would require choosing specific data structures and workloads outside the stated contribution. We have added a brief discussion paragraph noting that storage overhead is governed by implementation choices (e.g., retention policies, garbage collection of superseded artifacts) and that the model's explicit versioning and dependency tracking are intended to make such management feasible rather than to eliminate it. revision: partial

Circularity Check

0 steps flagged

No circularity in conceptual data model proposal

full rationale

The paper is a systems-level conceptual proposal that defines durable intermediate artifacts, distinguishes them from chat transcripts and chain-of-thought, formalizes additive/superseding semantics, and argues for improved inspectability without any equations, derivations, fitted parameters, or predictions. No load-bearing steps reduce to self-definition, self-citation chains, or renamed known results; the framework is self-contained against external benchmarks and contains no internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

This is a conceptual proposal. The central claim rests on domain assumptions about the value of preserved intermediate state rather than on fitted parameters or new entities with independent evidence.

axioms (2)
  • domain assumption Final artifacts in AI systems are lossy projections that discard useful upstream state
    Invoked in the opening argument that intermediate work should be preserved.
  • domain assumption Durable, typed, versioned artifacts will improve inspectability and maintainability
    Core premise of the proposed data model and evaluation shift.
invented entities (1)
  • Durable intermediate artifacts data model no independent evidence
    purpose: To formalize preservation, versioning, and consumption of intermediate work products
    The model itself is the proposed contribution; no independent evidence or falsifiable predictions are supplied in the abstract.

pith-pipeline@v0.9.0 · 5552 in / 1370 out tokens · 75693 ms · 2026-05-13T04:34:12.816504+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 8 internal anchors

  1. [1]

    ReAct: Syn- ergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Thomas Narasimhan, and Yuan Cao. ReAct: Syn- ergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations, 2023. 16 Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems

  2. [2]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language Models Can Teach Themselves to Use Tools. In Advances in Neural Information Processing Systems, 2023

  3. [3]

    CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

    Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. arXiv:2303.17760, 2023

  4. [4]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291, 2023

  5. [5]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J¨urgen Schmid- huber. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. arXiv:2308.00352, 2023

  6. [6]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155, 2023

  7. [7]

    Show Your Work: Scratchpads for Intermediate Computation with Language Models

    Max Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, Samy Bengio, and others. Show Your Work: Scratchpads for Inter- mediate Computation with Language Models. arXiv:2112.00114, 2021

  8. [8]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, 2022

  9. [9]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. InInternational Conference on Learning Representations, 2023

  10. [10]

    Wei, Tianyi Zhou, Xuezhi Wang, Dale Schuurmans, Claire Cui, Hao Bian, and others

    Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Jason Y . Wei, Tianyi Zhou, Xuezhi Wang, Dale Schuurmans, Claire Cui, Hao Bian, and others. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. InInternational Conference on Learning Representations, 2023

  11. [11]

    Griffiths, Yuan Cao, and Karthik Narasimhan

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InAdvances in Neural Information Processing Systems, 2023

  12. [12]

    Re- flexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems, 2023

  13. [13]

    Self-Refine: Iterative Refinement with Self-Feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and others. Self-Refine: Iterative Refinement with Self-Feedback. In Advances in Neural Information Processing Systems, 2023

  14. [14]

    Language agent tree search unifies reasoning, acting, and planning in language models

    Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models. arXiv:2310.04406, 2023

  15. [15]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. arXiv:2305.04388, 2023

  16. [16]

    Natural-Language Agent Harnesses

    Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Hai-Tao Zheng. Natural-Language Agent Harnesses. arXiv:2603.25723, 2026

  17. [17]

    General modular harness for llm agents in multi-turn gaming environments.arXiv preprint arXiv:2507.11633, 2025

    Yuxuan Zhang, Haoyang Yu, Lanxiang Hu, Haojian Jin, and Hao Zhang. General Modular Harness for LLM Agents in Multi-Turn Gaming Environments. arXiv:2507.11633, 2025

  18. [18]

    Agent Workflow Memory

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent Workflow Memory. arXiv:2409.07429, 2024

  19. [19]

    Workflowllm: Enhancing workflow orchestration capability of large language models.arXiv preprint arXiv:2411.05451, 2024

    Shengda Fan, Xin Cong, Yuepeng Fu, Zhong Zhang, Shuyan Zhang, Yuanwei Liu, Yesai Wu, Yankai Lin, Zhiyuan Liu, and Maosong Sun. WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Lan- guage Models. arXiv:2411.05451, 2024

  20. [20]

    LEGOMem : Modular procedural memory for multi-agent LLM systems for workflow automation, 2025

    Dongge Han, Camille Couturier, Daniel Madrigal Diaz, Xuchao Zhang, Victor R ¨uhle, and Saravan Rajmo- han. LEGOMem: Modular Procedural Memory for Multi-agent LLM Systems for Workflow Automation. arXiv:2510.04851, 2025. 17 Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems

  21. [21]

    Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

    Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Hinrich Sch¨utze, V olker Tresp, and Yunpu Ma. Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning. arXiv:2508.19828, 2025

  22. [22]

    Borro, Luiz A

    Luiz C. Borro, Luiz A. B. Macarini, Gordon Tindall, Michael Montero, and Adam B. Struck. Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents. arXiv:2603.19935, 2026

  23. [23]

    From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

    Josh Rosen and Seth Rosen. From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work. arXiv:2605.06365, 2026

  24. [24]

    Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

    Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. InEuroSys, 2007

  25. [25]

    Franklin, Scott Shenker, and Ion Stoica

    Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster Com- puting with Working Sets. InUSENIX HotCloud, 2010

  26. [26]

    Airflow Documentation: DAGs.https://airflow.apache.org/docs/ apache-airflow/3.0.4/core-concepts/dags.html, accessed May 6, 2026

    Apache Software Foundation. Airflow Documentation: DAGs.https://airflow.apache.org/docs/ apache-airflow/3.0.4/core-concepts/dags.html, accessed May 6, 2026

  27. [27]

    dbt Developer Hub.https://docs.getdbt.com/, accessed May 6, 2026

    dbt Labs. dbt Developer Hub.https://docs.getdbt.com/, accessed May 6, 2026

  28. [28]

    Why and Where: A Characterization of Data Prove- nance

    Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan. Why and Where: A Characterization of Data Prove- nance. InInternational Conference on Database Theory, 2001

  29. [29]

    Juliana Freire, David Koop, Emanuele Santos, and Claudio T. Silva. Provenance for Computational Tasks: A Survey.Computing in Science & Engineering, 10(3):11–21, 2008

  30. [30]

    LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology

    Renan Souza, Timothy Poteet, Brian Etz, Daniel Rosendo, Amal Gueroudji, Woong Shin, Prasanna Balaprakash, and Rafael Ferreira da Silva. LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology. arXiv:2509.13978, 2025

  31. [31]

    Donald E. Knuth. Literate Programming.The Computer Journal, 27(2):97–111, 1984

  32. [32]

    Jupyter Notebooks – a Publishing Format for Reproducible Computational Workflows

    Thomas Kluyver, Benjamin Ragan-Kelley, Fernando P´erez, Brian Granger, Matthias Bussonnier, Jonathan Fred- eric, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, and others. Jupyter Notebooks – a Publishing Format for Reproducible Computational Workflows. InPositioning and Power in Academic Publishing: Players, Agents and Agendas, 2016

  33. [33]

    MLIR: Scaling Compiler Infrastructure for Do- main Specific Computation

    Chris Lattner, Jacques Pienaar, Mehdi Amini, Uday Bondhugula, River Riddle, Albert Cohen, Tatiana Shpeis- man, Andy Davis, Nicolas Vasilache, and Oleksandr Zinenko. MLIR: Scaling Compiler Infrastructure for Do- main Specific Computation. InInternational Symposium on Code Generation and Optimization, 2021

  34. [34]

    Knowledge Graphs.ACM Computing Surveys, 54(4):1–37, 2021

    Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, Claudio Gutierrez, Sabrina Kirrane, Jos Lehmann, Roberto Navigli, Axel Polleres, and others. Knowledge Graphs.ACM Computing Surveys, 54(4):1–37, 2021. 18