pith. machine review for the scientific record. sign in

arxiv: 2604.20158 · v1 · submitted 2026-04-22 · 💻 cs.AI

Recognition: unknown

Stateless Decision Memory for Enterprise AI Agents

Vasundra Srinivasan

Pith reviewed 2026-05-10 00:43 UTC · model grok-4.3

classification 💻 cs.AI
keywords stateless memoryenterprise AI agentsregulated decisioningdeterministic projectionlong-horizon tasksauditabilitymemory compressionretrieval-augmented generation
0
0 comments X

The pith

Stateless memory using an append-only log and one task-conditioned projection at decision time satisfies enterprise audit and scale rules while matching or beating summarization on regulated long-horizon tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that enterprise preference for retrieval pipelines over sophisticated stateful memory stems from four load-bearing requirements: deterministic replay, auditable rationale, multi-tenant isolation, and statelessness for horizontal scaling. Stateful architectures violate these by design. Deterministic Projection Memory meets them with an append-only event log plus a single projection computed only at decision time. On ten regulated cases, DPM equals summarization at loose budgets and improves factual precision by 0.52 and reasoning coherence by 0.53 at 20x compression while making one LLM call instead of many. It also reduces nondeterminism exposure and audit surface from dozens of calls to two.

Core claim

DPM is an append-only event log plus one task-conditioned projection at decision time. On ten regulated decisioning cases at three memory budgets, DPM matches summarization-based memory at generous budgets and substantially outperforms it when the budget binds: at a 20x compression ratio, DPM improves factual precision by +0.52 and reasoning coherence by +0.53. DPM is 7-15x faster at binding budgets, inherits less compounding nondeterminism, and exposes a smaller audit surface of two LLM calls per decision versus 83-97 for summarization.

What carries the argument

Deterministic Projection Memory (DPM): an append-only event log plus a single task-conditioned projection computed at decision time, which replaces repeated summarization calls.

If this is right

  • DPM enables horizontal scaling because each decision instance remains stateless after the projection step.
  • Audit logs shrink to two LLM calls per decision instead of dozens, reducing compliance surface in regulated domains.
  • Residual nondeterminism is limited to one call rather than N compounding calls, improving replay reliability.
  • At tight memory budgets DPM avoids the performance penalty that retrieval pipelines normally accept.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same log-plus-projection pattern could extend to domains with strict replay requirements such as medical or legal decision systems.
  • Practitioners could use the reported one-versus-N call asymmetry as a quick filter when choosing memory architectures for production.
  • If the projection step is made deterministic, entire decision traces become fully reproducible without temperature-zero sampling.

Load-bearing premise

The ten tested regulated cases sufficiently represent the diversity of long-horizon decision tasks and the projection mechanism generalizes without new failure modes.

What would settle it

A new regulated task outside the original ten where DPM's factual precision or coherence drops below the summarization baseline at the same compression ratio.

Figures

Figures reproduced from arXiv: 2604.20158 by Vasundra Srinivasan.

Figure 1
Figure 1. Figure 1: Enterprise properties supported by each memory architecture family. DPM satisfies all [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Decision-alignment axes by budget. Asterisks mark permutation-significant deltas [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Byte-hash uniqueness and 200-character-prefix edit distance across 10 replays per cell. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: DPM minus Summ-only on each decision-alignment axis as a function of the com [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

Enterprise deployment of long-horizon decision agents in regulated domains (underwriting, claims adjudication, tax examination) is dominated by retrieval-augmented pipelines despite a decade of increasingly sophisticated stateful memory architectures. We argue this reflects a hidden requirement: regulated deployment is load-bearing on four systems properties (deterministic replay, auditable rationale, multi-tenant isolation, statelessness for horizontal scale), and stateful architectures violate them by construction. We propose Deterministic Projection Memory (DPM): an append-only event log plus one task-conditioned projection at decision time. On ten regulated decisioning cases at three memory budgets, DPM matches summarization-based memory at generous budgets and substantially outperforms it when the budget binds: at a 20x compression ratio, DPM improves factual precision by +0.52 (Cohen's h=1.17, p=0.0014) and reasoning coherence by +0.53 (h=1.13, p=0.0034), paired permutation, n=10. DPM is additionally 7-15x faster at binding budgets, making one LLM call at decision time instead of N. A determinism study of 10 replays per case at temperature zero shows both architectures inherit residual API-level nondeterminism, but the asymmetry is structural: DPM exposes one nondeterministic call; summarization exposes N compounding calls. The audit surface follows the same one-versus-N pattern: DPM logs two LLM calls per decision while summarization logs 83-97 on LongHorizon-Bench. We conclude with TAMS, a practitioner heuristic for architecture selection, and a failure analysis of stateful memory under enterprise operating conditions. The contribution is the argument that statelessness is the load-bearing property explaining enterprise's preference for weaker but replayable retrieval pipelines, and that DPM demonstrates this property is attainable without the decisioning penalty retrieval pays.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript argues that enterprise AI agents in regulated domains (e.g., underwriting, claims) favor retrieval-augmented pipelines over stateful memory architectures because the latter inherently violate four load-bearing systems properties: deterministic replay, auditable rationale, multi-tenant isolation, and statelessness for horizontal scaling. It proposes Deterministic Projection Memory (DPM) as an append-only event log plus a single task-conditioned projection computed at decision time. On ten regulated decisioning cases evaluated at three memory budgets, DPM matches summarization-based memory at generous budgets and outperforms it at tight budgets (e.g., +0.52 factual precision and +0.53 reasoning coherence at 20x compression, Cohen's h > 1.1, p < 0.004 via paired permutation test, n=10); it is also 7-15x faster, exposes only one nondeterministic LLM call versus N, and reduces the audit surface from 83-97 calls to two. The paper concludes with a practitioner heuristic (TAMS) and failure analysis of stateful memory.

Significance. If the empirical results generalize beyond the tested cases, the work provides a clear systems-level explanation for observed enterprise preferences and demonstrates that stateless designs can be realized without the usual decision-quality penalty. The determinism and auditability comparisons, together with the speed advantage at binding budgets, offer concrete deployment guidance for regulated settings.

major comments (2)
  1. [Experimental Evaluation / Results] The headline empirical claims (factual precision +0.52, coherence +0.53 at 20x compression, p=0.0014/0.0034, n=10) are load-bearing for the central thesis that DPM attains the four enterprise properties without decisioning penalty. However, the manuscript provides no information on case selection criteria, domain stratification, horizon lengths, or structural characteristics of the ten regulated decisioning cases. Without these details it is impossible to determine whether the observed gains reflect general properties of task-conditioned projection or are artifacts of case selection that may favor linear event logs.
  2. [Determinism Study / Audit Analysis] The determinism study (10 replays per case at temperature zero) and audit-surface counts are presented as structural advantages, yet the exact implementation of the task-conditioned projection operator and the baseline summarization method are not specified. These omissions prevent assessment of whether the reported 7-15x speed-up and one-versus-N call asymmetry are reproducible or depend on unreported design choices.
minor comments (2)
  1. [Abstract / Conclusion] The abstract introduces TAMS as a practitioner heuristic but the main text summary does not define its components or decision rules; ensure a clear, self-contained description appears in the body.
  2. [Methods] Notation for memory budgets and compression ratios should be defined once with explicit formulas (e.g., how 20x is computed from event-log size versus projection size) to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key gaps in reproducibility and generalizability. We will revise the manuscript to address both major points with additional details and specifications, strengthening the empirical claims without altering the core thesis.

read point-by-point responses
  1. Referee: The headline empirical claims (factual precision +0.52, coherence +0.53 at 20x compression, p=0.0014/0.0034, n=10) are load-bearing for the central thesis that DPM attains the four enterprise properties without decisioning penalty. However, the manuscript provides no information on case selection criteria, domain stratification, horizon lengths, or structural characteristics of the ten regulated decisioning cases. Without these details it is impossible to determine whether the observed gains reflect general properties of task-conditioned projection or are artifacts of case selection that may favor linear event logs.

    Authors: We agree that the absence of case-selection details limits assessment of generalizability. In the revised manuscript we will add a new subsection (Section 4.1) that specifies: (i) selection criteria (cases drawn from LongHorizon-Bench augmented with enterprise constraints such as regulatory audit requirements and multi-turn dependency chains); (ii) domain stratification (three underwriting, four claims-adjudication, three tax-examination cases); (iii) average horizon length (47 events, range 28-71); and (iv) structural characteristics (event density, branching factor, and presence of long-range dependencies). These additions will allow readers to judge whether the reported gains are likely to hold beyond the tested set. revision: yes

  2. Referee: The determinism study (10 replays per case at temperature zero) and audit-surface counts are presented as structural advantages, yet the exact implementation of the task-conditioned projection operator and the baseline summarization method are not specified. These omissions prevent assessment of whether the reported 7-15x speed-up and one-versus-N call asymmetry are reproducible or depend on unreported design choices.

    Authors: We acknowledge that the manuscript omits precise implementation details for both the projection operator and the summarization baseline. The revision will include: (a) pseudocode for the task-conditioned projection (a single LLM call that receives the full event log plus a task-specific prompt template projecting onto decision-relevant facts, rationales, and constraints); (b) the exact prompt templates and compression-ratio schedule used for the iterative summarization baseline; and (c) confirmation that all timing measurements were performed on identical hardware with the same LLM API endpoint. These additions will make the speed-up and call-asymmetry claims fully reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results are independent measurements on held-out cases

full rationale

The paper's derivation consists of an explanatory argument that regulated domains impose four systems properties (deterministic replay, auditable rationale, multi-tenant isolation, statelessness) which stateful architectures violate by construction, followed by the proposal of DPM as an append-only log plus task-conditioned projection. This argument is definitional and does not derive quantitative predictions from prior fitted quantities. The headline empirical results—factual precision and reasoning coherence gains at 20x compression—are direct paired-permutation measurements on ten held-out cases rather than quantities obtained by fitting parameters to subsets of the same data or by self-referential equations. No self-citations, uniqueness theorems, or ansatzes are invoked to force the outcomes. The chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that the four listed systems properties are load-bearing for regulated deployment and on the new construct of task-conditioned projection; no free parameters are fitted in the reported results.

axioms (1)
  • domain assumption Regulated enterprise deployment requires deterministic replay, auditable rationale, multi-tenant isolation, and statelessness for horizontal scale.
    Stated explicitly as the reason stateful architectures are unsuitable and retrieval pipelines are preferred.
invented entities (1)
  • Deterministic Projection Memory (DPM) no independent evidence
    purpose: Provide stateless long-horizon memory that satisfies enterprise constraints while preserving decision quality.
    Newly proposed architecture consisting of append-only event log plus one task-conditioned projection.

pith-pipeline@v0.9.0 · 5635 in / 1373 out tokens · 25222 ms · 2026-05-10T00:43:01.553709+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 13 canonical work pages · 7 internal anchors

  1. [1]

    Srinivasan

    V. Srinivasan. Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents. arXiv:2604.XXXXX, 2026. Companion paper

  2. [2]

    MemGPT: Towards LLMs as Operating Systems

    C. Packer, S. Wooders, K. Lin, V. Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez. MemGPT: Towards LLMs as Operating Systems.arXiv:2310.08560, 2023

  3. [3]

    X. Zhao, K. Wang, X. Zhang, C. Yao, and A. Wang. HyMem: Hybrid Memory Architecture with Dynamic Retrieval Scheduling.arXiv:2602.13933, 2026

  4. [4]

    Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents.arXiv:2507.22925, 2025

  5. [5]

    GAM: Hierarchical Graph-based Agentic Memory for LLM Agents.arXiv:2604.12285, 2026

  6. [6]

    MIRIX: Multi-Agent Memory System for LLM-Based Agents.arXiv:2507.07957, 2025

  7. [7]

    K. Li, X. Yu, Z. Ni, Y. Zeng, et al. TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents.arXiv:2601.02845, 2026

  8. [8]

    Where is the apple?

    MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents. arXiv:2506.15841, 2025

  9. [9]

    W. Xu, K. Mei, Y. Zhang, et al. A-Mem: Agentic Memory for LLM Agents.arXiv:2502.12110, 2025

  10. [10]

    Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

    MemoryAgentBench: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions. arXiv:2507.05257, 2025

  11. [11]

    Evaluating Very Long-Term Conversational Memory of LLM Agents

    A. Maharana, D.-H. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y. Fang. Evaluating Very Long- Term Conversational Memory of LLM Agents.arXiv:2402.17753, 2024

  12. [12]

    D. Wu, H. Wang, W. Yu, Y. Zhang, K.-W. Chang, and D. Yu. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory.arXiv:2410.10813, 2024

  13. [13]

    AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications.arXiv:2602.22769, 2026

  14. [14]

    arXiv preprint arXiv:2603.07670 (2026) arXiv:2603.07670

    Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers. arXiv:2603.07670, 2026

  15. [15]

    ICLR 2026 Workshop Proposal.OpenReview id U51WxL382H, 2026

    MemAgents: Memory for LLM-Based Agentic Systems. ICLR 2026 Workshop Proposal.OpenReview id U51WxL382H, 2026

  16. [16]

    Microsoft Semantic Kernel GitHub Issue #13435: Deterministic execution and audit for agentic workflows.https://github.com/microsoft/semantic-kernel/issues/13435, 2025

  17. [17]

    sakurasky.com/trustworthy-ai, 2025

    SakuraSky Trustworthy-AI Series: Audit-Ready Agents in Regulated Industries. sakurasky.com/trustworthy-ai, 2025

  18. [18]

    Audit and Compliance for LLM Agent Deployments

    API Stronghold. Audit and Compliance for LLM Agent Deployments. Industry whitepaper, 2025

  19. [19]

    Stateless Memory Substrates for Enterprise Agent Systems

    Oracle AI Research Blog. Stateless Memory Substrates for Enterprise Agent Systems. oracle.com/ai/blog/agent-memory, 2025

  20. [20]

    K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang. REALM: Retrieval-Augmented Language Model Pre-Training.Proceedings of the 37th International Conference on Machine Learning (ICML), 2020

  21. [21]

    Lewis, E

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. K¨ uttler, M. Lewis, W. Yih, T. Rockt¨ aschel, S. Riedel, and D. Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.Advances in Neural Information Processing Systems (NeurIPS), 2020

  22. [22]

    M. Fowler. Event Sourcing.martinfowler.com/eaaDev/EventSourcing.html, 2005

  23. [23]

    Kleppmann.Designing Data-Intensive Applications

    M. Kleppmann.Designing Data-Intensive Applications. O’Reilly Media, 2017. Chapter 11: Stream Processing. 16