Nous: A Predictive World Model for Long-Term Agent Memory

Pranav Singh

arxiv: 2606.22030 · v1 · pith:7BFHVNFJnew · submitted 2026-06-20 · 💻 cs.AI · cs.CL· cs.IR· cs.LG

Nous: A Predictive World Model for Long-Term Agent Memory

Pranav Singh This is my paper

Pith reviewed 2026-06-26 11:46 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.IRcs.LG

keywords agent memorypredictive world modelBayesian belief updatinglong-term conversational memoryLoCoMo benchmarksurprise minimizationcategorical distributionsbelief deltas

0 comments

The pith

Nous models long-term agent memory as a predictive world model of categorical probability distributions updated by Bayesian surprise rather than stored facts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Nous as an agent memory system that treats knowledge as ongoing prediction instead of static storage. It builds a collection of independent categorical probability distributions, each tied to an entity-attribute pair from the conversation, and revises them with closed-form Bayesian updates driven by the surprise of new observations. The stored record is only the change in belief distribution, not the observation itself, so forgetting appears automatically as the distribution relaxes toward uniformity. Evaluated on the LoCoMo benchmark of ten long conversations using GPT-4o-mini, the system reports F1 scores of 63.50 on single-hop, 55.32 on multi-hop, 58.57 on temporal, and 62.50 on open-domain questions while requiring no external vector store or graph database.

Core claim

Nous maintains a predictive world model consisting of categorical probability distributions called dimensions, one per observed entity-attribute pair. Each new observation updates its dimension through a closed-form Bayesian posterior computed from information-theoretic surprise S = -log2 P(obs | D). The system records only the delta between prior and posterior rather than the fact, lets forgetting emerge as entropy decay to the uniform distribution, and resolves entity identity via mutual information across dimension sets.

What carries the argument

dimensions: categorical probability distributions, one per entity-attribute pair, that form the predictive world model and are updated via closed-form Bayesian posterior on surprise

If this is right

The approach yields F1 scores of 63.50 single-hop, 55.32 multi-hop, 58.57 temporal, and 62.50 open-domain on LoCoMo with GPT-4o-mini.
It exceeds the reported numbers of A-MEM in three of four categories and BeliefMem in all four under the stated evaluation conditions.
No external vector database or graph engine is required for operation.
Forgetting and identity resolution arise directly from entropy increase and mutual information without separate modules.
The primary memory artifact is the belief delta rather than any explicit fact representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If attribute independence holds across typical conversations, the same architecture could be extended to multi-agent settings by sharing dimensions across agents.
The surprise-driven update rule suggests a natural link to active inference agents that select actions to reduce expected surprise.
A direct test would measure whether the stored deltas alone suffice for downstream planning tasks that require reconstructing full conversation histories.
Standardizing the LoCoMo evaluation pipeline would clarify whether the reported gains over concurrent belief-based systems are reproducible.

Load-bearing premise

That maintaining and updating a collection of independent categorical distributions via closed-form Bayesian updates on surprise is sufficient to capture the memory requirements of long multi-turn conversations without additional mechanisms or external storage.

What would settle it

A controlled test set in which performance collapses once questions require tracking statistical dependencies between different entity attributes that the independent dimensions cannot represent.

Figures

Figures reproduced from arXiv: 2606.22030 by Pranav Singh.

read the original abstract

We present Nous, a novel agent memory architecture grounded in the principle that knowledge is prediction, not storage. Rather than persisting facts as database records, vector embeddings, or knowledge-graph triples, Nous maintains a predictive world model: a collection of categorical probability distributions, called dimensions, one per entity-attribute pair observed in conversation. Each incoming observation is scored by its information-theoretic surprise S = -log2 P(obs | D), and the distribution is updated via a closed-form Bayesian posterior. The primary stored artifact is the delta, a record of the shift from prior to posterior belief, rather than the fact itself. Forgetting emerges naturally as entropy decay toward the uniform distribution, and identity resolution is handled through mutual information between entity dimension sets. Evaluated on the LoCoMo long-term conversational memory benchmark across ten conversations (1,540 questions) using GPT-4o-mini as backbone, Nous achieves F1 of 63.50 (single-hop), 55.32 (multi-hop), 58.57 (temporal), and 62.50 (open-domain). Against A-MEM's self-reported GPT-4o-mini numbers, Nous shows substantial gains in three of four categories, though we note that independent citations of A-MEM's results disagree with each other on category assignment, a reproducibility issue we discuss openly rather than resolve unilaterally. We additionally compare against BeliefMem, a concurrently developed system built on the same core premise of belief-based rather than deterministic memory; on the same benchmark and backbone, Nous's self-reported numbers exceed BeliefMem's self-reported numbers on all four categories, though we flag several uncontrolled differences between the two evaluation pipelines that prevent this from being a fully controlled comparison. Nous requires no external vector database or graph engine.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Nous replaces stored facts with per-entity-attribute categorical distributions updated by surprise-driven closed-form Bayes, storing only deltas, but the independence assumption looks shaky for the multi-hop and temporal gains it claims.

read the letter

The punchline is that this paper swaps traditional memory storage for a predictive model of independent categorical distributions, one per observed entity-attribute pair. Incoming observations get scored by surprise S = -log2 P(obs | D), the distribution updates via closed-form Bayesian posterior, and only the delta from prior to posterior gets stored. Forgetting happens through entropy decay to uniform, and identity resolution uses mutual information across dimension sets. That combination is not in the referenced prior work.

The paper does a couple things right. It openly flags inconsistencies in A-MEM baseline numbers and uncontrolled differences with the concurrent BeliefMem system instead of papering over them. It also avoids external vector stores or graph engines, which is a practical plus for some deployment settings.

The soft spots are more substantial. The abstract supplies no derivation, pseudocode, or error analysis for the closed-form updates or the mutual-information identity step, so the mechanics are hard to verify. All numbers are self-reported on LoCoMo with GPT-4o-mini, and the claimed gains sit on the very categories (multi-hop 55.32, temporal 58.57) where the independence assumption is most likely to break. If correlated attributes or higher-order context cannot be recovered from the marginal distributions, the architecture will underperform exactly where the paper says it improves. The reader's stress-test concern lands.

This is for people building long-term conversational agents who want to explore prediction-centric memory instead of retrieval. A reader already working on belief-state or information-theoretic agent designs would get the most from the framing. It is coherent enough on its own terms to deserve referee time, even though the current evidence is preliminary and the evaluation needs tighter controls.

Referee Report

3 major / 1 minor

Summary. The paper introduces Nous, an agent memory architecture that maintains a predictive world model consisting of independent categorical probability distributions (one per observed entity-attribute pair, termed 'dimensions'). Observations trigger information-theoretic surprise scoring S = -log2 P(obs | D) followed by closed-form Bayesian updates, with the delta (belief shift) as the primary stored artifact; forgetting occurs via entropy decay to uniform and identity resolution via mutual information across dimension sets. No external vector DB or graph is required. On the LoCoMo benchmark (10 conversations, 1,540 questions) with GPT-4o-mini, it reports F1 scores of 63.50 (single-hop), 55.32 (multi-hop), 58.57 (temporal), and 62.50 (open-domain), claiming gains over A-MEM in three categories and over BeliefMem in all four, while openly noting baseline inconsistencies and uncontrolled evaluation differences.

Significance. If the independence assumption and empirical results prove robust under controlled re-evaluation, the work could offer a lightweight, storage-efficient alternative to embedding or graph-based memory for long-horizon agents, grounded in predictive coding. The explicit discussion of reproducibility issues in baselines is a positive contribution to the literature.

major comments (3)

[Abstract] Abstract: The central empirical claim of substantial gains rests on self-reported F1 numbers whose baselines are flagged by the authors themselves as inconsistent across citations and subject to uncontrolled pipeline differences; this makes the comparative results difficult to interpret without an independent, controlled replication.
[Abstract] Architecture description (throughout): The load-bearing claim that a collection of independent per-(entity,attribute) categorical distributions suffices for multi-hop, temporal, and open-domain reasoning is not accompanied by any analysis, ablation, or derivation showing how attribute correlations or higher-order context can be recovered from the marginals; if the independence assumption fails, the reported advantages on precisely those categories would not follow.
[Abstract] Abstract and evaluation: No pseudocode, closed-form derivation, or error analysis is supplied for the surprise-driven Bayesian update, delta storage, entropy-decay forgetting, or mutual-information identity resolution, leaving the implementation details unverifiable from the manuscript alone.

minor comments (1)

[Abstract] The manuscript would benefit from an explicit table or section contrasting the exact evaluation protocol used for Nous versus the cited A-MEM and BeliefMem numbers to clarify the uncontrolled differences mentioned.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback and the recommendation for major revision. We address each major comment point-by-point below, with planned changes to the manuscript where appropriate. All responses focus on the substance of the comments.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim of substantial gains rests on self-reported F1 numbers whose baselines are flagged by the authors themselves as inconsistent across citations and subject to uncontrolled pipeline differences; this makes the comparative results difficult to interpret without an independent, controlled replication.

Authors: We agree that the self-reported nature of the comparisons, combined with the noted inconsistencies in baseline citations and uncontrolled pipeline differences, limits the strength of the empirical claims. The manuscript already flags these issues explicitly in the abstract and evaluation sections. In revision, we will further strengthen the caveats in the abstract (e.g., by qualifying the gains as self-reported and subject to evaluation variations) and expand the discussion section with additional analysis of reproducibility challenges in long-term memory benchmarks. We cannot perform an independent controlled replication ourselves but will highlight this as an important direction for future work. revision: yes
Referee: [Abstract] Architecture description (throughout): The load-bearing claim that a collection of independent per-(entity,attribute) categorical distributions suffices for multi-hop, temporal, and open-domain reasoning is not accompanied by any analysis, ablation, or derivation showing how attribute correlations or higher-order context can be recovered from the marginals; if the independence assumption fails, the reported advantages on precisely those categories would not follow.

Authors: The architecture deliberately adopts the independence assumption to enable tractable closed-form Bayesian updates and storage-efficient delta recording without external databases or graphs. We do not provide an explicit derivation or ablation for recovering correlations because the method operates solely on marginal distributions per dimension; higher-order effects are handled implicitly through cross-dimension queries and mutual-information identity resolution. We will add a dedicated limitations subsection discussing the independence assumption, its computational benefits, and potential failure cases for strongly correlated attributes. A full theoretical analysis of correlation recovery is not part of the current contribution. revision: partial
Referee: [Abstract] Abstract and evaluation: No pseudocode, closed-form derivation, or error analysis is supplied for the surprise-driven Bayesian update, delta storage, entropy-decay forgetting, or mutual-information identity resolution, leaving the implementation details unverifiable from the manuscript alone.

Authors: We agree that the absence of these details reduces verifiability. In the revised manuscript, we will add pseudocode for the full pipeline (surprise scoring, Bayesian posterior update, delta storage, entropy-decay forgetting, and mutual-information identity resolution). We will also include the closed-form derivation of the Bayesian update and a brief error analysis section covering approximation assumptions and numerical stability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark results independent of architecture definition

full rationale

The paper defines an architecture of per-(entity,attribute) categorical distributions updated via closed-form Bayesian posteriors driven by surprise S = -log2 P(obs | D), with forgetting as entropy decay and identity resolution via mutual information. It then reports F1 scores on the external LoCoMo benchmark (1,540 questions across 10 conversations) using GPT-4o-mini. These scores are measured outcomes of executing the system on held-out questions, not quantities that reduce to the model definition by construction. No self-citations are invoked as load-bearing uniqueness theorems, no fitted parameters are relabeled as predictions, and the central claim (sufficiency of independent marginals for the benchmark tasks) is presented as an empirical hypothesis rather than a definitional tautology. The derivation chain is therefore self-contained against the external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper rests on the domain assumption that knowledge should be represented as prediction rather than storage, introduces the invented entity of per-entity-attribute dimensions, and reports empirical results without additional free parameters visible in the abstract.

axioms (1)

domain assumption Knowledge is prediction, not storage.
Stated as the grounding principle for the entire architecture.

invented entities (1)

dimensions no independent evidence
purpose: Categorical probability distributions maintained for each observed entity-attribute pair
Core storage artifact of the system; no independent evidence provided outside the model definition.

pith-pipeline@v0.9.1-grok · 5848 in / 1292 out tokens · 25128 ms · 2026-06-26T11:46:22.534359+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 9 canonical work pages · 9 internal anchors

[1]

An essay towards solving a prob- lem in the doctrine of chances, 1763

Thomas Bayes. An essay towards solving a prob- lem in the doctrine of chances, 1763
[2]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taran- jeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley-Interscience, 2nd edition, 2006

2006
[5]

The free-energy principle: A uni- fied brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010

Karl Friston. The free-energy principle: A uni- fied brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010

2010
[6]

Littman, and Anthony R

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1-2):99–134, 1998

1998
[7]

Knill and Alexandre Pouget

David C. Knill and Alexandre Pouget. The bayesian brain: The role of uncertainty in neural coding and computation.Trends in Neurosciences, 27(12):712–719, 2004

2004
[8]

Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in Neural Informa- tion Processing Systems (NeurIPS), 33, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in Neural Informa- tion Processing Systems (NeurIPS), 33, 2020

2020
[9]

Belief Memory: Agent Memory Under Partial Observability

Junfeng Liao, Qizhou Wang, Jianing Zhu, Bo Du, Rui Yan, and Xiuying Chen. Belief memory: Agent memory under partial observability.arXiv preprint arXiv:2605.05583, 2026. MBZUAI, RIKEN AIP, UT Austin, Wuhan University

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How lan- guage models use long contexts.arXiv preprint arXiv:2307.03172, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conver- sational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Rajesh P. N. Rao and Dana H. Ballard. Predictive coding in the visual cortex: A functional inter- pretation of some extra-classical receptive-field effects.Nature Neuroscience, 2(1):79–87, 1999

1999
[14]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Preston Rasmussen, Pavlo Paliychuk, Travis Beau- vais, Jack Ryan, and Daniel Chalef. Zep: A tempo- ral knowledge graph architecture for agent mem- ory.arXiv preprint arXiv:2501.13956, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Claude E. Shannon. A mathematical theory of communication.The Bell System Technical Jour- nal, 27(3):379–423, 1948

1948
[16]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term inter- active memory.arXiv preprint arXiv:2410.10813,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Accepted at ICLR 2025

2025
[18]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Belief Engine: Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation

Joshua C. Yang, Damian Dailisan, and Maurice Flechtner. Belief engine: Bayesian memory for configurable opinion dynamics in llm agents. In ICLR 2026 Workshop on Memory for LLM-Based Agentic Systems (MemAgents), 2026. Distinct from the similarly-titled arXiv:2605.15343 by an overlapping author set, which addresses multi- agent deliberation rather than agen...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

An essay towards solving a prob- lem in the doctrine of chances, 1763

Thomas Bayes. An essay towards solving a prob- lem in the doctrine of chances, 1763

[2] [2]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taran- jeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [4]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley-Interscience, 2nd edition, 2006

2006

[4] [5]

The free-energy principle: A uni- fied brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010

Karl Friston. The free-energy principle: A uni- fied brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010

2010

[5] [6]

Littman, and Anthony R

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1-2):99–134, 1998

1998

[6] [7]

Knill and Alexandre Pouget

David C. Knill and Alexandre Pouget. The bayesian brain: The role of uncertainty in neural coding and computation.Trends in Neurosciences, 27(12):712–719, 2004

2004

[7] [8]

Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in Neural Informa- tion Processing Systems (NeurIPS), 33, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in Neural Informa- tion Processing Systems (NeurIPS), 33, 2020

2020

[8] [9]

Belief Memory: Agent Memory Under Partial Observability

Junfeng Liao, Qizhou Wang, Jianing Zhu, Bo Du, Rui Yan, and Xiuying Chen. Belief memory: Agent memory under partial observability.arXiv preprint arXiv:2605.05583, 2026. MBZUAI, RIKEN AIP, UT Austin, Wuhan University

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [10]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How lan- guage models use long contexts.arXiv preprint arXiv:2307.03172, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [11]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conver- sational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [12]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [13]

Rajesh P. N. Rao and Dana H. Ballard. Predictive coding in the visual cortex: A functional inter- pretation of some extra-classical receptive-field effects.Nature Neuroscience, 2(1):79–87, 1999

1999

[13] [14]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Preston Rasmussen, Pavlo Paliychuk, Travis Beau- vais, Jack Ryan, and Daniel Chalef. Zep: A tempo- ral knowledge graph architecture for agent mem- ory.arXiv preprint arXiv:2501.13956, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [15]

Claude E. Shannon. A mathematical theory of communication.The Bell System Technical Jour- nal, 27(3):379–423, 1948

1948

[15] [16]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term inter- active memory.arXiv preprint arXiv:2410.10813,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [17]

Accepted at ICLR 2025

2025

[17] [18]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [19]

Belief Engine: Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation

Joshua C. Yang, Damian Dailisan, and Maurice Flechtner. Belief engine: Bayesian memory for configurable opinion dynamics in llm agents. In ICLR 2026 Workshop on Memory for LLM-Based Agentic Systems (MemAgents), 2026. Distinct from the similarly-titled arXiv:2605.15343 by an overlapping author set, which addresses multi- agent deliberation rather than agen...

work page internal anchor Pith review Pith/arXiv arXiv 2026