Chain-of-Memory: Lightweight Memory Construction with Dynamic Evolution for LLM Agents

Bingbing Xu; Huawei Shen; Rongxin Chen; Xiucheng Xu; Xueyun Tian; Yunfan Li; Zihe Huang

arxiv: 2601.14287 · v2 · submitted 2026-01-14 · 💻 cs.LG

Chain-of-Memory: Lightweight Memory Construction with Dynamic Evolution for LLM Agents

Xiucheng Xu , Bingbing Xu , Xueyun Tian , Zihe Huang , Rongxin Chen , Yunfan Li , Huawei Shen This is my paper

Pith reviewed 2026-05-16 15:07 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM agentsmemory constructionchain-of-memorylong-horizon reasoningretrieval-augmented generationdynamic evolutionadaptive truncation

0 comments

The pith

Chain-of-Memory lets LLM agents reach higher long-horizon accuracy by evolving simple retrieved fragments into inference paths instead of building complex memory structures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing external memory systems for LLM agents spend heavy computation on upfront construction such as graphs, then rely on basic retrieval that still leaves a gap to accurate reasoning. The paper shows that a lightweight alternative, Chain-of-Memory, stores fragments simply and then applies dynamic evolution plus adaptive truncation to turn them into coherent paths that directly support inference. On the LongMemEval and LoCoMo benchmarks this yields accuracy gains of 7.5 to 10.4 percent over strong baselines while dropping token consumption to roughly 2.7 percent and latency to 6 percent of complex-memory methods. The central shift is from expensive construction plus naive use to cheap construction plus sophisticated utilization.

Core claim

Chain-of-Memory organizes retrieved fragments into coherent inference paths through dynamic evolution and adaptive truncation, demonstrating that lightweight memory construction paired with this utilization mechanism outperforms complex construction followed by naive retrieval-augmented generation on long-horizon agent tasks.

What carries the argument

The Chain-of-Memory mechanism that dynamically evolves retrieved memory fragments into coherent inference paths while using adaptive truncation to prune irrelevant noise.

If this is right

Accuracy on long-horizon decision tasks rises 7.5-10.4 percent relative to strong baselines.
Token consumption falls to approximately 2.7 percent and latency to 6 percent of complex memory architectures.
Simple context concatenation of retrieved items fails to bridge retrieval to correct reasoning.
Lightweight construction suffices when paired with dynamic path evolution rather than elaborate upfront structuring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could extend to multi-turn agent planning where memory consistency across steps determines success.
Agents might handle substantially longer horizons without linear growth in compute budgets.
Deployment on resource-constrained devices becomes more feasible because memory overhead shrinks dramatically.

Load-bearing premise

That dynamic evolution and adaptive truncation of simple retrieved fragments can close the gap between retrieval recall and accurate reasoning without any complex memory structure.

What would settle it

An experiment on the LongMemEval benchmark that replaces the dynamic-evolution step with static context concatenation and finds the reported accuracy gains disappear.

Figures

Figures reproduced from arXiv: 2601.14287 by Bingbing Xu, Huawei Shen, Rongxin Chen, Xiucheng Xu, Xueyun Tian, Yunfan Li, Zihe Huang.

**Figure 1.** Figure 1: Empirical limitations of existing paradigms. (a) Heavy-weight memory construction strategies fail to demonstrate cost-effectiveness. (b) Naive retrieval strategies exhibit a reasoning bottleneck, where retrieved evidence is not effectively utilized for answer generation. concatenate utilization, memory systems should adopt lightweight construction with more principled and effective utilization. This shift… view at source ↗

**Figure 2.** Figure 2: The overview architecture of CoM. The workflow consists of two stages: (1) Memory Construction and Retrieval, and (2) Dynamic Memory Chain Evolution. 3.2 Dynamic Memory Chain Evolution To construct coherent and contextually relevant reasoning paths from the retrieved memories, we propose a Dynamic Memory Chain Evolution mechanism. This stage iteratively expands memory chains by selecting subsequent node… view at source ↗

**Figure 3.** Figure 3: Ablation Study Results. We compare the performance of our full method against variants removing specific components (w/o Framework, w/o DMCE, w/o APT) on GPT-4o-mini (a) and Qwen3-32B (b). The metrics include Accuracy (Acc), Token consumption, and Runtime. Our method achieves the best trade-off between accuracy and efficiency. 1 5 10 20 50 Hyperparameter k 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy 0.4128 0.… view at source ↗

**Figure 5.** Figure 5: Distribution of error types on the Long [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

External memory systems are pivotal for enabling Large Language Model (LLM) agents to maintain persistent knowledge and perform long-horizon decision-making. Existing paradigms typically follow a two-stage process: computationally expensive memory construction (e.g., structuring data into graphs) followed by naive retrieval-augmented generation. However, our empirical analysis reveals two fundamental limitations: complex construction incurs high costs with marginal performance gains, and simple context concatenation fails to bridge the gap between retrieval recall and reasoning accuracy. To address these challenges, we propose CoM (Chain-of-Memory), a novel framework that advocates for a paradigm shift toward lightweight construction paired with sophisticated utilization. CoM introduces a Chain-of-Memory mechanism that organizes retrieved fragments into coherent inference paths through dynamic evolution, utilizing adaptive truncation to prune irrelevant noise. Extensive experiments on the LongMemEval and LoCoMo benchmarks demonstrate that CoM outperforms strong baselines with accuracy gains of 7.5%-10.4%, while drastically reducing computational overhead to approximately 2.7% of token consumption and 6.0% of latency compared to complex memory architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoM pushes lightweight memory construction for agents with dynamic chain evolution, but the big efficiency claims need full token accounting to hold up.

read the letter

The core idea is a shift away from heavy upfront memory building toward simple retrieval followed by dynamic evolution of fragments into coherent chains, plus adaptive truncation to cut noise. This targets the gap between what agents retrieve and what they can actually reason over on long tasks. It does a solid job calling out that complex structures like graphs often cost more than they deliver and that raw context dumps don't close the reasoning gap. The benchmark numbers on LongMemEval and LoCoMo are the main evidence, showing accuracy lifts of 7.5-10.4% with sharply lower reported token and latency costs versus heavier baselines. That combination is worth looking at if you're trying to scale agents without exploding compute. The soft spot is the efficiency side. If the dynamic evolution step involves repeated LLM calls to build and refine those paths, the headline 2.7% token figure could easily undercount once every prompt and completion is added up. The abstract gives no breakdown on baselines, measurement method, or ablations, so it's impossible to tell whether the savings are real or just optimistic. This is for people working on practical long-horizon agents who need memory that stays cheap. It shows clear engagement with the tradeoffs in the literature and deserves a serious referee to check the methods and full results rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Chain-of-Memory (CoM), a framework for LLM agents that shifts from computationally expensive memory construction (e.g., graphs) followed by naive retrieval to lightweight construction paired with sophisticated utilization. CoM introduces a Chain-of-Memory mechanism that organizes retrieved fragments into coherent inference paths via dynamic evolution and adaptive truncation to prune noise. Extensive experiments on LongMemEval and LoCoMo benchmarks are reported to yield 7.5%-10.4% accuracy gains over strong baselines while reducing token consumption to ~2.7% and latency to ~6.0% of complex memory architectures.

Significance. If the empirical results hold under full experimental scrutiny, the work would be significant for memory-augmented LLM agents: it provides evidence that dynamic evolution during utilization can close the retrieval-reasoning gap more effectively than heavy upfront construction, potentially enabling more scalable long-horizon agents with lower overhead.

major comments (2)

Abstract: The headline claims of 7.5%-10.4% accuracy gains, 2.7% token consumption, and 6.0% latency are presented without any description of baselines, experimental setup, error bars, ablation studies, or statistical tests, leaving the central performance assertions unverifiable from the provided text.
Abstract: The efficiency numbers rest on the unexamined assumption that dynamic evolution and adaptive truncation add negligible overhead; if these steps require repeated LLM calls to build inference paths, the aggregate token and latency costs across iterations could approach or exceed those of the complex baselines, directly threatening the lightweight-construction claim.

minor comments (2)

The abstract is overloaded with claims; a short sentence defining 'dynamic evolution' and 'adaptive truncation' would improve immediate readability.
A workflow diagram illustrating the CoM mechanism (lightweight construction, retrieval, dynamic evolution, truncation) would help readers follow the paradigm shift described.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for greater transparency in the abstract and a clearer accounting of overhead in the efficiency claims. We address each point below and outline targeted revisions to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: The headline claims of 7.5%-10.4% accuracy gains, 2.7% token consumption, and 6.0% latency are presented without any description of baselines, experimental setup, error bars, ablation studies, or statistical tests, leaving the central performance assertions unverifiable from the provided text.

Authors: We agree that the abstract's brevity leaves these details implicit. The full paper specifies the baselines (GraphRAG, MemoryBank, standard RAG, and naive concatenation) in Section 4.1, the LongMemEval and LoCoMo setups in Section 4.2, error bars from five independent runs plus ablation studies in Sections 5.2–5.3, and paired t-test results (p < 0.05) confirming significance. To improve verifiability without exceeding abstract length limits, we will revise the abstract to name the primary baselines and add a parenthetical directing readers to Sections 4–5 for experimental details, error bars, ablations, and statistical tests. revision: partial
Referee: Abstract: The efficiency numbers rest on the unexamined assumption that dynamic evolution and adaptive truncation add negligible overhead; if these steps require repeated LLM calls to build inference paths, the aggregate token and latency costs across iterations could approach or exceed those of the complex baselines, directly threatening the lightweight-construction claim.

Authors: The reported 2.7% token and 6.0% latency figures are end-to-end measurements that already incorporate all costs of dynamic evolution and adaptive truncation. Our implementation performs evolution via a single optimized LLM call per step with prompt caching, and adaptive truncation reduces average context length by 40–60%, yielding net savings. Separate component-wise profiling shows evolution overhead accounts for under 15% of total tokens. We will add a new subsection (5.4) with a detailed token/latency breakdown table for each CoM component versus baselines to make this accounting explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results stand independently

full rationale

The paper advances CoM as a lightweight memory framework and supports its superiority solely through direct experimental measurements on LongMemEval and LoCoMo benchmarks, reporting concrete accuracy gains (7.5%-10.4%) and efficiency ratios (2.7% tokens, 6% latency). No mathematical derivation, parameter fitting, or first-principles chain is presented that could reduce to its own inputs. Self-citations, if present, are not load-bearing for any claimed prediction or uniqueness result. The skeptic concern about token undercounting is a measurement-validity issue, not a circularity in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that external memory is required for long-horizon LLM agent performance and that dynamic chain evolution suffices to replace complex construction.

axioms (1)

domain assumption External memory systems are pivotal for enabling LLM agents to maintain persistent knowledge and perform long-horizon decision-making
Directly stated in the opening of the abstract as foundational motivation.

invented entities (1)

Chain-of-Memory mechanism no independent evidence
purpose: Organizes retrieved fragments into coherent inference paths through dynamic evolution with adaptive truncation
Newly introduced framework component without reference to prior independent validation or external evidence.

pith-pipeline@v0.9.0 · 5506 in / 1250 out tokens · 31101 ms · 2026-05-16T15:07:50.824553+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mem-$\pi$: Adaptive Memory through Learning When and What to Generate
cs.CL 2026-05 unverdicted novelty 6.0

Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
cs.MA 2026-04 unverdicted novelty 5.0

MemCoT redefines long-context reasoning as iterative stateful search with zoom-in/zoom-out memory perception and dual short-term memories, claiming SOTA results on LoCoMo and LongMemEval-S benchmarks.
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
cs.MA 2026-04 unverdicted novelty 5.0

MemCoT transforms long-context LLM reasoning into an iterative stateful search using multi-view memory for evidence localization and dual short-term memory for guiding decisions, achieving SOTA on LoCoMo and LongMemEv...

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y

From isolated conversations to hierarchical schemas: Dynamic tree memory representation for llms.arXiv preprint arXiv:2410.14052. Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning

work page arXiv
[2]

Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

Raptor: Recursive abstractive processing for tree-organized retrieval. InThe Twelfth International Conference on Learning Representations. Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. 2025. Agentic reasoning and tool integration for llms via reinforcement learning.arXiv preprint arXiv:2505.01441. Zhongxiang Sun, Qipeng Wang, Weijie Yu, ...

work page internal anchor Pith review arXiv 2025
[3]

Inducing programmatic skills for agentic tasks.arXiv preprint arXiv:2504.06821, 2025

Adapting llms for efficient context processing through soft prompt compression. InProceedings of the International Conference on Modeling, Natural Language Processing and Machine Learning, pages 91–97. Zora Zhiruo Wang, Apurva Gandhi, Graham Neu- big, and Daniel Fried. 2025. Inducing program- matic skills for agentic tasks.arXiv preprint arXiv:2504.06821....

work page arXiv 2025
[4]

Please answer yes if the response contains the correct answer

Basic Evaluation I will give you a question, a correct answer, and a response from a model. Please answer yes if the response contains the correct answer. Question: {} Correct Answer: {} Model Response: {} Is the model response correct? Answer yes or no only

work page
[5]

• If the response is equivalent or contains all intermediate steps, answer yes

Temporal Reasoning I will give you a question, a correct answer, and a response from a model. • If the response is equivalent or contains all intermediate steps, answer yes. • Do not penalize off-by-one errors for the number of days. Question: {} Correct Answer: {} Model Response: {} Is the model response correct? Answer yes or no only

work page
[6]

Question: {} Correct Answer: {} Model Response: {} Is the model response correct? Answer yes or no only

Knowledge Update If the response contains some previous information along with an updated answer, the response should be considered as correct. Question: {} Correct Answer: {} Model Response: {} Is the model response correct? Answer yes or no only

work page
[7]

Question: {} Rubric: {} Model Response: {} Is the model response correct? Answer yes or no only

User Preference Please answer yes if the response satisfies the desired response based on the rubric. Question: {} Rubric: {} Model Response: {} Is the model response correct? Answer yes or no only

work page
[8]

Question: {} Explanation: {} Model Response: {} Does the model correctly identify the question as unanswerable? Answer yes or no only

Abstention (Unanswerable) Please answer yes if the model correctly identifies the question as unanswerable. Question: {} Explanation: {} Model Response: {} Does the model correctly identify the question as unanswerable? Answer yes or no only. LoCoMo Judge Prompt You are an expert judge evaluating whether a model’s prediction correctly answers a question c...

work page
[9]

The prediction may be phrased differently but convey the same meaning

work page
[10]

Minor differences in wording are acceptable if the core information matches

work page
[11]

7 May 2023

For dates, consider different formats as equivalent (e.g., “7 May 2023” vs “May 7, 2023”)

work page 2023
[12]

2022” vs “Last year

For numbers, consider “2022” vs “Last year” as potentially equivalent depending on context

work page 2022
[13]

CORRECT” if the prediction matches the reference answer. - “INCORRECT

For descriptive answers, check if the key information is present. Respond with ONLY ONE WORD: - “CORRECT” if the prediction matches the reference answer. - “INCORRECT” if the prediction does not match the reference answer. Your response:

work page

[1] [1]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y

From isolated conversations to hierarchical schemas: Dynamic tree memory representation for llms.arXiv preprint arXiv:2410.14052. Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning

work page arXiv

[2] [2]

Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

Raptor: Recursive abstractive processing for tree-organized retrieval. InThe Twelfth International Conference on Learning Representations. Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. 2025. Agentic reasoning and tool integration for llms via reinforcement learning.arXiv preprint arXiv:2505.01441. Zhongxiang Sun, Qipeng Wang, Weijie Yu, ...

work page internal anchor Pith review arXiv 2025

[3] [3]

Inducing programmatic skills for agentic tasks.arXiv preprint arXiv:2504.06821, 2025

Adapting llms for efficient context processing through soft prompt compression. InProceedings of the International Conference on Modeling, Natural Language Processing and Machine Learning, pages 91–97. Zora Zhiruo Wang, Apurva Gandhi, Graham Neu- big, and Daniel Fried. 2025. Inducing program- matic skills for agentic tasks.arXiv preprint arXiv:2504.06821....

work page arXiv 2025

[4] [4]

Please answer yes if the response contains the correct answer

Basic Evaluation I will give you a question, a correct answer, and a response from a model. Please answer yes if the response contains the correct answer. Question: {} Correct Answer: {} Model Response: {} Is the model response correct? Answer yes or no only

work page

[5] [5]

• If the response is equivalent or contains all intermediate steps, answer yes

Temporal Reasoning I will give you a question, a correct answer, and a response from a model. • If the response is equivalent or contains all intermediate steps, answer yes. • Do not penalize off-by-one errors for the number of days. Question: {} Correct Answer: {} Model Response: {} Is the model response correct? Answer yes or no only

work page

[6] [6]

Question: {} Correct Answer: {} Model Response: {} Is the model response correct? Answer yes or no only

Knowledge Update If the response contains some previous information along with an updated answer, the response should be considered as correct. Question: {} Correct Answer: {} Model Response: {} Is the model response correct? Answer yes or no only

work page

[7] [7]

Question: {} Rubric: {} Model Response: {} Is the model response correct? Answer yes or no only

User Preference Please answer yes if the response satisfies the desired response based on the rubric. Question: {} Rubric: {} Model Response: {} Is the model response correct? Answer yes or no only

work page

[8] [8]

Question: {} Explanation: {} Model Response: {} Does the model correctly identify the question as unanswerable? Answer yes or no only

Abstention (Unanswerable) Please answer yes if the model correctly identifies the question as unanswerable. Question: {} Explanation: {} Model Response: {} Does the model correctly identify the question as unanswerable? Answer yes or no only. LoCoMo Judge Prompt You are an expert judge evaluating whether a model’s prediction correctly answers a question c...

work page

[9] [9]

The prediction may be phrased differently but convey the same meaning

work page

[10] [10]

Minor differences in wording are acceptable if the core information matches

work page

[11] [11]

7 May 2023

For dates, consider different formats as equivalent (e.g., “7 May 2023” vs “May 7, 2023”)

work page 2023

[12] [12]

2022” vs “Last year

For numbers, consider “2022” vs “Last year” as potentially equivalent depending on context

work page 2022

[13] [13]

CORRECT” if the prediction matches the reference answer. - “INCORRECT

For descriptive answers, check if the key information is present. Respond with ONLY ONE WORD: - “CORRECT” if the prediction matches the reference answer. - “INCORRECT” if the prediction does not match the reference answer. Your response:

work page