Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

Feng Liu; Hong Wang; Jingren Hou; Ruiyi Ding; Wei Xia; Wence Ji; Yeqiu Chen; Yongkang Yang; Zhezheng Hao; Ziyan Liu

arxiv: 2605.30159 · v1 · pith:DKVDHXPMnew · submitted 2026-05-28 · 💻 cs.AI

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

Ziyan Liu , Zhezheng Hao , Yeqiu Chen , Hong Wang , Jingren Hou , Ruiyi Ding , Yongkang Yang , Wence Ji

show 2 more authors

Wei Xia Feng Liu

This is my paper

Pith reviewed 2026-06-29 07:28 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsmemory policy optimizationlong-horizon tasksbelief entropyepistemic uncertaintyrecursive summarizationmetacognitive optimization

0 comments

The pith

Penalizing uncertain memory summaries lets LLM agents keep performance at 1.75M-token scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Memory-augmented LLM agents compress long interaction histories into recursive summaries so they can handle extended tasks, yet these summaries gradually discard relevant details and add noise. Training the memory policy solely on final task success gives no signal about where the summaries went wrong along the way. The paper introduces Belief Entropy as a self-supervised measure of how uncertain the model remains about the underlying task state after reading its own summary. MMPO then trains the policy to reduce this entropy at every step, supplying dense feedback instead of waiting for an outcome signal. The result is memory policies that preserve clearer internal beliefs and deliver higher success rates even when total context length reaches 1.75 million tokens.

Core claim

The paper claims that memory optimization for long-horizon LLM agents should target the clarity of the belief induced by intermediate summaries rather than trajectory-level success alone. Belief Entropy acts as a self-supervised proxy for the epistemic uncertainty the model holds about the latent task state given its current memory. MMPO uses this proxy to penalize high-entropy summaries during training, replacing sparse outcome-based reinforcement learning with memory-specific supervision that directly counters progressive belief deviation.

What carries the argument

Belief Entropy, a self-supervised proxy that quantifies the model's remaining uncertainty about the latent task state given its current memory summary.

If this is right

MMPO outperforms prior outcome-based memory policies on diverse long-horizon tasks.
Performance holds at 97.1 percent when context length reaches 1.75 million tokens.
Training supplies fine-grained supervision at each summary step instead of sparse final rewards.
Recursive summarization produces less progressive loss of task-relevant information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entropy signal could diagnose memory quality in any agent that maintains compressed internal state, not only LLM summarizers.
The approach may extend to compression methods other than natural-language summaries if an analogous uncertainty measure can be defined.
Longer-horizon experiments could test whether the same penalty continues to prevent belief deviation beyond the lengths already measured.

Load-bearing premise

Belief Entropy reliably tracks how clear or degraded the agent's estimate of the task state has become after each summary, and lowering it improves reasoning without harming other behaviors.

What would settle it

A run of MMPO in which belief entropy drops yet the agent still loses track of task-critical facts or shows no gain in final success rate.

Figures

Figures reproduced from arXiv: 2605.30159 by Feng Liu, Hong Wang, Jingren Hou, Ruiyi Ding, Wei Xia, Wence Ji, Yeqiu Chen, Yongkang Yang, Zhezheng Hao, Ziyan Liu.

**Figure 1.** Figure 1: Overview of MMPO. (Top) Existing outcome-based memory policies suffer from sparse credit assignment, failing to prevent ambiguous summaries from accumulating belief deviation. (Bottom) MMPO introduces an anchor-question-based Belief Entropy to provide dense, memoryspecific supervision. This fine-grained penalty for epistemic uncertainty preserves clearer summaryinduced beliefs and improves long-context r… view at source ↗

**Figure 2.** Figure 2: Belief-state under standard and summary-based POMDPs. (a) In standard POMDPs, the belief b = P(s | h) is updated from the full interaction history. (b) In summary-based POMDPs, the memory policy compresses the history into a summary m, inducing a belief b = P(s | m) from the compressed representation. ⟨S, A, Ω, T , O, R, γ⟩. At each step t, the agent observes ot ∈ Ω (e.g., a retrieved document snippet) whi… view at source ↗

**Figure 3.** Figure 3: Empirical validation of Belief Entropy. (a) Successful trajectories show decreasing [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the MMPO training pipeline. Stage 1: The memory policy πθ samples G trajectories per task. Stage 2: Each trajectory is decomposed into sub-trajectories τ≤1, . . . , τ≤T , and Belief Entropy HBE(mk) is computed at every turn to produce dense per-step rewards Rk. Stage 3: Sub-trajectory rewards are normalized via GRPO and aggregated into future-aware turnlevel advantages for policy optimization.… view at source ↗

**Figure 5.** Figure 5: Belief Entropy analysis. (a) Belief Entropy trajectories over reasoning turns at 56K context length. Successful trajectories show consistent entropy decrease, while failed trajectories stagnate or increase. (b) Correlation between total entropy reduction ∆HBE and task accuracy across 500 test episodes. MMPO strengthens this correlation compared with MemAgent, supporting Belief Entropy as a proxy for interm… view at source ↗

read the original abstract

Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermediate memory quality degrades. As interactions unfold, ambiguous recursive summaries progressively discard task-relevant information and introduce semantic noise. This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long-horizon reasoning. We therefore argue that memory optimization should focus not merely on trajectory-level success, but on the clarity of the belief induced by intermediate summaries. To this end, we introduce Belief Entropy, a self-supervised proxy that probes how uncertain the model remains about the latent task state given its current memory. Based on this proxy, we propose Metacognitive Memory Policy Optimization (MMPO). Instead of relying only on sparse outcome-based signals, MMPO provides fine-grained, memory-specific supervision via explicitly penalizing summaries that induce high epistemic uncertainty. Experiments show that MMPO consistently outperforms existing methods on diverse long-horizon tasks, maintaining 97.1% performance even when scaled to 1.75M-token contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes Belief Entropy to supervise memory policies in long-horizon LLM agents, but the abstract gives no computation details, baselines, or validation that the proxy tracks actual belief quality.

read the letter

The main takeaway is that this work tries to fix a real bottleneck in memory-augmented LLM agents by adding a self-supervised penalty on uncertain intermediate summaries instead of relying only on final task outcomes. MMPO is the resulting method.

What is new is the framing of memory optimization around the clarity of the belief induced by recursive summaries. The abstract correctly notes that outcome-based RL misses where information degrades over long trajectories, and Belief Entropy is positioned as a way to localize that degradation. The reported scaling to 1.75M-token contexts with 97.1% performance would matter if the experiments hold up.

The paper does identify a practical issue with progressive semantic noise in agent memory. Shifting supervision to epistemic uncertainty in the summaries is a logical next step from standard RL baselines.

The soft spots are the lack of substance in the abstract. There is no formula or procedure for computing Belief Entropy, no description of the baselines used, and no variance, statistical tests, or data details around the performance numbers. The circularity concern is live: the proxy is defined from the model's own uncertainty, yet nothing shows it correlates with actual task-state deviation or information loss. The stress-test point stands on the evidence given.

This is aimed at researchers building long-horizon LLM agents who already work on memory mechanisms. A reader who needs reproducible methods or grounded metrics will not find enough here.

It deserves peer review so the full paper can be checked for the missing validation experiments and controls. The underlying problem is worth referee time even if heavy revision is likely.

Referee Report

2 major / 0 minor

Summary. The paper introduces Belief Entropy, a self-supervised proxy for epistemic uncertainty induced by recursive memory summaries in LLM agents, and proposes Metacognitive Memory Policy Optimization (MMPO) to penalize high-uncertainty summaries during policy optimization. This provides fine-grained supervision beyond sparse outcome-based RL signals. The central empirical claim is that MMPO consistently outperforms existing methods on diverse long-horizon tasks while maintaining 97.1% performance at 1.75M-token contexts.

Significance. If the Belief Entropy proxy is shown to correlate with actual belief deviation and the performance gains are isolated to the metacognitive penalty, the work could meaningfully advance memory-augmented agents by localizing and mitigating information loss in long-horizon reasoning. The self-supervised nature and scaling result to very long contexts would be notable strengths if rigorously supported.

major comments (2)

[Abstract] Abstract: The performance claim (97.1% at 1.75M tokens) is reported without any description of baselines, variance across runs, data exclusion criteria, or the precise computation of Belief Entropy, rendering it impossible to assess whether the data supports the outperformance claim or whether gains are attributable to the proposed penalty rather than other unisolated changes.
[Abstract] The central claim that penalizing high Belief Entropy improves long-horizon reasoning rests on this quantity faithfully tracking degradation in the agent's estimate of the latent task state. No evidence is provided of correlation between Belief Entropy and downstream task failure, information loss, or an external measure such as KL divergence to an oracle state; without such grounding the self-supervised signal risks misalignment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract's clarity and the need to ground the Belief Entropy proxy. We respond to each major comment below and commit to revisions that address the concerns without overstating current results.

read point-by-point responses

Referee: [Abstract] Abstract: The performance claim (97.1% at 1.75M tokens) is reported without any description of baselines, variance across runs, data exclusion criteria, or the precise computation of Belief Entropy, rendering it impossible to assess whether the data supports the outperformance claim or whether gains are attributable to the proposed penalty rather than other unisolated changes.

Authors: We agree the abstract is too condensed to support standalone evaluation of the claim. The body of the manuscript provides the requested details on baselines, run variance, and Belief Entropy computation. To resolve the issue, we will revise the abstract to briefly note the comparison to outcome-based RL, that results are averaged over runs, and a high-level reference to the self-supervised computation. This change will be incorporated in the next version. revision: yes
Referee: [Abstract] The central claim that penalizing high Belief Entropy improves long-horizon reasoning rests on this quantity faithfully tracking degradation in the agent's estimate of the latent task state. No evidence is provided of correlation between Belief Entropy and downstream task failure, information loss, or an external measure such as KL divergence to an oracle state; without such grounding the self-supervised signal risks misalignment.

Authors: The manuscript introduces Belief Entropy explicitly as a self-supervised proxy and demonstrates its value through end-to-end task improvements rather than direct correlation to oracle measures. We acknowledge that no explicit correlation analysis with task failure, information loss, or KL divergence appears in the current version. We will add a targeted analysis in the experiments section correlating Belief Entropy scores with observed information retention and failure points to provide additional empirical grounding. revision: yes

Circularity Check

0 steps flagged

No circularity; Belief Entropy is an independently defined proxy validated on external task benchmarks

full rationale

The paper defines Belief Entropy as a self-supervised probe of model uncertainty given memory summaries, then uses it to supply an auxiliary penalty in MMPO. Task performance (e.g., 97.1% at 1.75M tokens) is measured on separate long-horizon benchmarks, not by construction from the entropy definition itself. No equations, fitted parameters renamed as predictions, or self-citation chains reduce the central claim to its inputs. The derivation remains self-contained against external outcome metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified effectiveness of Belief Entropy as a proxy and the assumption that penalizing it yields better long-horizon performance; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Belief Entropy is an effective self-supervised proxy for the quality of intermediate memory summaries
The method depends on this proxy providing useful supervision signals.

pith-pipeline@v0.9.1-grok · 5755 in / 1081 out tokens · 23254 ms · 2026-06-29T07:28:45.721136+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning
cs.AI 2026-06 unverdicted novelty 6.0

ThoughtFold applies introspective redundancy detection within correct CoT trajectories to create sub-trajectory spectra, then uses masked preference optimization to penalize redundant explorations, yielding 56% token ...

Reference graph

Works this paper leans on

24 extracted references · 22 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory.CoRR, abs/2504.19413,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards.CoRR, abs/2502.01456,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Memory for autonomous llm agents: Mechanisms, evaluation, and emerging frontiers

Pengfei Du. Memory for autonomous llm agents: Mechanisms, evaluation, and emerging frontiers. arXiv preprint arXiv:2603.07670,

work page arXiv
[6]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

Zhezheng Hao, Hong Wang, Haoyang Liu, Jian Luo, Jiarui Yu, Hande Dong, Qiang Lin, Can Wang, and Jiawei Chen. Rethinking entropy interventions in rlvr: An entropy change perspective.arXiv preprint arXiv:2510.10150,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25972–25981

Zhiyu Li, Shichao Song, Hanyu Wang, Simin Niu, Ding Chen, Jiawei Yang, Chenyang Xi, Huayi Lai, Jihao Zhao, Yezhaohui Wang, Junpeng Ren, Zehao Lin, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhiqiang Yin, Qingchen Yu, Bo Tang, Hongkang Yang, Zhi-Qin John Xu, and Feiyu Xiong. Memos: An operating system for memory-augmented generation (MAG) in large langu...

work page arXiv
[11]

A comprehensive survey on long context language modeling.CoRR, abs/2503.17407,

Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li, Tianyu Liu, Fanyu Meng, Wenbo Su, Yingshui Tan, ...

work page arXiv
[12]

MemGPT: Towards LLMs as Operating Systems

11 Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems.CoRR, abs/2310.08560,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Han Shen

doi: 10.1016/0022-247X(65) 90154-X. Han Shen. On entropy control in llm-rl algorithms.arXiv preprint arXiv:2509.03493,

work page doi:10.1016/0022-247x(65
[14]

When to memorize and when to stop: Gated recurrent memory for long-context reasoning.arXiv preprint arXiv:2602.10560,

Leheng Sheng, Yongtao Zhang, Wenchang Ma, Yaorui Shi, Ting Huang, Xiang Wang, An Zhang, Ke Shen, and Tat-Seng Chua. When to memorize and when to stop: Gated recurrent memory for long-context reasoning.arXiv preprint arXiv:2602.10560,

work page arXiv
[15]

21.5.1071

doi: 10.1287/opre. 21.5.1071. Susanne Still and Doina Precup. An information-theoretic approach to curiosity-driven reinforcement learning.Theory in Biosciences, 131(3):139–148,

work page doi:10.1287/opre
[16]

Mem-{\alpha}: Learning Memory Construction via Reinforcement Learning

Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian J. McAuley, and Xiaojian Wu. Mem-α: Learning memory construction via reinforcement learning.CoRR, abs/2509.25911,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context LLM with multi-conv rl-based memory agent.CoRR, abs/2507.02259,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

Youliang Yuan, Qiuyang Mang, Jingbang Chen, Hong Wan, Xiaoyuan Liu, Junjielong Xu, Jen-tse Huang, Wenxuan Wang, Wenxiang Jiao, and Pinjia He. Curing miracle steps in LLM mathematical reasoning with rubric rewards.CoRR, abs/2510.07774,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yua...

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Deepresearcher: Scaling deep research via reinforcement learning in real-world environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 414–431,

2025
[22]

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents.arXiv preprint arXiv:2506.15841,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

The key point is architectural: once the history is compressed into memory, downstream reasoning and action selection can only access the information preserved inm t

19:end for B Summary-Induced Belief: Architectural Justification This appendix justifies why the belief of a summary-based memory agent is conditioned on the textual memory mt rather than on the full interaction history ht. The key point is architectural: once the history is compressed into memory, downstream reasoning and action selection can only access...

1999
[24]

Based on current memory, what is the answer to the question?

studies long-horizon agents that jointly maintain in- ternal memory and perform task-directed reasoning. Instead of relying only on raw interaction history, the agent maintains a compact internal memory state across steps and uses it to support subsequent reasoning, querying, or environment interaction. This framework is evaluated in both multi-objective ...

work page arXiv

[1] [1]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory.CoRR, abs/2504.19413,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards.CoRR, abs/2502.01456,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Memory for autonomous llm agents: Mechanisms, evaluation, and emerging frontiers

Pengfei Du. Memory for autonomous llm agents: Mechanisms, evaluation, and emerging frontiers. arXiv preprint arXiv:2603.07670,

work page arXiv

[6] [6]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

Zhezheng Hao, Hong Wang, Haoyang Liu, Jian Luo, Jiarui Yu, Hande Dong, Qiang Lin, Can Wang, and Jiawei Chen. Rethinking entropy interventions in rlvr: An entropy change perspective.arXiv preprint arXiv:2510.10150,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25972–25981

Zhiyu Li, Shichao Song, Hanyu Wang, Simin Niu, Ding Chen, Jiawei Yang, Chenyang Xi, Huayi Lai, Jihao Zhao, Yezhaohui Wang, Junpeng Ren, Zehao Lin, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhiqiang Yin, Qingchen Yu, Bo Tang, Hongkang Yang, Zhi-Qin John Xu, and Feiyu Xiong. Memos: An operating system for memory-augmented generation (MAG) in large langu...

work page arXiv

[11] [11]

A comprehensive survey on long context language modeling.CoRR, abs/2503.17407,

Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li, Tianyu Liu, Fanyu Meng, Wenbo Su, Yingshui Tan, ...

work page arXiv

[12] [12]

MemGPT: Towards LLMs as Operating Systems

11 Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems.CoRR, abs/2310.08560,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Han Shen

doi: 10.1016/0022-247X(65) 90154-X. Han Shen. On entropy control in llm-rl algorithms.arXiv preprint arXiv:2509.03493,

work page doi:10.1016/0022-247x(65

[14] [14]

When to memorize and when to stop: Gated recurrent memory for long-context reasoning.arXiv preprint arXiv:2602.10560,

Leheng Sheng, Yongtao Zhang, Wenchang Ma, Yaorui Shi, Ting Huang, Xiang Wang, An Zhang, Ke Shen, and Tat-Seng Chua. When to memorize and when to stop: Gated recurrent memory for long-context reasoning.arXiv preprint arXiv:2602.10560,

work page arXiv

[15] [15]

21.5.1071

doi: 10.1287/opre. 21.5.1071. Susanne Still and Doina Precup. An information-theoretic approach to curiosity-driven reinforcement learning.Theory in Biosciences, 131(3):139–148,

work page doi:10.1287/opre

[16] [16]

Mem-{\alpha}: Learning Memory Construction via Reinforcement Learning

Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian J. McAuley, and Xiaojian Wu. Mem-α: Learning memory construction via reinforcement learning.CoRR, abs/2509.25911,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context LLM with multi-conv rl-based memory agent.CoRR, abs/2507.02259,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

Youliang Yuan, Qiuyang Mang, Jingbang Chen, Hong Wan, Xiaoyuan Liu, Junjielong Xu, Jen-tse Huang, Wenxuan Wang, Wenxiang Jiao, and Pinjia He. Curing miracle steps in LLM mathematical reasoning with rubric rewards.CoRR, abs/2510.07774,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yua...

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Deepresearcher: Scaling deep research via reinforcement learning in real-world environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 414–431,

2025

[22] [22]

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents.arXiv preprint arXiv:2506.15841,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

The key point is architectural: once the history is compressed into memory, downstream reasoning and action selection can only access the information preserved inm t

19:end for B Summary-Induced Belief: Architectural Justification This appendix justifies why the belief of a summary-based memory agent is conditioned on the textual memory mt rather than on the full interaction history ht. The key point is architectural: once the history is compressed into memory, downstream reasoning and action selection can only access...

1999

[24] [24]

Based on current memory, what is the answer to the question?

studies long-horizon agents that jointly maintain in- ternal memory and perform task-directed reasoning. Instead of relying only on raw interaction history, the agent maintains a compact internal memory state across steps and uses it to support subsequent reasoning, querying, or environment interaction. This framework is evaluated in both multi-objective ...

work page arXiv