pith. sign in

arxiv: 2606.31650 · v1 · pith:6GJISB6Cnew · submitted 2026-06-30 · 💻 cs.LG · cs.AI

ECHO: Prune to act, trace to learn with selective turn memory in agentic RL

Pith reviewed 2026-07-01 06:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords agentic RLturn memorycontext managementcredit assignmentlanguage agentsselective memorylong-horizon agentsBrowseComp-Plus
0
0 comments X

The pith

ECHO lets agents prune history into indexed records and route RL credit back to the exact turns that produced success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long-horizon language agents lose access to detailed past evidence as context windows fill, and standard RL cannot easily credit the specific observations that led to a correct answer. ECHO addresses both problems by turning each completed turn into a compact memory record that keeps its original source index. The policy then builds its current context by selecting a subset of these records, and any positive outcome reward is routed back through the same indices to update the policy on the evidence and selection choices that mattered. On BrowseComp-Plus this yields 43.4 percent held-out accuracy while using fewer turns and less total trajectory data than prior methods. The resulting policy also shows stronger zero-shot transfer to other QA, code, and search tasks.

Core claim

ECHO is a selective turn-memory framework that compresses each completed environment turn into a compact memory record, reconstructs bounded policy contexts by selecting from these records, and reuses the selected source indices to route positive outcome credit to the evidence and selection actions that support successful answers.

What carries the argument

source-indexed reconstruction: each turn is stored as a compact record carrying its original index; selection from the record pool both assembles the policy context and supplies the addresses for outcome-based credit assignment.

If this is right

  • Agents maintain direct access to original evidence without progressive loss as turn count grows.
  • Outcome rewards can update the precise turns and selection decisions that contributed to success.
  • Training uses fewer total turns and lower trajectory volume than rolling-summary baselines.
  • The learned policy transfers to multi-objective QA, code generation, and information-seeking tasks on both dense and MoE models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same index-based credit routing could be applied to other memory-compression schemes in long-horizon RL to test whether traceability is the main source of the observed gains.
  • If the compression step itself discards critical details, the framework would benefit from adaptive record granularity rather than fixed-size records.
  • The separation of pruning (via selection) from credit assignment (via indices) suggests a general pattern for making any bounded-context agent trainable with outcome RL.

Load-bearing premise

Reconstructing contexts from selected compressed records with their original indices still supplies enough fine-grained evidence for the policy to use it effectively and lets credit assignment remain stable without introducing selection bias.

What would settle it

Train two policies on the same data: one with ECHO's indexed credit routing and one with identical compression but credit assigned uniformly or without indices; if the indexed version shows no accuracy gain or higher variance on held-out tasks, the traceability benefit is absent.

Figures

Figures reproduced from arXiv: 2606.31650 by Aoqi Hu, Binbin Zheng, Enlei Gong, Guanqun Zhao, Jiayao Tang, Jihua Liu, Lingfeng Liu, Yuyang You, Zeyu Chen, Zijun Xie.

Figure 1
Figure 1. Figure 1: Held-out accuracy, tool-use turns per rollout, and trajectory volume over training on BrowseComp-Plus [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training diagnostics on long-horizon search. Summarization-based context management enables longer [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of ECHO. ECHO stores completed turns as source-indexed memories, selects useful memo￾ries for bounded context reconstruction, and reuses the same source trace for credit assignment. Autoregressive memory selection. Let S cap model-selected turns and K denote the latest turns retained au￾tomatically. At a compression boundary before segment j, let Hbd j be the bounded local state available there. E… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study on BrowseComp-Plus. (a) Learned source selection in ECHO outperforms semantic [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics on BrowseComp-Plus with the Qwen3-30B-A3B-Instruct MoE backbone. SUPO is [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Existing context reconstruction strategies reduce context length but often lose full traceability. ECHO [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

Long-horizon language agents must repeatedly interact with tools, accumulate evidence, and make decisions under bounded context windows. Existing context-management methods make such rollouts feasible by truncating distant history, folding past turns into summaries, or selecting compact memory states. However, these breakthroughs introduce two coupled limitations. First, as the number of turns grows, historical observations are progressively removed or collapsed into compressed states, making it harder for the policy to reuse fine-grained evidence. Second, once the original turns are no longer source-addressable, outcome-based RL loses an explicit path for aligning policy updates with the evidence that supported a successful final answer. To this end, we propose ECHO, a selective turn-memory framework that jointly addresses history collapse and traceable learning through source-indexed reconstruction. Specifically, ECHO compresses each completed environment turn into a compact memory record, reconstructs bounded policy contexts by selecting from these records, and reuses the selected source indices to route positive outcome credit to the evidence and selection actions that support successful answers. On BrowseComp-Plus, ECHO reaches 43.4% held-out accuracy, outperforming GRPO (28.9%) and the rolling-summary baseline SUPO (36.1%), while using fewer turns and lower trajectory volume than SUPO (Figure 1). Additionally, the trained policy improves zero-shot generalization across multi-objective QA, code generation, and deep information-seeking benchmarks on both dense and MoE backbones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes ECHO, a selective turn-memory framework for long-horizon language agents in RL. It compresses each environment turn into a compact memory record, reconstructs bounded policy contexts via selection from these records, and reuses the source indices to route positive outcome credit back to the supporting evidence and selection actions. On BrowseComp-Plus it reports 43.4% held-out accuracy (vs. GRPO 28.9%, SUPO 36.1%), with fewer turns and lower trajectory volume than SUPO, plus improved zero-shot generalization on multi-objective QA, code generation, and information-seeking tasks across dense and MoE backbones.

Significance. If the source-indexed reconstruction and index-based credit routing function as described without substantial information loss or selection bias, the method would address two persistent limitations in agentic RL—progressive history collapse and untraceable outcome credit—potentially enabling more efficient, evidence-reusing policies at scale.

major comments (3)
  1. [Abstract] Abstract: the central claim that source-indexed reconstruction preserves sufficient fine-grained evidence for policy reuse is unsupported by any reported diagnostic (e.g., token-level overlap, information-retention metrics, or reconstruction-error statistics between original turns and reconstructed contexts).
  2. [Abstract] Abstract: no ablation or diagnostic is supplied that isolates the contribution of index-based credit routing from the selection heuristic itself; without this, the 43.4% accuracy and reduced trajectory volume cannot be confidently attributed to the traceable-learning component rather than the compression/selection procedure alone.
  3. [Abstract] Abstract: the reported accuracy figures lack variance estimates, statistical significance tests, or multiple-run statistics, making it impossible to assess whether the gains over GRPO and SUPO are robust or sensitive to unstated implementation choices.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments highlighting areas where additional evidence would strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that source-indexed reconstruction preserves sufficient fine-grained evidence for policy reuse is unsupported by any reported diagnostic (e.g., token-level overlap, information-retention metrics, or reconstruction-error statistics between original turns and reconstructed contexts).

    Authors: We acknowledge that the current manuscript does not report explicit diagnostics such as token-level overlap, information-retention metrics, or reconstruction-error statistics. In the revised version we will add these analyses (including quantitative comparisons between original turns and reconstructed contexts) in a new appendix to substantiate the preservation claim. revision: yes

  2. Referee: [Abstract] Abstract: no ablation or diagnostic is supplied that isolates the contribution of index-based credit routing from the selection heuristic itself; without this, the 43.4% accuracy and reduced trajectory volume cannot be confidently attributed to the traceable-learning component rather than the compression/selection procedure alone.

    Authors: We agree that an ablation isolating the index-based credit routing mechanism from the selection heuristic is necessary to attribute performance gains. The revised manuscript will include this ablation (comparing full ECHO against a variant without credit routing) along with corresponding trajectory and accuracy metrics. revision: yes

  3. Referee: [Abstract] Abstract: the reported accuracy figures lack variance estimates, statistical significance tests, or multiple-run statistics, making it impossible to assess whether the gains over GRPO and SUPO are robust or sensitive to unstated implementation choices.

    Authors: The reported figures are from single runs without variance or significance statistics. In revision we will rerun all main experiments across multiple random seeds, report means and standard deviations, and include statistical significance tests against the baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results are direct measurements, not derived quantities.

full rationale

The provided abstract and description contain no equations, fitted parameters, or derivation steps. The central claims are held-out accuracy numbers (43.4% on BrowseComp-Plus) presented as direct experimental outcomes, with comparisons to baselines like GRPO and SUPO. No self-definitional relations, fitted-input predictions, or load-bearing self-citations appear. The method description (source-indexed reconstruction and credit routing) is presented as a proposed framework whose effectiveness is evaluated empirically rather than proven by construction from its own inputs. This matches the default expectation of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unverified assumption that indexed records preserve usable evidence and that index-based credit assignment is stable.

pith-pipeline@v0.9.1-grok · 5825 in / 1181 out tokens · 22800 ms · 2026-07-01T06:12:03.807983+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 38 canonical work pages · 26 internal anchors

  1. [1]

    Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.CoRR, abs/2508.06600, 2025

    Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, and Jimmy Lin. BrowseComp-Plus: A more fair and trans- parent evaluation benchmark of deep-research agent.arXiv...

  2. [2]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production- ready AI agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,

  3. [3]

    Generalizable end-to-end tool-use RL with synthetic CodeGym.arXiv preprint arXiv:2509.17325,

    Weihua Du, Hailei Gong, Zhan Ling, Kang Liu, Lingfeng Shen, Xuesong Yao, Yufei Xu, Dingyuan Shi, Yim- ing Yang, and Jiecao Chen. Generalizable end-to-end tool-use RL with synthetic CodeGym.arXiv preprint arXiv:2509.17325,

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948,

  5. [5]

    Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model.arXiv preprint arXiv:2408.09559,

    Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model.arXiv preprint arXiv:2408.09559,

  6. [6]

    SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

    Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Ziliang Zhao, Jiejun Tan, Zheng Liu, and Zhicheng Dou. SAM: State-adaptive memory for long-horizon reasoning agent.arXiv preprint arXiv:2605.24468,

  7. [7]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Wang Dong, Hamed Zamani, and Jiawei Han. Search-R1: Training LLMs to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

  8. [8]

    ACON: Optimizing Context Compression for Long-horizon LLM Agents

    Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. Acon: Optimizing context compression for long-horizon llm agents.arXiv preprint arXiv:2510.00615,

  9. [9]

    Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation

    Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. InPro- ceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologie...

  10. [10]

    MemPO: Self-Memory Policy Optimization for Long-Horizon Agents

    Ruoran Li, Xinghua Zhang, Haiyang Yu, Shitong Duan, Xiang Li, Wenxin Xiang, Chonghua Liao, Xudong Guo, Yongbin Li, and Jinli Suo. MemPO: Self-memory policy optimization for long-horizon agents.arXiv preprint arXiv:2603.00680,

  11. [11]

    Torl: Scaling tool-integrated rl,

    10 Xuefeng Li, Haoyang Zou, and Pengfei Liu. ToRL: Scaling tool-integrated RL.arXiv preprint arXiv:2503.23383,

  12. [12]

    Compressing context to enhance inference efficiency of large language models

    Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,

  13. [13]

    Prompt compression for large language models: A survey.arXiv preprint arXiv:2410.12388,

    Zongqian Li, Yinhong Liu, Yixuan Su, and Nigel Collier. Prompt compression for large language models: A survey.arXiv preprint arXiv:2410.12388,

  14. [14]

    Scaling LLM multi-turn RL with end-to-end summarization-based context management.arXiv preprint arXiv:2510.06727,

    Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, and Jiecao Chen. Scaling LLM multi-turn RL with end-to-end summarization-based context management.arXiv preprint arXiv:2510.06727,

  15. [15]

    Gaia: a benchmark for general ai assistants

    Gr´egoire Mialon, Cl´ementine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InInternational Conference on Learning Representations, volume 2024, pp. 9025–9049,

  16. [16]

    Agent-Omit: Adaptive Context Omission for Efficient LLM Agents

    Yansong Ning, Jun Fang, Naiqiang Tan, and Hao Liu. Agent-omit: Adaptive context omission for efficient llm agents.arXiv preprint arXiv:2602.04284,

  17. [17]

    Agent-Omit: Adaptive Context Omission for Efficient LLM Agents

    doi: 10.48550/arXiv.2602.04284. Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560,

  18. [18]

    HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

    Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, and Mingyi Hong. Hiper: Hierarchical reinforcement learning with explicit credit assignment for large language model agents. arXiv preprint arXiv:2602.16165,

  19. [19]

    Humanity's Last Exam

    Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249,

  20. [20]

    Measuring and narrowing the compositionality gap in language models

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 5687–5711,

  21. [21]

    ToolRL: Reward is All Tool Learning Needs

    Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-T ¨ur, Gokhan Tur, and Heng Ji. ToolRL: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958,

  22. [22]

    WebResearcher: Unleashing unbounded reasoning capability in long-horizon agents.arXiv preprint arXiv:2509.13309,

    Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, et al. WebResearcher: Unleashing unbounded reasoning capability in long-horizon agents.arXiv preprint arXiv:2509.13309,

  23. [23]

    Locobench-agent: An interactive benchmark for llm agents in long-context software engineering

    Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liang- wei Yang, Juntao Tan, et al. Locobench-agent: An interactive benchmark for llm agents in long-context software engineering.arXiv preprint arXiv:2511.13998,

  24. [24]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  25. [25]

    Cognitive memory in large language models

    Lianlei Shan, Shixian Luo, Zezhou Zhu, Yu Yuan, and Yong Wu. Cognitive memory in large language models. arXiv preprint arXiv:2504.02441,

  26. [26]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

  27. [27]

    QwenLong-CPRS: Towards∞-LLMs with dynamic context optimization.arXiv preprint arXiv:2505.18092,

    Weizhou Shen, Chenliang Li, Fanqi Wan, Shengyi Liao, Shaopeng Lai, Bo Zhang, Yingcheng Shi, Yuning Wu, Gang Fu, Zhansheng Li, et al. QwenLong-CPRS: Towards∞-LLMs with dynamic context optimization.arXiv preprint arXiv:2505.18092,

  28. [28]

    R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

    Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-Searcher: Incentivizing the search capability in LLMs via reinforcement learning.arXiv preprint arXiv:2503.05592,

  29. [29]

    Hindsight credit assignment for long-horizon llm agents.arXiv preprint arXiv:2603.08754,

    11 Hui-Ze Tan, Xiao-Wen Yang, Hao Chen, Jie-Jing Shao, Yi Wen, Yuteng Shen, Weihong Luo, Xiku Du, Lan-Zhe Guo, and Yu-Feng Li. Hindsight credit assignment for long-horizon llm agents.arXiv preprint arXiv:2603.08754,

  30. [30]

    In-context former: Lightning-fast compressing context for large language model

    Xiangfeng Wang, Zaiyi Chen, Zheyong Xie, Tong Xu, Yongyi He, and Enhong Chen. In-context former: Lightning-fast compressing context for large language model. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024,

  31. [31]

    MIRIX: Multi-Agent Memory System for LLM-Based Agents

    Yu Wang and Xi Chen. MIRIX: Multi-agent memory system for LLM-based agents.arXiv preprint arXiv:2507.07957,

  32. [32]

    Milestone-Guided Policy Learning for Long-Horizon Language Agents

    Zixuan Wang, Yuchen Yan, Hongxing Li, Teng Pan, Dingming Li, Ruiqing Zhang, Weiming Lu, Jun Xiao, Yuet- ing Zhuang, and Yongliang Shen. Milestone-guided policy learning for long-horizon language agents.arXiv preprint arXiv:2605.06078,

  33. [33]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,

  34. [34]

    Resum: Unlocking long-horizon search intelligence via context summarization

    Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, and Jingren Zhou. Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313,

  35. [35]

    ContextWeaver: Selective and Dependency-Structured Memory Construction for LLM Agents

    Yating Wu, Yuhao Zhang, Sayan Ghosh, Sourya Basu, Anoop Deoras, Jun Huan, and Gaurav Gupta. Con- textweaver: Selective and dependency-structured memory construction for llm agents.arXiv preprint arXiv:2604.23069,

  36. [36]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-MEM: Agentic memory for LLM agents.arXiv preprint arXiv:2502.12110,

  37. [37]

    Concise and precise context compression for tool-using language models

    Yang Xu, Yunlong Feng, Honglin Mu, Yutai Hou, Yitong Li, Xinghao Wang, Wanjun Zhong, Zhongyang Li, Dandan Tu, Qingfu Zhu, et al. Concise and precise context compression for tool-using language models. In Findings of the Association for Computational Linguistics: ACL 2024, pp. 16430–16441,

  38. [38]

    Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

    Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Hinrich Sch¨utze, V olker Tresp, and Yunpu Ma. Memory-R1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828,

  39. [39]

    Hotpotqa: A dataset for diverse, explainable multi-hop question answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 2369–2380,

  40. [40]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  41. [41]

    From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

    12 Chenchen Zhang. From reasoning to agentic: Credit assignment in reinforcement learning for large language models.arXiv preprint arXiv:2604.09459,

  42. [42]

    Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks

    Yuxiang Zhang, Jiangming Shu, Ye Ma, Xueyuan Lin, Shangxi Wu, and Jitao Sang. Memory as action: Au- tonomous context curation for long-horizon agentic tasks.arXiv preprint arXiv:2510.12635,

  43. [43]

    AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

    Haotian Zhao, Songlin Zhou, Yuxin Zhang, Stephen S.-T. Yau, Wenyu Zhang, Lun Tian, Tianshu Zhu, Yifeng Huang, Yucheng Zeng, Jingnan Gu, Daxiang Dong, and Jianmin Wu. AEM: Adaptive entropy modulation for multi-turn agentic reinforcement learning.arXiv preprint arXiv:2605.00425,

  44. [44]

    DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

    Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deep- Researcher: Scaling deep research via reinforcement learning in real-world environments.arXiv preprint arXiv:2504.03160,

  45. [45]

    MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

    Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. MEM1: Learning to synergize memory and reasoning for efficient long-horizon agents. arXiv preprint arXiv:2506.15841,

  46. [46]

    The top pipeline shows the rollout loop, where the policy generates tool calls, receives observations, and triggers reconstruction once history exceeds the budget

    13 A CONTEXTRECONSTRUCTIONSTRATEGIES Figure 6 compares context reconstruction strategies under bounded-context agentic RL. The top pipeline shows the rollout loop, where the policy generates tool calls, receives observations, and triggers reconstruction once history exceeds the budget. The lower panels show how different methods reconstruct the next polic...

  47. [47]

    CodeGym frames tasks as synthetic interactive environments where agents invoke problem-specific APIs, together withobserve()anddone(), rather than writing raw code

    and LoCoBench-Agent (Qiu et al., 2025). CodeGym frames tasks as synthetic interactive environments where agents invoke problem-specific APIs, together withobserve()anddone(), rather than writing raw code. Since many CodeGym tasks are either too easy or too difficult for meaningful comparison, we construct a medium- difficulty subset using the originalQwen...