ECHO is a selective turn-memory framework for agentic RL that compresses turns into indexed records, selects them for bounded contexts, and uses source indices to assign outcome credit to supporting evidence, reaching 43.4% accuracy on BrowseComp-Plus versus 28.9% for GRPO and 36.1% for SUPO.
Scaling llm multi-turn rl with end-to-end summarization-based context management
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 12verdicts
UNVERDICTED 12roles
background 1polarities
background 1representative citing papers
HORMA builds a hierarchical memory structure from agent experiences and trains a lightweight RL navigator to retrieve minimal sufficient context, yielding better task performance with at most 22.17% of baseline token usage on ALFWorld, LoCoMo, and LongMemEval.
TMEM lets LLM agents evolve their policy mid-episode by absorbing distilled supervision into online LoRA updates, outperforming summary and retrieval baselines on several long-context benchmarks.
SAM is a standalone memory framework for long-horizon LLM agents that creates state-adaptive cues from interactions, preserves raw trajectories for intent-driven recall, and optimizes the module via expert supervision and RL, outperforming baselines on BrowseComp and related benchmarks.
Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
Context-ReAct enables agents to dynamically manage context via five atomic operations, and LongSeeker fine-tuned on 10k trajectories achieves 61.5% and 62.5% on BrowseComp benchmarks, outperforming prior agents.
ScrapMem reports SOTA 51.0% Joint@10 on ATM-Bench with up to 93% memory reduction and 70.3% Recall@10 via optical forgetting and EM-Graph.
SWE-MeM introduces adaptive memory management for coding agents via synthesized trajectories and Memory-aware GRPO, reporting 43.4% and 60.2% resolve rates on SWE-Bench Verified for 4B and 30B models while beating baselines on performance and token use.
OSU-Mem shows overlapping memory helps retrieval when evidence shares tools or entities but hurts when steps are heterogeneous, with benefits on synthetic benchmarks vanishing on mixed real ones due to query mixing.
LaMR decomposes code context pruning into two rubrics using dedicated CRFs, a mixture-of-experts gate, and AST-derived labels to filter noise and often match or beat full-context baselines on coding benchmarks.
R²-Searcher introduces fine-grained evidence modeling, retrieval reflection, and R²PO RL to calibrate retrieval-reasoning boundaries and improve multi-hop QA performance.
The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.
citing papers explorer
-
When Does Overlap Help? OSU-Mem and a Cell-Conditional Analysis of Trajectory Memory for LLM Agents
OSU-Mem shows overlapping memory helps retrieval when evidence shares tools or entities but hurts when steps are heterogeneous, with benefits on synthetic benchmarks vanishing on mixed real ones due to query mixing.
-
R$^2$-Searcher: Calibrating Retrieval and Reasoning Boundaries for Agentic Search
R²-Searcher introduces fine-grained evidence modeling, retrieval reflection, and R²PO RL to calibrate retrieval-reasoning boundaries and improve multi-hop QA performance.