arxiv: 2604.11462 · v1 · submitted 2026-04-13 · 💻 cs.AI

Recognition: unknown

Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning

Xiaozhe Li , Tianyi Lyu , Yizhao Yang , Liang Shan , Siyi Yang , Ligao Zhang , Zhuoyi Huang , Qingwen Liu

show 1 more author

Yang Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentscontext managementreinforcement learninglong-horizon taskscontext bottleneckWebArenaactive curation

0 comments

The pith

A lightweight RL policy actively curates context for LLM agents by pruning noise and preserving reasoning anchors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes decoupling context management from task execution in LLM agents to address the context bottleneck in long-horizon tasks. A small specialized model, ContextCurator, is trained with reinforcement learning to reduce information entropy by aggressively pruning environmental noise while keeping critical reasoning anchors. This setup pairs the curator with a frozen powerful foundation model as the task executor. Experiments show improved success rates on WebArena and DeepSearch benchmarks along with substantial reductions in token consumption, with a 7B model matching GPT-4o performance.

Core claim

Training a lightweight policy model via reinforcement learning to actively curate the agent's context by pruning irrelevant information and retaining sparse reasoning anchors enables better performance on long-horizon tasks with lower token usage, even when the curator is much smaller than the executor model.

What carries the argument

ContextCurator, a lightweight policy model trained via reinforcement learning to minimize information entropy in working memory by pruning noise while preserving reasoning anchors.

If this is right

Success rates increase on WebArena from 36.4% to 41.2% for Gemini-3.0-flash with 8.8% less tokens.
Token consumption reduces by a factor of 8 on DeepSearch while raising success from 53.9% to 57.1%.
A 7B parameter ContextCurator achieves context management performance comparable to GPT-4o.
This provides a scalable approach for autonomous long-horizon agents without overloading the main model's context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such curation could extend agent capabilities to much longer interaction sequences without context overflow.
Active RL-based pruning may outperform passive methods like summarization in preserving necessary information for reasoning.
The decoupling suggests that context management can be optimized independently of the core reasoning model.

Load-bearing premise

The performance improvements result specifically from the reinforcement learning training of the context curation policy rather than from other implementation details or benchmark tuning.

What would settle it

Running the same experiments without the RL training for ContextCurator or without the pruning mechanism, and finding no improvement in success rates or token efficiency, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.11462 by Liang Shan, Ligao Zhang, Qingwen Liu, Siyi Yang, Tianyi Lyu, Xiaozhe Li, Yang Li, Yizhao Yang, Zhuoyi Huang.

**Figure 1.** Figure 1: Overview of ActiveContext. Top/Center: Standard LLM agent interactions in web and search domains, illustrating the context bottleneck and ”lost-in-the-middle” phenomenon during multi-turn reasoning. Top-Right & Bottom-Left: Our symbiotic architecture, where the ContextCurator actively refines verbose observations into a highfidelity working memory for the TaskExecutor. Bottom-Right: The Multi-turn GRPO pi… view at source ↗

read the original abstract

Large Language Models (LLMs) struggle with long-horizon tasks due to the "context bottleneck" and the "lost-in-the-middle" phenomenon, where accumulated noise from verbose environments degrades reasoning over multi-turn interactions. To address this issue, we introduce a symbiotic framework that decouples context management from task execution. Our architecture pairs a lightweight, specialized policy model, ContextCurator, with a powerful frozen foundation model, TaskExecutor. Trained via reinforcement learning, ContextCurator actively reduces information entropy in the working memory. It aggressively prunes environmental noise while preserving reasoning anchors, that is, sparse data points that are critical for future deductions. On WebArena, our framework improves the success rate of Gemini-3.0-flash from 36.4% to 41.2% while reducing token consumption by 8.8% (from 47.4K to 43.3K). On DeepSearch, it achieves a 57.1% success rate, compared with 53.9%, while reducing token consumption by a factor of 8. Remarkably, a 7B ContextCurator matches the context management performance of GPT-4o, providing a scalable and computationally efficient paradigm for autonomous long-horizon agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a symbiotic agent architecture that decouples context management from task execution: a small RL-trained ContextCurator (7B) actively prunes environmental noise while retaining sparse reasoning anchors, paired with a frozen powerful TaskExecutor. On WebArena it reports raising Gemini-3.0-flash success from 36.4% to 41.2% with an 8.8% token reduction; on DeepSearch it reports 57.1% vs. 53.9% success with an 8× token reduction; it further claims the 7B curator matches GPT-4o context-management performance.

Significance. If the gains are shown to be robust and attributable to the RL policy, the work supplies a concrete, scalable route to mitigating the context bottleneck in long-horizon agents without enlarging the executor model. The reported token-efficiency improvements and the possibility of a small specialized curator matching a frontier model would be practically useful for autonomous agents.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): the reported success-rate deltas (+4.8 pp on WebArena, +3.2 pp on DeepSearch) and token reductions are presented without error bars, number of trials, or any statistical significance test, so it is impossible to judge whether the improvements exceed run-to-run variance.
[§4.2 and §3] §4.2 (Ablations) and §3 (Method): no ablation studies isolate the contribution of the RL-trained curation policy from other unstated factors (prompt engineering, TaskExecutor implementation details, context-window heuristics, or benchmark-specific tuning). Without these controls the attribution of gains specifically to “active curation of reasoning anchors” cannot be verified.
[Abstract] Abstract: the claim that “a 7B ContextCurator matches the context management performance of GPT-4o” is stated without a direct head-to-head comparison table or protocol describing how context quality was measured for the two systems.

minor comments (2)

[§3] Notation for the reward function and the precise definition of “reasoning anchors” should be introduced earlier and used consistently throughout the method section.
[§4] Figure captions and axis labels in the experimental plots should explicitly state the number of runs and the baseline configurations being compared.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to strengthen the statistical reporting, add targeted ablations, and clarify the GPT-4o comparison.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the reported success-rate deltas (+4.8 pp on WebArena, +3.2 pp on DeepSearch) and token reductions are presented without error bars, number of trials, or any statistical significance test, so it is impossible to judge whether the improvements exceed run-to-run variance.

Authors: We agree that error bars and statistical tests are necessary for robust claims. The original results were obtained from single runs per configuration due to the high cost of long-horizon evaluations. In the revised manuscript we report means and standard deviations over five independent runs with different seeds and include a note that the observed improvements are statistically significant (p < 0.05) under a paired t-test. revision: yes
Referee: [§4.2 and §3] §4.2 (Ablations) and §3 (Method): no ablation studies isolate the contribution of the RL-trained curation policy from other unstated factors (prompt engineering, TaskExecutor implementation details, context-window heuristics, or benchmark-specific tuning). Without these controls the attribution of gains specifically to “active curation of reasoning anchors” cannot be verified.

Authors: Section 4.2 already compares the full RL curator against no curation and against a simple truncation baseline. To isolate the RL component more cleanly, the revision adds an ablation that replaces the RL policy with a prompt-engineered non-RL curator (same prompt template, no RL training) while keeping the TaskExecutor and all other implementation details fixed. The new results show that the RL-trained policy yields further gains beyond prompt engineering alone. revision: yes
Referee: [Abstract] Abstract: the claim that “a 7B ContextCurator matches the context management performance of GPT-4o” is stated without a direct head-to-head comparison table or protocol describing how context quality was measured for the two systems.

Authors: The supporting experiments appear in §4.3, where GPT-4o is substituted for the 7B curator while the TaskExecutor remains frozen; performance is measured by downstream success rate and token count under identical task prompts. We have added an explicit comparison table (new Table 5) that lists the exact protocol, success rates, and token statistics for both curators side-by-side. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical RL framework

full rationale

The paper describes a symbiotic RL-trained ContextCurator paired with a frozen TaskExecutor, reporting direct empirical gains on WebArena and DeepSearch via benchmark comparisons to Gemini and GPT-4o baselines. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that reduce any claimed result to its own inputs by construction. The architecture and results are presented as experimental outcomes without self-definitional loops or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the unstated assumption that an RL policy can reliably identify and retain reasoning anchors across tasks; no explicit free parameters, axioms, or invented physical entities are described.

pith-pipeline@v0.9.0 · 5548 in / 1274 out tokens · 130413 ms · 2026-05-10T15:07:44.438097+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 2 canonical work pages · 2 internal anchors

[1]

control bars

Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/ v1/2025.acl-long.1575. URLhttps://aclanthology.org/2025.acl-long.1575/. Jen-tse Huang, Kaiser Sun, Wenxuan Wang, and Mark Dredze. Language models do not have human-like working memory.arXiv preprint arXiv:2505.10571, 2025. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelm...

work page internal anchor Pith review doi:10.48550/arxiv 2025
[2]

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

OpenReview.net, 2024. URLhttps://openreview.net/forum?id=oKn9c6ytLx. Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents.ArXiv preprint, abs/2506.15841, 2025. URLhttps://arxiv.org/abs/2506.15841. 1...

work page internal anchor Pith review arXiv 2024
[3]

This post had no comments

ContextCurator (Memory Update): The WebAgent identified ‘MarvelsGrantMan136’ as the user who made the latest post under the ‘Postmill’ tab. This post had no comments. The agent then navigated to ‘MarvelsGrantMan136’s profile. Initial traversal of the user’s profile and previous posts in the ‘Postmill’ tab did not yield relevant results regarding comments ...
[4]

Latest Web Observation: Tab 0 (current): MarvelsGrantMan136
[5]

RootWebArea ‘MarvelsGrantMan136’ focused: True
[10]

heading ‘How Machine Learning Can Revolutionize the Book Industry’
[11]

link ‘How Machine Learning Can Revolutionize the Book Industry’
[14]

time ‘January 13, 2026 at 12:55:18 PM EST’

2026
[15]

StaticText ‘16 hours ago’
[18]

heading ‘MarvelsGrantMan136 wrote 16 hours ago’
[20]

time ‘January 13, 2026 at 12:51:42 PM EST’

2026
[25]

heading ‘How Machine Learning Can Transform the Corresponding Field How Machine Learning Can Transform the Corresponding Field’
[26]

Under review

link ‘How Machine Learning Can Transform the Corresponding Field How Machine Learning Can Transform the Corresponding Field’ 15 Preprint. Under review
[29]

time ‘January 13, 2026 at 9:09:04 AM EST’

2026
[30]

StaticText ‘20 hours ago’
[32]

heading ‘MarvelsGrantMan136 wrote 20 hours ago’
[34]

time ‘January 13, 2026 at 9:05:14 AM EST’

2026
[36]

link ‘AI experts disown Musk-backed campaign citing their research’
[37]

link ‘Don Gato1’ expanded: False
[39]

checkbox ‘Hide−’ checked: true
[40]

heading ‘MarvelsGrantMan136 wrote 23 hours ago’
[42]

time ‘January 13, 2026 at 5:59:50 AM EST’

2026
[43]

StaticText ‘Reply to ’
[44]

link ‘Driving in Pittsburgh summed up by one traffic sign.’
[45]

link ‘Dare County’ expanded: False
[46]

StaticText ‘Yeah, pittsburgh traffic, you know...Yeah, pittsburgh traffic, you know...’
[47]

button ‘Retract upvote’
[60]

link ‘cmu lticmu lti’ URL: http://3.151.53.169:9999/user/MarvelsGrantMan136
[61]

‘click [3017]

TaskExecutor’s Reasoning and Action (gpt-4o-mini): Let’s think step-by-step. According to the memory management agent, I have navigated to the comments section for the user ‘MarvelsGrantMan136’. The previous action was to click on the ‘Comments’ link, which should have led to a list of 16 Preprint. Under review. comments made by this user. However, the ob...
[62]

The agent then traversed the user’s profile to view posts and comments

ContextCurator (Memory Update): The WebAgent initially found that ’MarvelsGrantMan136’ made the latest post under the ‘Postmill’ tab, which had no comments. The agent then traversed the user’s profile to view posts and comments. After viewing the profile, the ‘Comments’ link was clicked to see all comments made by ‘MarvelsGrantMan136’. The page then loade...
[63]

Latest Web Observation: Tab 0 (current): Submissions
[64]

RootWebArea ‘Submissions’ focused: True
[65]

StaticText ‘Postmill’
[66]

searchbox ‘Search query’
[67]

link ‘Notifications (0)’
[68]

button ‘MarvelsGrantMan136’ hasPopup: menu expanded: False
[69]

heading ‘The Last Of Us’ Moves Production to Vancouver For Season 2
[70]

link ‘The Last Of Us’ Moves Production to Vancouver For Season 2
[73]

StaticText ‘t3 127rvdp’
[74]

time ‘March 31, 2023 at 1:02:17 PM EDT’

2023
[76]

Jackson’s Eye-Opening New Marvel Series"

heading ‘‘‘Secret Invasion’ Revealed: Inside Samuel L. Jackson’s Eye-Opening New Marvel Series"
[77]

Jackson’s Eye-Opening New Marvel Series"

link ‘‘‘Secret Invasion’ Revealed: Inside Samuel L. Jackson’s Eye-Opening New Marvel Series"
[78]

link ‘vanityfair.com’
[81]

StaticText ‘t3 127jaxx’
[82]

time ‘March 31, 2023 at 8:12:27 AM EDT’

2023
[84]

Under review

link ‘television’ 17 Preprint. Under review
[85]

heading ‘Star Trek: Starfleet Academy Series, From Alex Kurtzman and Nancy Drew Creator, Ordered at Paramount+’
[86]

link ‘Star Trek: Starfleet Academy Series, From Alex Kurtzman and Nancy Drew Creator, Ordered at Paramount+’
[89]

StaticText ‘t3 126sut3’
[90]

time ‘March 30, 2023 at 1:00:27 PM EDT’

2023
[92]

heading ‘‘‘Fargo’ Season 5: FX Unveils First Look at Jon Hamm, Premieres in Late 2023"

2023
[93]

link ‘‘‘Fargo’ Season 5: FX Unveils First Look at Jon Hamm, Premieres in Late 2023"

2023
[94]

link ‘tvinsider.com’
[97]

StaticText ‘t3 12686kh’
[98]

time ‘March 29, 2023 at 10:05:47 PM EDT’

2023
[100]

heading ‘Disney Lays Off Ike Perlmutter, Chairman of Marvel Entertainment’
[101]

link ‘Disney Lays Off Ike Perlmutter, Chairman of Marvel Entertainment’
[104]

StaticText ‘t3 125s0rj’
[105]

time ‘March 29, 2023 at 11:49:11 AM EDT’

2023
[107]

heading ‘House Of The Dragon’ To Get Sorter Season 2 (8 Episodes) As HBO Series Eyes Season 3 Greenlight
[108]

link ‘House Of The Dragon’ To Get Shorter Season 2 (8 Episodes) As HBO Series Eyes Season 3 Greenlight
[109]

StaticText ‘Submitted by ’
[110]

link ‘MarvelsGrantMan136’ expanded: False
[111]

StaticText ‘t3 1257zaj’
[112]

time ‘March 28, 2023 at 8:42:17 PM EDT’

2023
[113]

StaticText ‘3 years ago’
[114]

Under review

StaticText ‘1’ 18 Preprint. Under review
[115]

heading‘‘‘Star Trek: Strange New Worlds’ (Season 3) and ‘Lower Decks’ (Season 5) Renewed at Paramount+; ‘Strange New Worlds’ Season 2 Premieres June 15, ‘Lower Decks’ Season 4 Premieres Summer 2023"

2023
[116]

link ‘‘‘Star Trek: Strange New Worlds’ (Season 3) and ‘Lower Decks’ (Season 5) Renewed at Paramount+; ‘Strange New Worlds’ Season 2 Premieres June 15, ‘Lower Decks’ Season 4 Premieres Summer 2023"

2023
[117]

heading ‘MarvelsGrantMan136’
[118]

link ‘MarvelsGrantMan136’
[119]

StaticText ‘Registered ’
[120]

time ‘June 24, 2020 at 11:02:10 AM EDT’

2020
[121]

StaticText ‘6 years ago’
[122]

StaticText ‘Not whitelisted’
[123]

StaticText ‘t2 5adwlxvnAwesome Prompt ArtistFreelance Web DeveloperAwesome Prompt ArtistAwesome Prompt ArtistAwesome Prompt ArtistFreelance Web DeveloperI am a robotI am a robotPro Python Developer withFreelance Web Developer Freelance Web Developer Freelance Web Developer 20 years of Experience’
[124]

link ‘Edit biography’
[125]

link ‘Hidden forums’
[126]

heading ‘MarvelsGrantMan136 is a moderator on:’
[127]

link ‘PlantsForCatParents’

Showing first 80 references.