{"total":39,"items":[{"citing_arxiv_id":"2607.01916","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ContextSniper: AntTrail's Token-Efficient Code Memory for Repository-Level Program Repair","primary_cat":"cs.AI","submitted_at":"2026-07-02T09:15:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ContextSniper reduces token use by 38.9-51.5% in repository-level program repair agents on SWE-bench Lite with 2 percentage point drops in resolution rate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00692","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Self-GC: Self-Governing Context for Long-Horizon LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-07-01T09:41:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Self-GC governs agent context as indexed objects with planner-proposed actions, achieving 84.85% no-impact on future continuations on a hard set versus 54-70% for baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31650","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ECHO: Prune to act, trace to learn with selective turn memory in agentic RL","primary_cat":"cs.LG","submitted_at":"2026-06-30T13:29:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ECHO is a selective turn-memory framework for agentic RL that compresses turns into indexed records, selects them for bounded contexts, and uses source indices to assign outcome credit to supporting evidence, reaching 43.4% accuracy on BrowseComp-Plus versus 28.9% for GRPO and 36.1% for SUPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31564","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ACE: Pluggable Adaptive Context Elasticizer across Agents","primary_cat":"cs.AI","submitted_at":"2026-06-30T12:20:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ACE is a pluggable module that elastically orchestrates historical agent steps as raw, abstract, or dropped to maintain compact yet recoverable context for LLM agents handling long trajectories.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30005","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLM Agents Are Latent Context Managers: Eliciting Self-Managed Context via a Proprioceptive Dashboard","primary_cat":"cs.CL","submitted_at":"2026-06-29T09:13:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VISTA supplies LLM agents with a visible proprioceptive dashboard of typed context blocks, enabling untrained self-management that lifts performance on long-horizon tool-use benchmarks across multiple model scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29251","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Summaries Distort Decisions: Information Fidelity in LLM-Compressed Financial Analysis","primary_cat":"cs.AI","submitted_at":"2026-06-28T07:44:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM-based compression of financial source material can alter downstream investment decisions via decontextualization and model dependency, addressed by an agentic auditing approach that checks multiple compressions against the original.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22528","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-06-21T14:30:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Context compaction silently drops governance constraints in LLM agents, raising policy violation rates from 0% to 30% on average, with a proposed pinning mitigation restoring compliance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21732","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Safe to Check, Unsafe to Use: Relinking at the Compression Boundary of LLM Agents","primary_cat":"cs.CR","submitted_at":"2026-06-19T20:43:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Relinking is a new compression-boundary attack on LLM agents where summarization of split benign fragments produces malicious instructions, shown via Relink tool at 86.9% success rate and mitigated by KBRA defense to 0%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28376","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Does Overlap Help? OSU-Mem and a Cell-Conditional Analysis of Trajectory Memory for LLM Agents","primary_cat":"cs.IR","submitted_at":"2026-06-19T04:23:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OSU-Mem shows overlapping memory helps retrieval when evidence shares tools or entities but hurts when steps are heterogeneous, with benefits on synthetic benchmarks vanishing on mixed real ones due to query mixing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20683","ref_index":125,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Question Answering to Task Completion: A Survey on Agent System and Harness Design","primary_cat":"cs.AI","submitted_at":"2026-06-14T05:40:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Survey framing LLM agents as model-plus-harness systems, decomposing harness responsibilities, mapping them to tasks, and highlighting open challenges in evaluation, safety, and co-evolution.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"based on hierarchical summarization, graph construction, and relation-aware memory organization. The second direction focused on systematic context man- agement. Here the question is not only what to retrieve, but also when to inject information, how to compress it, how to refresh it, and how to preserve task-relevant state over long- horizon execution. This shift is reflected in methods such as ACON [125], which formulates context compression as an optimization problem, ARC [126], which treats context as a dynamically managed internal state updated through reflection, and ContextBudget [127], which makes compres- sion decisions under explicit context-window constraints. Related work further examines context maintenance in soft- ware and long-horizon settings, including CAT [128], which"},{"citing_arxiv_id":"2606.11680","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents","primary_cat":"cs.AI","submitted_at":"2026-06-10T05:49:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HORMA builds a hierarchical memory structure from agent experiences and trains a lightweight RL navigator to retrieve minimal sufficient context, yielding better task performance with at most 22.17% of baseline token usage on ALFWorld, LoCoMo, and LongMemEval.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11078","ref_index":68,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A History-Aware Visually Grounded Critic for Computer Use Agents","primary_cat":"cs.AI","submitted_at":"2026-06-09T16:39:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HiViG is a test-time critic that combines macro-action history summarization with visual grounding of execution coordinates to reduce short-sighted and visually erroneous actions in long-horizon GUI agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10616","ref_index":12,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents","primary_cat":"cs.AI","submitted_at":"2026-06-09T09:15:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OSL-MR is a learning-augmented framework that casts memory retention as constrained stochastic optimization under partial observability and outperforms heuristic baselines on LoCoMo and LongMemEval.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10209","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-06-08T22:01:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"On a hotel expense benchmark, pruning LLM agent context to the last 5 tool pairs plus summarization raises completion to 91.6% and cuts tokens by ~63% compared with retaining full conversation history.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08151","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-06-06T13:02:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CICL scores and compresses context evidence for LLM agents via action-shift and outcome-uplift metrics, lifting hit@1 from 0.58 to 0.78 on 50 SWE-bench retrieval tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06708","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Signal-Driven Observation for Long-Horizon Web Agents","primary_cat":"cs.CL","submitted_at":"2026-06-04T20:48:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Signal-Driven Observation decouples observation from action frequency in long-horizon web agents by invoking selective task-relevant DOM reads only on signals such as URL changes or action failures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06566","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"NTILC: Neural Tool Invocation via Learned Compression","primary_cat":"cs.SE","submitted_at":"2026-06-04T17:27:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NTILC replaces in-context tool registry lookup with learned latent retrieval using a signature-aware composite loss, reducing context consumption by over 95% and latency by up to 74%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03841","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management","primary_cat":"cs.AI","submitted_at":"2026-06-02T16:20:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EvoDS adds autonomous skill acquisition via synthesis-validation-reuse and adaptive context compression via learned control within a two-stage multi-agent RL scheme, claiming 28.9% average gains over prior agents on four benchmarks plus elimination of out-of-token failures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30842","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoMem: Context Management with A Decoupled Long-Context Model","primary_cat":"cs.LG","submitted_at":"2026-05-29T04:59:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CoMem decouples memory management from agent workflow with a k-step-off asynchronous pipeline and reward-driven training, achieving 1.4x latency reduction on SWE-Bench-Verified while preserving performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27141","ref_index":84,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions","primary_cat":"cs.AI","submitted_at":"2026-05-26T15:07:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26596","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-26T06:29:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AGORA is an inference-free step-level compressor for LLM agent prompts that retains at least 75% of uncompressed performance in most tested settings where token-level methods collapse due to action-grammar destruction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24468","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent","primary_cat":"cs.AI","submitted_at":"2026-05-23T08:37:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SAM is a standalone memory framework for long-horizon LLM agents that creates state-adaptive cues from interactions, preserves raw trajectories for intent-driven recall, and optimizes the module via expert supervision and RL, outperforming baselines on BrowseComp and related benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24279","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions","primary_cat":"cs.CL","submitted_at":"2026-05-22T23:13:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23296","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Parallel Context Compaction for Long-Horizon LLM Agent Serving","primary_cat":"cs.AI","submitted_at":"2026-05-22T07:12:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Parallel compaction for LLM agent context management provides predictable volume control and reduces wall time versus sequential baselines on HotpotQA and LoCoMo.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21996","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Patches to Trajectories: Privileged Process Supervision for Software-Engineering Agents","primary_cat":"cs.SE","submitted_at":"2026-05-21T04:54:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"P2T distills reference patches into a latent process graph and uses it to select shortest effective trajectory segments from teacher rollouts, yielding up to 10.8 point Pass@1 gains on SWE-bench Verified with 15% lower inference cost using only 1.8k instances.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19932","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-19T14:51:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-learning methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18597","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Latent Action Reparameterization for Efficient Agent Inference","primary_cat":"cs.AI","submitted_at":"2026-05-18T16:07:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LAR learns a compact latent action space from trajectories that shortens the effective decision horizon for LLM agents, reducing token count and inference time while preserving task success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18165","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-18T10:09:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Position-preserving MASK token compression reduces redundancy in diffusion LLMs to accelerate parallel decoding and enable context folding for longer sequences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15315","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-14T18:30:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LaMR decomposes code context pruning into two rubrics using dedicated CRFs, a mixture-of-experts gate, and AST-derived labels to filter noise and often match or beat full-context baselines on coding benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14563","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation","primary_cat":"cs.SE","submitted_at":"2026-05-14T08:35:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Yu, Dingchu Zhang, Yong Jiang, et al. Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025. [42] Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. Acon: Optimizing context compression for long-horizon llm agents.arXiv preprint arXiv:2510.00615, 2025. [43] Mo Li, LH Xu, Qitai Tan, Long Ma, Ting Cao, and Yunxin Liu. Sculptor: Empowering llms with cognitive agency via active context management.arXiv preprint arXiv:2508.04664, 2025. [44] Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori Pollock, and K Vijay- Shanker. Automatic generation of natural language summaries for java classes."},{"citing_arxiv_id":"2605.11436","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty","primary_cat":"cs.CL","submitted_at":"2026-05-12T02:37:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"asAgent-BRACE;(4) ReAct (RL):PPO trained model that additionally outputs its thinking inside <think>...</think>tokens before taking an action;(5) MEM1[Zhou et al., 2025]: RL framework that maintains a compact shared state for memory consolidation and reasoning - integrating prior memory with new observations while strategically discarding irrelevant or redundant information; (6) PABU[Jiang et al., 2026]: Belief-state framework that compactly represents an agent's state by explicitly modeling task progress and selectively retaining past actions and observation. Implementation Details.The maximum number of turns during training is set to be 15, and during inference, to test the long-horizon capability of the method, we set the maximum number of turns to"},{"citing_arxiv_id":"2605.08646","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PAAC: Privacy-Aware Agentic Device-Cloud Collaboration","primary_cat":"cs.LG","submitted_at":"2026-05-09T03:29:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PAAC aligns planner-executor decomposition with the device-cloud boundary via typed placeholders and on-device sanitization, delivering 15-36% higher accuracy and 2-6x lower leakage than prior device-cloud baselines on agentic benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[15] Matthew Honnibal, Ines Montani, Sofie Van Landeghem, Adriane Boyd, et al. spaCy: Industrial- Strength Natural Language Processing in Python. 2020. [16] Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14):6421, 2021. 10 [17] Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. ACON: Optimizing Context Compression for Long-horizon LLM Agents.arXiv preprint arXiv:2510.00615, 2025. [18] Jin-Hwa Kim, Nikita Kitaev, Xinlei Chen, Marcus Rohrbach, Byoung-Tak Zhang, Yuandong Tian, Dhruv Batra, and Devi Parikh. CoDraw: Collaborative Drawing as a Testbed for Grounded"},{"citing_arxiv_id":"2605.06978","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries","primary_cat":"cs.CL","submitted_at":"2026-05-07T21:51:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GoSkills converts flat skill lists into role-labeled execution contexts via anchor-centered groups and graph expansion, preserving coverage and improving rewards on SkillsBench and ALFWorld under small skill budgets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04496","ref_index":73,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States","primary_cat":"cs.CL","submitted_at":"2026-05-06T04:55:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26622","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory","primary_cat":"cs.CL","submitted_at":"2026-04-29T12:49:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Ki segments: ˆyi(q) = ˆyi,1, . . . ,ˆyi,Ki \u0001 ∈ {0,1} Ki (7) We ensure strict formatting by constraining label tokens to be either \"0\" or \"1\", whereˆyi,k = 1 indi- cates that segment k in image i is selected. Collect- ing these positive predictions across all N memory images yields the global index set ˆS(q) ={(i, k)|ˆy i,k = 1},(8) where 1≤i≤N,1≤k≤K i.(9) This \"index-only\" output is substantially faster and allows the system to deterministically \"transcribe\" content by fetching exact stored texts from the memoryM: E= Fetch ˆS(q),M \u0001 = M (i,k)∈ ˆS(q) si,k,(10) where ⊕ denotes concatenation under a fixed for- matting template. This separation of concerns lever- ages visual grounding for search while reserving"},{"citing_arxiv_id":"2604.20938","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HARBOR: Automated Harness Optimization","primary_cat":"cs.LG","submitted_at":"2026-04-22T13:45:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HARBOR formalizes harness optimization as constrained noisy Bayesian optimization over mixed-variable spaces and reports a case study where it outperforms manual tuning on a production coding agent.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13346","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AgentSPEX: An Agent SPecification and EXecution Language","primary_cat":"cs.CL","submitted_at":"2026-04-14T23:16:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentSPEX is a new language and harness for explicitly specifying and running structured LLM-agent workflows with typed steps, control flow, parallel execution, and a visual editor.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02688","ref_index":10,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration","primary_cat":"cond-mat.mtrl-sci","submitted_at":"2026-04-03T03:32:15+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MatClaw shows a code-first LLM agent autonomously generating and executing workflows for ML force field training, Curie temperature prediction, and parameter search on CuInP2S6, succeeding on code but requiring interventions for tacit domain knowledge.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.21354","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project","primary_cat":"cs.LG","submitted_at":"2026-03-22T18:30:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"A direct cost-performance comparison [30] shows that at 100K-token context lengths, per-turn inference charges grow propor- tionally with context even under prompt caching, and fact-based memory systems become cheaper after approximately ten interaction turns-precisely the regime where agent sessions operate. Context compression can mitigate the problem: ACON [31] achieves 26-54% peak-token reduction while preserving 95+% task accuracy via failure-driven guideline optimization, and Focus [32] demonstrates 22.7% token savings through autonomous agent-driven context pruning on SWE-bench tasks. However, both operate on the agent side, optimizing a single session in isolation. This is a fundamental limitation:"}],"limit":50,"offset":0}