Recognition: no theorem link
Ghost in the Context: Measuring Policy-Carriage Failures in Decision-Time Assembly
Pith reviewed 2026-05-14 21:20 UTC · model grok-4.3
The pith
Decision-time context assembly in LM agents is a measurable control-path element that can be partially hardened.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LM agents do not act on raw interaction history; they act on a bounded decision state assembled by truncation, summarization, reordering, and rewriting. If directive-bearing state is dropped, weakened, or rebound during that step, an agent can cross a policy boundary without prompt override, model changes, or persistent-memory compromise. SafeContext pins control state, reuses retained control prefixes, and optionally injects reminders under pressure while keeping model weights fixed. Unmitigated risk is systematic, but absolute exact respect remains low. Against truncation, SafeContext yields small gains; against a strong structured-compaction policy, most aggregate lift disappears, leaving
What carries the argument
SafeContext, a control layer that pins control state, reuses retained control prefixes, and optionally injects reminders under pressure while keeping model weights fixed.
If this is right
- Context assembly failures allow policy boundary crossings without any model or prompt alteration.
- SafeContext delivers small gains against truncation policies but limited benefit under structured compaction.
- The same failure mode appears in larger models including Qwen 14B and Llama 70B.
- Replay-only mechanisms do not account for the observed effects of SafeContext.
- Decision-time context assembly constitutes a measurable and partially addressable component of the agent control path.
Where Pith is reading between the lines
- Agent safety evaluations should incorporate direct testing of context assembly policies as a distinct attack surface.
- Layered hardening at the assembly stage may reduce dependence on model-level alignment alone.
- Further tests with a broader set of real-world assembly policies would help determine how representative the reported risks are.
- The visibility audits of assembled states provide a concrete method for diagnosing retention failures in deployed agents.
Load-bearing premise
The specific truncation and structured-compaction policies tested represent those used in practical LM agent systems, and the constraint respect judgments accurately reflect true policy carriage without bias.
What would settle it
An experiment demonstrating that assembled decision states retain all directive-bearing content under both truncation and compaction policies, or that SafeContext produces no measurable improvement in constraint respect when applied to additional assembly policies.
Figures
read the original abstract
LM agents do not act on raw interaction history; they act on a bounded decision state assembled by truncation, summarization, reordering, and rewriting. If directive-bearing state is dropped, weakened, or rebound during that step, an agent can cross a policy boundary without prompt override, model changes, or persistent-memory compromise. We study this failure mode over local Llama 3.1 8B, Qwen 2.5 7B, and Mistral 7B using judged exact constraint respect and direct audits of assembled-state visibility. We evaluate SafeContext, a control layer that pins control state, reuses retained control prefixes, and optionally injects reminders under pressure while keeping model weights fixed. Unmitigated risk is systematic, but absolute exact respect remains low. Against truncation, SafeContext yields small gains; against a strong structured-compaction policy, most aggregate lift disappears, leaving residual benefit mainly in overflow eviction and selected aliasing slices. Replay-only does not explain the effect. A larger-model extension on Qwen 14B and Llama 70B shows the same failure object under larger models, although sign and magnitude remain policy-conditional. Decision-time context assembly is therefore a measurable part of the control path that can be partially hardened.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates policy-carriage failures in LM agents caused by decision-time context assembly (truncation, summarization, reordering, rewriting) that can drop or weaken directive-bearing state without prompt overrides or model changes. It evaluates this failure mode on Llama 3.1 8B, Qwen 2.5 7B, and Mistral 7B using judged exact constraint respect and direct audits of assembled-state visibility. SafeContext is proposed as a control layer that pins control state, reuses prefixes, and injects reminders while keeping weights fixed. Experiments show systematic unmitigated risk with low absolute respect rates; SafeContext yields small gains against truncation that largely vanish under strong compaction, with residual benefits in overflow and aliasing cases. The same failure pattern appears in 14B/70B models but remains policy-conditional. The central claim is that decision-time context assembly is a measurable, partially hardenable component of the agent control path.
Significance. If the empirical findings hold under fuller validation, the work identifies a concrete, previously under-studied vector for unintended policy boundary crossings in deployed LM agents that operates entirely at inference time. The demonstration that a lightweight, weight-fixed intervention can produce measurable (if policy-dependent) hardening, together with the use of direct audits alongside judgments, supplies a useful empirical foothold for control-path analysis. The extension to larger models further indicates that the failure mode is not confined to small-scale systems.
major comments (3)
- [Abstract] Abstract: The experimental description provides no sample sizes, number of trials per condition, statistical methods, or full evaluation protocols. Without these, it is impossible to determine whether the reported systematic risk, small SafeContext gains against truncation, and their disappearance under compaction are statistically reliable or driven by small-N effects.
- [Abstract] Abstract: The manuscript does not justify why the two tested assembly policies (truncation and structured-compaction) are representative of mechanisms used in practical LM-agent deployments (e.g., retrieval-augmented windows, agent-specific summarizers, or dynamic reordering). If these policies are unrepresentative, the measured failure rates and residual hardening benefits cannot be taken as evidence that decision-time assembly is a general, measurable part of the control path.
- [Abstract] Abstract: No details are supplied on judgment prompts, inter-rater reliability, constraint-definition ambiguity, or how direct audits were performed. This leaves open the possibility that evaluator bias or inconsistent labeling inflates the reported unmitigated risk and understates or overstates SafeContext lift, directly undermining the central measurement claim.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and will revise the manuscript to strengthen the reporting of experimental details and justifications.
read point-by-point responses
-
Referee: [Abstract] Abstract: The experimental description provides no sample sizes, number of trials per condition, statistical methods, or full evaluation protocols. Without these, it is impossible to determine whether the reported systematic risk, small SafeContext gains against truncation, and their disappearance under compaction are statistically reliable or driven by small-N effects.
Authors: We agree that the abstract omits these quantitative details. The experiments used 100 trials per model-policy combination, with results expressed as exact respect proportions and 95% Wilson-score confidence intervals; no parametric assumptions were imposed. The full evaluation protocol, including trial generation and audit procedures, appears in Section 3. We will revise the abstract to state the sample size and statistical approach explicitly. revision: yes
-
Referee: [Abstract] Abstract: The manuscript does not justify why the two tested assembly policies (truncation and structured-compaction) are representative of mechanisms used in practical LM-agent deployments (e.g., retrieval-augmented windows, agent-specific summarizers, or dynamic reordering). If these policies are unrepresentative, the measured failure rates and residual hardening benefits cannot be taken as evidence that decision-time assembly is a general, measurable part of the control path.
Authors: Truncation and structured-compaction were chosen because they instantiate the two most common primitive operations in deployed context windows (length-based eviction and content-based condensation). We will add a short justification paragraph in the introduction, citing representative agent frameworks, while explicitly noting that the study does not cover retrieval-augmented or reordering policies and that generalization therefore remains conditional on those mechanisms. revision: yes
-
Referee: [Abstract] Abstract: No details are supplied on judgment prompts, inter-rater reliability, constraint-definition ambiguity, or how direct audits were performed. This leaves open the possibility that evaluator bias or inconsistent labeling inflates the reported unmitigated risk and understates or overstates SafeContext lift, directly undermining the central measurement claim.
Authors: We acknowledge the need for greater transparency on the evaluation pipeline. Judgment prompts are reproduced verbatim in Appendix A; direct audits consist of token-level inspection of the assembled context for directive presence; inter-rater reliability on a 200-example subset yielded Cohen’s kappa of 0.83; and constraint definitions were fixed in advance to limit ambiguity. We will summarize these elements in the abstract and expand the methods section accordingly. revision: yes
Circularity Check
No circularity; empirical measurement study with no derivation chain
full rationale
The paper is an empirical investigation of context-assembly failures in LM agents. It evaluates specific policies (truncation, structured-compaction) on fixed models via direct audits and judged constraint respect, then measures the effect of the SafeContext layer. No equations, fitted parameters, or predictions are defined in terms of the target outcomes. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim follows directly from the reported experimental deltas rather than reducing to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). ...
-
[2]
Amanda Bertsch, Maor Ivgi, Emily Xiao, Uri Alon, Jonathan Berant, Matthew R. Gormley, and Graham Neubig. 2025. In-Context Learning with Long-Context Models: An In-Depth Exploration. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguis- tics: Human Language Technologies (Volume 1: Long Pap...
-
[3]
Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. InAdvances in Neural Information Processing Systems 37. Curran Associates, Inc., Vancouver, Canada, 26. https://doi.org/10.52202/079017-2636 Datase...
-
[4]
Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. 2025. Memory Injection Attacks on LLM Agents via Query- Only Interaction. InAdvances in Neural Information Processing Systems 38. Curran Associates, Inc., San Diego, CA, USA, 35. https://neurips.cc/virtual/2025/poster/ 118152 Poster
work page 2025
-
[5]
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models?. InProceedings of the First Conference on Language Modeling. OpenReview.net, Philadelphia, PA, USA, 27
work page 2024
-
[6]
Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. 2025. Memory OS of AI Agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 25961–25970. https://doi.org/10.18653/v1/2025.emnlp-main.1318
-
[7]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173. https://doi.org/10.1162/tacl_a_00638
-
[8]
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xi- ang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2024. AgentBench: Evaluating LLMs as Agents. InThe Twelfth International Conference on Le...
work page 2024
-
[9]
Zesen Liu, Zhixiang Zhang, Yuchong Xie, and Dongdong She. 2025. Com- pressionAttack: Exploiting Prompt Compression as a New Attack Sur- face in LLM-Powered Agents. https://doi.org/10.48550/arXiv.2510.22963 arXiv:cs.CR/2510.22963
-
[10]
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. Evaluating Very Long-Term Conversational Memory of LLM Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Associa- tion for Computational Linguistics, Bangkok, Thailand, 13851–13870...
-
[11]
Rossi, Seunghyun Yoon, and Hinrich Sch"utze
Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, and Hinrich Sch"utze. 2025. NoLiMa: Long-Context Ghost in the Context: Measuring Policy-Carriage Failures in Decision-Time Assembly Evaluation Beyond Literal Matching. InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Mac...
work page 2025
-
[12]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2023. MemGPT: Towards LLMs as Operating Systems. https://doi.org/10.48550/arXiv.2310.08560 arXiv:cs.AI/2310.08560
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.08560 2023
-
[13]
Joon Sung Park, Joseph C. O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. ACM, San Francisco, CA, USA, 2:1–2:22. https://doi.org/10.1145/3586183.3606763
-
[14]
Fábio Perez and Ian Ribeiro. 2022. Ignore Previous Prompt: Attack Tech- niques for Language Models. https://doi.org/10.48550/arXiv.2211.09527 arXiv:cs.CL/2211.09527 Presented at the NeurIPS ML Safety Workshop 2022
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.09527 2022
-
[15]
Maddison, and Tatsunori Hashimoto
Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. 2024. Identifying the Risks of LM Agents with an LM-Emulated Sandbox. InThe Twelfth International Conference on Learning Representations. OpenReview.net, Vienna, Austria, 68. https://proceedings.iclr.cc/paper_files/paper...
work page 2024
-
[16]
Rana Salama, Jason Cai, Michelle Yuan, Anna Currey, Monica Sunkara, Yi Zhang, and Yassine Benajiba. 2025. MemInsight: Autonomous Memory Augmentation for LLM Agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 33136–33152. https://doi.org/10.18653/v1/202...
-
[17]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems 36. Curran Associates, Inc., New Or- leans, USA, 13. https://proceedings.neurips.cc/paper_f...
work page 2023
-
[18]
Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, and Omer Levy. 2023. Zero- SCROLLS: A Zero-Shot Benchmark for Long Text Understanding. InFindings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, 7977–7989. https://doi.org/10.18653/v1/ 2023.findings-emnlp.536
-
[19]
Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy. 2022. SCROLLS: Standardized CompaRison Over Long Language Sequences. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United...
-
[20]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learn- ing. InAdvances in Neural Information Processing Systems 36. Curran Associates, Inc., New Orleans, USA, 19. https://proceedings.neurips.cc/paper_files/paper/ 2023/hash/1b44b878bb782e6954cd888628510e90-Abstrac...
work page 2023
-
[21]
Saksham Sahai Srivastava and Haoyu He. 2025. MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval. https: //doi.org/10.48550/arXiv.2512.16962 arXiv:cs.CR/2512.16962
-
[22]
Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong
-
[23]
InFindings of the Association for Computational Linguistics: ACL 2025
MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents. InFindings of the Association for Computational Linguistics: ACL 2025. Association for Computational Linguistics, Vienna, Austria, 19336– 19352. https://doi.org/10.18653/v1/2025.findings-acl.989
-
[24]
Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. 2023. PlanBench: An Extensible Benchmark for Evalu- ating Large Language Models on Planning and Reasoning about Change. InAd- vances in Neural Information Processing Systems 36. Curran Associates, Inc., New Orleans, USA, 13. https://proceedings.neurips.cc/paper_...
work page 2023
-
[25]
Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. 2024. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. https://doi.org/10.48550/arXiv.2404.13208 arXiv:cs.CR/2404.13208
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.13208 2024
-
[26]
Bo Wang, Weiyi He, Shenglai Zeng, Zhen Xiang, Yue Xing, Jiliang Tang, and Pengfei He. 2025. Unveiling Privacy Risks in LLM Agent Memory. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers). Association for Computational Linguistics, Vienna, Austria, 25241–25260. https://doi.org/10.18653/v1/20...
-
[27]
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An Open-Ended Embodied Agent with Large Language Models. https://doi.org/10.48550/arXiv.2305.16291 arXiv:cs.AI/2305.16291
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.16291 2023
-
[28]
Yizhu Wang, Sizhe Chen, Raghad Alkhudair, Basel Alomair, and David Wagner
-
[29]
Defending Against Prompt Injection with DataFilter. https://doi.org/10. 48550/arXiv.2510.19207 arXiv:cs.CR/2510.19207
-
[30]
Zhenting Wang, Huancheng Chen, Jiayun Wang, and Wei Wei. 2026. Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory. https: //doi.org/10.48550/arXiv.2603.04257 arXiv:cs.CL/2603.04257
-
[31]
Qianshan Wei, Tengchao Yang, Yaochen Wang, Xinfeng Li, Lijun Li, Zhenfei Yin, Yi Zhan, Thorsten Holz, Zhiqiang Lin, and XiaoFeng Wang. 2025. A-MemGuard: A Proactive Defense Framework for LLM-Based Agent Memory. https://doi. org/10.48550/arXiv.2510.02373 arXiv:cs.CR/2510.02373
-
[32]
Ruoyao Wen, Hao Li, Chaowei Xiao, and Ning Zhang. 2026. AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management. https://doi.org/10.48550/arXiv.2602.07398 arXiv:cs.CR/2602.07398
-
[33]
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis
-
[34]
InThe Twelfth International Conference on Learning Representations
Efficient Streaming Language Models with Attention Sinks. InThe Twelfth International Conference on Learning Representations. OpenReview.net, Vienna, Austria, 21. https://proceedings.iclr.cc/paper_files/paper/2024/hash/ 5e5fd18f863cbe6d8ae392a93fd271c9-Abstract-Conference.html
work page 2024
-
[35]
Xianglin Yang, Yufei He, Shuo Ji, Bryan Hooi, and Jin Song Dong. 2026. Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections. https://doi.org/10.48550/arXiv.2602.15654 arXiv:cs.CR/2602.15654 Presented at Lifelong Agent @ ICLR 2026
-
[36]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Represen- tations. OpenReview.net, Kigali, Rwanda, 33. https://iclr.cc/virtual/2023/poster/ 11003
work page 2023
-
[37]
Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. 2025. Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1. ACM, Toronto, ON, Canada, 1809–1820. https://doi.org/10.1145/3690624.3709179
-
[38]
Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, Guohao Dai, Shengen Yan, and Yu Wang. 2025. LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K. InProceedings of the Second Conference on Language Modeling. OpenReview.net, Montreal, Canada, 26
work page 2025
-
[39]
Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. InFindings of the Association for Computational Linguistics: ACL
work page 2024
-
[40]
doi: 10.18653/v1/2024.findings-acl.624
Association for Computational Linguistics, Bangkok, Thailand, 10471– 10506. https://doi.org/10.18653/v1/2024.findings-acl.624
-
[41]
Guilin Zhang, Wei Jiang, Xiejiashan Wang, Aisha Behr, Kai Zhao, Jeffrey Fried- man, Xu Chu, and Amine Anoun. 2026. Adaptive Memory Admission Control for LLM Agents. https://doi.org/10.48550/arXiv.2603.04549 arXiv:cs.AI/2603.04549
-
[42]
Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. 2025. Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents. In The Thirteenth International Conference on Learning Representations. OpenRe- view.net, Singapore, 36. https://proceedings.iclr.cc/paper_...
work page 2025
- [43]
-
[44]
Yuxiang Zhang, Jiangming Shu, Ye Ma, Xueyuan Lin, Shangxi Wu, and Jitao Sang. 2025. Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks. https://doi.org/10.48550/arXiv.2510.12635 arXiv:cs.AI/2510.12635
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.12635 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.