pith. machine review for the scientific record. sign in

arxiv: 2604.03679 · v1 · submitted 2026-04-04 · 💻 cs.CL · cs.AI· cs.IR· cs.LG· cs.MM

Recognition: 2 theorem links

· Lean Theorem

LightThinker++: From Reasoning Compression to Memory Management

Da Zheng, Huajun Chen, Jintian Zhang, Lei Liang, Ningyu Zhang, Shuofei Qiao, Yujie Luo, Yuqi Zhu, Zhengke Gui, Zhenjie Wan

Pith reviewed 2026-05-13 17:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LGcs.MM
keywords LLM reasoning efficiencythought compressionadaptive memory managementtoken reductionlong-horizon taskstrajectory synthesismemory primitives
0
0 comments X

The pith

LightThinker++ lets LLMs compress intermediate thoughts into explicit memory primitives, cutting peak token use by 70 percent while raising accuracy on complex tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that static compression of reasoning traces often loses critical details and creates logical errors in long chains. LightThinker++ therefore adds explicit adaptive memory management, where the model learns to store compact semantic summaries and later retrieve or discard them on purpose. Training occurs through a dedicated trajectory synthesis pipeline that generates examples of correct memory scheduling. If this works, models can sustain deep reasoning over dozens of steps without exhausting context windows or introducing new mistakes. The result matters because current LLMs hit hard limits on token budgets during agentic or multi-step work, and a reliable compression-plus-management layer would let them run longer and cheaper on the same hardware.

Core claim

The authors claim that shifting from passive compression to explicit memory primitives, trained via a specialized trajectory synthesis pipeline, enables LLMs to schedule memory purposefully. This produces a 69.9 percent reduction in peak token usage together with a 2.42 percent accuracy increase under fixed context budgets, and keeps a stable low footprint past 80 rounds in long-horizon agentic tasks with a 14.8 percent average performance lift.

What carries the argument

Explicit Adaptive Memory Management, a behavioral-level system that inserts memory primitives into the reasoning trace and trains the model, through synthesized trajectories, to decide when to compress, store, retrieve, or discard intermediate semantic representations.

If this is right

  • Peak token usage drops roughly 70 percent and inference time drops 26 percent with only minimal accuracy loss on standard reasoning benchmarks.
  • Under a fixed context budget the method raises accuracy by 2.42 percent while still using far fewer tokens than full-trace baselines.
  • In tasks spanning more than 80 rounds the memory footprint remains low and average task performance rises 14.8 percent across varied complex scenarios.
  • The same primitives support both short reasoning chains and extended agentic loops without retraining the base model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The compression-plus-management pattern could be applied to multi-agent settings where agents must share compact memory states instead of full histories.
  • If the memory primitives prove reliable, the approach might let smaller open models handle problems that currently require much larger context windows.
  • A natural next test would measure whether the learned scheduling generalizes across domains without additional trajectory synthesis.

Load-bearing premise

The trajectory synthesis pipeline can teach the model to schedule memory in ways that never create new reasoning errors or systematic biases.

What would settle it

Run the same long-horizon agentic benchmark with and without the memory primitives; if accuracy falls below the uncompressed baseline once the context limit is reached, the claim that purposeful scheduling avoids logical bottlenecks is false.

read the original abstract

Large language models (LLMs) excel at complex reasoning, yet their efficiency is limited by the surging cognitive overhead of long thought traces. In this paper, we propose LightThinker, a method that enables LLMs to dynamically compress intermediate thoughts into compact semantic representations. However, static compression often struggles with complex reasoning where the irreversible loss of intermediate details can lead to logical bottlenecks. To address this, we evolve the framework into LightThinker++, introducing Explicit Adaptive Memory Management. This paradigm shifts to behavioral-level management by incorporating explicit memory primitives, supported by a specialized trajectory synthesis pipeline to train purposeful memory scheduling. Extensive experiments demonstrate the framework's versatility across three dimensions. (1) LightThinker reduces peak token usage by 70% and inference time by 26% with minimal accuracy loss. (2) In standard reasoning, LightThinker++ slashes peak token usage by 69.9% while yielding a +2.42% accuracy gain under the same context budget for maximum performance. (3) Most notably, in long-horizon agentic tasks, it maintains a stable footprint beyond 80 rounds (a 60%-70% reduction), achieving an average performance gain of 14.8% across different complex scenarios. Overall, our work provides a scalable direction for sustaining deep LLM reasoning over extended horizons with minimal overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes LightThinker, a method enabling LLMs to dynamically compress intermediate thoughts into compact semantic representations, and evolves it into LightThinker++ by introducing Explicit Adaptive Memory Management. This is supported by a specialized trajectory synthesis pipeline to train purposeful memory scheduling with explicit memory primitives. The paper reports that LightThinker reduces peak token usage by 70% and inference time by 26% with minimal accuracy loss; LightThinker++ achieves 69.9% token reduction with +2.42% accuracy gain under the same context budget; and in long-horizon agentic tasks it maintains a stable footprint beyond 80 rounds (60-70% reduction) with an average 14.8% performance gain across complex scenarios.

Significance. If the results hold under rigorous evaluation, the work offers a practical direction for sustaining deep LLM reasoning over extended horizons by moving from static compression to behavioral-level adaptive memory management. This could meaningfully reduce computational overhead in agentic and long-context applications while preserving or improving accuracy, addressing a core scalability bottleneck in current LLM inference.

major comments (3)
  1. [Abstract] Abstract: The headline quantitative claims (69.9% peak token reduction, +2.42% accuracy gain, 14.8% long-horizon gain) are presented without any description of experimental setup, benchmarks, baselines, number of runs, statistical tests, or error bars. This is load-bearing because the central argument that the trajectory synthesis pipeline enables purposeful scheduling without new reasoning errors cannot be evaluated from the given information.
  2. [Method] Method section (trajectory synthesis pipeline): The paper states that the specialized pipeline trains the model to perform purposeful memory scheduling that avoids introducing new reasoning errors or biases, yet provides no construction details, data curation process, or controls (e.g., comparison to random scheduling or error-rate breakdowns on failed trajectories). Without these, it is impossible to confirm that observed gains are attributable to the proposed Explicit Adaptive Memory Management rather than dataset artifacts.
  3. [Results] Results (long-horizon agentic tasks): The claim of stable performance beyond 80 rounds with 60-70% footprint reduction rests on the assumption that compression does not cause irreversible detail loss in complex cases, but no ablation studies, error analysis, or per-scenario breakdowns are referenced to support this.
minor comments (2)
  1. [Abstract] Abstract: The phrasing 'evolve the framework into LightThinker++' would benefit from an explicit one-sentence contrast between the static compression of LightThinker and the behavioral-level primitives of LightThinker++.
  2. [Method] Notation: The term 'Explicit Adaptive Memory Management' is introduced without a formal definition or pseudocode; a short algorithmic outline would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current presentation of experimental details and methodological controls requires strengthening to allow full evaluation of the claims. We address each major comment below and will incorporate the requested clarifications and additional analyses in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline quantitative claims (69.9% peak token reduction, +2.42% accuracy gain, 14.8% long-horizon gain) are presented without any description of experimental setup, benchmarks, baselines, number of runs, statistical tests, or error bars. This is load-bearing because the central argument that the trajectory synthesis pipeline enables purposeful scheduling without new reasoning errors cannot be evaluated from the given information.

    Authors: We acknowledge the abstract lacks sufficient context for the reported metrics. In the revision we will expand the abstract to briefly specify the benchmarks (GSM8K, MATH, and long-horizon agentic environments), the primary baselines, that all numbers are means over 5 runs, and that standard deviations and statistical significance tests appear in the main results tables. This will allow readers to assess the claims directly from the abstract. revision: yes

  2. Referee: [Method] Method section (trajectory synthesis pipeline): The paper states that the specialized pipeline trains the model to perform purposeful memory scheduling that avoids introducing new reasoning errors or biases, yet provides no construction details, data curation process, or controls (e.g., comparison to random scheduling or error-rate breakdowns on failed trajectories). Without these, it is impossible to confirm that observed gains are attributable to the proposed Explicit Adaptive Memory Management rather than dataset artifacts.

    Authors: We agree that the trajectory synthesis pipeline description is currently insufficient. The revision will add a dedicated subsection detailing: (1) the teacher-model trajectory generation procedure, (2) the curation filters that retain only trajectories exhibiting correct final answers and effective memory usage, (3) an explicit comparison of purposeful versus random memory scheduling, and (4) error-rate breakdowns on both successful and failed trajectories. These additions will demonstrate that performance gains stem from the learned scheduling policy rather than data artifacts. revision: yes

  3. Referee: [Results] Results (long-horizon agentic tasks): The claim of stable performance beyond 80 rounds with 60-70% footprint reduction rests on the assumption that compression does not cause irreversible detail loss in complex cases, but no ablation studies, error analysis, or per-scenario breakdowns are referenced to support this.

    Authors: We recognize that the long-horizon stability claim requires stronger supporting evidence. The revised manuscript will include: (1) ablation studies isolating the effect of compression on information retention, (2) a categorized error analysis of failure modes across rounds, and (3) per-scenario performance tables for the agentic tasks. These analyses will show that the explicit adaptive memory primitives selectively preserve critical details, preventing irreversible loss. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper proposes LightThinker and LightThinker++ as empirical frameworks for dynamic thought compression and explicit adaptive memory management, trained via a specialized trajectory synthesis pipeline. All central claims (token reductions of 69.9-70%, accuracy gains of +2.42% and +14.8%) are presented as measured experimental outcomes across standard and long-horizon tasks rather than as quantities derived from equations or parameters that reduce to the method's own inputs by construction. No mathematical derivations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations of uniqueness theorems appear in the abstract or description. The method is self-contained against external benchmarks via reported performance metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review is based solely on the abstract; no explicit free parameters, mathematical axioms, or independently evidenced invented entities are stated. The framework introduces the concept of explicit adaptive memory management as a behavioral-level addition to compression.

invented entities (1)
  • Explicit Adaptive Memory Management no independent evidence
    purpose: Shift from static compression to behavioral-level memory scheduling using explicit primitives
    Introduced to prevent logical bottlenecks from irreversible detail loss during compression

pith-pipeline@v0.9.0 · 5575 in / 1224 out tokens · 54815 ms · 2026-05-13T17:20:19.084487+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 13 internal anchors

  1. [1]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXivpreprint arXiv:2303.18223, 1(2), 2023

  2. [2]

    Chatgpt is a remarkable tool—for experts.DataIntelligence, 6(1): 240–296, 2024

    Amos Azaria, Rina Azoulay, and Shulamit Reches. Chatgpt is a remarkable tool—for experts.DataIntelligence, 6(1): 240–296, 2024. doi: 10.1162/dint_a_00235

  3. [3]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advancesin NeuralInformation Processing Systems 35: Annual Conference on Neural I...

  4. [4]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey ...

  5. [5]

    Qwq: Reflect deeply on the boundaries of the unknown, 2024

    Team Qwen. Qwq: Reflect deeply on the boundaries of the unknown, 2024. URLhttps://qwenlm.github.io/ blog/qwq-32b-preview/

  6. [6]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  7. [7]

    A comparative study on reasoning patterns of openai’s o1 model

    Siwei Wu, Zhongyuan Peng, Xinrun Du, Tuney Zheng, Minghao Liu, Jialong Wu, Jiachen Ma, Yizhi Li, Jian Yang, Wangchunshu Zhou, Qunshu Lin, Junbo Zhao, Zhaoxiang Zhang, Wenhao Huang, Ge Zhang, Chenghua Lin, and Jiaheng Liu. A comparative study on reasoning patterns of openai’s o1 model.CoRR, abs/2410.13639, 2024. doi: 10.48550/ARXIV.2410.13639. URLhttps://d...

  8. [8]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference ...

  9. [9]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  10. [10]

    Qwen2.5 Technical Report

    doi: 10.48550/ARXIV.2412.15115. URLhttps://doi.org/10.48550/arXiv.2412.15115

  11. [11]

    Token-budget-aware llm reasoning

    Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware LLM reasoning. CoRR, abs/2412.18547, 2024. doi: 10.48550/ARXIV.2412.18547. URLhttps://doi.org/10. 48550/arXiv.2412.18547

  12. [12]

    Break the chain: Large language models can be shortcut reasoners.CoRR, abs/2406.06580, 2024

    Mengru Ding, Hanmeng Liu, Zhizhang Fu, Jian Song, Wenbo Xie, and Yue Zhang. Break the chain: Large language models can be shortcut reasoners.CoRR, abs/2406.06580, 2024. doi: 10.48550/ARXIV.2406.06580. URL https://doi.org/10.48550/arXiv.2406.06580

  13. [13]

    Buttazzo, Nicolamaria Manes, and Fabrizio Giacomelli

    Sania Nayab, Giulio Rossolini, Giorgio C. Buttazzo, Nicolamaria Manes, and Fabrizio Giacomelli. Concise thoughts: Impact of output length on LLM reasoning and cost.CoRR, abs/2407.19825, 2024. doi: 10.48550/ARXIV.2407.19825. URLhttps://doi.org/10.48550/arXiv.2407.19825

  14. [14]

    Can language models learn to skip steps? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M

    Tengxiao Liu, Qipeng Guo, Xiangkun Hu, Cheng Jiayang, Yue Zhang, Xipeng Qiu, and Zheng Zhang. Can language models learn to skip steps? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Proc...

  15. [15]

    C3ot: Generating shorter chain-of- thought without compromising effectiveness

    Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou. C3ot: Generating shorter chain-of-thought without compromising effectiveness.CoRR, abs/2412.11664, 2024. doi: 10.48550/ARXIV.2412.11664. URLhttps://doi. org/10.48550/arXiv.2412.11664

  16. [16]

    Training language models to reason efficiently

    Daman Arora and Andrea Zanette. Training language models to reason efficiently.arXivpreprintarXiv:2502.04463, 2025

  17. [17]

    O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning, 2025

    Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning, 2025

  18. [18]

    Compressed chain of thought: Efficient reasoning through dense representations

    Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations. CoRR, abs/2412.13171, 2024. doi: 10.48550/ARXIV.2412.13171. URLhttps://doi.org/10. 48550/arXiv.2412.13171

  19. [19]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large languagemodelstoreasoninacontinuouslatentspace. CoRR,abs/2412.06769,2024.doi: 10.48550/ARXIV.2412.06769. URLhttps://doi.org/10.48550/arXiv.2412.06769

  20. [20]

    Implicit chain of thought reasoning via knowledge distillation, 2023

    YuntianDeng,KiranPrasad,RolandFernandez,PaulSmolensky,VishravChaudhary,andStuartM.Shieber. Implicit chainofthoughtreasoningviaknowledgedistillation. CoRR,abs/2311.01460,2023. doi: 10.48550/ARXIV.2311.01460. URLhttps://doi.org/10.48550/arXiv.2311.01460. 25

  21. [21]

    Yuntian Deng, Yejin Choi, and Stuart M. Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step. CoRR, abs/2405.14838, 2024. doi: 10.48550/ARXIV.2405.14838. URLhttps://doi.org/10.48550/arXiv. 2405.14838

  22. [22]

    Barrett, Zhangyang Wang, and Beidi Chen

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuan- dong Tian, Christopher Ré, Clark W. Barrett, Zhangyang Wang, and Beidi Chen. H2O: heavy-hitter oracle for efficient generative inference of large language models. In Alice Oh, Tristan Naumann, Amir Glober- son, Kate Saenko, Moritz Hardt, and Sergey Levine, edi...

  23. [23]

    Sepllm: Accelerate large language models by compressing one segment into one separator.CoRR, abs/2412.12094,2024

    Guoxuan Chen, Han Shi, Jiawei Li, Yihang Gao, Xiaozhe Ren, Yimeng Chen, Xin Jiang, Zhenguo Li, Weiyang Liu, and Chao Huang. Sepllm: Accelerate large language models by compressing one segment into one separator.CoRR, abs/2412.12094,2024. doi: 10.48550/ARXIV.2412.12094. URLhttps://doi.org/10.48550/arXiv.2412.12094

  24. [24]

    Jesse Mu, Xiang Li, and Noah D. Goodman. Learning to compress prompts with gist tokens. In Alice Oh, Tristan Naumann,AmirGloberson,KateSaenko,MoritzHardt,andSergeyLevine,editors, AdvancesinNeuralInformation Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA,December 10 - 16, 2023, 202...

  25. [25]

    Reasoning with language model prompting: A survey

    Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey. InProceedingsofthe 61stAnnualMeetingof the Association for Computational Linguistics (Volume1: Long Papers), pages 5368–5393, Toronto, Canada, July

  26. [26]

    URLhttps://aclanthology.org/2023.acl-long.294

    Association for Computational Linguistics. URLhttps://aclanthology.org/2023.acl-long.294

  27. [27]

    The empirical case for two systems of reasoning.Psychologicalbulletin, 119(1):3, 1996

    Steven A Sloman. The empirical case for two systems of reasoning.Psychologicalbulletin, 119(1):3, 1996

  28. [28]

    Thinking, fast and slow.Farrar,Strausand Giroux, 2011

    Daniel Kahneman. Thinking, fast and slow.Farrar,Strausand Giroux, 2011

  29. [29]

    Thinking fast and slow in ai

    Grady Booch, Francesco Fabiano, Lior Horesh, Kiran Kate, Jonathan Lenchner, Nick Linck, Andreas Loreggia, Keerthiram Murgesan, Nicholas Mattei, Francesca Rossi, et al. Thinking fast and slow in ai. InProceedingsofthe AAAI Conferenceon ArtificialIntelligence, volume 35, pages 15042–15046, 2021

  30. [30]

    Long context compression with activation beacon, 2024

    Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, and Zhicheng Dou. Long context compression with activation beacon, 2024

  31. [31]

    Wong, Xin He, Wanshun Chen, and Longyue Wang

    Jianhui Pang, Fanghua Ye, Derek F. Wong, Xin He, Wanshun Chen, and Longyue Wang. Anchor-based large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 4958–4976. Association for Computational Linguis...

  32. [32]

    Open Thoughts

    OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025

  33. [33]

    OneGen: Efficient one-pass unified generation and retrieval for LLMs

    Jintian Zhang, Cheng Peng, Mengshu Sun, Xiang Chen, Lei Liang, Zhiqiang Zhang, Jun Zhou, Huajun Chen, and Ningyu Zhang. OneGen: Efficient one-pass unified generation and retrieval for LLMs. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4088–4119, Miami, Florida, U...

  34. [34]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

  35. [35]

    Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 ,

  36. [36]

    Tokenskip: Con- trollable chain-of-thought compression in llms

    Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, and Wenjie Li. Tokenskip: Controllable chain-of- thought compression in llms. CoRR, abs/2502.12067, 2025. doi: 10.48550/ARXIV.2502.12067. URL https: //doi.org/10.48550/arXiv.2502.12067

  37. [37]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021. URLhttps://arxiv.org/abs/2110.14168

  38. [38]

    Measuringmassivemultitasklanguageunderstanding.In 9thInternationalConferenceonLearningRepresentations, ICLR2021,VirtualEvent,Austria,May3-7,2021.OpenReview.net,2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuringmassivemultitasklanguageunderstanding.In 9thInternationalConferenceonLearningRepresentations, ICLR2021,VirtualEvent,Austria,May3-7,2021.OpenReview.net,2021. URL https://openreview.net/forum? id=d7KBjmI3GmQ

  39. [39]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirstConferenceon Language Modeling, 2024. URLhttps://openreview.net/forum?id=Ti67584b98

  40. [40]

    Le, Ed H

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguis...

  41. [41]

    When thinking fails: The pitfalls of reasoning for instruction-following in llms, 2025

    Xiaomin Li, Zhou Yu, Zhiwei Zhang, Xupeng Chen, Ziji Zhang, Yingying Zhuang, Narayanan Sadagopan, and Anurag Beniwal. When thinking fails: The pitfalls of reasoning for instruction-following in llms, 2025. URL https://arxiv.org/abs/2505.11423

  42. [42]

    Hotpotqa: A dataset for diverse, explainable multi-hop question answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conferenceon EmpiricalMethods inNaturalLanguageProcessing, pages 2369–2380, 2018

  43. [43]

    Musique: Multihop questions via single-hop question composition.Transactionsofthe AssociationforComputationalLinguistics, 10:539–554, 2022

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactionsofthe AssociationforComputationalLinguistics, 10:539–554, 2022

  44. [44]

    arXiv preprint arXiv:2505.22648 (2025) GeoBrowse 19

    Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu,YongJiang,etal. Webdancer: Towardsautonomousinformationseekingagency. arXivpreprintarXiv:2505.22648, 2025

  45. [45]

    Webshaper: Agentically data synthesizing via information-seeking formalization.arXiv preprint arXiv:2507.15061, 2025

    Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, et al. Webshaper: Agentically data synthesizing via information-seeking formalization.arXiv preprint arXiv:2507.15061, 2025

  46. [46]

    Webwalker: Benchmarking llms in web traversal.arXivpreprintarXiv:2501.07572, 2025

    Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, et al. Webwalker: Benchmarking llms in web traversal.arXivpreprintarXiv:2501.07572, 2025

  47. [47]

    Glm-4.6: Advanced agentic, reasoning and coding capabilities, 2025

    Z.ai. Glm-4.6: Advanced agentic, reasoning and coding capabilities, 2025. URLhttps://z.ai/blog/glm-4.6/

  48. [48]

    System card: Claude opus 4 & claude sonnet 4, 2025

    Anthropic. System card: Claude opus 4 & claude sonnet 4, 2025. URLhttps://www-cdn.anthropic.com/ 6d8a8055020700718b0c49369f60816ba2a7c285.pdf

  49. [49]

    Introducing gpt-5, 2025

    OpenAI. Introducing gpt-5, 2025. URLhttps://openai.com/index/introducing-gpt-5/

  50. [50]

    Kimi K2: Open Agentic Intelligence

    Kimi, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXivpreprintarXiv:2507.20534, 2025. 27

  51. [51]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

  52. [52]

    Xbench-deepsearch, 2025

    Xbench Team. Xbench-deepsearch, 2025. URLhttps://xbench.org/agi/aisearch

  53. [53]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents. arXiv preprintarXiv:2504.12516, 2025

  54. [54]

    Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025

    Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025

  55. [55]

    Jina, 2025

    Jina.ai. Jina, 2025. URLhttps://jina.ai/

  56. [56]

    AWQ: activation-aware weight quantization for on-device LLM compression and acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: activation-aware weight quantization for on-device LLM compression and acceleration. In Phillip B. Gibbons, Gennady Pekhimenko, and Christopher De Sa, editors,Proceedings of the Seventh Annual Conference on Machine Lear...

  57. [57]

    Gpt3.int8(): 8-bit matrix multiplication for trans- formersatscale

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3.int8(): 8-bit matrix multiplication for trans- formersatscale. InSanmiKoyejo,S.Mohamed,A.Agarwal,DanielleBelgrave,K.Cho,andA.Oh,editors, Advances inNeuralInformationProcessingSystems35: AnnualConferenceonNeuralInformationProcessingSystems2022, NeurIPS 2022, NewOrleans, LA,USA,November28 ...

  58. [58]

    KIVI: A tuning-free asymmetric 2bit quantization for KV cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cache. InForty-firstInternational ConferenceonMachine Learning, ICML2024,Vienna,Austria,July21-27,2024. OpenReview.net, 2024. URLhttps://openreview.net/ forum?id=L057s2Rq8O

  59. [59]

    Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length LLM inference with KV cache quantization. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Pa- quet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Informa...

  60. [60]

    Adaptinglanguagemodelstocompresscontexts

    AlexisChevalier,AlexanderWettig,AnirudhAjith,andDanqiChen. Adaptinglanguagemodelstocompresscontexts. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conferenceon EmpiricalMethods inNaturalLanguageProcessing,EMNLP2023,Singapore,December6-10,2023, pages 3829–3846. Association for ComputationalLinguistics,2023. doi: 10.18653/V1/...

  61. [61]

    In-contextautoencoderforcontextcompression inalargelanguagemodel

    TaoGe, JingHu, LeiWang, XunWang, Si-QingChen, andFuruWei. In-contextautoencoderforcontextcompression inalargelanguagemodel. In TheTwelfthInternationalConferenceonLearningRepresentations,ICLR2024,Vienna, Austria,May7-11,2024. OpenReview.net, 2024. URLhttps://openreview.net/forum?id=uREj4ZuGJE

  62. [62]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (Dec 2023)

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, edi- tors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 13358–1337...

  63. [63]

    Snapkv: LLM knows what you are looking for before generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: LLM knows what you are looking for before generation. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances inNeuralInformationProcessingSyst...

  64. [64]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    ZefanCai,YichiZhang,BofeiGao,YuliangLiu,TianyuLiu,KemingLu,WayneXiong,YueDong,BaobaoChang,Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic KV cache compression based on pyramidal information funneling.CoRR, abs/2406.02069,2024. doi: 10.48550/ARXIV.2406.02069. URLhttps://doi.org/10.48550/arXiv.2406.02069

  65. [65]

    Efficientstreaminglanguagemodelswith attention sinks

    GuangxuanXiao,YuandongTian,BeidiChen,SongHan,andMikeLewis. Efficientstreaminglanguagemodelswith attention sinks. InTheTwelfthInternational Conferenceon Learning Representations,ICLR2024,Vienna,Austria, May7-11, 2024. OpenReview.net, 2024. URLhttps://openreview.net/forum?id=NG7sS51zVF

  66. [66]

    SCOPE: optimizing key-value cache compression in long-context generation.CoRR, abs/2412.13649, 2024

    Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, and Deyu Zhou. SCOPE: optimizing key-value cache compression in long-context generation.CoRR, abs/2412.13649, 2024. doi: 10.48550/ARXIV.2412.13649. URL https://doi.org/10.48550/arXiv.2412.13649

  67. [67]

    A Survey of Context Engineering for Large Language Models

    Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu. A survey of context engineering for large language models.CoRR, abs/2507.13334, 2025. doi: 10.48550/ARXIV.2507.13334. URL https://doi.org/10.48550/arXiv.2507.13334

  68. [68]

    Dynamic long context reasoning over compressed memory via end-to-end reinforcement learning, 2026

    Zhuoen Chen, Dongfang Li, Meishan Zhang, Baotian Hu, and Min Zhang. Dynamic long context reasoning over compressed memory via end-to-end reinforcement learning, 2026. URLhttps://arxiv.org/abs/2602.08382

  69. [69]

    Free(): Learning to forget in malloc-only reasoning models, 2026

    Yilun Zheng, Dongyang Ma, Tian Liang, Jiahao Xu, Xinting Huang, Lijie Chen, Haitao Mi, and Yan Wang. Free(): Learning to forget in malloc-only reasoning models, 2026. URLhttps://arxiv.org/abs/2602.08030

  70. [70]

    The pensieve paradigm: Stateful language models mastering their own context, 2026

    Xiaoyuan Liu, Tian Liang, Dongyang Ma, Deyu Zhou, Haitao Mi, Pinjia He, and Yan Wang. The pensieve paradigm: Stateful language models mastering their own context, 2026. URLhttps://arxiv.org/abs/2602.12108

  71. [71]

    MEM1: learning to synergize memory and reasoning for efficient long-horizon agents.CoRR, abs/2506.15841,2025

    ZijianZhou,AoQu,ZhaoxuanWu,SunghwanKim,AlokPrakash,DanielaRus,JinhuaZhao,BryanKianHsiangLow, and Paul Pu Liang. MEM1: learning to synergize memory and reasoning for efficient long-horizon agents.CoRR, abs/2506.15841,2025. doi: 10.48550/ARXIV.2506.15841. URLhttps://doi.org/10.48550/arXiv.2506.15841

  72. [72]

    Memagent: Re- shaping long-context llm with multi-conv rl-based mem- ory agent.arXiv preprint arXiv:2507.02259,

    Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context LLM with multi-conv rl-based memory agent. CoRR, abs/2507.02259, 2025. doi: 10.48550/ARXIV.2507.02259. URL https://doi.org/10. 48550/arXiv.2507.02259

  73. [73]

    Resum: Unlocking long-horizon search intelligence via context summarization.CoRR, abs/2509.13313, 2025

    Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, and Jingren Zhou. Resum: Unlocking long-horizon search intelligence via context summarization.CoRR, abs/2509.13313, 2025. doi: 10.48550/ARXIV.2509.13313. URL https://doi.org/10.48550/arXiv.2509.13313

  74. [74]

    Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699, 2025

    Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, and Yong Jiang. Agentfold: Long-horizon web agents with proactive context management.CoRR, abs/2510.24699, 2025. doi: 10.48550/ARXIV.2510.24699. URL https://doi.org/10.48550/arXiv.2510.24699

  75. [75]

    Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025

    Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025. doi: 10.48550/ARXIV.2510.11967. URL https: //doi.org/10.48550/arXiv.2510.11967

  76. [76]

    Bespoke-stratos: The unreasonable effectiveness of reasoning distillation

    Bespoke Labs. Bespoke-stratos: The unreasonable effectiveness of reasoning distillation. https://www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation, 2025. Accessed: 2025-01-22

  77. [77]

    Swift: A scalable lightweight infrastructure for fine-tuning

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. SWIFT: A scalable lightweight infrastructure for fine-tuning. In Toby Walsh, Julie Shah, and Zico Kolter, editors,AAAI-25, Sponsored bythe Association forthe Advancementof Artificial Intelligence, Februar...

  78. [78]

    System-1.5 reasoning: Traversal in language and latent spaces with dynamic shortcuts

    Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, and Bang Liu. System-1.5 reasoning: Traversal in language and latent spaces with dynamic shortcuts. CoRR, abs/2505.18962, 2025. doi: 10.48550/ARXIV.2505.18962. URL https://doi.org/10.48550/arXiv.2505.18962

  79. [79]

    Thought-basedAttentionMask Construction

    ZhenZhang, XuehaiHe, WeixiangYan, AoShen, ChenyangZhao, ShuohangWang, YelongShen, andXinEricWang. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space.CoRR, abs/2505.15778, 2025. doi: 10.48550/ARXIV.2505.15778. URLhttps://doi.org/10.48550/arXiv.2505.15778. 29 Appendix A Metric:Dependency Peak Tokens Prompt Length (a) Vanill...

  80. [80]

    Use tools based on these logic states: • Information Acquisition: Usesearch(query)or visit(url)to find new data or explore primary sources

    OPERATIONAL LOGIC: TOOL CHOICE Every step in your history is assigned anid(e.g.,[Thought ID], [Observation ID]). Use tools based on these logic states: • Information Acquisition: Usesearch(query)or visit(url)to find new data or explore primary sources. •Deepening or Re-visiting (expand): –Discrepancy Resolution: Useexpand(id)to compare conflicting data po...

Showing first 80 references.