arxiv: 2605.13050 · v2 · submitted 2026-05-13 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Context Training with Active Information Seeking

Zeyu Huang , Adhiguna Kuncoro , Qixuan Feng , Jiajun Shen , Lucio Dery , Arthur Szlam , Marc'Aurelio Ranzato

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords context optimizationactive information seekingLLM adaptationsearch toolscandidate pruningexternal knowledgedata efficiencytask generalization

0 comments

The pith

Pairing search tools with multi-candidate pruning during context training produces consistent LLM gains without weight updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that simply attaching Wikipedia search and browser tools to standard context optimization pipelines often reduces performance compared to baselines. When the same tools are instead used inside a training loop that keeps several candidate contexts and prunes the weaker ones, the resulting contexts deliver measurable improvements on low-resource translation, health queries, and reasoning benchmarks. Readers would care because the method adapts deployed models to new information using only external search and modest training data, without the expense of full retraining.

Core claim

Equipping context optimizers with Wikipedia search and browser tools for active information seeking, when combined with a search-based training procedure that maintains and prunes multiple candidate contexts, produces consistent and substantial performance gains on Flores+ low-resource translation, HealthBench health scenarios, LiveCodeBench, and Humanity's Last Exam reasoning tasks. The resulting textual contexts are data-efficient, robust across hyperparameters, and generalize to different models.

What carries the argument

A search-based training procedure that maintains multiple candidate contexts and prunes them to incorporate active information from external tools.

If this is right

Performance improves on low-resource translation benchmarks such as Flores+.
Accuracy rises on health-related queries in HealthBench.
Reasoning scores increase on LiveCodeBench and Humanity's Last Exam.
Training requires relatively little data while remaining stable across hyperparameter choices.
Learned contexts transfer effectively to models not seen during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could reduce the need for repeated full-model retraining when new domain knowledge appears.
Similar pruning logic might be applied to other external retrieval sources beyond Wikipedia.
Real-time systems could use the same active-seeking loop to keep contexts current without human intervention.
The method may complement existing retrieval-augmented generation pipelines by supplying higher-quality initial contexts.

Load-bearing premise

External search tools return sufficiently accurate and relevant passages, and the pruning step can reliably discard noisy contexts without removing useful ones.

What would settle it

Running the same tasks with the pruning step disabled or with deliberately noisy search results and finding no gains relative to the closed-loop baseline would falsify the claim.

read the original abstract

Most existing large language models (LLMs) are expensive to adapt after deployment, especially when a task requires newly produced information or niche domain knowledge. Recent work has shown that, by manipulating and optimizing their context, LLMs can be tailored to downstream tasks without updating their weights. However, most existing methods remain closed-loop, relying solely on the model's intrinsic knowledge. In this paper, we equip these context optimizers with Wikipedia search and browser tools for active information seeking. We show that naively adding these tools to a standard sequential context optimization pipeline can actually degrade performance compared to baselines. However, when paired with a search-based training procedure that maintains and prunes multiple candidate contexts, active information seeking delivers consistent and substantial gains. We demonstrate these improvements across diverse domains, including low-resource translation (Flores+), health scenarios (HealthBench), and reasoning-heavy tasks (LiveCodeBench and Humanity's Last Exam). Furthermore, our method proves to be data-efficient, robust across different hyperparameters, and capable of generating effective textual contexts that generalize well across different models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Active search tools hurt naive context optimization but help when paired with multi-candidate pruning training, though the pruning step lacks ablations.

read the letter

The main takeaway is that simply adding Wikipedia and browser tools to existing context optimizers degrades performance, but switching to a training loop that maintains multiple candidate contexts and prunes them produces consistent gains on Flores+ translation, HealthBench, LiveCodeBench, and Humanity's Last Exam. The paper positions this as a practical route to handle new or niche information without weight updates. What stands out is the negative result on naive tool addition, which shows the integration is not automatic and requires the specific search-based training procedure. The claims of data efficiency, hyperparameter robustness, and cross-model generalization are also useful for deployment scenarios where retraining is costly. The benchmarks span low-resource, domain-specific, and reasoning-heavy tasks, giving the results some breadth. The central weakness is the pruning mechanism. The gains appear to rest on reliably discarding noisy passages while keeping signal, yet the description gives little detail on the scoring rule or filter, and no ablations compare the pruned set against keeping all candidates. Without those controls or error analysis on false discards in health or code tasks, it is difficult to isolate whether the improvements come from the pruning logic itself or from other aspects of maintaining multiples. The stress-test concern about unverified search results and potential correlation with the downstream model holds based on the visible claims. This paper is for researchers working on tool-augmented LLMs and context optimization. A reader focused on practical adaptation methods would get value from the negative result and the benchmark coverage. It deserves a serious referee to examine the training details and request the missing ablations. I would bring it to a reading group to discuss the pruning procedure. I would not cite it yet. Send it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper proposes equipping context optimizers for LLMs with external Wikipedia search and browser tools to enable active information seeking. It claims that naively adding these tools to sequential optimization pipelines degrades performance relative to baselines, but pairing them with a search-based training procedure that maintains and prunes multiple candidate contexts produces consistent gains on low-resource translation (Flores+), health scenarios (HealthBench), and reasoning tasks (LiveCodeBench and Humanity's Last Exam). The method is further presented as data-efficient, robust to hyperparameter choices, and capable of producing contexts that generalize across models.

Significance. If the central result holds after addressing the noted gaps, the work would be significant for demonstrating how external tools combined with explicit multi-candidate pruning can enable effective, weight-free adaptation of LLMs to new or niche information, with potential implications for data-efficient deployment in dynamic domains.

major comments (2)

[Training Procedure and Experiments] The headline claim that active information seeking yields gains only when paired with the multi-candidate pruning procedure rests on an unablated assumption. No experiment compares performance when all candidates are retained versus when the pruning rule is applied, leaving open the possibility that gains arise from maintaining multiples rather than from the pruning step itself.
[Abstract and Results] The abstract asserts 'consistent and substantial gains' across four benchmarks, yet supplies no quantitative deltas, baseline details, statistical tests, or error analysis of false-negative discards during pruning. This absence makes it impossible to assess whether the pruning metric reliably separates signal from noise on health or reasoning tasks.

minor comments (1)

[Abstract] The abstract states robustness across hyperparameters but does not enumerate the specific hyperparameters varied or the ranges tested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments point-by-point below, and we will make the necessary revisions to strengthen the paper.

read point-by-point responses

Referee: [Training Procedure and Experiments] The headline claim that active information seeking yields gains only when paired with the multi-candidate pruning procedure rests on an unablated assumption. No experiment compares performance when all candidates are retained versus when the pruning rule is applied, leaving open the possibility that gains arise from maintaining multiples rather than from the pruning step itself.

Authors: We agree that an explicit ablation isolating the effect of the pruning rule from merely maintaining multiple candidates would provide clearer evidence for our claims. Our existing comparisons show that the naive tool addition (sequential optimization without multi-candidate maintenance) underperforms, while the full procedure with maintenance and pruning succeeds. To address this, we will add a new ablation experiment in the revised version that retains all candidates without pruning and compares it directly to the pruned version. revision: yes
Referee: [Abstract and Results] The abstract asserts 'consistent and substantial gains' across four benchmarks, yet supplies no quantitative deltas, baseline details, statistical tests, or error analysis of false-negative discards during pruning. This absence makes it impossible to assess whether the pruning metric reliably separates signal from noise on health or reasoning tasks.

Authors: We acknowledge the need for more quantitative detail and rigor in the abstract and results. We will update the abstract to include specific performance deltas (e.g., absolute and relative improvements on each benchmark) and baseline descriptions. In the results section, we will incorporate statistical tests such as significance testing across runs and an analysis of pruning errors, including false negative discards, to demonstrate the reliability of the pruning metric. These additions will be included in the main text or supplementary material as appropriate. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical method: equipping context optimizers with external Wikipedia search and browser tools, then pairing them with a search-based training procedure that maintains and prunes multiple candidate contexts. The abstract and described claims contain no equations, fitted parameters, or derivations that reduce the reported gains to inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear. Performance is evaluated on external benchmarks (Flores+, HealthBench, LiveCodeBench, Humanity's Last Exam), making the central claim dependent on those results rather than internal tautology. The pruning component is part of the proposed procedure, not derived from itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method is described as building on existing context optimization and tool-use techniques.

pith-pipeline@v0.9.0 · 5500 in / 1100 out tokens · 36014 ms · 2026-05-15T06:03:28.957434+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 8 internal anchors

[1]

Mastering the game of

Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , journal=. Mastering the game of. 2016 , publisher=

work page 2016
[2]

2024 , eprint=

Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs , author=. 2024 , eprint=

work page 2024
[3]

Arora and Jason Wei and Rebecca Soskin Hicks and Preston Bowman and Joaquin Qui

Rahul K. Arora and Jason Wei and Rebecca Soskin Hicks and Preston Bowman and Joaquin Qui. HealthBench: Evaluating Large Language Models Towards Improved Human Health , journal =

work page
[4]

Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , booktitle =

work page
[5]

Mondal and Jyoti Prakash Sahoo , title =

Yisheng Song and Ting Wang and Puyu Cai and Subrota K. Mondal and Jyoti Prakash Sahoo , title =

work page
[6]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,

Qingxiu Dong and Lei Li and Damai Dai and Ce Zheng and Jingyuan Ma and Rui Li and Heming Xia and Jingjing Xu and Zhiyong Wu and Baobao Chang and Xu Sun and Lei Li and Zhifang Sui , title =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,

work page 2024
[7]

CoRR , volume =

Pranab Sahoo and Ayush Kumar Singh and Sriparna Saha and Vinija Jain and Samrat Mondal and Aman Chadha , title =. CoRR , volume =

work page
[8]

Kroiz and Feileen Li and Hudson Tao and Ashay Srivastava and Hevander Da Costa and Saloni Gupta and Megan L

Sander Schulhoff and Michael Ilie and Nishant Balepur and Konstantine Kahadze and Amanda Liu and Chenglei Si and Yinheng Li and Aayush Gupta and HyoJung Han and Sevien Schulhoff and Pranav Sandeep Dulepet and Saurav Vidyadhara and Dayeon Ki and Sweta Agrawal and Chau Pham and Gerson C. Kroiz and Feileen Li and Hudson Tao and Ashay Srivastava and Hevander ...

work page
[9]

Chi and Quoc V

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , title =. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 , year =

work page 2022
[10]

Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution , booktitle =

Chrisantha Fernando and Dylan Banarse and Henryk Michalewski and Simon Osindero and Tim Rockt. Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution , booktitle =

work page
[11]

Yuxin Wen and Neel Jain and John Kirchenbauer and Micah Goldblum and Jonas Geiping and Tom Goldstein , title =. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , year =

work page 2023
[12]

CoRR , volume =

Ryumei Nakada and Wenlong Ji and Tianxi Cai and James Zou and Linjun Zhang , title =. CoRR , volume =

work page
[13]

NLLB Team and Costa-juss \`a , Marta R. and Cross, James and C elebi, Onur and Elbayad, Maha and Heafield, Kenneth and Heffernan, Kevin and Kalbassi, Elahe and Lam, Janice and Licht, Daniel and Maillard, Jean and Sun, Anna and Wang, Skyler and Wenzek, Guillaume and Youngblood, Al and Akula, Bapi and Barrault, Loic and Gonzalez, Gabriel Mejia and Hansanti,...

work page doi:10.1038/s41586-024-07335-x 2024
[14]

The F lores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and Chen, Peng-Jen and Wenzek, Guillaume and Ju, Da and Krishnan, Sanjana and Ranzato, Marc ' Aurelio and Guzm \'a n, Francisco and Fan, Angela. The F lores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation. Transactions of the Association for Computational Linguistics. 2022

work page 2022
[15]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , booktitle =

Naman Jain and King Han and Alex Gu and Wen. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , booktitle =

work page
[16]

2025 , eprint=

Humanity's Last Exam , author=. 2025 , eprint=

work page 2025
[17]

CoRR , volume =

Qizheng Zhang and Changran Hu and Shubhangi Upasani and Boyuan Ma and Fenglu Hong and Vamsidhar Kamanuru and Jay Rainton and Chen Wu and Mengmeng Ji and Hanchen Li and Urmish Thakker and James Zou and Kunle Olukotun , title =. CoRR , volume =

work page
[18]

CoRR , volume =

Gemini Team , title =. CoRR , volume =

work page
[19]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Proceedings of the 27th International Conference on Computational Linguistics,

Vikas Yadav and Steven Bethard , title =. Proceedings of the 27th International Conference on Computational Linguistics,

work page
[22]

WIREs Data Mining Knowl

Lei Zhang and Shuai Wang and Bing Liu , title =. WIREs Data Mining Knowl. Discov. , volume =

work page
[23]

A survey on code generation with llm-based agents,

Yihong Dong and Xue Jiang and Jiaru Qian and Tian Wang and Kechi Zhang and Zhi Jin and Ge Li , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2508.00083 , eprinttype =. 2508.00083 , timestamp =

work page doi:10.48550/arxiv.2508.00083 2025
[24]

Competitive Programming with Large Reasoning Models , journal =

Ahmed El. Competitive Programming with Large Reasoning Models , journal =. 2025 , url =. doi:10.48550/ARXIV.2502.06807 , eprinttype =. 2502.06807 , timestamp =

work page doi:10.48550/arxiv.2502.06807 2025
[25]

Vechev , title =

Mislav Balunovic and Jasper Dekoninck and Ivo Petrov and Nikola Jovanovic and Martin T. Vechev , title =. CoRR , volume =

work page
[26]

and Huang, Jimin and Qian, Lingfei and Peng, Xueqing and Suchow, Jordan W

Li, Haohang and Cao, Yupeng and Yu, Yangyang and Javaji, Shashidhar Reddy and Deng, Zhiyang and He, Yueru and Jiang, Yuechen and Zhu, Zining and Subbalakshmi, K.p. and Huang, Jimin and Qian, Lingfei and Peng, Xueqing and Suchow, Jordan W. and Xie, Qianqian. INVESTORBENCH : A Benchmark for Financial Decision-Making Tasks with LLM -based Agent. Proceedings ...

work page 2025
[27]

CoRR , volume =

Shuo Ren and Pu Jian and Zhenjiang Ren and Chunlin Leng and Can Xie and Jiajun Zhang , title =. CoRR , volume =. 2025 , url =

work page 2025
[28]

A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence , journal =

Huan. A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence , journal =. 2025 , url =

work page 2025
[29]

CoRR , volume =

Jinyuan Fang and Yanwen Peng and Xi Zhang and Yingxu Wang and Xinhao Yi and Guibin Zhang and Yi Xu and Bin Wu and Siwei Liu and Zihao Li and Zhaochun Ren and Nikos Aletras and Xi Wang and Han Zhou and Zaiqiao Meng , title =. CoRR , volume =. 2025 , url =

work page 2025
[30]

The Eleventh International Conference on Learning Representations,

Zeyu Huang and Yikang Shen and Xiaofeng Zhang and Jie Zhou and Wenge Rong and Zhang Xiong , title =. The Eleventh International Conference on Learning Representations,

work page
[31]

Smith and Yejin Choi and Kentaro Inui , title =

Jungo Kasai and Keisuke Sakaguchi and Yoichi Takahashi and Ronan Le Bras and Akari Asai and Xinyan Yu and Dragomir Radev and Noah A. Smith and Yejin Choi and Kentaro Inui , title =. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 20...

work page 2023
[32]

Cohen and Emine Yilmaz , title =

Zheng Zhao and Clara Vania and Subhradeep Kayal and Naila Khan and Shay B. Cohen and Emine Yilmaz , title =. Findings of the Association for Computational Linguistics,

work page
[33]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Interactive fiction games: A colossal adventure , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[34]

CoRR , volume =

Edan Toledo and Karen Hambardzumyan and Martin Josifoski and Rishi Hazra and Nicolas Mario Baldwin and Alexis Audran. CoRR , volume =

work page
[35]

Differentiation

Mert Y. TextGrad: Automatic "Differentiation" via Text , journal =

work page
[36]

Automatic Prompt Optimization with ``Gradient Descent'' and Beam Search

Pryzant, Reid and Iter, Dan and Li, Jerry and Lee, Yin and Zhu, Chenguang and Zeng, Michael. Automatic Prompt Optimization with ``Gradient Descent'' and Beam Search. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023

work page 2023
[37]

Forty-second International Conference on Machine Learning,

Zora Zhiruo Wang and Jiayuan Mao and Daniel Fried and Graham Neubig , title =. Forty-second International Conference on Machine Learning,. 2025 , url =

work page 2025
[38]

A Benchmark for Learning to Translate a New Language from One Grammar Book , booktitle =

Garrett Tanzer and Mirac Suzgun and Eline Visser and Dan Jurafsky and Luke Melas. A Benchmark for Learning to Translate a New Language from One Grammar Book , booktitle =

work page
[39]

The Thirteenth International Conference on Learning Representations,

Seth Aycock and David Stap and Di Wu and Christof Monz and Khalil Sima'an , title =. The Thirteenth International Conference on Learning Representations,

work page
[40]

McAuley , title =

Yuanzhe Hu and Yu Wang and Julian J. McAuley , title =. CoRR , volume =. 2025 , url =

work page 2025
[41]

CoRR , volume =

Huichi Zhou and Yihang Chen and Siyuan Guo and Xue Yan and Kin Hei Lee and Zihan Wang and Ka Yiu Lee and Guchun Zhang and Kun Shao and Linyi Yang and Jun Wang , title =. CoRR , volume =

work page
[42]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory , journal =

Siru Ouyang and Jun Yan and I. ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory , journal =

work page
[43]

arXiv preprint arXiv:2502.00592 , year=

M+: Extending MemoryLLM with Scalable Long-Term Memory , author=. arXiv preprint arXiv:2502.00592 , year=

work page arXiv
[44]

A Survey of Context Engineering for Large Language Models

A survey of context engineering for large language models , author=. arXiv preprint arXiv:2507.13334 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

ACM Computing Surveys , volume=

Continual learning of large language models: A comprehensive survey , author=. ACM Computing Surveys , volume=. 2025 , publisher=

work page 2025
[46]

Contextual Experience Replay for Self-Improvement of Language Agents

Liu, Yitao and Si, Chenglei and Narasimhan, Karthik R and Yao, Shunyu. Contextual Experience Replay for Self-Improvement of Language Agents. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025

work page 2025
[47]

Guanzhi Wang and Yuqi Xie and Yunfan Jiang and Ajay Mandlekar and Chaowei Xiao and Yuke Zhu and Linxi Fan and Anima Anandkumar , title =. Trans. Mach. Learn. Res. , volume =

work page
[48]

Proceedings of the 31st International Conference on Computational Linguistics , pages=

In-context continual learning assisted by an external continual learner , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

work page
[49]

Zhang, Q., Chen, S., Bei, Y ., Yuan, Z., Zhou, H., Hong, Z., Dong, J., Chen, H., Chang, Y ., and Huang, X

A survey of graph retrieval-augmented generation for customized large language models , author=. arXiv preprint arXiv:2501.13958 , year=

work page arXiv
[50]

PLOS Digital Health , volume=

Retrieval augmented generation for large language models in healthcare: A systematic review , author=. PLOS Digital Health , volume=. 2025 , publisher=

work page 2025
[51]

Computers and Education: Artificial Intelligence , pages=

Retrieval-augmented generation for educational application: A systematic survey , author=. Computers and Education: Artificial Intelligence , pages=. 2025 , publisher=

work page 2025
[52]

EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

Evotest: Evolutionary test-time learning for self-improving agentic systems , author=. arXiv preprint arXiv:2510.13220 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

arXiv preprint arXiv:2505.18524 , year=

metaTextGrad: Automatically optimizing language model optimizers , author=. arXiv preprint arXiv:2505.18524 , year=

work page arXiv
[54]

Patil and Kevin Lin and Sarah Wooders and Joseph E

Charles Packer and Vivian Fang and Shishir G. Patil and Kevin Lin and Sarah Wooders and Joseph E. Gonzalez , title =. CoRR , volume =. 2023 , url =

work page 2023
[55]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu and Zujie Liang and Kai Mei and Hang Gao and Juntao Tan and Yongfeng Zhang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.12110 , eprinttype =. 2502.12110 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.12110 2025
[56]

Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory , journal =

Mirac Suzgun and Mert Y. Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory , journal =. 2025 , url =. doi:10.48550/ARXIV.2504.07952 , eprinttype =. 2504.07952 , timestamp =

work page doi:10.48550/arxiv.2504.07952 2025
[57]

arXiv preprint arXiv:2511.06449 , year=

Flex: Continuous agent evolution via forward learning from experience , author=. arXiv preprint arXiv:2511.06449 , year=

work page arXiv
[58]

ExpSeek: Self-Triggered Experience Seeking for Web Agents

ExpSeek: Self-Triggered Experience Seeking for Web Agents , author=. arXiv preprint arXiv:2601.08605 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[60]

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory , author=. arXiv preprint arXiv:2511.20857 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Aman Madaan and Niket Tandon and Prakhar Gupta and Skyler Hallinan and Luyu Gao and Sarah Wiegreffe and Uri Alon and Nouha Dziri and Shrimai Prabhumoye and Yiming Yang and Shashank Gupta and Bodhisattwa Prasad Majumder and Katherine Hermann and Sean Welleck and Amir Yazdanbakhsh and Peter Clark , title =. Advances in Neural Information Processing Systems ...

work page 2023
[62]

The Twelfth International Conference on Learning Representations , year=

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines , author=. The Twelfth International Conference on Learning Representations , year=

work page
[63]

Nature , volume=

AI models collapse when trained on recursively generated data , author=. Nature , volume=. 2024 , publisher=

work page 2024
[64]

arXiv preprint arXiv:2601.18510 , year=

Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates , author=. arXiv preprint arXiv:2601.18510 , year=

work page arXiv
[65]

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

Diverse beam search: Decoding diverse solutions from neural sequence models , author=. arXiv preprint arXiv:1610.02424 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

Ponti and Ivan Titov , title =

Zeyu Huang and Tianhao Cheng and Zihan Qiu and Zili Wang and Yinghui Xu and Edoardo M. Ponti and Ivan Titov , title =. CoRR , volume =

work page
[67]

Ponti and Ivan Titov , title =

Zeyu Huang and Zihan Qiu and Zili Wang and Edoardo M. Ponti and Ivan Titov , title =. The Thirteenth International Conference on Learning Representations,

work page