arxiv: 2605.06702 · v1 · submitted 2026-05-05 · 💻 cs.AI · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment

Siyuan Guo , Yali Du , Hechang Chen , Yi Chang , Jun Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:13 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords deployment-time learningepisodic memorycontextual banditLLM agentscontinual adaptationcase-based reasoningexperience reuseno-regret learning

0 comments

The pith

LLM agents can learn from experience during deployment by building and querying an explicit episodic memory without changing their parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the rigid separation between training and deployment leaves LLMs unable to improve after release, unlike natural intelligence that keeps adapting through interaction. It introduces CASCADE as a framework that equips agents with a growing episodic memory of past cases and selects relevant ones for new tasks by treating selection as a contextual bandit problem. This setup supplies exploration-exploitation trade-offs along with no-regret guarantees, letting agents accumulate and refine task-relevant experiences into usable knowledge. The result is a 20.9 percent gain in macro-averaged success rate over zero-shot prompting across sixteen tasks in medicine, law, code, and other domains. A sympathetic reader would care because the work reframes deployment itself as an ongoing learning stage rather than a static endpoint.

Core claim

The central claim is that formalizing deployment-time learning as a distinct stage after training and fine-tuning, then equipping LLM agents with an explicit evolving episodic memory whose case selection is cast as a contextual bandit problem, produces no-regret guarantees over long interactions, allows agents to accumulate select and refine task-relevant cases, and raises macro-averaged success rates by 20.9 percent over zero-shot prompting while outperforming gradient-based and memory-based baselines on sixteen diverse tasks.

What carries the argument

An explicit evolving episodic memory whose case selection is formulated as a contextual bandit problem to balance exploration and exploitation while accumulating actionable knowledge.

If this is right

Agents accumulate, select, and refine task-relevant cases from past interactions without parameter changes.
No-regret guarantees hold for long-term deployment interactions.
Macro-averaged success rate rises 20.9 percent over zero-shot prompting across sixteen tasks.
The approach outperforms both gradient-based and other memory-based baselines on medical, legal, coding, search, tool-use, and embodied tasks.
Deployment is reframed as a continual adaptive learning process rather than a fixed endpoint.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Memory systems of this kind could support personalized agents that retain user-specific interaction patterns across sessions.
The same case-selection logic might extend to teams of agents that share and query a joint memory store.
Longer real-world deployments would test whether the claimed no-regret property produces measurable gains beyond the reported sixteen-task suite.

Load-bearing premise

That casting experience reuse as a contextual bandit problem will actually deliver no-regret guarantees and convert accumulated cases into effective knowledge without any updates to the underlying model parameters.

What would settle it

A sequence of repeated interactions in which the agent's success rate stays flat at the zero-shot level or cumulative regret fails to converge toward zero over time.

Figures

Figures reproduced from arXiv: 2605.06702 by Hechang Chen, Jun Wang, Siyuan Guo, Yali Du, Yi Chang.

**Figure 1.** Figure 1: The LLM Lifecycle. In the first stage, LLMs are pre-trained with next-token prediction tasks on a large scale of corpus. Then, LLMs are further finetuned using supervised finetuning (SFT) and reinforcement learning finetuning (RLFT) for alignment and enhancing reasoning capabilities. We consider deployment-time learning as the third stage, where LLMs learn from experience during deployment, enabling contin… view at source ↗

**Figure 2.** Figure 2: Overview of CASCADE. a, Given a query, CASCADE retrieves the case via the contextual bandit algorithm, reuses and revises it to generate the solution, and receives the reward. The retriever policy is updated accordingly, and successful cases are retained in the case bank. b, CASCADE exhibits the no-regret learning property: the coverage gap is controlled by the Retain step, while the retrieval regret is mi… view at source ↗

**Figure 3.** Figure 3: Main results on 12 single-turn tasks. All results are obtained using Qwen3-32B and are reported based on five different random seeds. a, Success rate improvement over Zero-shot method during the deployment steps across different tasks. Solid lines represent mean values and the error bars are standard deviations. b, Table displaying the normalised scores (0-1 range) of all the methods across different tasks… view at source ↗

**Figure 4.** Figure 4: In-depth analyses on 12 single-turn tasks. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Results on embodied sequential decision-making tasks. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Results on two real-world tasks: web-based deep search and complex tabular [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Large language models (LLMs) have become a central foundation of modern artificial intelligence, yet their lifecycle remains constrained by a rigid separation between training and deployment, after which learning effectively ceases. This limitation contrasts with natural intelligence, which continually adapts through interaction with its environment. In this paper, we formalise deployment-time learning (DTL) as the third stage in the LLM lifecycle that enables LLM agents to improve from experience during deployment without modifying model parameters. We present CASCADE (CASe-based Continual Adaptation during DEployment), a general and principled framework that equips LLM agents with an explicit, evolving episodic memory. CASCADE formulates experience reuse as a contextual bandit problem, enabling principled exploration-exploitation trade-offs and establishing no-regret guarantees over long-term interactions. This design allows agents to accumulate, select, and refine task-relevant cases, transforming past experience into actionable knowledge. Across 16 diverse tasks spanning medical diagnosis, legal analysis, code generation, web search, tool use, and embodied interaction, CASCADE improves macro-averaged success rate by 20.9% over zero-shot prompting while consistently outperforming gradient-based and memory-based baselines. By reframing deployment as an adaptive learning process, this work establishes a foundation for continually improving AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CASCADE adds episodic memory and a contextual bandit for selecting cases in LLM prompts at deployment, with broad empirical gains, but the no-regret claim is weakened by the indirect stochastic reward from the LLM itself.

read the letter

The main things here are that the paper formalizes deployment-time learning as a distinct stage after training and inference, then gives a concrete system called CASCADE that keeps an evolving episodic memory of past cases and uses a contextual bandit to decide which case to pull into the current prompt. It reports a 20.9% macro-averaged success improvement over zero-shot across 16 tasks in medicine, law, code, web, tools, and embodiment, while beating both fine-tuning baselines and other memory methods.

Referee Report

2 major / 2 minor

Summary. The paper introduces CASCADE, a framework for deployment-time learning (DTL) in LLMs that equips agents with an evolving episodic memory. It formulates experience reuse as a contextual bandit problem to enable exploration-exploitation trade-offs and no-regret guarantees without parameter updates. The approach accumulates, selects, and refines task-relevant cases, with empirical evaluation across 16 tasks (medical diagnosis, legal analysis, code generation, web search, tool use, embodied interaction) showing a 20.9% macro-averaged success rate improvement over zero-shot prompting and consistent outperformance of gradient-based and memory-based baselines.

Significance. If the central claims hold, this work is significant for reframing LLM deployment as an adaptive process rather than a static endpoint. The parameter-free design via case-based memory and the scale of the 16-task evaluation are strengths that could influence practical agent systems. The attempt to import contextual bandit theory for principled long-term improvement is a clear contribution, though its applicability here requires careful validation.

major comments (2)

[theoretical analysis section on contextual bandit formulation] Contextual bandit formulation (theoretical analysis section deriving no-regret guarantees): The claim that formulating case selection as a contextual bandit yields no-regret guarantees for the overall agent is not automatically supported. Standard bounds (e.g., for LinUCB) assume direct, observable rewards from the chosen arm, but here the reward is the stochastic success of the LLM-generated response after inserting the retrieved case; the bandit never observes the internal LLM computation. This indirect mapping means case-selection regret does not necessarily translate to performance guarantees for the agent, and a precise reduction or modified analysis is needed to support the assertion.
[experimental evaluation and results] Experimental section (results on 16 tasks and baseline comparisons): The reported 20.9% macro-averaged gain and consistent outperformance are promising, but the manuscript must clarify controls for post-hoc task selection and whether the bandit algorithm's exploration is evaluated in a truly online, non-stationary deployment setting rather than offline replay. Without these, the empirical support for long-term knowledge accumulation remains incomplete.

minor comments (2)

[abstract and introduction] The abstract and introduction should explicitly name the specific contextual bandit algorithm (e.g., LinUCB, Thompson sampling) and the exact reward definition used in the formulation.
[framework description] Notation for the episodic memory and case retrieval process could be made more precise, including how cases are represented and updated over time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The comments on the theoretical grounding of the contextual bandit formulation and the need for clearer experimental controls are valuable and will help strengthen the manuscript. We address each point below, outlining the revisions we will make.

read point-by-point responses

Referee: Contextual bandit formulation (theoretical analysis section deriving no-regret guarantees): The claim that formulating case selection as a contextual bandit yields no-regret guarantees for the overall agent is not automatically supported. Standard bounds (e.g., for LinUCB) assume direct, observable rewards from the chosen arm, but here the reward is the stochastic success of the LLM-generated response after inserting the retrieved case; the bandit never observes the internal LLM computation. This indirect mapping means case-selection regret does not necessarily translate to performance guarantees for the agent, and a precise reduction or modified analysis is needed to support the assertion.

Authors: We appreciate this precise observation on the reward structure. In CASCADE, the contextual bandit treats case selection as the action, with the observed reward being the binary task success (0/1) after the LLM produces its response using the selected case. This reward is directly observable post-execution and follows the standard stochastic reward model in contextual bandits, where the distribution depends on context and arm but need not reveal internal mechanisms. The no-regret bound therefore applies to the case-selection policy relative to the optimal policy in hindsight, ensuring sublinear regret in cumulative reward (i.e., task successes) over long-term interactions. While LLM stochasticity means the bound does not yield a deterministic performance guarantee for every response, it does guarantee that the selection policy improves, which in turn drives the observed agent-level gains. We will revise the theoretical analysis section to include an explicit reduction: we map the problem to a standard contextual bandit instance by defining the reward as the observed success indicator, state the assumptions under which LinUCB-style bounds hold, and clarify that the guarantees concern regret of the bandit (not a direct bound on LLM internals). A new subsection will formalize this mapping. revision: yes
Referee: Experimental section (results on 16 tasks and baseline comparisons): The reported 20.9% macro-averaged gain and consistent outperformance are promising, but the manuscript must clarify controls for post-hoc task selection and whether the bandit algorithm's exploration is evaluated in a truly online, non-stationary deployment setting rather than offline replay. Without these, the empirical support for long-term knowledge accumulation remains incomplete.

Authors: We agree that explicit controls are required to substantiate the deployment-time claims. The current evaluation processes the 16 tasks sequentially in a single continuous stream, with the episodic memory and bandit updating after each interaction; task order is randomized across runs to induce non-stationarity, and exploration occurs online via the bandit algorithm at each step. No offline replay or post-hoc filtering of tasks is performed—all 16 tasks are included as predefined. To make this transparent, we will add a new subsection in the experimental evaluation that (i) details the online sequential protocol, (ii) confirms absence of post-hoc task selection, (iii) describes how non-stationarity is simulated through randomized ordering and evolving memory, and (iv) includes cumulative success-rate plots over the interaction sequence to visualize long-term accumulation. These additions will directly address the concern about empirical support for continual adaptation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper applies standard contextual bandit theory to experience reuse for no-regret guarantees, relying on external literature rather than self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations. The central claims rest on empirical results across 16 tasks and the formalization of DTL, which does not reduce to its inputs by construction. No equations or steps in the provided text exhibit the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only; the central claim rests on standard bandit no-regret properties and the assumption that episodic memory can be maintained and queried effectively by LLMs without parameter updates.

axioms (1)

domain assumption Contextual bandit formulation yields no-regret guarantees over long-term LLM agent interactions
Abstract states this as enabling principled exploration-exploitation trade-offs.

invented entities (1)

Evolving episodic memory for LLMs no independent evidence
purpose: Stores and refines task-relevant cases for deployment-time adaptation
New structure introduced to transform past experience into actionable knowledge without model changes.

pith-pipeline@v0.9.0 · 5528 in / 1202 out tokens · 37812 ms · 2026-05-11T01:13:04.064416+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CASCADE formulates experience reuse as a contextual bandit problem, enabling principled exploration-exploitation trade-offs and establishing no-regret guarantees over long-term interactions.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RT = E[∑(R(qt,a⋆t)−R¯(qt,c⋆t)+R¯(qt,c⋆t)−R¯(qt,ct))]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · 8 internal anchors

[1]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131, 2025. URL https://arxiv.org/abs/2506.13131

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Kolb-based experiential learning for generalist agents with human-level kaggle data science performance,

Haitham Bou-Ammar, Antoine Grosnit, Alexandre Maraval, Refinath SN, Zichao Zhao, James Doran, Giuseppe Paolo, Albert Thomas, Jonas Gonzalez, Abhineet Kumar, et al. Kolb-based experiential learning for generalist agents with human-level kaggle data science performance,

work page
[3]

URL https://doi.org/10.21203/rs.3.rs-7472642/v1

work page doi:10.21203/rs.3.rs-7472642/v1
[4]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, 2025

work page 2025
[5]

Experience-dependent structural synaptic plasticity in the mammalian brain

Anthony Holtmaat and Karel Svoboda. Experience-dependent structural synaptic plasticity in the mammalian brain. Nature Reviews Neuroscience, 10(9):647–658, 2009

work page 2009
[6]

Predictive processing: a canonical cortical computation

Georg B Keller and Thomas D Mrsic-Flogel. Predictive processing: a canonical cortical computation. Neuron, 100(2):424–435, 2018

work page 2018
[7]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024

work page 2024
[8]

Reinforcement learning: An introduction, volume 1

Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998
[15]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems, volume 36, pages 8634–8652, 2023. 19

work page 2023
[18]

GEPA: Reflective prompt evolution can outperform reinforcement learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. In The Fourteenth Internationa...

work page 2026
[21]

Lora: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?i d=nZeVKeeFYf9

work page 2022
[28]

Alfworld: Aligning text and embodied environments for interactive learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. In International Conference on Learning Representations , 2021. URL https: //openreview.net/forum?id=0IOX0YcCdTn. 20

work page 2021
[30]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?i d=WE_vluYUL-X

work page 2023
[31]

Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. In Second Conference on Language Modeling, 2025

work page 2025
[33]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, 2020

work page 2020
[34]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review arXiv 2022
[36]

Hl7 fhir: An agile and restful approach to healthcare information exchange

Duane Bender and Kamran Sartipi. Hl7 fhir: An agile and restful approach to healthcare information exchange. In Proceedings of the 26th IEEE international symposium on computer- based medical systems, pages 326–331. IEEE, 2013

work page 2013
[38]

Ehr-r1: A reasoning-enhanced foundational language model for electronic health record analysis.arXiv preprint arXiv:2510.25628, 2025

Yusheng Liao, Chaoyi Wu, Junwei Liu, Shuyang Jiang, Pengcheng Qiu, Haowen Wang, Yun Yue, Shuai Zhen, Jian Wang, Qianrui Fan, et al. Ehr-r1: A reasoning-enhanced foundational language model for electronic health record analysis. arXiv preprint arXiv:2510.25628, 2025

work page arXiv 2025
[43]

Retrieval, reuse, revision and retention in case-based reasoning

Ramon Lopez De Mantaras, David Mcsherry, Derek Bridge, David Leake, Barry Smyth, Susan Craw, Boi Faltings, Mary Lou Maher, Michael T Cox, Kenneth Forbus, et al. Retrieval, reuse, revision and retention in case-based reasoning. Knowledge Engineering Review, 20(3):215–240, 2005

work page 2005
[47]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[48]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019
[49]

Neural contextual bandits with ucb-based exploration

Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. In International conference on machine learning, pages 11492–11502. PMLR, 2020. 22 a b c Figure E1: Results on the extension of CASCADE to multiple cases. All results are reported based on five different random seeds. a, Performance comparison under different n...

work page 2020
[50]

After coming to an agreement, the three took a moped driven by Chu to Shiliang Bay Park near the bridge on Jiuchang Road, Chicheng Street, Tiantai County

On the night of July 8, 2013, the defendants Chen, Liu, and Chu conspired in advance to snatch a gold necklace in the room of the Renhe Hotel at Chicheng Street, Tiantai County. After coming to an agreement, the three took a moped driven by Chu to Shiliang Bay Park near the bridge on Jiuchang Road, Chicheng Street, Tiantai County. When the victim, Ding, g...

work page 2013
[51]

Supporter

In the afternoon of July 9, 2013, defendants Chen and Liu agreed to snatch another gold necklace. Liu drove the moped carrying Chen to a road outside Tiantai V ocational and Technical Secondary School on Tiantaishan Middle Road in Tiantai County, approached Yang, who was riding an electric bike, and Chen snatched the gold necklace from around Yang’s neck....

work page 2013
[52]

Assumption 5

With this assumption, we can derive that the initialisation scheme above leads to f(xt,k; ω0) = 0 for all t ∈ [T ], k ∈ [K]. Assumption 5. There exists a constant ℓLip > 0 such that for all x, x′ ∈ {xi}T K i=1, ∥∇ωf(x; ω0) − ∇ωf(x′; ω0)∥ ≤ ℓLip∥x − x′∥2. This assumption ensures the stability of the parameter gradients with respect to input variations, whi...

work page
[53]

David Silver and Richard S. Sutton. Welcome to the era of experience, 2025. URL https: //storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%2 0of%20Experience%20Paper.pdf

work page 2025
[54]

The landscape of agentic reinforcement learning for llms: A survey, 2025

Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita-Velez, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. The landscape of agentic reinfor...

work page 2025
[55]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[56]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Group-in-group policy optimization for llm agent training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[58]

Agent learning via early experience.arXiv preprint arXiv:2510.08558, 2025

Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, et al. Agent learning via early experience. arXiv preprint arXiv:2510.08558, 2025

work page arXiv 2025
[59]

arXiv preprint arXiv:2510.01051 , year=

Zichen Liu, Anya Sims, Keyu Duan, Changyu Chen, Simon Yu, Xiangxin Zhou, Haotian Xu, Shaopan Xiong, Bo Liu, Chenmien Tan, et al. Gem: A gym for agentic llms. arXiv preprint arXiv:2510.01051, 2025

work page arXiv 2025
[60]

A survey of context engineering for large language models, 2025

Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu. A survey of context engineering for large language models, 2025

work page 2025
[61]

Optimizing generative ai by backpropagating language model feedback

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative ai by backpropagating language model feedback. Nature, 639(8055):609–616, 2025

work page 2025
[62]

Feedback descent: Open-ended text optimization via pairwise comparison

Yoonho Lee, Joseph Boen, and Chelsea Finn. Feedback descent: Open-ended text optimization via pairwise comparison. arXiv preprint arXiv:2511.07919, 2025

work page arXiv 2025
[63]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learning Represen...

work page 2024
[64]

GEPA: Reflective prompt evolution can outperform reinforcement learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. In The Fourteenth Internationa...

work page 2026
[65]

Llms are in-context bandit reinforcement learners

Giovanni Monea, Antoine Bosselut, Kianté Brantley, and Yoav Artzi. Llms are in-context bandit reinforcement learners. In Second Conference on Language Modeling , 2025. URL https://openreview.net/forum?id=c0RsezY2D1

work page 2025
[66]

arXiv preprint arXiv:2507.06229 , year=

Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, et al. Agent kb: Leveraging cross-domain experience for agentic problem solving. arXiv preprint arXiv:2507.06229, 2025

work page arXiv 2025
[67]

Memento: Fine-tuning LLM agents without fine-tuning LLMs.arXiv, 2025

Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, and Jun Wang. Memento: Fine-tuning llm agents without fine-tuning llms. arXiv preprint arXiv: 2508.16153, 2025. URL https://arxiv.or g/abs/2508.16153

work page arXiv 2025
[68]

arXiv preprint arXiv:2511.06449 , year=

Zhicheng Cai, Xinyuan Guo, Yu Pei, JiangTao Feng, Jiangjie Chen, Ya-Qin Zhang, Wei-Ying Ma, Mingxuan Wang, and Hao Zhou. Flex: Continuous agent evolution via forward learning from experience. arXiv preprint arXiv:2511.06449, 2025

work page arXiv 2025
[69]

Dynamic cheatsheet: Test-time learning with adaptive memory

Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7080–7106, 2026

work page 2026
[70]

Agentic context engineering: Learning comprehensive contexts for self-improving language models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Learning comprehensive contexts for self-improving language models. In The Fourteenth International Conference on Learning Representations, 2026

work page 2026
[71]

An introduction to case-based reasoning

Janet L Kolodner. An introduction to case-based reasoning. Artificial intelligence review, 6(1): 3–34, 1992

work page 1992
[72]

Case-based reasoning: A review

Ian Watson and Farhi Marir. Case-based reasoning: A review. The knowledge engineering review, 9(4):327–354, 1994

work page 1994
[73]

Case-based reasoning: Foundational issues, methodological variations, and system approaches

Agnar Aamodt and Enric Plaza. Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI communications, 7(1):39–59, 1994

work page 1994
[74]

Floyd, Lasal Jayawardena, David Leake, Mirko Lenz, Lukas Malburg, David H

Kerstin Bach, Ralph Bergmann, Florian Brand, Marta Caro-Martínez, Viktor Eisenstadt, Michael W. Floyd, Lasal Jayawardena, David Leake, Mirko Lenz, Lukas Malburg, David H. Ménager, Mirjam Minor, Brian Schack, Ian Watson, Kaitlynne Wilkerson, and Nirmalie Wiratunga. Case- based reasoning meets large language models: A research manifesto for open challenges ...

work page 2025
[75]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[76]

A survey on in-context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. In Proceedings of the 2024 conference on empirical methods in natural language processing, pages 1107–1128, 2024

work page 2024
[77]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022. 57

work page 2022
[78]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2(1), 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

Ds-agent: Automated data science by empowering large language models with case-based reasoning

Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, and Jun Wang. Ds-agent: Automated data science by empowering large language models with case-based reasoning. In International Conference on Machine Learning, pages 16813–16848. PMLR, 2024

work page 2024
[80]

Optimizing case-based reasoning system for functional test script generation with large language models

Siyuan Guo, Huiwu Liu, Xiaolong Chen, Yuming Xie, Liang Zhang, Tao Han, Hechang Chen, Yi Chang, and Jun Wang. Optimizing case-based reasoning system for functional test script generation with large language models. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 4487–4498, 2025

work page 2025
[81]

Case-based reasoning enhances the predictive power of llms in drug-drug interaction

Guangyi Liu, Yongqi Zhang, Xunyuan Liu, and Quanming Yao. Case-based reasoning enhances the predictive power of llms in drug-drug interaction. arXiv preprint arXiv:2505.23034, 2025

work page arXiv 2025
[82]

Memento-ii: Learning by stateful reflective memory.arXiv preprint arXiv:2512.22716, 2025

Jun Wang. Memento-ii: Learning by stateful reflective memory. arXiv preprint arXiv:2512.22716, 2025

work page arXiv 2025
[83]

Memento-skills: Let agents design agents

Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents. arXiv preprint arXiv:2603.18743, 2026

work page arXiv 2026
[84]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. In The Twelfth International Conference on Learning Representations, 2023

work page 2023
[85]

A contextual-bandit approach to personalized news article recommendation

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010

work page 2010
[86]

Scalable neural contextual bandit for recommender systems

Zheqing Zhu and Benjamin Van Roy. Scalable neural contextual bandit for recommender systems. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 3636–3646, 2023

work page 2023
[87]

Neural contextual bandits for personalized recommendation

Yikun Ban, Yunzhe Qi, and Jingrui He. Neural contextual bandits for personalized recommendation. In Companion Proceedings of the ACM Web Conference 2024, pages 1246– 1249, 2024

work page 2024
[88]

Use your instinct: Instruction optimization for llms using neural bandits coupled with transformers

Xiaoqiang Lin, Zhaoxuan Wu, Zhongxiang Dai, Wenyang Hu, Yao Shu, See-Kiong Ng, Patrick Jaillet, and Bryan Kian Hsiang Low. Use your instinct: Instruction optimization for llms using neural bandits coupled with transformers. In International Conference on Machine Learning, pages 30317–30345. PMLR, 2024

work page 2024
[89]

Neural contextual bandits with ucb-based exploration

Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. In International conference on machine learning, pages 11492–11502. PMLR, 2020

work page 2020
[90]

Prompt optimization with ease? efficient ordering-aware automated selection of exemplars

Zhaoxuan Wu, Xiaoqiang Lin, Zhongxiang Dai, Wenyang Hu, Yao Shu, See-Kiong Ng, Patrick Jaillet, and Bryan Kian Hsiang Low. Prompt optimization with ease? efficient ordering-aware automated selection of exemplars. Advances in Neural Information Processing Systems, 37: 122706–122740, 2024

work page 2024
[91]

Adaptive llm routing under budget constraints

Pranoy Panda, Raghav Magazine, Chaitanya Devaguptapu, Sho Takemori, and Vishal Sharma. Adaptive llm routing under budget constraints. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 23934–23949, 2025. 58

work page 2025
[92]

Online multi-llm selection via contextual bandits under unstructured context evolution

Manhin Poon, XiangXiang Dai, Xutong Liu, Fang Kong, John Lui, and Jinhang Zuo. Online multi-llm selection via contextual bandits under unstructured context evolution. arXiv preprint arXiv:2506.17670, 2025

work page arXiv 2025
[93]

Ddxplus: A new dataset for automatic medical diagnosis

Arsene Fansi Tchango, Rishab Goel, Zhi Wen, Julien Martel, and Joumana Ghosn. Ddxplus: A new dataset for automatic medical diagnosis. Advances in neural information processing systems, 35:31306–31318, 2022

work page 2022
[94]

Streambench: Towards benchmarking continuous improvement of language agents

Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Vivian Chen, and Hung-yi Lee. Streambench: Towards benchmarking continuous improvement of language agents. Advances in Neural Information Processing Systems, 37:107039–107063, 2024

work page 2024
[95]

Mimic-iv, a freely accessible electronic health record dataset

Alistair EW Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, et al. Mimic-iv, a freely accessible electronic health record dataset. Scientific data, 10(1):1, 2023

work page 2023
[96]

Large language model distilling medication recommendation model

Qidong Liu, Xian Wu, Xiangyu Zhao, Yuanshao Zhu, Zijian Zhang, Feng Tian, and Yefeng Zheng. Large language model distilling medication recommendation model. arXiv preprint arXiv:2402.02803, 2024

work page arXiv 2024
[97]

Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis

Farieda Gaber, Maqsood Shaik, Fabio Allega, Agnes Julia Bilecz, Felix Busch, Kelsey Goon, Vedran Franke, and Altuna Akalin. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. npj Digital Medicine, 8(1):263, 2025

work page 2025
[98]

Emergency severity index (esi): a triage tool for emergency department care, version 4

Nicki Gilboy, Paula Tanabe, Debbie Travers, Alexander M Rosenau, et al. Emergency severity index (esi): a triage tool for emergency department care, version 4. Implementation handbook, 2012:12–0014, 2012

work page 2012
[99]

Through the mud: A multi-defendant charge prediction benchmark with linked crime elements

Xiao Wei, Qi Xu, Hang Yu, Qian Liu, and Erik Cambria. Through the mud: A multi-defendant charge prediction benchmark with linked crime elements. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2864–2878, 2024

work page 2024
[100]

Cmdl: A large-scale chinese multi-defendant legal judgment prediction dataset

Wanhong Huang, Yi Feng, Chuanyi Li, Honghan Wu, Jidong Ge, and Vincent Ng. Cmdl: A large-scale chinese multi-defendant legal judgment prediction dataset. In Findings of the Association for Computational Linguistics: ACL 2024 , pages 5895–5906. Association for Computational Linguistics, 2024

work page 2024
[101]

Efficient intent detection with dual sentence encoders

Iñigo Casanueva, Tadas Temˇcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuli´c. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 38–45, 2020

work page 2020
[102]

Sentfin 1.0: Entity-aware sentiment analysis for financial news

Ankur Sinha, Satishwar Kedas, Rishu Kumar, and Pekka Malo. Sentfin 1.0: Entity-aware sentiment analysis for financial news. Journal of the Association for Information Science and Technology, 73(9):1314–1335, 2022

work page 2022
[103]

Logreasoner: Empowering llms with expert-like coarse- to-fine reasoning for log analysis tasks

Lipeng Ma, Yixuan Li, Weidong Yang, Mingjie Zhou, Xinyi Liu, Ben Fei, Shuhao Li, Xiaoyan Sun, Sihang Jiang, and Yanghua Xiao. Logreasoner: Empowering llms with expert-like coarse- to-fine reasoning for log analysis tasks. arXiv preprint arXiv:2509.20798, 2025

work page arXiv 2025
[104]

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3911–3921, 2018. 59

work page 2018
[105]

Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems, 36:42330–42357, 2023

work page 2023
[106]

Alfworld: Aligning text and embodied environments for interactive learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. In International Conference on Learning Representations , 2021. URL https: //openreview.net/forum?id=0IOX0YcCdTn

work page 2021
[107]

Scienceworld: Is your agent smarter than a 5th grader? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022

Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022

work page 2022

Showing first 80 references.