AdaMEM: Test-Time Adaptive Memory for Language Agents

Ali Payani; Lu Wang; Yiheng Li; Yunxiang Zhang

arxiv: 2606.05684 · v1 · pith:BGTU6CJPnew · submitted 2026-06-04 · 💻 cs.AI

AdaMEM: Test-Time Adaptive Memory for Language Agents

Yunxiang Zhang , Yiheng Li , Ali Payani , Lu Wang This is my paper

Pith reviewed 2026-06-28 01:34 UTC · model grok-4.3

classification 💻 cs.AI

keywords language agentstest-time adaptationagentic memoryhybrid memoryALFWorldWebShopHotpotQA

0 comments

The pith

Language agents adapt at test time by generating short-term strategy memory dynamically from stored trajectories, without any parameter updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the rigidity of static memory retrieval in language agents, which becomes misaligned during long-horizon tasks. AdaMEM keeps a fixed long-term memory of raw trajectories collected offline and generates fresh short-term strategy memory on the fly to steer each decision. This hybrid setup lets agents trade token cost against adaptability at inference time. Experiments show consistent gains over static baselines across ALFWorld, WebShop, and HotpotQA, plus further lifts from the STEP-MFT fine-tuning step. The approach opens a scaling route for post-deployment agent evolution.

Core claim

AdaMEM maintains a long-term trajectory memory of offline experiences and, at each step, synthesizes a short-term strategy memory from retrieved items to guide the current action, thereby achieving test-time adaptation solely through memory operations.

What carries the argument

Hybrid memory architecture that pairs static long-term trajectory memory with on-the-fly generated short-term strategy memory.

If this is right

Agents reach up to 13 percent relative improvement on ALFWorld and 11 percent on WebShop over static baselines.
Performance leadership extends to agentic search on HotpotQA.
STEP-MFT training of the policy to synthesize strategies yields additional gains.
The method supports continuous reasoning and self-evolution after deployment with no online parameter changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same memory-generation loop could be applied to other sequential decision domains where environment feedback arrives only after multiple steps.
Token budgets could be allocated dynamically by deciding how many strategies to generate per step based on observed task complexity.
The separation of long-term storage and short-term synthesis may reduce the frequency of full model retraining in production agent systems.

Load-bearing premise

Dynamically synthesized short-term strategies drawn from long-term experiences will stay aligned with task needs and improve decisions throughout an unfolding episode.

What would settle it

Running the same long-horizon tasks with AdaMEM versus a static memory baseline and finding no accuracy gain or a net loss on ALFWorld or WebShop.

Figures

Figures reproduced from arXiv: 2606.05684 by Ali Payani, Lu Wang, Yiheng Li, Yunxiang Zhang.

**Figure 1.** Figure 1: Overview of ADAMEM. Instead of relying on a static, episode-level strategy, the agent adapts to the current decision step by querying a long-term trajectory memory of raw experiences (1-2) and synthesizing them into a dynamic short-term strategy memory tailored to the current state (3). Conditioned on this testtime strategy, the agent adapts its next action (4) without requiring parameter updates. 1. Intr… view at source ↗

**Figure 2.** Figure 2: Comparison of test-time agent memory mechanisms. ReAct operates without external memory. Synapse and ReasoningBank employ static initialization, retrieving a trajectory or strategy (Z0) only at the episode start (S0), which forces the agent to rely on fixed guidance throughout the long-horizon task. In contrast, ADAMEM enables test-time adaptation via dynamic memory retrieval and synthesis. ADAMEM-LOW bala… view at source ↗

**Figure 3.** Figure 3: Overview of the STEP-MFT framework. We employ dual-filter rejection sampling to curate high-utility strategies for supervised fine-tuning (SFT). The process retains only successful trajectories where the strategy actually changes the proposed action (At ̸= A ′ t , green), while discarding redundant instances where the memory-free baseline yields the same action (At = A ′ t , blue). D = {(st, Eret, zt, at, … view at source ↗

**Figure 4.** Figure 4: Scalability with retrieval budget k on ALFWorld seen split (Qwen3-4B-Instruct, on-policy). Shaded bands show ±std over 3 runs. ADAMEM scales monotonically with k, while Synapse degrades as injecting additional raw trajectories into the context leads to context overflow and increased interference from irrelevant content. respectively. We attribute this deficit to their rigid retrieval timing at episode init… view at source ↗

**Figure 6.** Figure 6: Effectiveness vs. Efficiency trade-off. Following the on-policy long-term memory setup in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Example of a successful trajectory produced by ADAMEM-LOW agent. For this ALFWorld task, the agent recovers from both information misalignment and physical constraints through step-wise strategy memory updates. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Failure case study on ALFWorld. The agent attempts the task “put a clean butterknife in countertop”. After an exhaustive but unsuccessful search of all cabinets and drawers, the agent suffers from strategy inertia: it incorrectly assesses that the current strategy remains valid and that the failure is merely due to incomplete exploration. Consequently, it declines to refresh the strategy (drefresh = no),… view at source ↗

read the original abstract

A central challenge for language agents is utilizing past experience to adapt to dynamic test-time conditions. While recent work demonstrates the promise of agentic memory mechanisms, most systems restrict retrieval to episode initiation. Consequently, agents are forced to rely on static guidance that becomes increasingly misaligned as long-horizon tasks unfold. To address this rigidity, we propose the Adaptive Memory Agent (AdaMEM), a novel framework for agent test-time adaptation. Without updating model parameters online, AdaMEM adapts agent behavior via a hybrid memory architecture: it maintains a long-term trajectory memory of raw experiences collected offline while generating dynamic short-term strategy memory on-the-fly to guide decision-making. This mechanism enables the trade-off between token efficiency and adaptability across varying inference-time compute levels. Empirically, AdaMEM significantly outperforms static memory baselines, achieving relative gains of up to 13% on ALFWorld and 11% on WebShop, with consistent leading performance extending to agentic search on HotpotQA. To further enhance this adaptation, we develop STEP-MFT, a Step-wise Memory Fine-Tuning technique that trains the policy to synthesize high-quality strategies from retrieved experiences, yielding additional performance gains. Our work establishes a new scaling dimension for agentic memory, supporting continuous reasoning and self-evolution post-deployment in real-world environments. Our code is available at https://github.com/yunx-z/AdaMEM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaMEM gives agents a hybrid memory that pulls long-term trajectories into on-the-fly short-term strategies at test time, with reported gains on ALFWorld and WebShop, but the evaluation leaves open whether those gains come from the memory design or from extra LLM calls.

read the letter

The core idea is straightforward: keep raw past trajectories in long-term storage and synthesize fresh strategy notes during the episode instead of freezing guidance at the start. That setup plus the STEP-MFT fine-tuning step is the actual new piece; most prior agent memory work either stays static or updates parameters online. The paper shows the approach on three standard benchmarks and releases code, which is useful for anyone trying to make agents handle longer horizons without retraining.

The main soft spot is the comparison itself. The abstract flags a token-efficiency trade-off, yet the static baselines appear to receive only initial retrieval while AdaMEM keeps generating at each step. Without matched total forward passes or token budgets, it is hard to tell how much of the 13 % and 11 % relative lifts trace to the hybrid memory versus simply giving the agent more inference compute. The abstract also gives no error bars, ablation on the generation frequency, or details on how misalignment between long-term and short-term memory is handled, so the empirical claims stay provisional.

This is the kind of paper that belongs in a reading group focused on agent memory and test-time adaptation; the direction is real even if the current evidence is thin. A serious editor should send it to review rather than desk-reject, mainly to get the authors to run the compute-controlled experiments and tighten the method section. I would not cite it yet.

Referee Report

2 major / 2 minor

Summary. The paper proposes AdaMEM, a hybrid memory framework for language agents that maintains offline long-term trajectory memory while generating short-term strategy memory on-the-fly at test time, without online parameter updates. It claims this enables better adaptation to dynamic conditions in long-horizon tasks, with reported relative gains of up to 13% on ALFWorld and 11% on WebShop over static memory baselines, consistent leading results on HotpotQA agentic search, and further gains from the introduced STEP-MFT fine-tuning method. The work positions this as a new scaling dimension for post-deployment agent self-evolution and releases code.

Significance. If the empirical gains can be cleanly attributed to the adaptive hybrid memory rather than uncontrolled differences in inference compute, the approach would represent a meaningful advance in test-time adaptation for agents, supporting continuous reasoning without retraining. The public code release aids reproducibility and allows direct verification of the claimed trade-off between token efficiency and adaptability.

major comments (2)

[Experiments] Experiments section: The headline performance claims (up to 13% relative gain on ALFWorld, 11% on WebShop) compare AdaMEM against static memory baselines, but the setup does not equalize total LLM calls or tokens. AdaMEM's on-the-fly short-term strategy generation requires repeated LLM invocations per decision step, while static baselines use only initial retrieval; without this control, the delta cannot be attributed to the hybrid architecture versus extra test-time compute, undermining the central empirical claim.
[Method] Method and Abstract: The core assumption that on-the-fly short-term strategy memory generated from long-term trajectories will reliably improve decision-making lacks discussion of safeguards against misalignment or error propagation in long-horizon tasks; this is load-bearing for the adaptation claims but receives no analysis or ablation.

minor comments (2)

[Abstract] Abstract: Performance numbers are stated without any mention of number of runs, variance, or specific baselines, making initial soundness assessment difficult even before full experimental details.
[Abstract] The trade-off between token efficiency and adaptability is noted but not quantified with concrete token or latency measurements across the reported tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: The headline performance claims (up to 13% relative gain on ALFWorld, 11% on WebShop) compare AdaMEM against static memory baselines, but the setup does not equalize total LLM calls or tokens. AdaMEM's on-the-fly short-term strategy generation requires repeated LLM invocations per decision step, while static baselines use only initial retrieval; without this control, the delta cannot be attributed to the hybrid architecture versus extra test-time compute, undermining the central empirical claim.

Authors: We agree that the current experiments do not explicitly control for total LLM calls or tokens, and that AdaMEM's repeated invocations for short-term strategy generation constitute additional test-time compute. This is an inherent feature of the proposed adaptation mechanism rather than an uncontrolled variable. To address the concern directly, we will revise the experiments section to report average token usage and LLM call counts across methods, add compute-matched ablations (e.g., limiting strategy generation frequency or comparing against static baselines augmented with equivalent extra retrieval steps), and qualify the performance claims accordingly while preserving the core comparison to standard static baselines. revision: yes
Referee: [Method] Method and Abstract: The core assumption that on-the-fly short-term strategy memory generated from long-term trajectories will reliably improve decision-making lacks discussion of safeguards against misalignment or error propagation in long-horizon tasks; this is load-bearing for the adaptation claims but receives no analysis or ablation.

Authors: The referee is correct that the manuscript provides no explicit analysis or ablation of misalignment or error propagation risks in the generated short-term strategies. We will add a new subsection discussing these issues, including potential safeguards such as cross-verification of generated strategies against the long-term memory or confidence thresholding, and include a targeted ablation examining performance degradation under simulated noisy or misaligned memory retrieval. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical claims

full rationale

The paper advances an empirical agent framework (AdaMEM with hybrid long-term/short-term memory and optional STEP-MFT) whose central claims are relative performance gains on ALFWorld, WebShop, and HotpotQA versus static baselines. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. All reported results rest on external benchmark comparisons rather than any self-referential construction, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; full methods, assumptions, and prior literature details unavailable.

axioms (1)

domain assumption Language agents can synthesize effective short-term strategies from retrieved long-term experiences at inference time.
Central premise enabling the hybrid memory mechanism.

pith-pipeline@v0.9.1-grok · 5778 in / 1084 out tokens · 23314 ms · 2026-06-28T01:34:46.752751+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 2 canonical work pages · 1 internal anchor

[1]

URL https://openreview.net/forum? id=nZeVKeeFYf9. Hu, Y ., Liu, S., Yue, Y ., Zhang, G., Liu, B., Zhu, F., Lin, J., Guo, H., Dou, S., Xi, Z., Jin, S., Tan, J., Yin, Y ., Liu, J., Zhang, Z., Sun, Z., Zhu, Y ., Sun, H., Peng, B., Cheng, Z., Fan, X., Guo, J., Yu, X., Zhou, Z., Hu, Z., Huo, J., Wang, J., Niu, Y ., Wang, Y ., Yin, Z., Hu, X., Liao, Y ., Li, Q....

Pith/arXiv arXiv 2026
[2]

On the Reliability of Large Language Models for Causal Discovery

Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long

work page doi:10.18653/v1/2025.acl-long 2025
[3]

acl-long.1464/

URL https://aclanthology.org/2025. acl-long.1464/. Jansen, P., C ˆot´e, M.-A., Khot, T., Bransom, E., Mishra, B. D., Majumder, B. P., Tafjord, O., and Clark, P. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Bench...

2025
[4]

Jia, Z., Li, J., Kang, Y ., Wang, Y ., Wu, T., Wang, Q., Wang, X., Zhang, S., Shen, J., Li, Q., Qi, S., Liang, Y ., He, D., Zheng, Z., and Zhu, S.-C

URL https://openreview.net/forum? id=cDYqckEt6d. Jia, Z., Li, J., Kang, Y ., Wang, Y ., Wu, T., Wang, Q., Wang, X., Zhang, S., Shen, J., Li, Q., Qi, S., Liang, Y ., He, D., Zheng, Z., and Zhu, S.-C. The ai hippocampus: How far are we from human memory?, 2026. URL https: //arxiv.org/abs/2601.09113. 10 Adaptive Memory Agent Jiang, Y ., Jiang, L., Teney, D.,...

arXiv 2026
[7]

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y

URL https://openreview.net/forum? id=oVKEAFjEqv. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300. Shridhar, M., Yuan, X., Cˆot´e, M., Bisk, Y ., Trischler, A., and Ha...

Pith/arXiv arXiv 2024
[9]

Wang, P., Li, L., Shao, Z., Xu, R

URL https://openreview.net/forum? id=ehfRiF0R3a. Wang, P., Li, L., Shao, Z., Xu, R. X., Dai, D., Li, Y ., Chen, D., Wu, Y ., and Sui, Z. Math-shepherd: Ver- ify and reinforce llms step-by-step without human an- notations, 2024b. URL https://arxiv.org/abs/ 2312.08935. Wang, Z. Z., Mao, J., Fried, D., and Neubig, G. Agent workflow memory. InForty-second Int...

Pith/arXiv arXiv 2025
[11]

Ye, S., Yu, C., Ke, K., Xu, C., and Wei, Y

URL https://openreview.net/forum? id=WE_vluYUL-X. Ye, S., Yu, C., Ke, K., Xu, C., and Wei, Y . H2r: Hierarchical hindsight reflection for multi-task llm agents, 2025. URL https://arxiv.org/abs/2509.12810. Yu, Y ., Yao, L., Xie, Y ., Tan, Q., Feng, J., Li, Y ., and Wu, L. Agentic memory: Learning unified long-term and short-term memory management for large...

arXiv 2025
[12]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

doi: 10.48550/ARXIV .2308.01825. URL https: //doi.org/10.48550/arXiv.2308.01825. Zhang, S., Wang, J., Zhou, R., Liao, J., Feng, Y ., Zhang, W., Wen, Y ., Li, Z., Xiong, F., Qi, Y ., Tang, B., and Wen, M. Memrl: Self-evolving agents via runtime reinforcement learning on episodic memory, 2026a. URL https:// arxiv.org/abs/2601.03192. Zhang, Y ., Khalifa, M.,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2026
[13]

Generate a **proposed action** based on the *current* strategy
[14]

success” or “failure

Decide if a refresh is strictly necessary based on the criteria above. **Output format:** <think>...reasoning...</think> <action>...your proposed action...</action> <refresh decision>yes or no</refresh decision> Prompt for the ADAMEM Agent: Strategy Mem- ory Synthesis You need to synthesize contextualized, non-generic strategies from retrieved past experi...
[15]

go to cabinet 1 open cabinet 1

Systematically explorecabinets and drawers inorder (starting from cabinet1) to locate the butterknife,opening closed containersas needed. go to cabinet 1 open cabinet 1 ... go to cabinet 13 open cabinet 13 Agent Thought: [...] I have already completed the systematic exploration ofcabinets 1-17 and drawers 1-7. The next logical step would be tocontinue exp...
[16]

Once found, pick up thebutterknife and clean it usingthe sinkbasin
[17]

put a clean butterknife in countertop

After cleaning, move thebutterknife to a countertop(e.g., countertop 1, 2, 3, or 4)to fulﬁll the goal. Figure 8.Failure case study on ALFWorld. The agent attempts the task “put a clean butterknife in countertop”. After an exhaustive but unsuccessful search of all cabinets and drawers, the agent suf- fers fromstrategy inertia: it incorrectly assesses that ...

[1] [1]

URL https://openreview.net/forum? id=nZeVKeeFYf9. Hu, Y ., Liu, S., Yue, Y ., Zhang, G., Liu, B., Zhu, F., Lin, J., Guo, H., Dou, S., Xi, Z., Jin, S., Tan, J., Yin, Y ., Liu, J., Zhang, Z., Sun, Z., Zhu, Y ., Sun, H., Peng, B., Cheng, Z., Fan, X., Guo, J., Yu, X., Zhou, Z., Hu, Z., Huo, J., Wang, J., Niu, Y ., Wang, Y ., Yin, Z., Hu, X., Liao, Y ., Li, Q....

Pith/arXiv arXiv 2026

[2] [2]

On the Reliability of Large Language Models for Causal Discovery

Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long

work page doi:10.18653/v1/2025.acl-long 2025

[3] [3]

acl-long.1464/

URL https://aclanthology.org/2025. acl-long.1464/. Jansen, P., C ˆot´e, M.-A., Khot, T., Bransom, E., Mishra, B. D., Majumder, B. P., Tafjord, O., and Clark, P. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Bench...

2025

[4] [4]

Jia, Z., Li, J., Kang, Y ., Wang, Y ., Wu, T., Wang, Q., Wang, X., Zhang, S., Shen, J., Li, Q., Qi, S., Liang, Y ., He, D., Zheng, Z., and Zhu, S.-C

URL https://openreview.net/forum? id=cDYqckEt6d. Jia, Z., Li, J., Kang, Y ., Wang, Y ., Wu, T., Wang, Q., Wang, X., Zhang, S., Shen, J., Li, Q., Qi, S., Liang, Y ., He, D., Zheng, Z., and Zhu, S.-C. The ai hippocampus: How far are we from human memory?, 2026. URL https: //arxiv.org/abs/2601.09113. 10 Adaptive Memory Agent Jiang, Y ., Jiang, L., Teney, D.,...

arXiv 2026

[5] [7]

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y

URL https://openreview.net/forum? id=oVKEAFjEqv. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300. Shridhar, M., Yuan, X., Cˆot´e, M., Bisk, Y ., Trischler, A., and Ha...

Pith/arXiv arXiv 2024

[6] [9]

Wang, P., Li, L., Shao, Z., Xu, R

URL https://openreview.net/forum? id=ehfRiF0R3a. Wang, P., Li, L., Shao, Z., Xu, R. X., Dai, D., Li, Y ., Chen, D., Wu, Y ., and Sui, Z. Math-shepherd: Ver- ify and reinforce llms step-by-step without human an- notations, 2024b. URL https://arxiv.org/abs/ 2312.08935. Wang, Z. Z., Mao, J., Fried, D., and Neubig, G. Agent workflow memory. InForty-second Int...

Pith/arXiv arXiv 2025

[7] [11]

Ye, S., Yu, C., Ke, K., Xu, C., and Wei, Y

URL https://openreview.net/forum? id=WE_vluYUL-X. Ye, S., Yu, C., Ke, K., Xu, C., and Wei, Y . H2r: Hierarchical hindsight reflection for multi-task llm agents, 2025. URL https://arxiv.org/abs/2509.12810. Yu, Y ., Yao, L., Xie, Y ., Tan, Q., Feng, J., Li, Y ., and Wu, L. Agentic memory: Learning unified long-term and short-term memory management for large...

arXiv 2025

[8] [12]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

doi: 10.48550/ARXIV .2308.01825. URL https: //doi.org/10.48550/arXiv.2308.01825. Zhang, S., Wang, J., Zhou, R., Liao, J., Feng, Y ., Zhang, W., Wen, Y ., Li, Z., Xiong, F., Qi, Y ., Tang, B., and Wen, M. Memrl: Self-evolving agents via runtime reinforcement learning on episodic memory, 2026a. URL https:// arxiv.org/abs/2601.03192. Zhang, Y ., Khalifa, M.,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2026

[9] [13]

Generate a **proposed action** based on the *current* strategy

[10] [14]

success” or “failure

Decide if a refresh is strictly necessary based on the criteria above. **Output format:** <think>...reasoning...</think> <action>...your proposed action...</action> <refresh decision>yes or no</refresh decision> Prompt for the ADAMEM Agent: Strategy Mem- ory Synthesis You need to synthesize contextualized, non-generic strategies from retrieved past experi...

[11] [15]

go to cabinet 1 open cabinet 1

Systematically explorecabinets and drawers inorder (starting from cabinet1) to locate the butterknife,opening closed containersas needed. go to cabinet 1 open cabinet 1 ... go to cabinet 13 open cabinet 13 Agent Thought: [...] I have already completed the systematic exploration ofcabinets 1-17 and drawers 1-7. The next logical step would be tocontinue exp...

[12] [16]

Once found, pick up thebutterknife and clean it usingthe sinkbasin

[13] [17]

put a clean butterknife in countertop

After cleaning, move thebutterknife to a countertop(e.g., countertop 1, 2, 3, or 4)to fulﬁll the goal. Figure 8.Failure case study on ALFWorld. The agent attempts the task “put a clean butterknife in countertop”. After an exhaustive but unsuccessful search of all cabinets and drawers, the agent suf- fers fromstrategy inertia: it incorrectly assesses that ...