Joint Agent Memory and Exploration Learning via Novelty Signals
Pith reviewed 2026-06-28 15:01 UTC · model grok-4.3
The pith
JAMEL jointly trains an agent's memory module and exploration policy using novelty signals from interactions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training memory and exploration together through novelty-driven interaction, where deterministic novelty signals provide annotation-free supervision, the JAMEL framework enables agents to generalize to unseen environments, achieving superior exploration to open-weight baselines and comparable depth to closed-source models with reduced token consumption.
What carries the argument
The mutually dependent loop between memory and exploration, where sustained exploration requires memory to distinguish exhausted behaviors from unseen ones, and novelty-seeking provides supervision for memory.
If this is right
- Memory compression becomes feasible without losing exploration utility.
- Agents can handle longer trajectories in open-ended settings.
- Exploration policies improve from the trained memory module.
- Generalization occurs without environment-specific annotations.
Where Pith is reading between the lines
- This could apply to other environments with measurable progress metrics beyond GUI code coverage.
- Reducing token consumption may allow deployment on resource-limited systems.
- The approach might inspire similar joint training in other agent components like planning.
- Testing in non-GUI domains would show if the novelty signal dependency holds broadly.
Load-bearing premise
That persistent novelty signals reliably indicate useful new behaviors for training the memory without external labels.
What would settle it
Observing that in a new environment the code coverage signal does not lead to improved exploration performance over time would falsify the utility of the joint learning loop.
Figures
read the original abstract
In open-ended environments, exploration is fundamental for autonomous agents, yet current language model agents struggle with this. Effective exploration requires memory, but retaining raw interaction histories is computationally expensive over long trajectories. While latent memory offers a solution to compress interaction histories, its training lacks reliable supervisory signals. We introduce \textbf{J}oint \textbf{A}gent \textbf{M}emory and \textbf{E}xploration \textbf{L}earning (\textbf{JAMEL}), a framework that trains agentic memory and exploration policy together through novelty-driven interaction. We observe that memory and exploration form a mutually dependent loop: sustained exploration requires memory to distinguish exhausted behaviors from unseen ones, while novelty-seeking interaction provides the supervision needed to make memory useful for future exploration. By utilizing deterministic and persistent novelty signals such as code coverage in the GUI domain, we provide natural, annotation-free supervision for the memory module. Empirical evaluations demonstrate that \ours successfully generalizes to unseen environments. Its exploration capability outperforms open-weight baselines and rivals the exploration depth of a closed-source model while reducing token consumption. Our code and model are open-sourced at https://github.com/MobileLLM/JAMEL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces JAMEL, a framework that jointly trains an agent's memory module and exploration policy via novelty-driven interactions. It exploits a mutually dependent loop in which memory distinguishes exhausted from novel behaviors while novelty-seeking interactions supply supervision for memory; deterministic signals such as code coverage in GUI domains are used as annotation-free supervision. The central empirical claim is that the resulting agents generalize to unseen environments, outperform open-weight baselines in exploration depth, rival a closed-source model, and reduce token consumption.
Significance. If the results hold and the novelty signal generalizes, the approach would supply a scalable, annotation-free route to training persistent memory for long-horizon LLM agents, addressing a recognized bottleneck in open-ended exploration. The open release of code and model would further strengthen reproducibility.
major comments (3)
- [Abstract] Abstract: the claim of successful generalization and performance gains is stated without any reported metrics, baselines, environment counts, or controls, so the support for the central empirical claim cannot be evaluated from the provided text.
- [Method] Method / novelty-signal description: the mutual-dependency loop is presented as an observation that the framework exploits, yet no quantitative result is shown demonstrating that the memory module receives independent supervision rather than a quantity defined by the same exploration loop; this leaves the annotation-free claim vulnerable to circularity.
- [Experiments] Experiments / domain discussion: the supervisory signal is explicitly tied to code coverage in the GUI domain; the manuscript supplies no evidence or argument that equivalent deterministic, persistent signals exist in arbitrary open-ended environments, so the extrapolation to general settings rests on an untested assumption.
minor comments (1)
- [Abstract] Abstract: the bolding of individual letters in JAMEL is typographically inconsistent and should be rendered uniformly.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of JAMEL. We respond to each major comment below and note planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of successful generalization and performance gains is stated without any reported metrics, baselines, environment counts, or controls, so the support for the central empirical claim cannot be evaluated from the provided text.
Authors: We agree that the abstract would benefit from quantitative support. In the revised manuscript we will add concise metrics drawn from the experimental section, including the number of unseen environments evaluated, specific baseline comparisons, and reported token reductions, subject to length limits. revision: yes
-
Referee: [Method] Method / novelty-signal description: the mutual-dependency loop is presented as an observation that the framework exploits, yet no quantitative result is shown demonstrating that the memory module receives independent supervision rather than a quantity defined by the same exploration loop; this leaves the annotation-free claim vulnerable to circularity.
Authors: The supervisory signal is code coverage, an external deterministic quantity computed by the GUI environment independently of the agent's policy or memory module. This breaks potential circularity. We will add an explicit quantitative analysis (e.g., an ablation comparing memory training with versus without the external coverage signal) to the revised method section. revision: partial
-
Referee: [Experiments] Experiments / domain discussion: the supervisory signal is explicitly tied to code coverage in the GUI domain; the manuscript supplies no evidence or argument that equivalent deterministic, persistent signals exist in arbitrary open-ended environments, so the extrapolation to general settings rests on an untested assumption.
Authors: The current evaluation centers on the GUI domain, yet the framework is formulated to accept any deterministic, persistent novelty signal supplied by an environment. We will expand the discussion section with concrete examples of analogous signals in other domains and will qualify the generalization claims accordingly. revision: yes
Circularity Check
No circularity: empirical framework with external supervision signal
full rationale
The paper introduces JAMEL as a joint training framework that exploits an observed mutual dependency between memory and exploration, using an external deterministic signal (code coverage) for supervision. No equations, fitted parameters renamed as predictions, or self-citation chains are present that reduce any claimed result to its own inputs by construction. Generalization and performance claims rest on empirical evaluations rather than a closed derivation loop. The domain-specific nature of the signal is an assumption about applicability, not a circularity in the derivation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Memory and exploration form a mutually dependent loop where each enables the other
- domain assumption Deterministic novelty signals such as code coverage supply reliable supervision for memory without annotations
Reference graph
Works this paper leans on
-
[1]
Nature , volume=
Mastering diverse control tasks through world models , author=. Nature , volume=. 2025 , publisher=
2025
-
[2]
2025 , eprint=
Training Agents Inside of Scalable World Models , author=. 2025 , eprint=
2025
-
[3]
2026 , eprint=
Code2World: A GUI World Model via Renderable Code Generation , author=. 2026 , eprint=
2026
-
[4]
arXiv preprint arXiv:2602.20502 , year=
Actionengine: From reactive to programmatic gui agents via state machine memory , author=. arXiv preprint arXiv:2602.20502 , year=
-
[5]
Advances in neural information processing systems , volume=
Learning universal policies via text-guided video generation , author=. Advances in neural information processing systems , volume=
-
[6]
2025 , eprint=
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models , author=. 2025 , eprint=
2025
-
[7]
2025 , eprint=
VideoVLA: Video Generators Can Be Generalizable Robot Manipulators , author=. 2025 , eprint=
2025
-
[8]
2026 , eprint=
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising , author=. 2026 , eprint=
2026
-
[9]
2020 , eprint=
Longformer: The Long-Document Transformer , author=. 2020 , eprint=
2020
-
[10]
2023 , eprint=
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. 2023 , eprint=
2023
-
[11]
2023 , eprint=
GPT Understands, Too , author=. 2023 , eprint=
2023
-
[12]
2022 , eprint=
P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks , author=. 2022 , eprint=
2022
-
[13]
2024 , eprint=
MEMORYLLM: Towards Self-Updatable Large Language Models , author=. 2024 , eprint=
2024
-
[14]
2025 , eprint=
M+: Extending MemoryLLM with Scalable Long-Term Memory , author=. 2025 , eprint=
2025
-
[15]
2026 , url =
Wu, Zijun and Hao, Yongchang and Mou, Lili , booktitle =. 2026 , url =
2026
-
[16]
2022 , eprint=
Recurrent Memory Transformer , author=. 2022 , eprint=
2022
-
[17]
and Darrell, Trevor , title =
Pathak, Deepak and Agrawal, Pulkit and Efros, Alexei A. and Darrell, Trevor , title =. Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =. 2017 , publisher =
2017
-
[18]
2018 , eprint=
Exploration by Random Network Distillation , author=. 2018 , eprint=
2018
-
[19]
2023 , eprint=
Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. 2023 , eprint=
2023
-
[20]
Nature , year=
First return, then explore , author=. Nature , year=
-
[21]
2022 , eprint=
Multi-Stage Episodic Control for Strategic Exploration in Text Games , author=. 2022 , eprint=
2022
-
[22]
2025 , eprint=
Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models , author=. 2025 , eprint=
2025
-
[23]
2025 , eprint=
Dual-Scale World Models for LLM Agents Towards Hard-Exploration Problems , author=. 2025 , eprint=
2025
-
[24]
International Conference on Learning Representations , year=
Monte-Carlo Planning and Learning with Language Action Value Estimates , author=. International Conference on Learning Representations , year=
-
[25]
2025 , eprint=
Monte Carlo Planning with Large Language Model for Text-Based Game Agents , author=. 2025 , eprint=
2025
-
[26]
Sun, Taize and Fujita, Katsuhide and Markov, Konstantin and Chang, Shengbo , title =. 2025 , isbn =. doi:10.1007/978-981-95-0020-8_27 , booktitle =
-
[27]
2026 , eprint=
Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity , author=. 2026 , eprint=
2026
-
[28]
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year =
Wang, Ruoyao and Jansen, Peter and C. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year =
2022
-
[29]
Trivedi, Harsh and Khot, Tushar and Hartmann, Mareike and Manku, Ruskin and Dong, Vinty and Li, Edward and Gupta, Shashank and Sabharwal, Ashish and Balasubramanian, Niranjan , booktitle =
-
[30]
2025 , eprint=
AgentEvolver: Towards Efficient Self-Evolving Agent System , author=. 2025 , eprint=
2025
-
[31]
Nature , volume =
First Return, Then Explore , author =. Nature , volume =
- [32]
-
[33]
2025 , eprint=
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents , author=. 2025 , eprint=
2025
-
[34]
Transactions on Machine Learning Research , issn=
The BrowserGym Ecosystem for Web Agent Research , author=. Transactions on Machine Learning Research , issn=. 2025 , url=
2025
-
[35]
2026 , version =
Guohong Liu , title =. 2026 , version =
2026
-
[36]
and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Bisk, Yonatan and Fried, Daniel and Alon, Uri and Neubig, Graham , booktitle =
Zhou, Shuyan and Xu, Frank F. and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Bisk, Yonatan and Fried, Daniel and Alon, Uri and Neubig, Graham , booktitle =
-
[37]
Advances in Neural Information Processing Systems , year =
Mind2Web: Towards a Generalist Agent for the Web , author =. Advances in Neural Information Processing Systems , year =
-
[38]
Qi, Zehan and Liu, Xiao and Iong, Iat Long and Lai, Hanyu and Wang, Xueqiao and Yang, Zhiliang and Chen, Zhizheng and Yu, Yanghua and Wang, Xinyi and Liu, Zhenyu and Yao, Jiadai and Jin, Tianjie and Zhang, Shulin and Li, Jie and Tang, Yuxiao and Dong, Jie , booktitle =
-
[39]
and Alon, Uri and Neubig, Graham and Bisk, Yonatan and Salakhutdinov, Ruslan , booktitle =
Pan, Hao and Zhou, Shuyan and Sclar, Meret and Xu, Frank F. and Alon, Uri and Neubig, Graham and Bisk, Yonatan and Salakhutdinov, Ruslan , booktitle =
-
[40]
International Conference on Machine Learning , year =
GPT-4V(ision) is a Generalist Web Agent, if Grounded , author =. International Conference on Machine Learning , year =
-
[41]
Cheng, Kanzhi and Sun, Qiushi and Chu, Yougang and Xu, Fangzhi and Li, Yantao and Zhang, Jianbing and Wu, Zhiyong , booktitle =
-
[42]
He, Hongliang and Yao, Wenlin and Ma, Kaixin and Yu, Wenhao and Dai, Yong and Zhang, Hongming and Lan, Zhenzhong and Yu, Dong , journal =
-
[43]
International Conference on Machine Learning , year =
Curiosity-Driven Exploration by Self-Supervised Prediction , author =. International Conference on Machine Learning , year =
-
[44]
International Conference on Learning Representations , year =
Large-Scale Study of Curiosity-Driven Learning , author =. International Conference on Learning Representations , year =
-
[45]
Advances in Neural Information Processing Systems , year =
Unifying Count-Based Exploration and Intrinsic Motivation , author =. Advances in Neural Information Processing Systems , year =
-
[46]
Advances in Neural Information Processing Systems , year =
\#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , author =. Advances in Neural Information Processing Systems , year =
-
[47]
Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration
Remember to be Curious: Episodic Context and Persistent World Models Enable Curiosity-Driven Exploration , author =. arXiv preprint arXiv:2605.22814 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
and Lin, Kevin and Wooders, Sarah and Gonzalez, Joseph E
Packer, Charles and Fang, Vivian and Patil, Shishir G. and Lin, Kevin and Wooders, Sarah and Gonzalez, Joseph E. , journal =
-
[49]
Transactions on Machine Learning Research , year =
Cognitive Architectures for Language Agents , author =. Transactions on Machine Learning Research , year =
-
[50]
Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =
-
[51]
Advances in Neural Information Processing Systems , year =
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , year =
-
[52]
Artificial Intelligence , volume =
Planning and Acting in Partially Observable Stochastic Domains , author =. Artificial Intelligence , volume =
-
[53]
and Verme, Manuel Del and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre , title =
Drouin, Alexandre and Gasse, Maxime and Caccia, Massimo and Laradji, Issam H. and Verme, Manuel Del and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =
2024
-
[54]
Communications of the ACM , volume =
Automatic Generation of Test Cases , author =. Communications of the ACM , volume =
-
[55]
2014 , howpublished =
Zalewski, Micha. 2014 , howpublished =
2014
-
[56]
Coverage-based Greybox Fuzzing as
B. Coverage-based Greybox Fuzzing as. ACM Conference on Computer and Communications Security , year =
-
[57]
Bulletin de la Soci
Jaccard, Paul , title =. Bulletin de la Soci
-
[58]
International Conference on Learning Representations , year =
Retrieval Meets Long Context Large Language Models , author =. International Conference on Learning Representations , year =
-
[59]
Compressing Long Context for Enhancing
Zhong, Wenhao and others , journal =. Compressing Long Context for Enhancing
-
[60]
2024 , eprint=
A Survey on the Memory Mechanism of Large Language Model based Agents , author=. 2024 , eprint=
2024
-
[61]
2026 , eprint=
NextMem: Towards Latent Factual Memory for LLM-based Agents , author=. 2026 , eprint=
2026
-
[62]
2026 , eprint=
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents , author=. 2026 , eprint=
2026
-
[63]
2025 , eprint=
Qwen3-VL Technical Report , author=. 2025 , eprint=
2025
-
[64]
2025 , eprint=
Qwen2.5-VL Technical Report , author=. 2025 , eprint=
2025
-
[65]
2026 , month = mar, howpublished =
2026
-
[66]
2025 , eprint=
GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration , author=. 2025 , eprint=
2025
-
[67]
2025 , eprint=
LLM-Explorer: Towards Efficient and Affordable LLM-based Exploration for Mobile Apps , author=. 2025 , eprint=
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.