What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
Pith reviewed 2026-05-25 05:26 UTC · model grok-4.3
The pith
Curriculum composition for RL memory agents acts as a fine-grained lever on skill specialization rather than a uniform performance scaler.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Curriculum composition acts as a fine-grained lever on specialization rather than a uniform scaling factor on performance. The mixed curriculum yields the strongest overall F1 on both evaluation sets. Training on a narrow out-of-domain set transfers a targeted skill—temporal reasoning—despite weak aggregate performance. Per-type differences substantially exceed aggregate differences, indicating that single-number benchmark comparisons systematically underreport curriculum effects.
What carries the argument
Controlled variation of training curricula (in-domain LoCoMo, mixed LoCoMo+LongMemEval, out-of-domain LongMemEval only) while holding architecture, RL algorithm, and all hyperparameters fixed.
If this is right
- Mixed curriculum training produces the highest overall F1 on both in-domain and out-of-domain evaluation sets.
- Out-of-domain-only training specifically improves temporal reasoning performance despite lower aggregate scores.
- Performance differences broken down by question type exceed those shown by aggregate metrics.
- Cross-benchmark mixing requires filtering format-specific noise from memory banks to maintain training signal.
- Binary exact-match reward yields no learning signal at small group sizes, requiring continuous reward functions instead.
Where Pith is reading between the lines
- Agents trained this way could be further specialized by deliberately adding narrow out-of-domain subsets for skills like temporal reasoning.
- The observed per-type specialization patterns may generalize to other RL agent tasks that rely on long-context memory.
- Future work could test whether the same curriculum effects hold when scaling group size or switching to different continuous reward formulations.
Load-bearing premise
Fixing architecture, RL algorithm, and hyperparameters across curricula is sufficient to isolate curriculum composition as the sole cause of skill differences.
What would settle it
Observing identical per-question-type F1 scores across the three curricula when the same model and algorithm are retrained with altered hyperparameters would falsify the isolation of curriculum as the causal factor.
Figures
read the original abstract
Reinforcement learning (RL) has emerged as a viable recipe for training LLM agents to reason over external memory banks in multi-session dialogue. Existing work trains exclusively on a single benchmark, leaving open how the composition of training data shapes the skills a memory agent acquires. We present a controlled empirical study that holds architecture, RL algorithm, and all hyperparameters fixed and varies only the training curriculum across three conditions: in-domain (LoCoMo), mixed-benchmark (LoCoMo + LongMemEval), and out-of-domain (LongMemEval only). Across two benchmarks and ten question types, curriculum composition acts as a fine-grained lever on specialization rather than a uniform scaling factor on performance. The mixed curriculum yields the strongest overall F1 on both evaluation sets. Training on a narrow out-of-domain set transfers a targeted skill - temporal reasoning - despite weak aggregate performance. Per-type differences substantially exceed aggregate differences, indicating that single-number benchmark comparisons systematically underreport curriculum effects. We further report two practical lessons from adapting GRPO to a single-GPU regime: cross-benchmark mixing requires filtering format-specific noise from memory banks to preserve training signal, and binary exact-match reward produces no learning signal at the small group sizes (G = 4) required on one GPU, motivating continuous reward functions in this regime.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a controlled empirical study of curriculum effects in RL-trained memory-augmented QA agents. With architecture, GRPO algorithm, and all hyperparameters held fixed, it compares three curricula—in-domain (LoCoMo), mixed (LoCoMo + LongMemEval), and out-of-domain (LongMemEval only)—and reports that curriculum composition functions as a fine-grained lever on specialization rather than uniform scaling. The mixed curriculum achieves the highest aggregate F1 on both evaluation sets; the narrow out-of-domain curriculum transfers targeted temporal-reasoning skill despite low overall performance; and per-type F1 differences substantially exceed aggregate differences. The work also notes two practical adaptations required for single-GPU GRPO training.
Significance. If the isolation of curriculum composition holds, the result is significant for training memory agents: it shows that data mixing can produce non-uniform skill profiles and that single-number benchmarks understate curriculum impact. The per-type breakdown and the targeted transfer finding are particularly useful. The single-GPU GRPO lessons (filtering requirement and continuous rewards) are practical contributions. The study is purely empirical with no parameter-free derivations or machine-checked proofs.
major comments (2)
- [Abstract / Methods] Abstract and Methods: The central claim requires that the three curricula differ solely in data composition. The abstract states that 'cross-benchmark mixing requires filtering format-specific noise from memory banks,' but does not specify whether this filtering step (or an equivalent preprocessing) was applied uniformly to the pure in-domain and out-of-domain conditions. If filtering alters memory-bank statistics only in the mixed case, the reported superiority of the mixed curriculum and the targeted temporal-reasoning transfer could arise from filtering–GRPO interactions rather than composition alone. This directly undermines the isolation of the independent variable.
- [Results] Results section: All reported F1 scores are presented as point estimates with no error bars, standard deviations across runs, or statistical tests. The central claims rest on F1 differences (both aggregate and per-type) whose reliability cannot be assessed without this information; the abstract itself provides no details on the number of independent runs averaged.
minor comments (1)
- [Methods] The paper should explicitly state the exact preprocessing pipeline applied to each curriculum condition so readers can verify isolation of composition effects.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address each major comment below and describe the corresponding revisions.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and Methods: The central claim requires that the three curricula differ solely in data composition. The abstract states that 'cross-benchmark mixing requires filtering format-specific noise from memory banks,' but does not specify whether this filtering step (or an equivalent preprocessing) was applied uniformly to the pure in-domain and out-of-domain conditions. If filtering alters memory-bank statistics only in the mixed case, the reported superiority of the mixed curriculum and the targeted temporal-reasoning transfer could arise from filtering–GRPO interactions rather than composition alone. This directly undermines the isolation of the independent variable.
Authors: We appreciate this point on experimental isolation. The filtering step is required exclusively for the mixed curriculum because combining LoCoMo and LongMemEval introduces format-specific noise in the memory banks; the pure in-domain (LoCoMo only) and out-of-domain (LongMemEval only) conditions use homogeneous data and therefore receive no such filtering. To make this explicit and reinforce that the independent variable remains curriculum composition, we will revise the Methods section to document the preprocessing pipeline for each of the three conditions separately, confirming that filtering is applied only where format mismatch exists. revision: yes
-
Referee: [Results] Results section: All reported F1 scores are presented as point estimates with no error bars, standard deviations across runs, or statistical tests. The central claims rest on F1 differences (both aggregate and per-type) whose reliability cannot be assessed without this information; the abstract itself provides no details on the number of independent runs averaged.
Authors: We agree that variability information would strengthen the presentation. The reported scores are point estimates obtained from single training runs per curriculum, reflecting the high compute cost of single-GPU GRPO. In the revision we will (i) state the number of runs explicitly in the Results section and (ii) add a limitations paragraph noting the absence of error bars and the desirability of multi-run statistics in follow-up work. This addresses the concern without misrepresenting the current experimental scope. revision: yes
Circularity Check
No circularity: purely empirical comparisons with no derivations
full rationale
The paper conducts a controlled empirical study varying only training curriculum while holding architecture, RL algorithm, and hyperparameters fixed. All reported results are direct measurements of F1 scores and per-type differences across conditions. No equations, fitted parameters presented as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes appear in the provided text. The central claims reduce to observed performance differences under stated experimental conditions, with no reduction to inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
holds architecture, RL algorithm, and all hyperparameters fixed and varies only the training curriculum across three conditions
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
binary exact-match reward produces no learning signal at the small group sizes (G = 4)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Prateek Chhikara, Dhruv Khant, S. Aryan, T. Singh, et al. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.arXiv preprint, 2025. arXiv:2504.19413
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
W. Xu, K. Mei, H. Gao, J. Tan, Z. Liang, and Y. Zhang. A-Mem: Agentic Memory for LLM Agents.arXiv preprint, 2025. arXiv:2502.12110
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
LangMem SDK for Agent Long-Term Memory
LangChain. LangMem SDK for Agent Long-Term Memory. Blog post, 2025.https://www.la ngchain.com/blog/langmem-sdk-launch. Accessed May 2026
2025
-
[4]
Zep: A Temporal Knowledge Graph Architecture for Agent Memory
P. Rasmussen, P. Paliychuk, T. Beauvais, J. Ryan, and D. Chalef. Zep: A Temporal Knowledge Graph Architecture for Agent Memory.arXiv preprint, 2025. arXiv:2501.13956
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
S. Yan, X. Yang, Z. Huang, E. Nie, Z. Ding, et al. Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning.arXiv preprint,
-
[6]
arXiv:2508.19828. 10
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Reflexion: Language Agents with Verbal Reinforcement Learning
N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2303.11366
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
MemGPT: Towards LLMs as Operating Systems
C. Packer, S. Wooders, K. Lin, V. Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez. MemGPT: Towards LLMs as Operating Systems.arXiv preprint, 2023. arXiv:2310.08560
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv preprint, 2024. arXiv:2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Evaluating Very Long-Term Conversational Memory of LLM Agents
A. Maharana, D.-H. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y. Fang. Evaluating Very Long-Term Conversational Memory of LLM Agents. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024. arXiv:2402.17753
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
D. Wu, H. Wang, W. Yu, Y. Zhang, K.-W. Chang, and D. Yu. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2410.10813
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. arXiv:2005.11401
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[13]
Dense Passage Retrieval for Open-Domain Question Answering
V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih. Dense Passage Retrieval for Open-Domain Question Answering. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. arXiv:2004.04906
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[14]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
N. Reimers and I. Gurevych. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019. arXiv:1908.10084
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[15]
Bengio, J
Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum Learning. InInternational Conference on Machine Learning (ICML), 2009
2009
-
[16]
Graves, M
A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu. Automated Curriculum Learning for Neural Networks. InInternational Conference on Machine Learning (ICML),
-
[17]
S. Parashar, S. Gui, X. Li, H. Ling, S. Vemuri, et al. Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning.arXiv preprint, 2025. arXiv:2506.06632
-
[18]
G. Jiang, W. Feng, G. Quan, C. Hao, Y. Zhang, G. Liu, and H. Wang. VCRL: Variance-Based Curriculum Reinforcement Learning for Large Language Models.arXiv preprint, 2025. arXiv:2509.19803
- [19]
-
[20]
Training language models to follow instructions with human feedback
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, et al. Training Language Models to Follow Instructions with Human Feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2203.02155
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal Policy Optimization Algorithms.arXiv preprint, 2017. arXiv:1707.06347. 11
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2305.18290
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint, 2025. arXiv:2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Qwen Team. Qwen2.5 Technical Report.arXiv preprint, 2024. arXiv:2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations (ICLR), 2022. arXiv:2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
Parameter-Efficient Transfer Learning for NLP
N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, et al. Parameter-Efficient Transfer Learning for NLP. InInternational Conference on Machine Learning (ICML), 2019. arXiv:1902.00751
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [27]
-
[28]
QLoRA: Efficient Finetuning of Quantized LLMs
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2305.14314
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
J. Xu, A. Szlam, and J. Weston. Beyond Goldfish Memory: Long-Term Open-Domain Conversation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Volume 1: Long Papers, pages 5180–5197, 2022. arXiv:2107.07567. A Hyperparameters 12 Table 7:Full training hyperparameters. Parameter Value Model Qwen-2.5-7B-Instruct L...
-
[30]
Select the most relevant memories for answering the question
-
[31]
Reason step-by-step using the selected memories
-
[32]
What is the name of the music streaming service I have been using lately?
Provide a concise, accurate answer. Output format: <selected_memories>[list the memory IDs or snippets]</selected_memories> <reasoning>[your step-by-step reasoning]</reasoning> <answer>[your final answer - be concise]</answer> C LLM-as-Judge Scores Table 8:Mean LLM-as-Judge scores (Claude 3 Haiku, 1–5 scale). Model LoCoMo LongMemEval Baseline 3.27 3.39 Co...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.