What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

Jialun Wu; Qiyang Xie; Shuai Xiao; Su Liu; Weikai Zhou; Xinjie He; Zhiyuan Lin

arxiv: 2605.23067 · v1 · pith:7YEVGP25new · submitted 2026-05-21 · 💻 cs.CL

What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

Xinjie He , Zhiyuan Lin , Su Liu , Jialun Wu , Qiyang Xie , Weikai Zhou , Shuai Xiao This is my paper

Pith reviewed 2026-05-25 05:26 UTC · model grok-4.3

classification 💻 cs.CL

keywords curriculum effectsreinforcement learningmemory-augmented QARL agentstemporal reasoningbenchmark compositionGRPO algorithm

0 comments

The pith

Curriculum composition for RL memory agents acts as a fine-grained lever on skill specialization rather than a uniform performance scaler.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how training data composition shapes the abilities of reinforcement learning agents that reason over external memory for question answering. It trains identical agent setups and RL methods on three curricula—in-domain only, out-of-domain only, and a mix—while measuring outcomes on two benchmarks across ten question types. Results show the mixed curriculum produces the highest overall F1 scores, while narrow out-of-domain training transfers targeted temporal reasoning skill even with weak aggregate results. Per-type performance gaps are larger than overall scores, so single-number benchmarks understate curriculum impact. The study also notes practical adaptations needed when running the GRPO algorithm on limited hardware.

Core claim

Curriculum composition acts as a fine-grained lever on specialization rather than a uniform scaling factor on performance. The mixed curriculum yields the strongest overall F1 on both evaluation sets. Training on a narrow out-of-domain set transfers a targeted skill—temporal reasoning—despite weak aggregate performance. Per-type differences substantially exceed aggregate differences, indicating that single-number benchmark comparisons systematically underreport curriculum effects.

What carries the argument

Controlled variation of training curricula (in-domain LoCoMo, mixed LoCoMo+LongMemEval, out-of-domain LongMemEval only) while holding architecture, RL algorithm, and all hyperparameters fixed.

If this is right

Mixed curriculum training produces the highest overall F1 on both in-domain and out-of-domain evaluation sets.
Out-of-domain-only training specifically improves temporal reasoning performance despite lower aggregate scores.
Performance differences broken down by question type exceed those shown by aggregate metrics.
Cross-benchmark mixing requires filtering format-specific noise from memory banks to maintain training signal.
Binary exact-match reward yields no learning signal at small group sizes, requiring continuous reward functions instead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents trained this way could be further specialized by deliberately adding narrow out-of-domain subsets for skills like temporal reasoning.
The observed per-type specialization patterns may generalize to other RL agent tasks that rely on long-context memory.
Future work could test whether the same curriculum effects hold when scaling group size or switching to different continuous reward formulations.

Load-bearing premise

Fixing architecture, RL algorithm, and hyperparameters across curricula is sufficient to isolate curriculum composition as the sole cause of skill differences.

What would settle it

Observing identical per-question-type F1 scores across the three curricula when the same model and algorithm are retrained with altered hyperparameters would falsify the isolation of curriculum as the causal factor.

Figures

Figures reproduced from arXiv: 2605.23067 by Jialun Wu, Qiyang Xie, Shuai Xiao, Su Liu, Weikai Zhou, Xinjie He, Zhiyuan Lin.

**Figure 1.** Figure 1: Experimental design. Three curricula, identical training recipe, evaluated on two benchmarks with per-type F1 breakdown. 2 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Per-question-type F1 across all models and both benchmarks. Color intensity shows delta from baseline (green = improvement, red = regression). Bold values indicate the best RL-trained configuration per type. Curriculum effects concentrate in specific question types, with per-type differences several times larger than overall gaps. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Reinforcement learning (RL) has emerged as a viable recipe for training LLM agents to reason over external memory banks in multi-session dialogue. Existing work trains exclusively on a single benchmark, leaving open how the composition of training data shapes the skills a memory agent acquires. We present a controlled empirical study that holds architecture, RL algorithm, and all hyperparameters fixed and varies only the training curriculum across three conditions: in-domain (LoCoMo), mixed-benchmark (LoCoMo + LongMemEval), and out-of-domain (LongMemEval only). Across two benchmarks and ten question types, curriculum composition acts as a fine-grained lever on specialization rather than a uniform scaling factor on performance. The mixed curriculum yields the strongest overall F1 on both evaluation sets. Training on a narrow out-of-domain set transfers a targeted skill - temporal reasoning - despite weak aggregate performance. Per-type differences substantially exceed aggregate differences, indicating that single-number benchmark comparisons systematically underreport curriculum effects. We further report two practical lessons from adapting GRPO to a single-GPU regime: cross-benchmark mixing requires filtering format-specific noise from memory banks to preserve training signal, and binary exact-match reward produces no learning signal at the small group sizes (G = 4) required on one GPU, motivating continuous reward functions in this regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mixed curricula beat single-benchmark training here and out-of-domain data transfers temporal reasoning, but the filtering step applied only to the mixed condition undercuts the isolation claim.

read the letter

The core finding is that training curriculum acts as a lever for specific skills rather than just overall scaling in these GRPO-trained memory agents. Mixed in-domain plus out-of-domain data gives the highest F1 on both test sets, while pure out-of-domain training still boosts temporal reasoning questions despite lower aggregate scores. Per-type breakdowns reveal larger effects than the headline numbers suggest, which is a useful observation for anyone tuning these agents.

Referee Report

2 major / 1 minor

Summary. The paper conducts a controlled empirical study of curriculum effects in RL-trained memory-augmented QA agents. With architecture, GRPO algorithm, and all hyperparameters held fixed, it compares three curricula—in-domain (LoCoMo), mixed (LoCoMo + LongMemEval), and out-of-domain (LongMemEval only)—and reports that curriculum composition functions as a fine-grained lever on specialization rather than uniform scaling. The mixed curriculum achieves the highest aggregate F1 on both evaluation sets; the narrow out-of-domain curriculum transfers targeted temporal-reasoning skill despite low overall performance; and per-type F1 differences substantially exceed aggregate differences. The work also notes two practical adaptations required for single-GPU GRPO training.

Significance. If the isolation of curriculum composition holds, the result is significant for training memory agents: it shows that data mixing can produce non-uniform skill profiles and that single-number benchmarks understate curriculum impact. The per-type breakdown and the targeted transfer finding are particularly useful. The single-GPU GRPO lessons (filtering requirement and continuous rewards) are practical contributions. The study is purely empirical with no parameter-free derivations or machine-checked proofs.

major comments (2)

[Abstract / Methods] Abstract and Methods: The central claim requires that the three curricula differ solely in data composition. The abstract states that 'cross-benchmark mixing requires filtering format-specific noise from memory banks,' but does not specify whether this filtering step (or an equivalent preprocessing) was applied uniformly to the pure in-domain and out-of-domain conditions. If filtering alters memory-bank statistics only in the mixed case, the reported superiority of the mixed curriculum and the targeted temporal-reasoning transfer could arise from filtering–GRPO interactions rather than composition alone. This directly undermines the isolation of the independent variable.
[Results] Results section: All reported F1 scores are presented as point estimates with no error bars, standard deviations across runs, or statistical tests. The central claims rest on F1 differences (both aggregate and per-type) whose reliability cannot be assessed without this information; the abstract itself provides no details on the number of independent runs averaged.

minor comments (1)

[Methods] The paper should explicitly state the exact preprocessing pipeline applied to each curriculum condition so readers can verify isolation of composition effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below and describe the corresponding revisions.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods: The central claim requires that the three curricula differ solely in data composition. The abstract states that 'cross-benchmark mixing requires filtering format-specific noise from memory banks,' but does not specify whether this filtering step (or an equivalent preprocessing) was applied uniformly to the pure in-domain and out-of-domain conditions. If filtering alters memory-bank statistics only in the mixed case, the reported superiority of the mixed curriculum and the targeted temporal-reasoning transfer could arise from filtering–GRPO interactions rather than composition alone. This directly undermines the isolation of the independent variable.

Authors: We appreciate this point on experimental isolation. The filtering step is required exclusively for the mixed curriculum because combining LoCoMo and LongMemEval introduces format-specific noise in the memory banks; the pure in-domain (LoCoMo only) and out-of-domain (LongMemEval only) conditions use homogeneous data and therefore receive no such filtering. To make this explicit and reinforce that the independent variable remains curriculum composition, we will revise the Methods section to document the preprocessing pipeline for each of the three conditions separately, confirming that filtering is applied only where format mismatch exists. revision: yes
Referee: [Results] Results section: All reported F1 scores are presented as point estimates with no error bars, standard deviations across runs, or statistical tests. The central claims rest on F1 differences (both aggregate and per-type) whose reliability cannot be assessed without this information; the abstract itself provides no details on the number of independent runs averaged.

Authors: We agree that variability information would strengthen the presentation. The reported scores are point estimates obtained from single training runs per curriculum, reflecting the high compute cost of single-GPU GRPO. In the revision we will (i) state the number of runs explicitly in the Results section and (ii) add a limitations paragraph noting the absence of error bars and the desirability of multi-run statistics in follow-up work. This addresses the concern without misrepresenting the current experimental scope. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons with no derivations

full rationale

The paper conducts a controlled empirical study varying only training curriculum while holding architecture, RL algorithm, and hyperparameters fixed. All reported results are direct measurements of F1 scores and per-type differences across conditions. No equations, fitted parameters presented as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes appear in the provided text. The central claims reduce to observed performance differences under stated experimental conditions, with no reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical machine-learning study; contains no mathematical derivations, fitted constants, or postulated entities. All claims derive from experimental measurements under fixed training conditions.

pith-pipeline@v0.9.0 · 5785 in / 1140 out tokens · 51507 ms · 2026-05-25T05:26:28.124201+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

holds architecture, RL algorithm, and all hyperparameters fixed and varies only the training curriculum across three conditions
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

binary exact-match reward produces no learning signal at the small group sizes (G = 4)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 25 canonical work pages · 20 internal anchors

[1]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dhruv Khant, S. Aryan, T. Singh, et al. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.arXiv preprint, 2025. arXiv:2504.19413

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

W. Xu, K. Mei, H. Gao, J. Tan, Z. Liang, and Y. Zhang. A-Mem: Agentic Memory for LLM Agents.arXiv preprint, 2025. arXiv:2502.12110

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

LangMem SDK for Agent Long-Term Memory

LangChain. LangMem SDK for Agent Long-Term Memory. Blog post, 2025.https://www.la ngchain.com/blog/langmem-sdk-launch. Accessed May 2026

2025
[4]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

P. Rasmussen, P. Paliychuk, T. Beauvais, J. Ryan, and D. Chalef. Zep: A Temporal Knowledge Graph Architecture for Agent Memory.arXiv preprint, 2025. arXiv:2501.13956

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

S. Yan, X. Yang, Z. Huang, E. Nie, Z. Ding, et al. Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning.arXiv preprint,
[6]

arXiv:2508.19828. 10

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Reflexion: Language Agents with Verbal Reinforcement Learning

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2303.11366

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

MemGPT: Towards LLMs as Operating Systems

C. Packer, S. Wooders, K. Lin, V. Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez. MemGPT: Towards LLMs as Operating Systems.arXiv preprint, 2023. arXiv:2310.08560

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv preprint, 2024. arXiv:2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Evaluating Very Long-Term Conversational Memory of LLM Agents

A. Maharana, D.-H. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y. Fang. Evaluating Very Long-Term Conversational Memory of LLM Agents. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024. arXiv:2402.17753

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

D. Wu, H. Wang, W. Yu, Y. Zhang, K.-W. Chang, and D. Yu. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2410.10813

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. arXiv:2005.11401

work page internal anchor Pith review Pith/arXiv arXiv 2020
[13]

Dense Passage Retrieval for Open-Domain Question Answering

V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih. Dense Passage Retrieval for Open-Domain Question Answering. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. arXiv:2004.04906

work page internal anchor Pith review Pith/arXiv arXiv 2020
[14]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers and I. Gurevych. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019. arXiv:1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019
[15]

Bengio, J

Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum Learning. InInternational Conference on Machine Learning (ICML), 2009

2009
[16]

Graves, M

A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu. Automated Curriculum Learning for Neural Networks. InInternational Conference on Machine Learning (ICML),
[17]

Curricu- lum reinforcement learning from easy to hard tasks im- proves llm reasoning.arXiv preprint arXiv:2506.06632,

S. Parashar, S. Gui, X. Li, H. Ling, S. Vemuri, et al. Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning.arXiv preprint, 2025. arXiv:2506.06632

work page arXiv 2025
[18]

Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025

G. Jiang, W. Feng, G. Quan, C. Hao, Y. Zhang, G. Liu, and H. Wang. VCRL: Variance-Based Curriculum Reinforcement Learning for Large Language Models.arXiv preprint, 2025. arXiv:2509.19803

work page arXiv 2025
[19]

Z. Wang, G. Cui, K. Wan, and W. Zhao. DUMP: Automated Distribution-Level Curriculum Learning for RL-Based LLM Post-Training.arXiv preprint, 2025. arXiv:2504.09710

work page arXiv 2025
[20]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, et al. Training Language Models to Follow Instructions with Human Feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2203.02155

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal Policy Optimization Algorithms.arXiv preprint, 2017. arXiv:1707.06347. 11

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2305.18290

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint, 2025. arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 Technical Report.arXiv preprint, 2024. arXiv:2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations (ICLR), 2022. arXiv:2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Parameter-Efficient Transfer Learning for NLP

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, et al. Parameter-Efficient Transfer Learning for NLP. InInternational Conference on Machine Learning (ICML), 2019. arXiv:1902.00751

work page internal anchor Pith review Pith/arXiv arXiv 2019
[27]

Skalse, N

J. Skalse, N. H. R. Howe, D. Krasheninnikov, and D. Krueger. Defining and Characterizing Reward Hacking. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2209.13085

work page arXiv 2022
[28]

QLoRA: Efficient Finetuning of Quantized LLMs

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2305.14314

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

J. Xu, A. Szlam, and J. Weston. Beyond Goldfish Memory: Long-Term Open-Domain Conversation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Volume 1: Long Papers, pages 5180–5197, 2022. arXiv:2107.07567. A Hyperparameters 12 Table 7:Full training hyperparameters. Parameter Value Model Qwen-2.5-7B-Instruct L...

work page arXiv 2022
[30]

Select the most relevant memories for answering the question
[31]

Reason step-by-step using the selected memories
[32]

What is the name of the music streaming service I have been using lately?

Provide a concise, accurate answer. Output format: <selected_memories>[list the memory IDs or snippets]</selected_memories> <reasoning>[your step-by-step reasoning]</reasoning> <answer>[your final answer - be concise]</answer> C LLM-as-Judge Scores Table 8:Mean LLM-as-Judge scores (Claude 3 Haiku, 1–5 scale). Model LoCoMo LongMemEval Baseline 3.27 3.39 Co...

[1] [1]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dhruv Khant, S. Aryan, T. Singh, et al. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.arXiv preprint, 2025. arXiv:2504.19413

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

W. Xu, K. Mei, H. Gao, J. Tan, Z. Liang, and Y. Zhang. A-Mem: Agentic Memory for LLM Agents.arXiv preprint, 2025. arXiv:2502.12110

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

LangMem SDK for Agent Long-Term Memory

LangChain. LangMem SDK for Agent Long-Term Memory. Blog post, 2025.https://www.la ngchain.com/blog/langmem-sdk-launch. Accessed May 2026

2025

[4] [4]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

P. Rasmussen, P. Paliychuk, T. Beauvais, J. Ryan, and D. Chalef. Zep: A Temporal Knowledge Graph Architecture for Agent Memory.arXiv preprint, 2025. arXiv:2501.13956

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

S. Yan, X. Yang, Z. Huang, E. Nie, Z. Ding, et al. Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning.arXiv preprint,

[6] [6]

arXiv:2508.19828. 10

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Reflexion: Language Agents with Verbal Reinforcement Learning

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2303.11366

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

MemGPT: Towards LLMs as Operating Systems

C. Packer, S. Wooders, K. Lin, V. Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez. MemGPT: Towards LLMs as Operating Systems.arXiv preprint, 2023. arXiv:2310.08560

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv preprint, 2024. arXiv:2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Evaluating Very Long-Term Conversational Memory of LLM Agents

A. Maharana, D.-H. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y. Fang. Evaluating Very Long-Term Conversational Memory of LLM Agents. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024. arXiv:2402.17753

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

D. Wu, H. Wang, W. Yu, Y. Zhang, K.-W. Chang, and D. Yu. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2410.10813

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. arXiv:2005.11401

work page internal anchor Pith review Pith/arXiv arXiv 2020

[13] [13]

Dense Passage Retrieval for Open-Domain Question Answering

V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih. Dense Passage Retrieval for Open-Domain Question Answering. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. arXiv:2004.04906

work page internal anchor Pith review Pith/arXiv arXiv 2020

[14] [14]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers and I. Gurevych. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019. arXiv:1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019

[15] [15]

Bengio, J

Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum Learning. InInternational Conference on Machine Learning (ICML), 2009

2009

[16] [16]

Graves, M

A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu. Automated Curriculum Learning for Neural Networks. InInternational Conference on Machine Learning (ICML),

[17] [17]

Curricu- lum reinforcement learning from easy to hard tasks im- proves llm reasoning.arXiv preprint arXiv:2506.06632,

S. Parashar, S. Gui, X. Li, H. Ling, S. Vemuri, et al. Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning.arXiv preprint, 2025. arXiv:2506.06632

work page arXiv 2025

[18] [18]

Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025

G. Jiang, W. Feng, G. Quan, C. Hao, Y. Zhang, G. Liu, and H. Wang. VCRL: Variance-Based Curriculum Reinforcement Learning for Large Language Models.arXiv preprint, 2025. arXiv:2509.19803

work page arXiv 2025

[19] [19]

Z. Wang, G. Cui, K. Wan, and W. Zhao. DUMP: Automated Distribution-Level Curriculum Learning for RL-Based LLM Post-Training.arXiv preprint, 2025. arXiv:2504.09710

work page arXiv 2025

[20] [20]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, et al. Training Language Models to Follow Instructions with Human Feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2203.02155

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal Policy Optimization Algorithms.arXiv preprint, 2017. arXiv:1707.06347. 11

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2305.18290

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint, 2025. arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 Technical Report.arXiv preprint, 2024. arXiv:2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations (ICLR), 2022. arXiv:2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

Parameter-Efficient Transfer Learning for NLP

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, et al. Parameter-Efficient Transfer Learning for NLP. InInternational Conference on Machine Learning (ICML), 2019. arXiv:1902.00751

work page internal anchor Pith review Pith/arXiv arXiv 2019

[27] [27]

Skalse, N

J. Skalse, N. H. R. Howe, D. Krasheninnikov, and D. Krueger. Defining and Characterizing Reward Hacking. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2209.13085

work page arXiv 2022

[28] [28]

QLoRA: Efficient Finetuning of Quantized LLMs

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2305.14314

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

J. Xu, A. Szlam, and J. Weston. Beyond Goldfish Memory: Long-Term Open-Domain Conversation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Volume 1: Long Papers, pages 5180–5197, 2022. arXiv:2107.07567. A Hyperparameters 12 Table 7:Full training hyperparameters. Parameter Value Model Qwen-2.5-7B-Instruct L...

work page arXiv 2022

[30] [30]

Select the most relevant memories for answering the question

[31] [31]

Reason step-by-step using the selected memories

[32] [32]

What is the name of the music streaming service I have been using lately?

Provide a concise, accurate answer. Output format: <selected_memories>[list the memory IDs or snippets]</selected_memories> <reasoning>[your step-by-step reasoning]</reasoning> <answer>[your final answer - be concise]</answer> C LLM-as-Judge Scores Table 8:Mean LLM-as-Judge scores (Claude 3 Haiku, 1–5 scale). Model LoCoMo LongMemEval Baseline 3.27 3.39 Co...