pith. machine review for the scientific record. sign in

arxiv: 2604.16918 · v1 · submitted 2026-04-18 · 💻 cs.CL · cs.LG

Recognition: unknown

Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

Jian Zhao, Mohamed Elhoseiny, Weiyu Ma, Xinyu Cui, Xuhui Liu, Yan Song, Yongcheng Zeng

Pith reviewed 2026-05-10 07:23 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords experience replayprioritized experience replayreinforcement learninglarge language modelssample efficiencypolicy stalenessagentic tasks
0
0 comments X

The pith

Freshness-Aware PER adds exponential age decay to experience priorities to overcome staleness in fast-evolving LLM policies and achieve major gains over on-policy RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning for large language and vision-language models relies on on-policy methods that throw away every trajectory after a single update, wasting the expensive cost of multi-turn interactions. Directly using prioritized experience replay to reuse samples fails because the model's policy changes so quickly that old priorities no longer reflect current usefulness. The proposed Freshness-Aware PER multiplies every priority by an exponential decay factor that increases with the trajectory's age, based on effective sample size ideas. This change lets the agent sample informative past experiences without being dominated by outdated ones. Experiments on search, puzzle, and navigation tasks show consistent and sometimes dramatic improvements while plain PER hurts results.

Core claim

The central discovery is that augmenting prioritized experience replay with a multiplicative exponential age decay term, derived from effective sample size analysis, resolves the priority staleness issue caused by rapid policy updates in large models, allowing successful off-policy learning that significantly exceeds on-policy baselines.

What carries the argument

The age decay factor, a multiplicative term applied to priorities that exponentially reduces the weight of older trajectories as the policy evolves.

If this is right

  • Large performance improvements on tasks such as natural language search, Sokoban, and vision-language navigation.
  • Consistent degradation when using standard PER without age decay.
  • Effective across model sizes of 0.5B, 3B, and 7B parameters.
  • Applicable to a range of multi-step reasoning and agentic environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Age decay mechanisms like this may extend to other off-policy RL methods when applied to non-stationary policies in large models.
  • Reducing the discard rate of trajectories could lower the overall compute needed for post-training LLMs.
  • Further work might explore adaptive decay rates that depend on measured policy change speed.

Load-bearing premise

The specific form of multiplicative exponential age decay will accurately counteract staleness from policy changes without introducing harmful sampling biases or requiring extensive tuning.

What would settle it

If an ablation study removing the age decay or using a fixed decay rate results in performance no better than or worse than on-policy methods, the mechanism's effectiveness would be called into question.

Figures

Figures reproduced from arXiv: 2604.16918 by Jian Zhao, Mohamed Elhoseiny, Weiyu Ma, Xinyu Cui, Xuhui Liu, Yan Song, Yongcheng Zeng.

Figure 1
Figure 1. Figure 1: Overview of the FreshPER training pipeline. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: On-policy LLM RL algorithms (PPO, REINFORCE++, GRPO) use each trajectory for a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Learning curves on LLM tasks. Blue ■: On-Policy; Yellow ▲: Standard PER; Red •: FreshPER (Ours). 0 25 50 75 100 125 150 175 200 Training Steps 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Validation Success Rate (a) VLM FrozenLake 0 50 100 150 200 250 300 350 400 Training Steps 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Validation Success Rate (b) VLM GeoQA [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Learning curves on VLM tasks. Blue ■: On-Policy; Yellow ▲: Standard PER; Red •: FreshPER (Ours). The benefit of replay scales with task difficulty. Ablation on the age decay constant τ across Sokoban Simple and FrozenLake (fig. 5) reveals that τ is task-dependent: Sokoban, which exhibits rapid policy drift, requires aggressive decay (τ=500), while the slower-evolving FrozenLake ben￾efits from a gentler set… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation of age decay constant τ . Blue ■: Baseline; Red •: τ=500; Orange ♦: τ=1000; Gray ▼: τ=1500. (a) Sokoban: τ=500 is optimal; τ=1500 fails completely. (b) FrozenLake: τ=1000 is optimal; all τ values outperform Baseline. 0 50 100 150 200 250 300 350 400 Training Steps 25 20 15 10 5 0 Validation Score (Mean) (a) CliffWalking 0 50 100 150 200 250 300 350 400 Training Steps 0.92 0.93 0.94 0.95 0.96 0.97 … view at source ↗
Figure 6
Figure 6. Figure 6: Control experiments. Blue ■: On-Policy; Yellow ▲: Standard PER; Red •: Fresh￾PER (Ours). (a) CliffWalking: all methods converge to optimal; replay adds transient instability. (b) GSM8K: initial performance already >93%; all methods saturate at ∼97%. to the current policy. The contrast between Standard PER and FreshPER isolates the contribution of age decay, confirming that it is essential for applying PER … view at source ↗
Figure 7
Figure 7. Figure 7: IS correction ablation on FrozenLake (LLM). [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation of age decay constant τ . Blue ■: Baseline; Red •: τ=500; Orange ♦: τ=1000; Gray ▼: τ=1500. On Sokoban Simple (fig. 8a), τ=500 achieves a peak score of 2.30, τ=1000 reaches 1.50, and τ=1500 completely fails (−0.90), identical to Standard PER. This environment exhibits rapid policy drift, requiring aggressive decay. On FrozenLake (fig. 8b), the ranking shifts: τ=1000 achieves the highest peak (0.33… view at source ↗
Figure 9
Figure 9. Figure 9: IS correction ablation on FrozenLake (LLM). [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Control experiments. Blue ■: On-Policy; Yellow ▲: Standard PER; Red •: FreshPER (Ours). Too-Simple Environments (CliffWalking). All three methods converge to the optimal score of 0 (fig. 10a). On-Policy converges fastest and most stably, while Standard PER and FreshPER exhibit 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
read the original abstract

Reinforcement Learning (RL) has achieved impressive success in post-training Large Language Models (LLMs) and Vision-Language Models (VLMs), with on-policy algorithms such as PPO, GRPO, and REINFORCE++ serving as the dominant paradigm. However, these methods discard all collected trajectories after a single gradient update, resulting in poor sample efficiency, particularly wasteful for agentic tasks where multi-turn environment interactions are expensive. While Experience Replay drives sample efficiency in classic RL by allowing agents to reuse past trajectories and prioritize informative ones, directly applying Prioritized Experience Replay (PER) to LLMs fails. The rapid policy evolution of billion-parameter models renders stored priorities stale, causing old high-priority trajectories to dominate sampling long after they have become uninformative. We propose Freshness-Aware PER, which addresses this priority staleness problem by augmenting any PER-based priority with a multiplicative exponential age decay grounded in effective sample size analysis. To the best of our knowledge, Freshness-Aware PER is the first work to successfully apply PER to LLM/VLM reinforcement learning. We evaluate on eight multi-step agentic, reasoning, and math competition tasks with 0.5B, 3B, and 7B models. Freshness-Aware PER significantly outperforms on-policy baselines, achieving +46% on NQ Search, +367% on Sokoban, and +133% on VLM FrozenLake, while standard PER without age decay consistently degrades performance. Our code is publicly available at https://github.com/Vision-CAIR/Freshness-Aware-PER.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Freshness-Aware Prioritized Experience Replay (PER) to improve sample efficiency in LLM and VLM reinforcement learning. Standard on-policy methods like PPO discard trajectories after one update, while direct application of PER fails due to rapid policy evolution making stored priorities stale. The method augments PER priorities with a multiplicative exponential age decay derived from effective sample size analysis. Evaluations across eight agentic, reasoning, and math tasks using 0.5B–7B models report large gains over on-policy baselines (+46% on NQ Search, +367% on Sokoban, +133% on VLM FrozenLake) and consistent degradation when using vanilla PER without the decay. Code is released publicly.

Significance. If the central result holds, the work would provide a practical route to off-policy reuse in LLM/VLM post-training, where environment interactions are costly. The consistent degradation of standard PER and the public code release are notable strengths that would support adoption if the age-decay mechanism proves robust across tasks and model scales.

major comments (2)
  1. [Method (age decay derivation)] The grounding of the multiplicative exponential age decay in effective sample size analysis (described in the method) assumes priority distributions evolve on timescales comparable to classic RL. For billion-parameter models, a single gradient step can induce abrupt policy shifts; the manuscript must demonstrate that the chosen decay rate remains effective without per-task retuning or introducing new sampling bias under these conditions, as the reported gains (+46% NQ Search, etc.) could otherwise hinge on hyperparameter selection.
  2. [Experiments (main results table and ablation)] Table reporting main results and the PER ablation: the manuscript shows standard PER degrades performance, but does not report per-seed variance, statistical significance tests, or the exact replay-buffer size and priority normalization used. Without these, it is unclear whether the Freshness-Aware variant’s advantage is robust or sensitive to implementation details that interact with the decay term.
minor comments (2)
  1. [Related Work] The abstract states 'to the best of our knowledge' Freshness-Aware PER is the first successful application of PER to LLM/VLM RL; the related-work section should explicitly cite and differentiate from any prior attempts at experience replay in language-model RL to strengthen this claim.
  2. [Method] Notation for the age-decay factor and its integration into the priority formula should be introduced with a single equation early in the method section rather than scattered across paragraphs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the presentation of robustness without altering the core claims.

read point-by-point responses
  1. Referee: [Method (age decay derivation)] The grounding of the multiplicative exponential age decay in effective sample size analysis (described in the method) assumes priority distributions evolve on timescales comparable to classic RL. For billion-parameter models, a single gradient step can induce abrupt policy shifts; the manuscript must demonstrate that the chosen decay rate remains effective without per-task retuning or introducing new sampling bias under these conditions, as the reported gains (+46% NQ Search, etc.) could otherwise hinge on hyperparameter selection.

    Authors: We thank the referee for this important observation. The effective sample size derivation yields a general exponential decay factor that depends on the observed rate of policy evolution (via KL divergence between consecutive policies), rather than assuming classic RL timescales. In the original experiments, a single decay hyperparameter was used uniformly across all eight tasks and model scales (0.5B–7B) with no per-task retuning, and the consistent gains plus vanilla PER degradation support that the rate is not overly sensitive. In the revised manuscript we add (i) per-step policy shift measurements confirming the decay remains effective under abrupt LLM updates and (ii) a sensitivity plot over a range of decay rates showing stable performance without new sampling bias (the multiplicative term is renormalized after application). These additions directly address the concern while preserving the reported gains. revision: yes

  2. Referee: [Experiments (main results table and ablation)] Table reporting main results and the PER ablation: the manuscript shows standard PER degrades performance, but does not report per-seed variance, statistical significance tests, or the exact replay-buffer size and priority normalization used. Without these, it is unclear whether the Freshness-Aware variant’s advantage is robust or sensitive to implementation details that interact with the decay term.

    Authors: We agree these reporting details are essential. The replay buffer size (10,000 trajectories) and priority normalization (sum-tree with freshness decay applied prior to normalization) are specified in Section 4.1 and the publicly released code. In the revised manuscript we expand the main results table to report mean ± standard deviation over five random seeds and include paired t-test p-values establishing statistical significance of the improvements over baselines. We also add a short paragraph clarifying that the decay term is applied before normalization, preventing stale-sample dominance and ensuring the observed advantage is not an artifact of implementation choices. revision: yes

Circularity Check

0 steps flagged

No circularity: age decay presented as derived from effective sample size analysis with independent empirical validation

full rationale

The paper's central proposal augments PER priorities with a multiplicative exponential age decay explicitly grounded in effective sample size analysis, rather than by fitting to the target performance metrics or by self-referential definition. No equations, claims, or citations in the provided text reduce this decay (or the overall Freshness-Aware PER method) to its inputs by construction, invoke load-bearing self-citations for uniqueness, or rename known results as novel derivations. Performance gains are reported as separate empirical evaluations across multiple tasks and model sizes, without the 'predictions' being statistically forced by the derivation itself. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that priority staleness in PER for rapidly evolving LLM policies can be addressed by an exponential age decay derived from effective sample size analysis. No new entities are postulated. Limited information is available from the abstract alone.

free parameters (1)
  • exponential decay rate
    The specific rate parameter for the age decay is part of the method but not detailed or shown as fitted in the abstract.
axioms (1)
  • domain assumption Effective sample size analysis provides a valid grounding for determining trajectory freshness in the context of fast policy updates.
    The method augments any PER priority with multiplicative exponential age decay based on this analysis.

pith-pipeline@v0.9.0 · 5603 in / 1517 out tokens · 77945 ms · 2026-05-10T07:23:09.997152+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 24 canonical work pages · 10 internal anchors

  1. [1]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    11 DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learn- ing.arXiv preprint arXiv:2501.12948,

  3. [3]

    Rlhf workflow: From reward modeling to online rlhf

    Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. RLHF workflow: From reward modeling to online RLHF.arXiv preprint arXiv:2405.07863,

  4. [4]

    Prioritized replay for RL post-training.arXiv preprint arXiv:2601.02648,

    Mehdi Fatemi, Banafsheh Rafiee, Kavosh Asadi, Yaqiao Li, Yunhao Tang, Yongchao Zhou, and Dzmitry Bahdanau. Prioritized replay for RL post-training.arXiv preprint arXiv:2601.02648,

  5. [5]

    Revisiting fundamentals of experience replay.arXiv preprint arXiv:2007.06700,

    William Fedus, Prajit Ramachandran, Aravind Rajeswaran, Charles Blundell, Timothy Lillicrap, et al. Revisiting fundamentals of experience replay.arXiv preprint arXiv:2007.06700,

  6. [6]

    2505.24298 , archivePrefix=

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Siheng Li, Wanpeng Zhang, Yue Wu, Tianbao Xie, Yongfeng Zhang, Tao Yu, Zhiwei Jia, and Zhaoran Wang. AReaL: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298,

  7. [7]

    Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024

    Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, et al. OpenRLHF: An easy-to-use, scalable and high-performance RLHF framework.arXiv preprint arXiv:2405.11143,

  8. [8]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. REINFORCE++: Stabilizing critic-free policy optimization with global advantage normalization.arXiv preprint arXiv:2501.03262,

  9. [9]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, et al. Vision-R1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  10. [10]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-R1: Training LLMs to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

  11. [11]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025a. Zichuan Liu, Jinyu Wang, Lei Song, and Jiang Bian. Sample-efficient LLM optimization with reset replay.arXiv preprint arXiv:2508.06412, 2025b. Yu Luo, Shuo Han,...

  12. [12]

    Asynchronous rlhf: Faster and more efficient off-policy rl for language models.arXiv preprint arXiv:2410.18252,

    Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous RLHF: Faster and more efficient off-policy RL for language models.arXiv preprint arXiv:2410.18252,

  13. [13]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

  14. [14]

    Qwen2.5-VL Technical Report

    Qwen Team. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923,

  15. [15]

    Tapered off-policy reinforce: Stable and efficient reinforcement learning for llms.arXiv preprint arXiv:2503.14286, 2025

    Nicolas Le Roux, Marc G. Bellemare, Jonathan Lebensold, Arnaud Bergeron, Joshua Greaves, et al. Tapered off-policy REINFORCE: Stable and efficient reinforcement learning for LLMs.arXiv preprint arXiv:2503.14286,

  16. [16]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prajit Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  17. [17]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  18. [18]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, et al. VLM-R1: A stable and generalizable R1-style large vision-language model.arXiv preprint arXiv:2504.07615,

  19. [19]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework.arXiv preprint arXiv:2409.19256,

  20. [20]

    Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay, 2026

    Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, et al. Improving data ef- ficiency for LLM reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay.arXiv preprint arXiv:2506.05316,

  21. [21]

    Reinforcement learning optimization for large- scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

    Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, Zichen Liu, Haizhou Zhao, et al. ROLL: Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025a. 13 Zihan Wang, Kangrui Wang, Qineng Wang, Ping...

  22. [22]

    arXiv preprint arXiv:2510.18927 , year=

    Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, et al. BAPO: Stabilizing off-policy reinforcement learning for LLMs via balanced policy optimization with adaptive clipping.arXiv preprint arXiv:2510.18927,

  23. [23]

    Part II: ROLL flash - accelerating RLVR and agentic training with asynchrony.CoRR, abs/2510.11345,

    Luo Yu, Zhiyuan Zeng, Jiaze Chen, Qiying Yu, Jian Li, Yuchi Zhang, Hang Yan, Bairen Yi, Ao Liu, Tao Ji, Zhipeng Chen, Dahua Lin, Junbo Zhao, and Zhi Zheng. ROLL Flash: Accelerating RLVR and agentic training with asynchrony.arXiv preprint arXiv:2510.11345, 2025a. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohang C...

  24. [24]

    Prosperity before collapse: How far can off-policy rl reach with stale data on llms?arXiv preprint arXiv:2510.01161, 2025

    Haizhong Zheng, Jiawei Zhao, and Beidi Chen. Prosperity before collapse: How far can off-policy RL reach with stale data on LLMs?arXiv preprint arXiv:2510.01161,

  25. [25]

    Max Actions

    between distributionsPandQis defined as: Dα(P∥Q) = 1 α−1 logE Q P(x) Q(x) α .(22) Two standard properties are relevant here: Property 1: Connection toχ 2-divergence.Settingα= 2in Eq. (22): D2(P∥Q) = logE Q ρ2 = log 1 +χ 2(P∥Q) ,(23) where we usedE Q[ρ2] = 1 +χ 2 from Eq. (21). Rearranging: χ2(P∥Q) = exp(D 2(P∥Q))−1.(24) 17 Property 2: Monotonicity inα.Pre...

  26. [26]

    Con- fig A

    This serves as a control: the task is simple enough that on-policy training solves it quickly, and replay is not expected to help. GSM8K.Grade-school math word problems [Cobbe et al., 2021] requiring multi-step arithmetic reasoning. The base Qwen2.5-0.5B model already achieves>93% accuracy on this task, so it serves as a near-saturated control where repla...