pith. machine review for the scientific record. sign in

arxiv: 2605.07331 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

Changlong Yu, Chenlu Ye, Nan Jiang, Saurabh Sahu, Shuowei Jin, Wei Xiong, Yuheng Zhang

Pith reviewed 2026-05-11 01:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords importance samplingpolicy optimizationLLM post-trainingcumulative token ratiovariance reductionreinforcement learningoff-policy estimationmathematical reasoning
0
0 comments X

The pith

The cumulative token importance sampling ratio supplies unbiased prefix corrections with strictly lower variance than full-sequence ratios under token-level policy gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing importance sampling methods for LLM reinforcement learning create a bias-variance tradeoff: token-level ratios ignore prefix distribution shifts and introduce bias, while full-sequence ratios multiply every per-token factor and suffer high variance. The paper shows that the running product of per-token ratios up to the current position resolves the dilemma by delivering an exact unbiased correction for each token-level gradient term. This ratio is then paired with position-adaptive clipping whose bounds grow with the square root of token position to maintain consistent regularization strength across the sequence. The resulting CTPO algorithm is evaluated on tool-integrated mathematical reasoning benchmarks and outperforms both GRPO and GSPO baselines across model scales.

Core claim

Under the token-level policy-gradient formulation, the cumulative token IS ratio—the product of per-token importance sampling ratios up to position t—provides an unbiased prefix correction for each token-level gradient term and has strictly lower variance than the full sequence ratio. CTPO implements this ratio together with log-space clip bounds that scale proportionally to sqrt(t), yielding more uniform regularization across token positions and improved performance on mathematical reasoning tasks.

What carries the argument

The cumulative token IS ratio, defined as the running product of per-token importance sampling ratios from sequence start to the current position t, which supplies prefix corrections for token-level gradients.

If this is right

  • Off-policy updates become feasible at the token level without the bias of simple token ratios or the variance explosion of full-sequence products.
  • Position-adaptive clipping maintains comparable regularization strength at every token index rather than over- or under-clipping later positions.
  • The method delivers higher average accuracy on challenging mathematical reasoning benchmarks than GRPO and GSPO across multiple model sizes.
  • Training stability improves because the importance weight no longer grows multiplicatively with full sequence length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cumulative-ratio construction could be applied to other autoregressive sequence models trained with reinforcement learning beyond language models.
  • If the variance reduction holds in practice, it may permit larger batch sizes or learning rates without additional gradient clipping.
  • The approach suggests examining whether other sequential RL settings with long trajectories benefit from prefix-product corrections instead of full-trajectory weights.

Load-bearing premise

The token-level policy-gradient formulation accurately captures the LLM post-training objective, and the sqrt(t)-scaled clipping does not introduce new bias or instability.

What would settle it

An experiment that computes gradient variance on held-out trajectories using both cumulative and full-sequence ratios while keeping the policy and data fixed; the cumulative ratio should show measurably lower variance without degrading final task performance.

Figures

Figures reproduced from arXiv: 2605.07331 by Changlong Yu, Chenlu Ye, Nan Jiang, Saurabh Sahu, Shuowei Jin, Wei Xiong, Yuheng Zhang.

Figure 1
Figure 1. Figure 1: Analysis of log ρ cum t across training steps 50, 100, and 150. Top row: Empirical standard deviation of log ρ cum t vs. token position t, fitted with σˆ √ t, confirming the log-space variance growth discussed in Section 3.3. Bottom row: Clip rate vs. position under fixed clipping (ratio ∈ [0.5, 5]) and adaptive clipping. The fixed clip rate grows monotonically with t, while adaptive clipping maintains a s… view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics of GRPO, GSPO, and CTPO. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post-training. Central to these approaches is the design of the importance sampling (IS) ratio used in off-policy policy-gradient estimation. Existing methods face a fundamental bias-variance dilemma: token-level IS ratios, as adopted by PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024), introduce bias by ignoring prefix state distribution mismatch; full sequence ratios provide exact trajectory-level correction but suffer from high variance due to the multiplicative accumulation of per-token ratios, while GSPO (Zheng et al., 2025) improves numerical stability via length normalization at the cost of deviating from the exact full-sequence IS correction. In this work, we identify the cumulative token IS ratio, the product of per-token ratios up to position $t$, as a theoretically principled solution to this dilemma. We prove that, under the token-level policy-gradient formulation, this ratio provides an unbiased prefix correction for each token-level gradient term and has strictly lower variance than the full sequence ratio. Building on this insight, we propose CTPO (Cumulative Token Policy Optimization), which combines the cumulative token IS ratio with position-adaptive clipping that scales log-space clip bounds according to the natural $\sqrt{t}$ growth of the cumulative log-ratio. This yields more consistent regularization across token positions. We implement and evaluate CTPO in the tool-integrated reasoning setting on several challenging mathematical reasoning benchmarks, achieving the best average performance across both model scales compared with strong GRPO and GSPO baselines. Code will be available at https://github.com/horizon-llm/CTPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript identifies a bias-variance dilemma in importance sampling ratios for off-policy policy-gradient estimation in LLM post-training. It claims that the cumulative token IS ratio (product of per-token ratios up to position t) provides an unbiased prefix correction for each token-level gradient term and strictly lower variance than the full-sequence ratio, under the token-level policy-gradient formulation. The authors propose CTPO, which combines this ratio with position-adaptive clipping scaled by sqrt(t) growth of the cumulative log-ratio, and report that it achieves the best average performance on tool-integrated mathematical reasoning benchmarks relative to GRPO and GSPO baselines.

Significance. If the theoretical result holds under the stated formulation and the empirical gains are reproducible, the work could offer a principled middle path between biased token-level ratios and high-variance sequence-level ratios, improving stability in LLM reinforcement post-training. The position-adaptive clipping addresses a concrete practical issue in long trajectories. The evaluation on challenging math benchmarks provides initial evidence of utility, though quantitative variance measurements and ablations would strengthen the assessment.

major comments (3)
  1. [Abstract] Abstract: the central claim asserts a proof that the cumulative token IS ratio is unbiased and has strictly lower variance than the full sequence ratio under the token-level policy-gradient formulation, yet no derivation, explicit gradient expressions, variance formulas, or list of assumptions (e.g., on state distributions or return structure) is supplied. This is load-bearing for the entire contribution.
  2. [Abstract] Abstract: the unbiasedness and variance results are derived only for the token-level policy-gradient formulation, but standard RLVR/math-reasoning objectives use sequence-level rewards received after the full trajectory. The manuscript does not show whether the per-token prefix correction remains unbiased when the advantage is computed over the entire sequence, which directly affects applicability to the reported experiments.
  3. [CTPO] CTPO proposal: the position-adaptive clipping with sqrt(t) scaling is presented as yielding consistent regularization, but no analysis or ablation demonstrates that this scaling preserves unbiasedness or does not introduce instability; without such evidence the component remains heuristic and load-bearing for the claimed performance gains.
minor comments (2)
  1. The abstract states that CTPO achieves the best average performance but supplies neither the specific benchmark names, accuracy numbers, nor variance measurements; adding these quantitative details would allow readers to gauge the magnitude of improvement.
  2. Notation for the cumulative token IS ratio and the adaptive clip bounds should be defined with an equation early in the text to improve readability before the proof is invoked.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the theoretical and empirical aspects of the work. We address each major comment below and will revise the manuscript to incorporate clarifications, derivations, and additional analyses.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim asserts a proof that the cumulative token IS ratio is unbiased and has strictly lower variance than the full sequence ratio under the token-level policy-gradient formulation, yet no derivation, explicit gradient expressions, variance formulas, or list of assumptions (e.g., on state distributions or return structure) is supplied. This is load-bearing for the entire contribution.

    Authors: We agree that the proof is central and should be presented more explicitly for clarity. In the revised manuscript, we will move the full derivation of unbiasedness and the variance comparison into the main text (expanding Section 3), including explicit token-level gradient expressions, the variance formulas under the token-level formulation, and a clear list of assumptions regarding state distributions and return structure. This will make the theoretical claims self-contained without relying solely on the appendix. revision: yes

  2. Referee: [Abstract] Abstract: the unbiasedness and variance results are derived only for the token-level policy-gradient formulation, but standard RLVR/math-reasoning objectives use sequence-level rewards received after the full trajectory. The manuscript does not show whether the per-token prefix correction remains unbiased when the advantage is computed over the entire sequence, which directly affects applicability to the reported experiments.

    Authors: This point correctly identifies a gap in bridging the token-level theory to the sequence-level reward setting used in the experiments. While the cumulative token IS ratio provides prefix correction for the state distribution up to position t, we will add a dedicated subsection discussing its application with sequence-level advantages. We will show that the per-token gradient terms remain unbiased under the prefix correction even when the advantage is the full-trajectory return (as the IS ratio corrects the visitation up to t independently of the return structure). We will also include a note on limitations and empirical variance measurements on the math benchmarks to support applicability. revision: partial

  3. Referee: [CTPO] CTPO proposal: the position-adaptive clipping with sqrt(t) scaling is presented as yielding consistent regularization, but no analysis or ablation demonstrates that this scaling preserves unbiasedness or does not introduce instability; without such evidence the component remains heuristic and load-bearing for the claimed performance gains.

    Authors: We acknowledge that the sqrt(t) scaling for clipping bounds is a practical design choice motivated by the growth of the cumulative log-ratio and would benefit from further justification. In the revision, we will add an ablation study evaluating alternative scalings (constant, linear in t, and sqrt(t)) on both performance and training stability metrics. We will also include a short analysis showing that the scaling preserves the unbiasedness of the IS ratio (as clipping is applied after ratio computation) while reducing position-dependent variance in the effective regularization strength, supported by the new empirical results. revision: yes

Circularity Check

0 steps flagged

No circularity: proof is self-contained under explicit token-level assumption

full rationale

The paper's core derivation is a mathematical proof establishing unbiasedness and variance reduction for the cumulative token IS ratio, conditioned explicitly on the token-level policy-gradient formulation. This follows from standard importance-sampling identities applied to per-token terms and does not reduce to any fitted parameter, self-citation chain, or redefinition of the target quantity. Cited prior methods (GRPO, GSPO) supply context for the bias-variance dilemma but are not invoked as load-bearing uniqueness theorems or ansatzes. The position-adaptive clipping is introduced as a practical heuristic derived from the observed sqrt(t) scaling of cumulative log-ratios, without circular dependence on the result itself. The derivation therefore remains independent of its outputs and is self-contained within the stated modeling assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the token-level policy-gradient formulation as the setting for the proof; no free parameters, invented entities, or additional ad-hoc axioms are mentioned in the abstract.

axioms (1)
  • domain assumption Token-level policy-gradient formulation holds for the LLM optimization setting
    The proof of unbiased prefix correction is stated to hold under this formulation.

pith-pipeline@v0.9.0 · 5625 in / 1146 out tokens · 41551 ms · 2026-05-11T01:14:49.998679+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 18 internal anchors

  1. [1]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  3. [3]

    GPT-4o System Card

    Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

  4. [4]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

  5. [5]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  6. [6]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  7. [7]

    Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, and Jieping Ye

    Learning to reason under off-policy guidance , author=. arXiv preprint arXiv:2504.14945 , year=

  8. [8]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

  9. [9]

    nature , volume=

    Human-level control through deep reinforcement learning , author=. nature , volume=. 2015 , publisher=

  10. [10]

    arXiv preprint arXiv:2505.18573 , year=

    Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs , author=. arXiv preprint arXiv:2505.18573 , year=

  11. [11]

    Optimizing chain-of-thought reasoners via gradient variance minimization in rejection sampling and rl, 2025

    Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL , author=. arXiv preprint arXiv:2505.02391 , year=

  12. [12]

    Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

    Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration , author=. arXiv preprint arXiv:2508.13755 , year=

  13. [13]

    Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay, 2026

    Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay , author=. arXiv preprint arXiv:2506.05316 , year=

  14. [14]

    arXiv preprint arXiv:2507.07451 , year=

    RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning , author=. arXiv preprint arXiv:2507.07451 , year=

  15. [15]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  16. [16]

    Advances in Neural Information Processing Systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=

  17. [17]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kto: Model alignment as prospect theoretic optimization , author=. arXiv preprint arXiv:2402.01306 , year=

  18. [18]

    In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 11170–11189, 2024

    Orpo: Monolithic preference optimization without reference model , author=. arXiv preprint arXiv:2403.07691 , year=

  19. [19]

    Advances in Neural Information Processing Systems , volume=

    Simpo: Simple preference optimization with a reference-free reward , author=. Advances in Neural Information Processing Systems , volume=

  20. [20]

    Rlhf workflow: From reward modeling to online rlhf.arXiv preprint arXiv:2405.07863,

    Rlhf workflow: From reward modeling to online rlhf , author=. arXiv preprint arXiv:2405.07863 , year=

  21. [21]

    arXiv preprint arXiv:2405.21046 , year=

    Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf , author=. arXiv preprint arXiv:2405.21046 , year=

  22. [22]

    International Conference on Artificial Intelligence and Statistics , pages=

    A general theoretical paradigm to understand learning from human preferences , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

  23. [23]

    G., Row- land, M., Guo, Z

    Nash learning from human feedback , author=. arXiv preprint arXiv:2312.00886 , volume=

  24. [24]

    arXiv preprint arXiv:2405.00675 , year=

    Self-play preference optimization for language model alignment , author=. arXiv preprint arXiv:2405.00675 , year=

  25. [25]

    Advances in Neural Information Processing Systems , volume=

    Online iterative reinforcement learning from human feedback with general preference model , author=. Advances in Neural Information Processing Systems , volume=

  26. [26]

    arXiv preprint arXiv:2407.00617 , year=

    Iterative nash policy optimization: Aligning llms with general preferences via no-regret learning , author=. arXiv preprint arXiv:2407.00617 , year=

  27. [27]

    Improving llm general preference alignment via optimistic online mirror descent

    Improving LLM general preference alignment via optimistic online mirror descent , author=. arXiv preprint arXiv:2502.16852 , year=

  28. [28]

    OpenAI o1 System Card

    Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

  29. [29]

    Group Sequence Policy Optimization

    Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

  30. [30]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

  31. [31]

    1998 , publisher=

    Reinforcement learning: An introduction , author=. 1998 , publisher=

  32. [32]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi k1. 5: Scaling reinforcement learning with llms , author=. arXiv preprint arXiv:2501.12599 , year=

  33. [33]

    First Conference on Language Modeling , year=

    Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=

  34. [34]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

  35. [35]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

  36. [36]

    Proceedings of the Twentieth European Conference on Computer Systems , pages=

    Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=

  37. [37]

    Advances in neural information processing systems , volume=

    Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=

  38. [38]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. arXiv preprint arXiv:2402.14008 , year=

  39. [39]

    Qwen2 Technical Report

    Qwen2 technical report , author=. arXiv preprint arXiv:2407.10671 , year=

  40. [40]

    arXiv e-prints , pages=

    The llama 3 herd of models , author=. arXiv e-prints , pages=

  41. [41]

    Eligibility traces for off-policy policy evaluation , author=

  42. [42]

    Journal of Machine Learning Research , volume=

    Variance reduction techniques for gradient estimates in reinforcement learning , author=. Journal of Machine Learning Research , volume=

  43. [43]

    International conference on machine learning , pages=

    Doubly robust off-policy value evaluation for reinforcement learning , author=. International conference on machine learning , pages=. 2016 , organization=

  44. [44]

    2015 , school=

    Safe reinforcement learning , author=. 2015 , school=

  45. [45]

    CoRR , volume =

    Soft adaptive policy optimization , author=. arXiv preprint arXiv:2511.20347 , year=

  46. [46]

    Sim- pletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning.arXiv preprint arXiv:2509.02479,

    Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning , author=. arXiv preprint arXiv:2509.02479 , year=

  47. [47]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  48. [48]

    DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL , author=

  49. [49]

    Seallms 3: Open foundation and chat multilingual large language models for southeast asian languages

    Improving sampling efficiency in rlvr through adaptive rollout and response reuse , author=. arXiv preprint arXiv:2509.25808 , year=