arxiv: 2605.07331 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

Changlong Yu, Chenlu Ye, Nan Jiang, Saurabh Sahu, Shuowei Jin, Wei Xiong, Yuheng Zhang

Pith reviewed 2026-05-11 01:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords importance samplingpolicy optimizationLLM post-trainingcumulative token ratiovariance reductionreinforcement learningoff-policy estimationmathematical reasoning

0 comments

The pith

The cumulative token importance sampling ratio supplies unbiased prefix corrections with strictly lower variance than full-sequence ratios under token-level policy gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing importance sampling methods for LLM reinforcement learning create a bias-variance tradeoff: token-level ratios ignore prefix distribution shifts and introduce bias, while full-sequence ratios multiply every per-token factor and suffer high variance. The paper shows that the running product of per-token ratios up to the current position resolves the dilemma by delivering an exact unbiased correction for each token-level gradient term. This ratio is then paired with position-adaptive clipping whose bounds grow with the square root of token position to maintain consistent regularization strength across the sequence. The resulting CTPO algorithm is evaluated on tool-integrated mathematical reasoning benchmarks and outperforms both GRPO and GSPO baselines across model scales.

Core claim

Under the token-level policy-gradient formulation, the cumulative token IS ratio—the product of per-token importance sampling ratios up to position t—provides an unbiased prefix correction for each token-level gradient term and has strictly lower variance than the full sequence ratio. CTPO implements this ratio together with log-space clip bounds that scale proportionally to sqrt(t), yielding more uniform regularization across token positions and improved performance on mathematical reasoning tasks.

What carries the argument

The cumulative token IS ratio, defined as the running product of per-token importance sampling ratios from sequence start to the current position t, which supplies prefix corrections for token-level gradients.

If this is right

Off-policy updates become feasible at the token level without the bias of simple token ratios or the variance explosion of full-sequence products.
Position-adaptive clipping maintains comparable regularization strength at every token index rather than over- or under-clipping later positions.
The method delivers higher average accuracy on challenging mathematical reasoning benchmarks than GRPO and GSPO across multiple model sizes.
Training stability improves because the importance weight no longer grows multiplicatively with full sequence length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cumulative-ratio construction could be applied to other autoregressive sequence models trained with reinforcement learning beyond language models.
If the variance reduction holds in practice, it may permit larger batch sizes or learning rates without additional gradient clipping.
The approach suggests examining whether other sequential RL settings with long trajectories benefit from prefix-product corrections instead of full-trajectory weights.

Load-bearing premise

The token-level policy-gradient formulation accurately captures the LLM post-training objective, and the sqrt(t)-scaled clipping does not introduce new bias or instability.

What would settle it

An experiment that computes gradient variance on held-out trajectories using both cumulative and full-sequence ratios while keeping the policy and data fixed; the cumulative ratio should show measurably lower variance without degrading final task performance.

Figures

Figures reproduced from arXiv: 2605.07331 by Changlong Yu, Chenlu Ye, Nan Jiang, Saurabh Sahu, Shuowei Jin, Wei Xiong, Yuheng Zhang.

**Figure 1.** Figure 1: Analysis of log ρ cum t across training steps 50, 100, and 150. Top row: Empirical standard deviation of log ρ cum t vs. token position t, fitted with σˆ √ t, confirming the log-space variance growth discussed in Section 3.3. Bottom row: Clip rate vs. position under fixed clipping (ratio ∈ [0.5, 5]) and adaptive clipping. The fixed clip rate grows monotonically with t, while adaptive clipping maintains a s… view at source ↗

**Figure 2.** Figure 2: Training dynamics of GRPO, GSPO, and CTPO. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post-training. Central to these approaches is the design of the importance sampling (IS) ratio used in off-policy policy-gradient estimation. Existing methods face a fundamental bias-variance dilemma: token-level IS ratios, as adopted by PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024), introduce bias by ignoring prefix state distribution mismatch; full sequence ratios provide exact trajectory-level correction but suffer from high variance due to the multiplicative accumulation of per-token ratios, while GSPO (Zheng et al., 2025) improves numerical stability via length normalization at the cost of deviating from the exact full-sequence IS correction. In this work, we identify the cumulative token IS ratio, the product of per-token ratios up to position $t$, as a theoretically principled solution to this dilemma. We prove that, under the token-level policy-gradient formulation, this ratio provides an unbiased prefix correction for each token-level gradient term and has strictly lower variance than the full sequence ratio. Building on this insight, we propose CTPO (Cumulative Token Policy Optimization), which combines the cumulative token IS ratio with position-adaptive clipping that scales log-space clip bounds according to the natural $\sqrt{t}$ growth of the cumulative log-ratio. This yields more consistent regularization across token positions. We implement and evaluate CTPO in the tool-integrated reasoning setting on several challenging mathematical reasoning benchmarks, achieving the best average performance across both model scales compared with strong GRPO and GSPO baselines. Code will be available at https://github.com/horizon-llm/CTPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The cumulative token IS ratio is a clean theoretical middle ground for token-level gradients, but the paper's guarantees sit inside a formulation that may not match the sequence-level rewards used in actual LLM post-training.

read the letter

The paper's main move is to treat the product of per-token importance ratios up to position t as the right correction term for each token's gradient. Under their token-level policy-gradient setup they prove this partial product is unbiased for the prefix mismatch and has lower variance than the full-sequence product. They pair it with clipping whose log bounds grow like sqrt(t) so the regularization strength stays roughly constant as sequences get longer. That combination is what they call CTPO. It is new relative to the PPO, GRPO, and GSPO baselines they cite; none of those use the cumulative product with this exact justification or the position-dependent clip schedule. The experiments on tool-integrated math reasoning show CTPO coming out ahead on average across model sizes, which is concrete evidence that the change is at least not harmful in practice. The code release helps too. The soft spot is the scope of the theory. The unbiasedness and variance claims are derived inside a token-level decomposition where each token has its own return. Standard RLVR for reasoning gives reward only after the full answer, so the natural gradient uses the complete trajectory ratio. It is not immediate that the cumulative token ratio recovers an unbiased estimator for that objective, and the abstract does not walk through the translation. Without the full derivation or direct measurements of estimator variance on the actual training runs, it is hard to know how much of the claimed benefit survives in the setting people actually use. Readers who care about importance sampling details in LLM fine-tuning will get something useful from the framing and the clipping rule. The work is coherent on its own terms and engages the relevant prior papers, so it deserves a serious referee who can check the proof and the experimental controls in full. I would send it out rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The manuscript identifies a bias-variance dilemma in importance sampling ratios for off-policy policy-gradient estimation in LLM post-training. It claims that the cumulative token IS ratio (product of per-token ratios up to position t) provides an unbiased prefix correction for each token-level gradient term and strictly lower variance than the full-sequence ratio, under the token-level policy-gradient formulation. The authors propose CTPO, which combines this ratio with position-adaptive clipping scaled by sqrt(t) growth of the cumulative log-ratio, and report that it achieves the best average performance on tool-integrated mathematical reasoning benchmarks relative to GRPO and GSPO baselines.

Significance. If the theoretical result holds under the stated formulation and the empirical gains are reproducible, the work could offer a principled middle path between biased token-level ratios and high-variance sequence-level ratios, improving stability in LLM reinforcement post-training. The position-adaptive clipping addresses a concrete practical issue in long trajectories. The evaluation on challenging math benchmarks provides initial evidence of utility, though quantitative variance measurements and ablations would strengthen the assessment.

major comments (3)

[Abstract] Abstract: the central claim asserts a proof that the cumulative token IS ratio is unbiased and has strictly lower variance than the full sequence ratio under the token-level policy-gradient formulation, yet no derivation, explicit gradient expressions, variance formulas, or list of assumptions (e.g., on state distributions or return structure) is supplied. This is load-bearing for the entire contribution.
[Abstract] Abstract: the unbiasedness and variance results are derived only for the token-level policy-gradient formulation, but standard RLVR/math-reasoning objectives use sequence-level rewards received after the full trajectory. The manuscript does not show whether the per-token prefix correction remains unbiased when the advantage is computed over the entire sequence, which directly affects applicability to the reported experiments.
[CTPO] CTPO proposal: the position-adaptive clipping with sqrt(t) scaling is presented as yielding consistent regularization, but no analysis or ablation demonstrates that this scaling preserves unbiasedness or does not introduce instability; without such evidence the component remains heuristic and load-bearing for the claimed performance gains.

minor comments (2)

The abstract states that CTPO achieves the best average performance but supplies neither the specific benchmark names, accuracy numbers, nor variance measurements; adding these quantitative details would allow readers to gauge the magnitude of improvement.
Notation for the cumulative token IS ratio and the adaptive clip bounds should be defined with an equation early in the text to improve readability before the proof is invoked.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the theoretical and empirical aspects of the work. We address each major comment below and will revise the manuscript to incorporate clarifications, derivations, and additional analyses.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim asserts a proof that the cumulative token IS ratio is unbiased and has strictly lower variance than the full sequence ratio under the token-level policy-gradient formulation, yet no derivation, explicit gradient expressions, variance formulas, or list of assumptions (e.g., on state distributions or return structure) is supplied. This is load-bearing for the entire contribution.

Authors: We agree that the proof is central and should be presented more explicitly for clarity. In the revised manuscript, we will move the full derivation of unbiasedness and the variance comparison into the main text (expanding Section 3), including explicit token-level gradient expressions, the variance formulas under the token-level formulation, and a clear list of assumptions regarding state distributions and return structure. This will make the theoretical claims self-contained without relying solely on the appendix. revision: yes
Referee: [Abstract] Abstract: the unbiasedness and variance results are derived only for the token-level policy-gradient formulation, but standard RLVR/math-reasoning objectives use sequence-level rewards received after the full trajectory. The manuscript does not show whether the per-token prefix correction remains unbiased when the advantage is computed over the entire sequence, which directly affects applicability to the reported experiments.

Authors: This point correctly identifies a gap in bridging the token-level theory to the sequence-level reward setting used in the experiments. While the cumulative token IS ratio provides prefix correction for the state distribution up to position t, we will add a dedicated subsection discussing its application with sequence-level advantages. We will show that the per-token gradient terms remain unbiased under the prefix correction even when the advantage is the full-trajectory return (as the IS ratio corrects the visitation up to t independently of the return structure). We will also include a note on limitations and empirical variance measurements on the math benchmarks to support applicability. revision: partial
Referee: [CTPO] CTPO proposal: the position-adaptive clipping with sqrt(t) scaling is presented as yielding consistent regularization, but no analysis or ablation demonstrates that this scaling preserves unbiasedness or does not introduce instability; without such evidence the component remains heuristic and load-bearing for the claimed performance gains.

Authors: We acknowledge that the sqrt(t) scaling for clipping bounds is a practical design choice motivated by the growth of the cumulative log-ratio and would benefit from further justification. In the revision, we will add an ablation study evaluating alternative scalings (constant, linear in t, and sqrt(t)) on both performance and training stability metrics. We will also include a short analysis showing that the scaling preserves the unbiasedness of the IS ratio (as clipping is applied after ratio computation) while reducing position-dependent variance in the effective regularization strength, supported by the new empirical results. revision: yes

Circularity Check

0 steps flagged

No circularity: proof is self-contained under explicit token-level assumption

full rationale

The paper's core derivation is a mathematical proof establishing unbiasedness and variance reduction for the cumulative token IS ratio, conditioned explicitly on the token-level policy-gradient formulation. This follows from standard importance-sampling identities applied to per-token terms and does not reduce to any fitted parameter, self-citation chain, or redefinition of the target quantity. Cited prior methods (GRPO, GSPO) supply context for the bias-variance dilemma but are not invoked as load-bearing uniqueness theorems or ansatzes. The position-adaptive clipping is introduced as a practical heuristic derived from the observed sqrt(t) scaling of cumulative log-ratios, without circular dependence on the result itself. The derivation therefore remains independent of its outputs and is self-contained within the stated modeling assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the token-level policy-gradient formulation as the setting for the proof; no free parameters, invented entities, or additional ad-hoc axioms are mentioned in the abstract.

axioms (1)

domain assumption Token-level policy-gradient formulation holds for the LLM optimization setting
The proof of unbiased prefix correction is stated to hold under this formulation.

pith-pipeline@v0.9.0 · 5625 in / 1146 out tokens · 41551 ms · 2026-05-11T01:14:49.998679+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 18 internal anchors

[1]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, and Jieping Ye

Learning to reason under off-policy guidance , author=. arXiv preprint arXiv:2504.14945 , year=

work page arXiv
[8]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

nature , volume=

Human-level control through deep reinforcement learning , author=. nature , volume=. 2015 , publisher=

work page 2015
[10]

arXiv preprint arXiv:2505.18573 , year=

Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs , author=. arXiv preprint arXiv:2505.18573 , year=

work page arXiv
[11]

Optimizing chain-of-thought reasoners via gradient variance minimization in rejection sampling and rl, 2025

Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL , author=. arXiv preprint arXiv:2505.02391 , year=

work page arXiv
[12]

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration , author=. arXiv preprint arXiv:2508.13755 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay, 2026

Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay , author=. arXiv preprint arXiv:2506.05316 , year=

work page arXiv
[14]

arXiv preprint arXiv:2507.07451 , year=

RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning , author=. arXiv preprint arXiv:2507.07451 , year=

work page arXiv
[15]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[16]

Advances in Neural Information Processing Systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=

work page
[17]

KTO: Model Alignment as Prospect Theoretic Optimization

Kto: Model alignment as prospect theoretic optimization , author=. arXiv preprint arXiv:2402.01306 , year=

work page internal anchor Pith review arXiv
[18]

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 11170–11189, 2024

Orpo: Monolithic preference optimization without reference model , author=. arXiv preprint arXiv:2403.07691 , year=

work page arXiv
[19]

Advances in Neural Information Processing Systems , volume=

Simpo: Simple preference optimization with a reference-free reward , author=. Advances in Neural Information Processing Systems , volume=

work page
[20]

Rlhf workflow: From reward modeling to online rlhf.arXiv preprint arXiv:2405.07863,

Rlhf workflow: From reward modeling to online rlhf , author=. arXiv preprint arXiv:2405.07863 , year=

work page arXiv
[21]

arXiv preprint arXiv:2405.21046 , year=

Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf , author=. arXiv preprint arXiv:2405.21046 , year=

work page arXiv
[22]

International Conference on Artificial Intelligence and Statistics , pages=

A general theoretical paradigm to understand learning from human preferences , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

work page 2024
[23]

G., Row- land, M., Guo, Z

Nash learning from human feedback , author=. arXiv preprint arXiv:2312.00886 , volume=

work page arXiv
[24]

arXiv preprint arXiv:2405.00675 , year=

Self-play preference optimization for language model alignment , author=. arXiv preprint arXiv:2405.00675 , year=

work page arXiv
[25]

Advances in Neural Information Processing Systems , volume=

Online iterative reinforcement learning from human feedback with general preference model , author=. Advances in Neural Information Processing Systems , volume=

work page
[26]

arXiv preprint arXiv:2407.00617 , year=

Iterative nash policy optimization: Aligning llms with general preferences via no-regret learning , author=. arXiv preprint arXiv:2407.00617 , year=

work page arXiv
[27]

Improving llm general preference alignment via optimistic online mirror descent

Improving LLM general preference alignment via optimistic online mirror descent , author=. arXiv preprint arXiv:2502.16852 , year=

work page arXiv
[28]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review arXiv
[31]

1998 , publisher=

Reinforcement learning: An introduction , author=. 1998 , publisher=

work page 1998
[32]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms , author=. arXiv preprint arXiv:2501.12599 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

First Conference on Language Modeling , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=

work page
[34]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=

work page
[37]

Advances in neural information processing systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=

work page
[38]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. arXiv preprint arXiv:2402.14008 , year=

work page internal anchor Pith review arXiv
[39]

Qwen2 Technical Report

Qwen2 technical report , author=. arXiv preprint arXiv:2407.10671 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

work page
[41]

Eligibility traces for off-policy policy evaluation , author=

work page
[42]

Journal of Machine Learning Research , volume=

Variance reduction techniques for gradient estimates in reinforcement learning , author=. Journal of Machine Learning Research , volume=

work page
[43]

International conference on machine learning , pages=

Doubly robust off-policy value evaluation for reinforcement learning , author=. International conference on machine learning , pages=. 2016 , organization=

work page 2016
[44]

2015 , school=

Safe reinforcement learning , author=. 2015 , school=

work page 2015
[45]

CoRR , volume =

Soft adaptive policy optimization , author=. arXiv preprint arXiv:2511.20347 , year=

work page arXiv
[46]

Sim- pletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning.arXiv preprint arXiv:2509.02479,

Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning , author=. arXiv preprint arXiv:2509.02479 , year=

work page arXiv
[47]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL , author=

work page
[49]

Seallms 3: Open foundation and chat multilingual large language models for southeast asian languages

Improving sampling efficiency in rlvr through adaptive rollout and response reuse , author=. arXiv preprint arXiv:2509.25808 , year=

work page arXiv