VIMPO: Value-Implicit Policy Optimization for LLMs

Aosong Feng; Dawn Song; Sergey Levine; Xuandong Zhao; Zhewei Kang

arxiv: 2606.20008 · v1 · pith:6Z6SGOOZnew · submitted 2026-06-18 · 💻 cs.LG

VIMPO: Value-Implicit Policy Optimization for LLMs

Zhewei Kang , Aosong Feng , Sergey Levine , Dawn Song , Xuandong Zhao This is my paper

Pith reviewed 2026-06-26 17:57 UTC · model grok-4.3

classification 💻 cs.LG

keywords VIMPOpolicy optimizationlarge language modelsreinforcement learning with verifiable rewardscritic-freevalue functionGRPOmathematical reasoning

0 comments

The pith

VIMPO derives a policy-implied value function from KL-regularized RL optimality conditions to enable critic-free optimization for LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

VIMPO addresses the trade-off in reinforcement learning for LLM reasoning by avoiding both the coarse credit assignment of group-relative methods and the instability of learned critics. It derives an implicit value function directly from the optimality conditions of KL-regularized reinforcement learning. For autoregressive sequences, this value follows a recurrence expressed using the log-ratio between the current policy and a reference model, with the recurrence terminating at zero because no reward remains after the sequence ends. The resulting value loss incorporates outcome rewards without a separate value network, while a separate actor update handles policy improvement in a PPO-like manner. This leads to better results than GRPO on math reasoning benchmarks and greater robustness to reward noise.

Core claim

VIMPO derives a policy-implied value function from the optimality conditions of KL-regularized reinforcement learning. For autoregressive generation, the resulting value recurrence can be written in terms of policy-reference log-ratios and anchored by the terminal condition that no future reward remains at the end of a trajectory. This gives a simple value loss that incorporates outcome-level verifiable rewards without training a critic. The same derivation also yields a critic-free actor advantage, allowing VIMPO to separate reward incorporation through the value loss from policy improvement through a PPO-style actor update.

What carries the argument

Policy-implied value function derived from KL-regularized optimality conditions, expressed as a log-ratio recurrence anchored at trajectory termination.

Load-bearing premise

The value recurrence for autoregressive generation can be written in terms of policy-reference log-ratios and anchored by the terminal condition that no future reward remains at the end of a trajectory.

What would settle it

Running VIMPO and GRPO on the same set of math benchmarks and observing whether VIMPO fails to show improvement when the value loss component is removed.

Figures

Figures reproduced from arXiv: 2606.20008 by Aosong Feng, Dawn Song, Sergey Levine, Xuandong Zhao, Zhewei Kang.

**Figure 1.** Figure 1: Overview of VIMPO. Given a prompt q, the policy generates a completion o, scored to obtain an outcome reward r. VIMPO uses this reward to train the policy-implied value loss, while the policy and frozen reference model define a token-level TD signal used to form the actor advantage. This separates reward incorporation from policy improvement without training an explicit critic. errors can affect policy opt… view at source ↗

**Figure 3.** Figure 3: Main comparison among naive GRPO, GRPO and VIMPO under clean verifier rewards. We [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Noisy-reward stress test over the first 200 training steps. Solid curves show runs trained [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation over VIMPO coefficients. We compare a value-only variant with [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Training entropy of naive GRPO, GRPO, and VIMPO under clean verifier rewards. Faint [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Response-length dynamics for the VIMPO coefficient ablation. The main VIMPO setting [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Token-aligned comparison between the VIMPO GAE actor signal and a Monte Carlo [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Token-level view of selected critical spans from the same combinatorics case using the [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards has become a central tool for improving the reasoning ability of large language models, but current methods face a trade-off between simplicity and credit assignment. Group-relative methods such as GRPO avoid training a critic, but typically assign a trajectory-level advantage to every token. Actor-critic methods provide denser learning signals, but require a learned value function with its own training instability. We introduce VIMPO, a critic-free policy optimization method that derives a policy-implied value function from the optimality conditions of KL-regularized reinforcement learning. For autoregressive generation, the resulting value recurrence can be written in terms of policy-reference log-ratios and anchored by the terminal condition that no future reward remains at the end of a trajectory. This gives a simple value loss that incorporates outcome-level verifiable rewards without training a critic. The same derivation also yields a critic-free actor advantage, allowing VIMPO to separate reward incorporation through the value loss from policy improvement through a PPO-style actor update. On mathematical RLVR benchmarks, VIMPO improves over GRPO across MATH-500, AIME 2024, AIME 2025, and OlympiadBench, with especially larger gains on competition-style evaluations. Under noisy rewards, VIMPO retains a consistent advantage over GRPO, suggesting that policy-implied value optimization can provide finer credit assignment while preserving the practical simplicity of critic-free training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VIMPO pulls a critic-free value recurrence from KL-regularized optimality conditions for autoregressive generation and reports gains over GRPO on math benchmarks, but the derivation needs explicit checking for hidden bias under outcome rewards.

read the letter

VIMPO tries to split the difference between GRPO's simplicity and actor-critic density by deriving an implicit value directly from the optimality conditions of KL-regularized RL. For autoregressive generation the value recurrence is written in terms of policy-reference log-ratios and closed by the terminal condition that no future reward remains. This value loss is then paired with a separate PPO-style actor update, keeping the whole thing critic-free.

The experiments are the clearest positive. VIMPO beats GRPO on MATH-500, AIME 2024, AIME 2025, and OlympiadBench, with larger margins on the harder competition problems. It also keeps the edge when rewards are noisy, which is a practical test of whether the implicit value actually improves credit assignment.

The soft spot is the derivation itself. The stress-test concern is reasonable: if the reference policy continuation does not cancel cleanly, or if the advantage does not separate without bias when rewards are only at the end of the trajectory, the method could reduce to a more complicated version of GRPO rather than a genuine improvement. The paper needs to lay out every step from the optimality conditions and show that the value loss remains unbiased. Without that, the central claim rests on an assumption that is not yet verified in the abstract.

This is for people working on RLVR pipelines who want something between group-relative baselines and full critic training. A reader focused on math reasoning or noisy-reward settings would find the empirical comparison useful.

It deserves peer review because it offers a concrete alternative with benchmark numbers, even if the math requires close scrutiny in revision.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces VIMPO, a critic-free policy optimization method for LLMs under verifiable rewards. It derives a policy-implied value function directly from the optimality conditions of KL-regularized RL; for autoregressive generation this yields a value recurrence expressed via policy-reference log-ratios and anchored solely by the terminal condition of zero future reward. The same derivation supplies a critic-free actor advantage, enabling a value loss for reward incorporation and a PPO-style actor update for policy improvement. Experiments report consistent gains over GRPO on MATH-500, AIME 2024, AIME 2025 and OlympiadBench (larger on competition problems) together with retained advantage under noisy rewards.

Significance. If the central derivation is free of hidden assumptions on reference-policy cancellation or advantage separation, VIMPO would supply a practical middle path between the coarse credit assignment of group-relative methods and the instability of learned critics. The reported robustness to reward noise and the differential gains on harder benchmarks would then constitute a concrete empirical contribution to RLVR for reasoning models.

major comments (2)

[Value recurrence derivation (Methods section)] Value recurrence derivation (Methods section): the manuscript states that the recurrence follows directly from KL-regularized optimality conditions and is anchored only by the terminal condition. It must be shown explicitly that the reference policy continuation value cancels exactly under outcome-only rewards; otherwise the claimed separation of value loss from actor advantage may not hold and the method could reduce to a reparameterized form of GRPO rather than delivering distinct finer credit assignment.
[Results section] Experimental claims (Results section): the reported improvements over GRPO are presented without error bars, number of independent runs, or statistical significance tests. This information is load-bearing for the claim of “especially larger gains on competition-style evaluations” and for the assertion of consistent advantage under noisy rewards.

minor comments (1)

[Abstract] The acronym RLVR is used without an initial definition; a parenthetical expansion on first use would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the value recurrence derivation and the experimental reporting. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Value recurrence derivation (Methods section)] Value recurrence derivation (Methods section): the manuscript states that the recurrence follows directly from KL-regularized optimality conditions and is anchored only by the terminal condition. It must be shown explicitly that the reference policy continuation value cancels exactly under outcome-only rewards; otherwise the claimed separation of value loss from actor advantage may not hold and the method could reduce to a reparameterized form of GRPO rather than delivering distinct finer credit assignment.

Authors: The derivation starts from the KL-regularized optimality condition for the soft value function under a reference policy. For outcome-only rewards the terminal condition is V_T = 0. Expanding the soft Bellman equation yields a recurrence in which the reference-policy continuation value appears symmetrically in both the log-ratio term and the subtracted future-value term; these cancel exactly, leaving a value expressed solely in policy-reference log-ratios anchored at the terminal zero. This cancellation is what permits the separate value loss and actor advantage. We will add a fully expanded, line-by-line derivation in the Methods section to make the cancellation explicit. revision: yes
Referee: [Results section] Experimental claims (Results section): the reported improvements over GRPO are presented without error bars, number of independent runs, or statistical significance tests. This information is load-bearing for the claim of “especially larger gains on competition-style evaluations” and for the assertion of consistent advantage under noisy rewards.

Authors: We agree that the current results lack the statistical detail needed to support the strength of the claims. In the revision we will report means and standard deviations over multiple independent runs, include error bars on all figures, and add statistical significance tests (e.g., paired t-tests or Wilcoxon tests) comparing VIMPO against GRPO on each benchmark and under noisy rewards. revision: yes

Circularity Check

0 steps flagged

Derivation from standard KL-regularized optimality conditions is self-contained

full rationale

The central claim derives a policy-implied value recurrence directly from the optimality conditions of KL-regularized RL, expressed via policy-reference log-ratios and anchored only by the terminal condition of zero future reward. No self-citations, fitted parameters renamed as predictions, or self-definitional steps are present in the provided description. The separation of value loss from PPO-style actor update follows from the same optimality conditions without reducing to GRPO by construction. This is a standard theoretical derivation in RL and remains independent of the paper's own inputs or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5787 in / 1098 out tokens · 23091 ms · 2026-06-26T17:57:34.574707+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 11 linked inside Pith

[1]

URLhttps://arxiv.org/abs/2506.14965. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,

arXiv
[2]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

Pith/arXiv arXiv
[3]

Olympiad- bench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiad- bench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,

Pith/arXiv arXiv
[4]

Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

10 Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

Pith/arXiv arXiv
[5]

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Alex Low, Alan Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al

Software package. Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Alex Low, Alan Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

Pith/arXiv arXiv
[6]

Attention illuminates llm reason- ing: The preplan-and-anchor rhythm enables fine-grained policy optimization.arXiv preprint arXiv:2510.13554,

Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, Wenbo Su, Bo Zheng, and Junchi Yan. Attention illuminates llm reason- ing: The preplan-and-anchor rhythm enables fine-grained policy optimization.arXiv preprint arXiv:2510.13554,

Pith/arXiv arXiv
[7]

Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835,

Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush V osoughi, Guoyin Wang, and Jingren Zhou. Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835,

arXiv
[8]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

Pith/arXiv arXiv
[9]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Pith/arXiv arXiv
[10]

Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256,

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256,

Pith/arXiv arXiv
[11]

Qwen3 technical report.arXiv preprint arXiv:2505.09388,

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

Pith/arXiv arXiv
[12]

Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

11 Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

Pith/arXiv arXiv
[13]

What’s behind ppo’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491,

Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind ppo’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491,

arXiv
[14]

Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, Tiantian Fan, Zhengyin Du, Xiangpeng Wei, Gaohong Liu, Juncai Liu, Lingjun Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Chi Zhang, Mofan Zhang, Wang Zhang, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

Pith/arXiv arXiv
[15]

Method- specific VIMPO coefficients are reported in Section 4.1

12 A Experimental Details A.1 Shared hyperparameters Table 2 lists the hyperparameters shared by GRPO and VIMPO in the main experiments. Method- specific VIMPO coefficients are reported in Section 4.1. Table 2: Shared experimental hyperparameters for GRPO and VIMPO. Hyperparameter Value Base model Qwen3-4B-Base Training data Guru Math subset, 54.4K exampl...

2048
[16]

The naive GRPO baseline uses the same group-relative advantage as GRPO, but averages the token loss within each completion before averaging across completions. B Additional Experimental Results B.1 Training entropy 0 100 200 300 400 500 Training step 0.00 0.25 0.50 0.75 1.00 Entropy Naive GRPO GRPO VIMPO 200 300 400 500 0.0 0.1 Figure 6: Training entropy ...

2025
[17]

This supports the observation that VIMPO-specific coefficients affect not only optimization speed and policy movement, but also generation-length dynamics. 0 25 50 75 100 125 150 175 200 Training step 1000 1250 1500 1750 2000 Response length β = 5e−4, cA = 0 β = 5e−4, cA = 0.005 β = 0.05, cA = 5.0 Figure 7: Response-length dynamics for the VIMPO coefficie...

2000

[1] [1]

URLhttps://arxiv.org/abs/2506.14965. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,

arXiv

[2] [2]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

Pith/arXiv arXiv

[3] [3]

Olympiad- bench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiad- bench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,

Pith/arXiv arXiv

[4] [4]

Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

10 Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

Pith/arXiv arXiv

[5] [5]

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Alex Low, Alan Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al

Software package. Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Alex Low, Alan Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

Pith/arXiv arXiv

[6] [6]

Attention illuminates llm reason- ing: The preplan-and-anchor rhythm enables fine-grained policy optimization.arXiv preprint arXiv:2510.13554,

Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, Wenbo Su, Bo Zheng, and Junchi Yan. Attention illuminates llm reason- ing: The preplan-and-anchor rhythm enables fine-grained policy optimization.arXiv preprint arXiv:2510.13554,

Pith/arXiv arXiv

[7] [7]

Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835,

Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush V osoughi, Guoyin Wang, and Jingren Zhou. Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835,

arXiv

[8] [8]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

Pith/arXiv arXiv

[9] [9]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Pith/arXiv arXiv

[10] [10]

Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256,

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256,

Pith/arXiv arXiv

[11] [11]

Qwen3 technical report.arXiv preprint arXiv:2505.09388,

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

Pith/arXiv arXiv

[12] [12]

Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

11 Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

Pith/arXiv arXiv

[13] [13]

What’s behind ppo’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491,

Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind ppo’s collapse in long-cot? value optimization holds the secret.arXiv preprint arXiv:2503.01491,

arXiv

[14] [14]

Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, Tiantian Fan, Zhengyin Du, Xiangpeng Wei, Gaohong Liu, Juncai Liu, Lingjun Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Chi Zhang, Mofan Zhang, Wang Zhang, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

Pith/arXiv arXiv

[15] [15]

Method- specific VIMPO coefficients are reported in Section 4.1

12 A Experimental Details A.1 Shared hyperparameters Table 2 lists the hyperparameters shared by GRPO and VIMPO in the main experiments. Method- specific VIMPO coefficients are reported in Section 4.1. Table 2: Shared experimental hyperparameters for GRPO and VIMPO. Hyperparameter Value Base model Qwen3-4B-Base Training data Guru Math subset, 54.4K exampl...

2048

[16] [16]

The naive GRPO baseline uses the same group-relative advantage as GRPO, but averages the token loss within each completion before averaging across completions. B Additional Experimental Results B.1 Training entropy 0 100 200 300 400 500 Training step 0.00 0.25 0.50 0.75 1.00 Entropy Naive GRPO GRPO VIMPO 200 300 400 500 0.0 0.1 Figure 6: Training entropy ...

2025

[17] [17]

This supports the observation that VIMPO-specific coefficients affect not only optimization speed and policy movement, but also generation-length dynamics. 0 25 50 75 100 125 150 175 200 Training step 1000 1250 1500 1750 2000 Response length β = 5e−4, cA = 0 β = 5e−4, cA = 0.005 β = 0.05, cA = 5.0 Figure 7: Response-length dynamics for the VIMPO coefficie...

2000