Value-Gradient Hypothesis of RL for LLMs
Pith reviewed 2026-05-22 08:48 UTC · model grok-4.3
The pith
Under differentiable rollouts with additive noise, actor updates in critic-free RL for LLMs propagate costates whose expectation equals the value gradient.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a differentiable rollout and additive-noise parameterization, the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. For discrete transformer policies, autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.
What carries the argument
Differentiable rollout combined with additive-noise policy parameterization, which allows the backward pass to produce costates whose expectation matches the value gradient.
Load-bearing premise
The rollout must be differentiable and the policy must be parameterized with additive noise.
What would settle it
If empirical costates obtained by autodifferentiation through attention deviate from the true value gradient by more than the amount predicted by sampling gap and policy entropy, the central claim is falsified.
Figures
read the original abstract
Reinforcement learning substantially improves pretrained language models, but it remains understudied why critic-free methods such as PPO and GRPO work as well as they do, and when they should provide the largest gains. We develop a value-gradient perspective of critic-free RL for LLM post-training. First, under a differentiable rollout and additive-noise parameterization, we show that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. Second, for discrete transformer policies, we show that autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a value-gradient perspective on critic-free RL for LLM post-training. Under a differentiable rollout and additive-noise parameterization of the policy, it claims that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. For discrete transformer policies, autodifferentiation through attention is shown to produce empirical costates that approximate this value signal, with approximation error controlled by sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.
Significance. If the central derivation and approximation hold, the work supplies a theoretical explanation for the empirical success of methods such as PPO and GRPO in LLM post-training and identifies conditions under which critic-free RL yields the largest gains. The explicit connection between backward-pass costates and value gradients, together with the proposed decomposition, could guide more principled choices of when and how to apply RL along the pretraining-to-post-training continuum.
major comments (1)
- [Results section (discrete transformer policies paragraph)] Results section (discrete transformer policies paragraph): The claim that autodifferentiation through attention produces costates whose expectation approximates the value gradient rests on an error term controlled by sampling gap and policy entropy. No explicit bound, theorem, or quantitative estimate of this error is supplied for the high-entropy, large-vocabulary regime characteristic of standard LLM policies; because the continuous additive-noise case does not directly apply to categorical transformer outputs, this approximation is load-bearing for the central hypothesis yet remains unverified.
minor comments (2)
- [Abstract] Abstract and introduction: The term 'costates' is introduced without a brief definition or reference to its origin in optimal control, which may hinder accessibility for readers whose primary background is in language-model training rather than continuous-time RL.
- [Introduction] The manuscript would benefit from an explicit statement of the precise assumptions (e.g., differentiability of the rollout and the form of the additive noise) in a dedicated assumptions subsection or theorem statement.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for identifying this important point regarding the rigor of the approximation in the discrete case. We address the comment below and have revised the manuscript to strengthen the presentation.
read point-by-point responses
-
Referee: Results section (discrete transformer policies paragraph): The claim that autodifferentiation through attention produces costates whose expectation approximates the value gradient rests on an error term controlled by sampling gap and policy entropy. No explicit bound, theorem, or quantitative estimate of this error is supplied for the high-entropy, large-vocabulary regime characteristic of standard LLM policies; because the continuous additive-noise case does not directly apply to categorical transformer outputs, this approximation is load-bearing for the central hypothesis yet remains unverified.
Authors: We agree that the manuscript would benefit from greater rigor on this point. The current text states that the error is controlled by the sampling gap and policy entropy but does not supply an explicit theorem or quantitative bound tailored to the high-entropy, large-vocabulary regime of standard LLM policies. The continuous additive-noise derivation provides the core intuition, while the discrete transformer case is presented via the properties of autodifferentiation through attention. In the revised manuscript we have added a new proposition in the results section that supplies an explicit upper bound on the approximation error. The bound is expressed in terms of the total-variation distance induced by the sampling gap and a factor that scales with policy entropy; we also include a short discussion of its behavior under large vocabulary sizes. This addition makes the central hypothesis more self-contained without changing the overall claims or requiring new experiments. revision: yes
Circularity Check
Derivation is conditional on explicit modeling assumptions with no reduction to inputs by construction
full rationale
The paper derives that under a differentiable rollout and additive-noise parameterization the backward pass produces costates whose conditional expectation equals the value gradient. This is presented as a mathematical consequence of the stated assumptions rather than a redefinition or statistical fit of the target quantity itself. No equations in the abstract or reader's summary reduce the claimed equality to a fitted parameter or self-referential normalization. The discrete-transformer extension is described as an approximation whose error is controlled by sampling gap and entropy, which is a standard modeling claim rather than circular. No self-citation chains or uniqueness theorems imported from prior author work are invoked as load-bearing. The result is therefore self-contained against the listed assumptions and receives a zero circularity score.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Rollout is differentiable and policy uses additive-noise parameterization
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
under a differentiable rollout and additive-noise parameterization, the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
λ_t := ∂R_t / ∂s_t ... λ_t = Dr(s_t, a_t) + γ (Df_θ(s_t, a_t))^T λ_{t+1}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Reasoning with Sampling: Your Base Model is Smarter Than You Think
Reasoning with sampling: Your base model is smarter than you think , author=. arXiv preprint arXiv:2510.14901 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Advances in neural information processing systems , volume=
Improving policies without measuring merits , author=. Advances in neural information processing systems , volume=
-
[3]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
The 2012 international joint conference on neural networks (ijcnn) , pages=
Value-gradient learning , author=. The 2012 international joint conference on neural networks (ijcnn) , pages=. 2012 , organization=
work page 2012
-
[5]
International conference on machine learning , pages=
Stochastic backpropagation and approximate inference in deep generative models , author=. International conference on machine learning , pages=. 2014 , organization=
work page 2014
-
[6]
Advances in neural information processing systems , volume=
Gradient estimation using stochastic computation graphs , author=. Advances in neural information processing systems , volume=
-
[7]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
High-dimensional continuous control using generalized advantage estimation , author=. arXiv preprint arXiv:1506.02438 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[9]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=
work page 1992
-
[11]
Advances in neural information processing systems , volume=
Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , volume=
- [12]
-
[13]
arXiv preprint arXiv:2603.01162 , year=
Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic , author=. arXiv preprint arXiv:2603.01162 , year=
-
[14]
arXiv preprint arXiv:2506.08007 , year=
Reinforcement pre-training , author=. arXiv preprint arXiv:2506.08007 , year=
-
[15]
arXiv preprint arXiv:2511.06411 , year=
Soft-grpo: Surpassing discrete-token llm reinforcement learning via gumbel-reparameterized soft-thinking policy optimization , author=. arXiv preprint arXiv:2511.06411 , year=
-
[16]
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=
-
[17]
Reinforcement Learning via Value Gradient Flow
Reinforcement Learning via Value Gradient Flow , author=. arXiv preprint arXiv:2604.14265 , year=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.