Value-Gradient Hypothesis of RL for LLMs

Arip Asadulaev; Daniil Ognev; Karim Salta; Martin Takac

arxiv: 2605.21654 · v1 · pith:5D7V4YGGnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI· cs.CL

Value-Gradient Hypothesis of RL for LLMs

Arip Asadulaev , Daniil Ognev , Karim Salta , Martin Takac This is my paper

Pith reviewed 2026-05-22 08:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords reinforcement learninglarge language modelsvalue gradientsPPOGRPOactor updatespost-training

0 comments

The pith

Under differentiable rollouts with additive noise, actor updates in critic-free RL for LLMs propagate costates whose expectation equals the value gradient.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a value-gradient perspective on critic-free reinforcement learning for language model post-training. It shows that when rollouts are differentiable and policies use additive-noise parameterization, the backward pass through the actor produces signals whose conditional expectation matches the value gradient. For transformer policies this approximation holds with error bounded by sampling gap and policy entropy. The resulting decomposition into value-gradient strength and reachable reward headroom supplies a criterion for when RL post-training yields the largest gains along a pretraining trajectory.

Core claim

Under a differentiable rollout and additive-noise parameterization, the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. For discrete transformer policies, autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.

What carries the argument

Differentiable rollout combined with additive-noise policy parameterization, which allows the backward pass to produce costates whose expectation matches the value gradient.

Load-bearing premise

The rollout must be differentiable and the policy must be parameterized with additive noise.

What would settle it

If empirical costates obtained by autodifferentiation through attention deviate from the true value gradient by more than the amount predicted by sampling gap and policy entropy, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.21654 by Arip Asadulaev, Daniil Ognev, Karim Salta, Martin Takac.

**Figure 1.** Figure 1: Real RL gain vs. predicted one using value impact formula (Section 5, Eq. 29). Recently, Large Language Models (LLMs) achieve state-of-the-art reasoning using Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which discards the critic entirely, yet classical Reinforcement Learning (RL) theory predicts that critic-free methods should fail at long-horizon credit assignment. Why don’t they? In t… view at source ↗

**Figure 2.** Figure 2: Bound inequality plot. Real value gradient gap vs proposed bound (Section 4, Prop. 2). Attention is doing the heavy lifting. In an RNN, ht+1 depends on ht only through the recurrence ht+1 = f(ht, eot ). The only pathway is through the discrete token, so the sampling gap blocks all temporal credit flow. In a transformer, h (L) t+1 depends on h (L) t through attention, a direct, differentiable, content-bas… view at source ↗

**Figure 3.** Figure 3: Results of the RL upon various OLMO-2 pretraining checkpoints. The left image shows the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Real RL gain vs. predicted one using value impact formula (Section 5, Eq. 29). Closed-form RL task. For the RL-impact experiment, we use OLMo-2 1B checkpoints from pretraining steps 50k to 1M in increments of 50k. The task is a controlled label-copying problem. Given a prompt containing a target label in {A, B, C, D}, the model must put probability mass on the matching answer token. The reward is R(θ; q… view at source ↗

**Figure 5.** Figure 5: Z-scores of the gain after RL vs. z-scores of the predicted RL impact. Correlation (left) and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Reinforcement learning substantially improves pretrained language models, but it remains understudied why critic-free methods such as PPO and GRPO work as well as they do, and when they should provide the largest gains. We develop a value-gradient perspective of critic-free RL for LLM post-training. First, under a differentiable rollout and additive-noise parameterization, we show that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. Second, for discrete transformer policies, we show that autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a value-gradient framing for why critic-free RL works on LLMs and a signal-plus-headroom split for timing, but the additive-noise assumption for discrete policies is the part that needs the most checking.

read the letter

Hi, the main point here is that the authors argue critic-free RL updates on LLMs can be read as value-gradient steps once you assume a differentiable rollout and an additive-noise policy. They then show that autodiff through attention gives a workable approximation for the usual discrete token case, with the error tied to sampling gap and entropy. From there they split RL gains into the strength of that value signal and the leftover reward headroom, which they say can indicate when post-training will move the needle most along a pretraining curve. That decomposition is the piece that could actually affect how labs schedule compute. The framing itself is new enough; I do not recall the exact costate-expectation argument appearing in the PPO or GRPO literature they cite. They also do a reasonable job keeping the story tied to existing practice rather than inventing new algorithms. The soft spot is the modeling choice. Standard LLM policies are categorical, not additive noise, so the discrete extension rests on how small the approximation error stays in high-entropy, large-vocabulary regimes. The abstract claims the error is controlled, but without seeing the explicit bounds or any ablation on real models it is difficult to judge whether the equality holds tightly enough to support the downstream claims about compute allocation. Minor gaps in the citation pattern around related work on differentiable relaxations would also be easy to fix. This paper is for people who already think about RLHF theory or scaling decisions rather than practitioners looking for a new trick. A reader who wants a lens on why PPO succeeds and when to apply it would get something useful. It deserves a serious referee because the question is real and the perspective is fresh, even if the assumptions will need tightening in review.

Referee Report

1 major / 2 minor

Summary. The paper develops a value-gradient perspective on critic-free RL for LLM post-training. Under a differentiable rollout and additive-noise parameterization of the policy, it claims that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. For discrete transformer policies, autodifferentiation through attention is shown to produce empirical costates that approximate this value signal, with approximation error controlled by sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.

Significance. If the central derivation and approximation hold, the work supplies a theoretical explanation for the empirical success of methods such as PPO and GRPO in LLM post-training and identifies conditions under which critic-free RL yields the largest gains. The explicit connection between backward-pass costates and value gradients, together with the proposed decomposition, could guide more principled choices of when and how to apply RL along the pretraining-to-post-training continuum.

major comments (1)

[Results section (discrete transformer policies paragraph)] Results section (discrete transformer policies paragraph): The claim that autodifferentiation through attention produces costates whose expectation approximates the value gradient rests on an error term controlled by sampling gap and policy entropy. No explicit bound, theorem, or quantitative estimate of this error is supplied for the high-entropy, large-vocabulary regime characteristic of standard LLM policies; because the continuous additive-noise case does not directly apply to categorical transformer outputs, this approximation is load-bearing for the central hypothesis yet remains unverified.

minor comments (2)

[Abstract] Abstract and introduction: The term 'costates' is introduced without a brief definition or reference to its origin in optimal control, which may hinder accessibility for readers whose primary background is in language-model training rather than continuous-time RL.
[Introduction] The manuscript would benefit from an explicit statement of the precise assumptions (e.g., differentiability of the rollout and the form of the additive noise) in a dedicated assumptions subsection or theorem statement.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for identifying this important point regarding the rigor of the approximation in the discrete case. We address the comment below and have revised the manuscript to strengthen the presentation.

read point-by-point responses

Referee: Results section (discrete transformer policies paragraph): The claim that autodifferentiation through attention produces costates whose expectation approximates the value gradient rests on an error term controlled by sampling gap and policy entropy. No explicit bound, theorem, or quantitative estimate of this error is supplied for the high-entropy, large-vocabulary regime characteristic of standard LLM policies; because the continuous additive-noise case does not directly apply to categorical transformer outputs, this approximation is load-bearing for the central hypothesis yet remains unverified.

Authors: We agree that the manuscript would benefit from greater rigor on this point. The current text states that the error is controlled by the sampling gap and policy entropy but does not supply an explicit theorem or quantitative bound tailored to the high-entropy, large-vocabulary regime of standard LLM policies. The continuous additive-noise derivation provides the core intuition, while the discrete transformer case is presented via the properties of autodifferentiation through attention. In the revised manuscript we have added a new proposition in the results section that supplies an explicit upper bound on the approximation error. The bound is expressed in terms of the total-variation distance induced by the sampling gap and a factor that scales with policy entropy; we also include a short discussion of its behavior under large vocabulary sizes. This addition makes the central hypothesis more self-contained without changing the overall claims or requiring new experiments. revision: yes

Circularity Check

0 steps flagged

Derivation is conditional on explicit modeling assumptions with no reduction to inputs by construction

full rationale

The paper derives that under a differentiable rollout and additive-noise parameterization the backward pass produces costates whose conditional expectation equals the value gradient. This is presented as a mathematical consequence of the stated assumptions rather than a redefinition or statistical fit of the target quantity itself. No equations in the abstract or reader's summary reduce the claimed equality to a fitted parameter or self-referential normalization. The discrete-transformer extension is described as an approximation whose error is controlled by sampling gap and entropy, which is a standard modeling claim rather than circular. No self-citation chains or uniqueness theorems imported from prior author work are invoked as load-bearing. The result is therefore self-contained against the listed assumptions and receives a zero circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claims rest on modeling assumptions of differentiability and additive noise that are introduced to enable the costate analysis; no free parameters or new entities are mentioned in the abstract.

axioms (1)

domain assumption Rollout is differentiable and policy uses additive-noise parameterization
Invoked to establish that the actor update propagates costates whose conditional expectation equals the value gradient.

pith-pipeline@v0.9.0 · 5675 in / 1267 out tokens · 44314 ms · 2026-05-22T08:48:55.488467+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

under a differentiable rollout and additive-noise parameterization, the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

λ_t := ∂R_t / ∂s_t ... λ_t = Dr(s_t, a_t) + γ (Df_θ(s_t, a_t))^T λ_{t+1}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 5 internal anchors

[1]

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Reasoning with sampling: Your base model is smarter than you think , author=. arXiv preprint arXiv:2510.14901 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Advances in neural information processing systems , volume=

Improving policies without measuring merits , author=. Advances in neural information processing systems , volume=

work page
[3]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

The 2012 international joint conference on neural networks (ijcnn) , pages=

Value-gradient learning , author=. The 2012 international joint conference on neural networks (ijcnn) , pages=. 2012 , organization=

work page 2012
[5]

International conference on machine learning , pages=

Stochastic backpropagation and approximate inference in deep generative models , author=. International conference on machine learning , pages=. 2014 , organization=

work page 2014
[6]

Advances in neural information processing systems , volume=

Gradient estimation using stochastic computation graphs , author=. Advances in neural information processing systems , volume=

work page
[7]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

High-dimensional continuous control using generalized advantage estimation , author=. arXiv preprint arXiv:1506.02438 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[9]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992
[11]

Advances in neural information processing systems , volume=

Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , volume=

work page
[12]

, author=

Learning Without Critics? Revisiting GRPO in Classical Reinforcement Learning Environments. , author=. Arxiv , volume=

work page
[13]

arXiv preprint arXiv:2603.01162 , year=

Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic , author=. arXiv preprint arXiv:2603.01162 , year=

work page arXiv
[14]

arXiv preprint arXiv:2506.08007 , year=

Reinforcement pre-training , author=. arXiv preprint arXiv:2506.08007 , year=

work page arXiv
[15]

arXiv preprint arXiv:2511.06411 , year=

Soft-grpo: Surpassing discrete-token llm reinforcement learning via gumbel-reparameterized soft-thinking policy optimization , author=. arXiv preprint arXiv:2511.06411 , year=

work page arXiv
[16]

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=

work page
[17]

Reinforcement Learning via Value Gradient Flow

Reinforcement Learning via Value Gradient Flow , author=. arXiv preprint arXiv:2604.14265 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Reasoning with sampling: Your base model is smarter than you think , author=. arXiv preprint arXiv:2510.14901 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Advances in neural information processing systems , volume=

Improving policies without measuring merits , author=. Advances in neural information processing systems , volume=

work page

[3] [3]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

The 2012 international joint conference on neural networks (ijcnn) , pages=

Value-gradient learning , author=. The 2012 international joint conference on neural networks (ijcnn) , pages=. 2012 , organization=

work page 2012

[5] [5]

International conference on machine learning , pages=

Stochastic backpropagation and approximate inference in deep generative models , author=. International conference on machine learning , pages=. 2014 , organization=

work page 2014

[6] [6]

Advances in neural information processing systems , volume=

Gradient estimation using stochastic computation graphs , author=. Advances in neural information processing systems , volume=

work page

[7] [7]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

High-dimensional continuous control using generalized advantage estimation , author=. arXiv preprint arXiv:1506.02438 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page

[9] [9]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992

[11] [11]

Advances in neural information processing systems , volume=

Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , volume=

work page

[12] [12]

, author=

Learning Without Critics? Revisiting GRPO in Classical Reinforcement Learning Environments. , author=. Arxiv , volume=

work page

[13] [13]

arXiv preprint arXiv:2603.01162 , year=

Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic , author=. arXiv preprint arXiv:2603.01162 , year=

work page arXiv

[14] [14]

arXiv preprint arXiv:2506.08007 , year=

Reinforcement pre-training , author=. arXiv preprint arXiv:2506.08007 , year=

work page arXiv

[15] [15]

arXiv preprint arXiv:2511.06411 , year=

Soft-grpo: Surpassing discrete-token llm reinforcement learning via gumbel-reparameterized soft-thinking policy optimization , author=. arXiv preprint arXiv:2511.06411 , year=

work page arXiv

[16] [16]

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=

work page

[17] [17]

Reinforcement Learning via Value Gradient Flow

Reinforcement Learning via Value Gradient Flow , author=. arXiv preprint arXiv:2604.14265 , year=

work page internal anchor Pith review Pith/arXiv arXiv