pith. sign in

arxiv: 2605.21654 · v1 · pith:5D7V4YGGnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI· cs.CL

Value-Gradient Hypothesis of RL for LLMs

Pith reviewed 2026-05-22 08:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords reinforcement learninglarge language modelsvalue gradientsPPOGRPOactor updatespost-training
0
0 comments X

The pith

Under differentiable rollouts with additive noise, actor updates in critic-free RL for LLMs propagate costates whose expectation equals the value gradient.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a value-gradient perspective on critic-free reinforcement learning for language model post-training. It shows that when rollouts are differentiable and policies use additive-noise parameterization, the backward pass through the actor produces signals whose conditional expectation matches the value gradient. For transformer policies this approximation holds with error bounded by sampling gap and policy entropy. The resulting decomposition into value-gradient strength and reachable reward headroom supplies a criterion for when RL post-training yields the largest gains along a pretraining trajectory.

Core claim

Under a differentiable rollout and additive-noise parameterization, the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. For discrete transformer policies, autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.

What carries the argument

Differentiable rollout combined with additive-noise policy parameterization, which allows the backward pass to produce costates whose expectation matches the value gradient.

Load-bearing premise

The rollout must be differentiable and the policy must be parameterized with additive noise.

What would settle it

If empirical costates obtained by autodifferentiation through attention deviate from the true value gradient by more than the amount predicted by sampling gap and policy entropy, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.21654 by Arip Asadulaev, Daniil Ognev, Karim Salta, Martin Takac.

Figure 1
Figure 1. Figure 1: Real RL gain vs. predicted one using value impact formula (Section 5, Eq. 29). Recently, Large Language Models (LLMs) achieve state-of-the-art reasoning using Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which discards the critic entirely, yet classical Reinforcement Learning (RL) the￾ory predicts that critic-free methods should fail at long-horizon credit assignment. Why don’t they? In t… view at source ↗
Figure 2
Figure 2. Figure 2: Bound inequality plot. Real value gra￾dient gap vs proposed bound (Section 4, Prop. 2). Attention is doing the heavy lifting. In an RNN, ht+1 depends on ht only through the recurrence ht+1 = f(ht, eot ). The only pathway is through the discrete token, so the sampling gap blocks all temporal credit flow. In a transformer, h (L) t+1 depends on h (L) t through attention, a direct, dif￾ferentiable, content-bas… view at source ↗
Figure 3
Figure 3. Figure 3: Results of the RL upon various OLMO-2 pretraining checkpoints. The left image shows the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Real RL gain vs. predicted one using value impact formula (Section 5, Eq. 29). Closed-form RL task. For the RL-impact ex￾periment, we use OLMo-2 1B checkpoints from pretraining steps 50k to 1M in increments of 50k. The task is a controlled label-copying prob￾lem. Given a prompt containing a target label in {A, B, C, D}, the model must put probabil￾ity mass on the matching answer token. The reward is R(θ; q… view at source ↗
Figure 5
Figure 5. Figure 5: Z-scores of the gain after RL vs. z-scores of the predicted RL impact. Correlation (left) and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Reinforcement learning substantially improves pretrained language models, but it remains understudied why critic-free methods such as PPO and GRPO work as well as they do, and when they should provide the largest gains. We develop a value-gradient perspective of critic-free RL for LLM post-training. First, under a differentiable rollout and additive-noise parameterization, we show that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. Second, for discrete transformer policies, we show that autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper develops a value-gradient perspective on critic-free RL for LLM post-training. Under a differentiable rollout and additive-noise parameterization of the policy, it claims that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. For discrete transformer policies, autodifferentiation through attention is shown to produce empirical costates that approximate this value signal, with approximation error controlled by sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.

Significance. If the central derivation and approximation hold, the work supplies a theoretical explanation for the empirical success of methods such as PPO and GRPO in LLM post-training and identifies conditions under which critic-free RL yields the largest gains. The explicit connection between backward-pass costates and value gradients, together with the proposed decomposition, could guide more principled choices of when and how to apply RL along the pretraining-to-post-training continuum.

major comments (1)
  1. [Results section (discrete transformer policies paragraph)] Results section (discrete transformer policies paragraph): The claim that autodifferentiation through attention produces costates whose expectation approximates the value gradient rests on an error term controlled by sampling gap and policy entropy. No explicit bound, theorem, or quantitative estimate of this error is supplied for the high-entropy, large-vocabulary regime characteristic of standard LLM policies; because the continuous additive-noise case does not directly apply to categorical transformer outputs, this approximation is load-bearing for the central hypothesis yet remains unverified.
minor comments (2)
  1. [Abstract] Abstract and introduction: The term 'costates' is introduced without a brief definition or reference to its origin in optimal control, which may hinder accessibility for readers whose primary background is in language-model training rather than continuous-time RL.
  2. [Introduction] The manuscript would benefit from an explicit statement of the precise assumptions (e.g., differentiability of the rollout and the form of the additive noise) in a dedicated assumptions subsection or theorem statement.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for identifying this important point regarding the rigor of the approximation in the discrete case. We address the comment below and have revised the manuscript to strengthen the presentation.

read point-by-point responses
  1. Referee: Results section (discrete transformer policies paragraph): The claim that autodifferentiation through attention produces costates whose expectation approximates the value gradient rests on an error term controlled by sampling gap and policy entropy. No explicit bound, theorem, or quantitative estimate of this error is supplied for the high-entropy, large-vocabulary regime characteristic of standard LLM policies; because the continuous additive-noise case does not directly apply to categorical transformer outputs, this approximation is load-bearing for the central hypothesis yet remains unverified.

    Authors: We agree that the manuscript would benefit from greater rigor on this point. The current text states that the error is controlled by the sampling gap and policy entropy but does not supply an explicit theorem or quantitative bound tailored to the high-entropy, large-vocabulary regime of standard LLM policies. The continuous additive-noise derivation provides the core intuition, while the discrete transformer case is presented via the properties of autodifferentiation through attention. In the revised manuscript we have added a new proposition in the results section that supplies an explicit upper bound on the approximation error. The bound is expressed in terms of the total-variation distance induced by the sampling gap and a factor that scales with policy entropy; we also include a short discussion of its behavior under large vocabulary sizes. This addition makes the central hypothesis more self-contained without changing the overall claims or requiring new experiments. revision: yes

Circularity Check

0 steps flagged

Derivation is conditional on explicit modeling assumptions with no reduction to inputs by construction

full rationale

The paper derives that under a differentiable rollout and additive-noise parameterization the backward pass produces costates whose conditional expectation equals the value gradient. This is presented as a mathematical consequence of the stated assumptions rather than a redefinition or statistical fit of the target quantity itself. No equations in the abstract or reader's summary reduce the claimed equality to a fitted parameter or self-referential normalization. The discrete-transformer extension is described as an approximation whose error is controlled by sampling gap and entropy, which is a standard modeling claim rather than circular. No self-citation chains or uniqueness theorems imported from prior author work are invoked as load-bearing. The result is therefore self-contained against the listed assumptions and receives a zero circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claims rest on modeling assumptions of differentiability and additive noise that are introduced to enable the costate analysis; no free parameters or new entities are mentioned in the abstract.

axioms (1)
  • domain assumption Rollout is differentiable and policy uses additive-noise parameterization
    Invoked to establish that the actor update propagates costates whose conditional expectation equals the value gradient.

pith-pipeline@v0.9.0 · 5675 in / 1267 out tokens · 44314 ms · 2026-05-22T08:48:55.488467+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 5 internal anchors

  1. [1]

    Reasoning with Sampling: Your Base Model is Smarter Than You Think

    Reasoning with sampling: Your base model is smarter than you think , author=. arXiv preprint arXiv:2510.14901 , year=

  2. [2]

    Advances in neural information processing systems , volume=

    Improving policies without measuring merits , author=. Advances in neural information processing systems , volume=

  3. [3]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  4. [4]

    The 2012 international joint conference on neural networks (ijcnn) , pages=

    Value-gradient learning , author=. The 2012 international joint conference on neural networks (ijcnn) , pages=. 2012 , organization=

  5. [5]

    International conference on machine learning , pages=

    Stochastic backpropagation and approximate inference in deep generative models , author=. International conference on machine learning , pages=. 2014 , organization=

  6. [6]

    Advances in neural information processing systems , volume=

    Gradient estimation using stochastic computation graphs , author=. Advances in neural information processing systems , volume=

  7. [7]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    High-dimensional continuous control using generalized advantage estimation , author=. arXiv preprint arXiv:1506.02438 , year=

  8. [8]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  9. [9]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  10. [10]

    Machine learning , volume=

    Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

  11. [11]

    Advances in neural information processing systems , volume=

    Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , volume=

  12. [12]

    , author=

    Learning Without Critics? Revisiting GRPO in Classical Reinforcement Learning Environments. , author=. Arxiv , volume=

  13. [13]

    arXiv preprint arXiv:2603.01162 , year=

    Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic , author=. arXiv preprint arXiv:2603.01162 , year=

  14. [14]

    arXiv preprint arXiv:2506.08007 , year=

    Reinforcement pre-training , author=. arXiv preprint arXiv:2506.08007 , year=

  15. [15]

    arXiv preprint arXiv:2511.06411 , year=

    Soft-grpo: Surpassing discrete-token llm reinforcement learning via gumbel-reparameterized soft-thinking policy optimization , author=. arXiv preprint arXiv:2511.06411 , year=

  16. [16]

    DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=

  17. [17]

    Reinforcement Learning via Value Gradient Flow

    Reinforcement Learning via Value Gradient Flow , author=. arXiv preprint arXiv:2604.14265 , year=