The Two-Stage Decision-Sampling Hypothesis: Understanding the Emergence of Self-Reflection in RL-Trained LLMs

(2) ShanghaiTech University; (3) Shanghai Artificial Intelligence Laboratory); Haipeng Zhang (2); Xingcheng Xu (3) ((1) Arizona State University; Yuanting Zha (2); Zibo Zhao (1)

arxiv: 2601.01580 · v2 · submitted 2026-01-04 · 💻 cs.LG · cs.AI

The Two-Stage Decision-Sampling Hypothesis: Understanding the Emergence of Self-Reflection in RL-Trained LLMs

Zibo Zhao (1) , Yuanting Zha (2) , Haipeng Zhang (2) , Xingcheng Xu (3) ((1) Arizona State University , (2) ShanghaiTech University , (3) Shanghai Artificial Intelligence Laboratory) This is my paper

Pith reviewed 2026-05-16 17:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learninglarge language modelsself-reflectiongradient attributionpolicy decompositionsupervised fine-tuningreasoning

0 comments

The pith

RL post-training succeeds over SFT in LLMs because it balances gradient attribution between sampling and decision components, enabling self-reflection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that self-reflection emerges in large language models after RL post-training because the optimization distributes reward gradients evenly across the parts of the policy that generate solutions and the parts that decide when to revise them. It decomposes the policy into a sampling stage for generation and a decision stage for verification, then proves that surrogate rewards balance attribution to both while SFT and KL penalties unbalance it. Length-weighting in SFT further constrains generation while leaving verification under-optimized. A sympathetic reader would care because the account explains why RL produces models capable of self-correction and better generalization, rather than merely longer but uncorrected outputs.

Core claim

The authors decompose the policy into sampling (π_sample) for generating solutions and decision (π_d) for verifying and revising them under the Two-Stage Decision-Sampling Hypothesis. They prove that surrogate rewards exhibit Balanced Gradient Attribution to both components, while SFT and KL penalties exhibit Unbalanced Gradient Attribution. Length-weighting creates asymmetric regularization that constrains π_sample while leaving π_d under-optimized. This supplies the theoretical reason RL succeeds where SFT fails. Empirical checks on arithmetic reasoning confirm that RL gains come primarily from improved decision-making rather than sampling.

What carries the argument

The Two-Stage Decision-Sampling Hypothesis, which decomposes the policy into sampling (π_sample) for generation and decision (π_d) for verification, and tracks how reward gradients are separately attributed to each.

If this is right

Surrogate rewards allow joint optimization of generation and verification, producing self-reflective outputs.
SFT's unbalanced gradients leave verification under-developed even when generation improves.
RL's superior generalization on reasoning tasks stems mainly from stronger decision-making rather than sampling alone.
Length-weighting in SFT creates asymmetric regularization that limits emergence of self-correction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

New training objectives could be designed to enforce explicit balance in gradient attribution between components to induce self-reflection with fewer steps.
The two-stage split may extend to code or multimodal tasks where verification involves checking intermediate outputs against external criteria.
Hybrid SFT-plus-RL schedules could be tuned by adjusting penalty strengths to approximate the balanced effect without full RL.

Load-bearing premise

The policy can be decomposed into distinct sampling and decision components whose gradients can be separately attributed and optimized under the RL objective.

What would settle it

Direct measurement showing that surrogate rewards produce unbalanced gradient magnitudes between the sampling and decision components, or an experiment where RL training yields no gain in verification accuracy over SFT on a reasoning task.

read the original abstract

Self-reflection capabilities emerge in Large Language Models after RL post-training, with multi-turn RL achieving substantial gains over SFT counterparts. Yet the mechanism of how a unified optimization objective gives rise to functionally distinct capabilities of generating solutions and evaluating when to revise them remains opaque. To address this question, we introduce the Gradient Attribution Property to characterize how reward gradients distribute across policy components, formalized through the Two-Stage Decision-Sampling (DS) Hypothesis, which decomposes the policy into sampling ($\pi_{sample}$) for generation and decision ($\pi_{d}$) for verification. We prove that surrogate rewards exhibit Balanced Gradient Attribution, while SFT and KL penalties exhibit Unbalanced Gradient Attribution, with length-weighting creating asymmetric regularization that constrains $\pi_{sample}$ while leaving $\pi_{d}$ under-optimized, providing an theoretical explanation of why RL succeeds where SFT fails. We also empirically validate our theoretical predictions on arithmetic reasoning demonstrates that RL's superior generalization stems primarily from improved decision-making ($\pi_{d}$) rather than sampling capabilities, providing a first-principles mechanistic explanation for self-correction in thinking models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The two-stage split into independent sampling and decision policies looks artificial and probably doesn't survive in a real autoregressive LLM, so the gradient attribution proof is the main thing that needs checking.

read the letter

The paper puts forward a Two-Stage Decision-Sampling Hypothesis to explain why RL post-training produces self-reflection in LLMs while SFT does not. It decomposes the policy into a generation component and a verification component, then claims surrogate rewards give balanced gradient attribution across them while SFT and KL penalties do not, with length weighting making the imbalance worse. That framing is new in the RL-for-LLMs literature and the arithmetic reasoning experiments are a reasonable first check on whether the gains really come from better decision behavior rather than just longer or better-sampled outputs. Those parts are straightforward and worth having on record. The central problem is that the decomposition itself is not shown to be clean. In an autoregressive model the entire output is one token distribution, so any reflection step is just additional tokens drawn from the same policy. The paper needs an explicit construction that pulls π_sample and π_d apart without leftover cross-terms in the gradient, and it needs to show that the balanced attribution result still holds once those terms are restored. Without that step the length-weighting asymmetry argument and the claim that RL succeeds where SFT fails rest on an assumption that may not be true for the models people actually train. The empirical section is narrow, limited to arithmetic, so it does not yet test whether the same pattern appears in broader reasoning tasks. This is the kind of paper that belongs in a reading group for people working on RL objectives for language models. It deserves referee time because the question is real and the proposed mechanism is specific enough to be falsified, but only if the authors can supply the missing separation argument and more diverse experiments. I would send it out rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Gradient Attribution Property and the Two-Stage Decision-Sampling (DS) Hypothesis, which decomposes an LLM's unified autoregressive policy into a sampling component (π_sample) responsible for generation and a decision component (π_d) responsible for verification and reflection. It asserts a proof that surrogate rewards in RL exhibit Balanced Gradient Attribution while SFT and KL penalties exhibit Unbalanced Gradient Attribution, with length-weighting inducing asymmetric regularization that constrains π_sample but leaves π_d under-optimized; this is offered as a first-principles explanation for why RL post-training elicits self-reflection where SFT does not. Empirical results on arithmetic reasoning tasks are presented to show that RL gains derive primarily from improved decision-making rather than sampling.

Significance. If the decomposition and attribution results can be made rigorous, the work would supply a mechanistic account of how RL objectives induce reflective capabilities in LLMs, with direct implications for objective design in reasoning models. The empirical separation of sampling versus decision improvements on arithmetic tasks provides a concrete testbed, though the theoretical framing is the primary contribution.

major comments (2)

[Abstract / Theoretical development] The Two-Stage DS Hypothesis and Gradient Attribution Property rest on a decomposition of the single autoregressive policy into independent π_sample and π_d components whose gradients can be separately attributed without cross-terms. No explicit construction or formal definition of this partition is supplied (abstract and theoretical development), and the skeptic correctly notes that joint token generation makes separability non-obvious; this is load-bearing for the balanced/unbalanced claim and the length-weighting asymmetry argument.
[Theoretical development] The manuscript asserts a proof that surrogate rewards yield Balanced Gradient Attribution while SFT/KL yield Unbalanced Gradient Attribution, yet the full derivation (including how the policy gradient factors across stages and how length-weighting produces the claimed asymmetry) is not provided. Without these steps, it is impossible to confirm that the attribution result survives restoration of dependence between sampling and decision tokens.

minor comments (2)

[Abstract] Abstract contains the typo 'an theoretical' (should be 'a theoretical').
[Empirical validation] The empirical validation section lacks sufficient detail on experimental setup, exact baselines, number of runs, and statistical tests, making it hard to assess whether the reported superiority of RL on decision-making is robust.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us clarify the theoretical foundations of the Gradient Attribution Property and Two-Stage DS Hypothesis. We address each major comment below and have revised the manuscript to incorporate explicit definitions and expanded derivations.

read point-by-point responses

Referee: [Abstract / Theoretical development] The Two-Stage DS Hypothesis and Gradient Attribution Property rest on a decomposition of the single autoregressive policy into independent π_sample and π_d components whose gradients can be separately attributed without cross-terms. No explicit construction or formal definition of this partition is supplied (abstract and theoretical development), and the skeptic correctly notes that joint token generation makes separability non-obvious; this is load-bearing for the balanced/unbalanced claim and the length-weighting asymmetry argument.

Authors: We agree that the original presentation did not supply a sufficiently explicit formal construction of the partition, which is indeed load-bearing. In the revised manuscript we have added a new subsection (Section 3.1) that defines the decomposition rigorously: the autoregressive policy factors as π(θ) = π_sample(θ_s) ⋅ π_d(θ_d | sampled tokens), where θ_s parameterizes solution-token generation and θ_d parameterizes verification/reflection tokens. We prove that the joint gradient attribution separates without irreducible cross-terms because the reward is received only after the full trajectory and the decision stage conditions on but does not alter the sampling-stage log-probabilities in the gradient expression. This construction directly resolves the separability concern while preserving the autoregressive joint generation. revision: yes
Referee: [Theoretical development] The manuscript asserts a proof that surrogate rewards yield Balanced Gradient Attribution while SFT/KL yield Unbalanced Gradient Attribution, yet the full derivation (including how the policy gradient factors across stages and how length-weighting produces the claimed asymmetry) is not provided. Without these steps, it is impossible to confirm that the attribution result survives restoration of dependence between sampling and decision tokens.

Authors: We acknowledge that the original appendix presented the derivation in condensed form. The revised manuscript moves the complete proof to Section 4 with all intermediate steps. The policy gradient is factored explicitly as ∇ log π = ∇ log π_sample + ∇ log π_d, and we show that surrogate rewards weight both terms proportionally to the final reward, yielding balanced attribution. For SFT and KL penalties we derive the length-weighting asymmetry: the loss scales the sampling-stage term by trajectory length while the decision-stage term receives no such scaling, leaving π_d under-optimized. Dependence between stages is restored by conditioning π_d on the sampled tokens; the proof demonstrates that the balanced/unbalanced distinction is preserved under this conditioning because the reward signal remains stage-separable. revision: yes

Circularity Check

1 steps flagged

DS Hypothesis decomposition is introduced by definition then used to derive its own Balanced Gradient Attribution claims by construction.

specific steps

self definitional [Abstract]
"we introduce the Gradient Attribution Property to characterize how reward gradients distribute across policy components, formalized through the Two-Stage Decision-Sampling (DS) Hypothesis, which decomposes the policy into sampling (π_sample) for generation and decision (π_d) for verification. We prove that surrogate rewards exhibit Balanced Gradient Attribution, while SFT and KL penalties exhibit Unbalanced Gradient Attribution"

The Balanced Gradient Attribution property is defined via the DS Hypothesis decomposition; the claim that surrogate rewards exhibit it therefore follows tautologically from the introduced decomposition rather than from an independent derivation of gradient separability in the autoregressive policy.

full rationale

The paper introduces the Two-Stage Decision-Sampling Hypothesis and Gradient Attribution Property internally to decompose the policy into π_sample and π_d, then 'proves' that surrogate rewards exhibit balanced attribution while SFT/KL exhibit unbalanced. This attribution result is downstream of the posited separability rather than independently derived, matching self-definitional circularity. No external benchmarks or machine-checked derivations anchor the decomposition, so the central explanation for RL superiority reduces to the framework's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the novel policy decomposition into sampling and decision stages plus the newly defined Gradient Attribution Property, both introduced without reference to prior independent evidence or external benchmarks.

axioms (1)

domain assumption The policy decomposes into sampling (π_sample) and decision (π_d) components with separable gradient attributions
Foundational premise of the Two-Stage Decision-Sampling Hypothesis stated in the abstract.

invented entities (2)

Gradient Attribution Property no independent evidence
purpose: Characterize distribution of reward gradients across policy components
Newly introduced formalization to support the hypothesis.
Balanced Gradient Attribution no independent evidence
purpose: Property exhibited by surrogate rewards under the two-stage decomposition
Defined within the paper's framework to contrast with SFT.

pith-pipeline@v0.9.0 · 5536 in / 1334 out tokens · 79601 ms · 2026-05-16T17:31:33.287599+00:00 · methodology

The Two-Stage Decision-Sampling Hypothesis: Understanding the Emergence of Self-Reflection in RL-Trained LLMs

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)