Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models

Enver Sangineto; Fiorenzo Parascandolo; Jian Luan; Jianzhong Ju; Qian Cao; Rita Cucchiara; Ruihua Song; Wenhui Tan; Zhenbo Luo

arxiv: 2602.01698 · v3 · submitted 2026-02-02 · 💻 cs.CL · cs.LG

Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models

Wenhui Tan , Fiorenzo Parascandolo , Enver Sangineto , Jianzhong Ju , Zhenbo Luo , Qian Cao , Rita Cucchiara , Ruihua Song

show 1 more author

Jian Luan

This is my paper

Pith reviewed 2026-05-16 08:41 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords large reasoning modelsexploration collapselatent exploration decodingpost-trainingreinforcement learningentropydecoding strategypass@k accuracy

0 comments

The pith

Latent Exploration Decoding restores exploration in post-trained large reasoning models by selecting high-entropy intermediate posteriors, raising pass rates without added training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Post-training of large reasoning models with reinforcement learning produces an exploration collapse in which temperature sampling stops improving multi-sample accuracy because final-layer posteriors lose entropy. Intermediate layers retain substantially higher entropy, so the useful diversity for exploration still exists inside the model. Latent Exploration Decoding recovers this diversity by computing cumulative sums of posteriors across depths and decoding from the depth configuration that maximizes entropy. The method delivers average gains of 0.61 points on pass@1 and 1.03 points on pass@16 across benchmarks and models. When substituted for standard rollouts inside reinforcement learning loops, it also produces faster reward growth and higher final performance.

Core claim

The paper shows that RL post-training induces exploration collapse in large reasoning models by driving final-layer posterior entropy close to zero while intermediate layers stay entropic, and that Latent Exploration Decoding recovers performance by selecting, for each token, the depth configuration whose cumulative intermediate posterior has maximal entropy.

What carries the argument

Latent Exploration Decoding (LED), a depth-conditioned decoding procedure that forms cumulative sums of intermediate posteriors and decodes from the maximal-entropy depth configuration.

If this is right

LED raises pass@1 accuracy by 0.61 percentage points on average across models and benchmarks.
LED raises pass@16 accuracy by 1.03 percentage points on average.
Substituting LED for standard rollouts inside GRPO yields faster reward improvement and higher final model performance.
All gains are obtained with no additional training, parameters, or architectural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Post-training appears to over-constrain final-layer distributions at the expense of internal representational diversity.
Layer-wise entropy monitoring during post-training could serve as an early-warning signal for exploration collapse.
LED-style selection may transfer to other autoregressive tasks where temperature sampling currently underperforms.
Combining LED with existing techniques such as beam search or self-consistency could produce additive gains.

Load-bearing premise

That choosing depth configurations with highest cumulative entropy will reliably produce more diverse and correct reasoning traces than standard final-layer sampling.

What would settle it

An experiment on the same benchmarks where LED-selected depths yield equal or lower pass@1 and pass@16 accuracy than temperature sampling from the final layer, or where entropy of the chosen configurations shows no correlation with solution correctness.

read the original abstract

Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@$n$ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Furthermore, integrating LED into reinforcement learning, e.g., using GRPO as the rollout strategy, yields faster reward improvement and higher final performance, due to the efficient exploration capability of LED. Project page: https://github.com/AlbertTan404/LED.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LED spots entropy collapse in post-trained reasoning models and offers a simple depth-selection fix, but the gains look incremental and the entropy rule needs better controls.

read the letter

The core observation here is that RL post-training on reasoning models shrinks entropy at the final layer, so temperature sampling stops helping exploration much. The authors respond with Latent Exploration Decoding: it sums intermediate posteriors cumulatively across depths and picks the highest-entropy configuration for the next token. No new parameters or training required. They report steady lifts of 0.61 points on pass@1 and 1.03 on pass@16 across several math and code benchmarks, plus faster reward curves when LED is plugged into the GRPO rollout stage. The GitHub link is a real plus for anyone wanting to check the numbers quickly. The layer-wise entropy asymmetry is a clean empirical finding and the method stays lightweight, which is the main practical value. The soft spot is the lack of controls on the selection rule itself. Nothing in the write-up tests whether max-entropy depth is better than random depth sampling or other simple heuristics; if those alternatives produce similar lifts, the entropy motivation is not carrying the result. The accuracy deltas are also small enough that they could be sensitive to exact baseline choices or seed variation, and the abstract gives limited detail on statistical testing. This is aimed at people already running inference or RL loops on math and code models who want cheap tricks to squeeze out a bit more exploration. A reader in that area could try it in an afternoon and see if it helps their setup. I would send it for peer review. The problem it targets is common, the implementation cost is low, and referees can push for the missing ablations without much extra work from the authors.

Referee Report

1 major / 2 minor

Summary. The paper claims that RL post-training of Large Reasoning Models induces exploration collapse, evidenced by reduced final-layer entropy and the failure of temperature sampling to improve pass@n accuracy. Motivated by higher entropy in intermediate layers, it introduces Latent Exploration Decoding (LED): a training-free method that computes cumulative intermediate posteriors and selects depth configurations maximizing entropy. This yields average gains of 0.61 pp on pass@1 and 1.03 pp on pass@16 across reasoning benchmarks and models, and accelerates reward improvement when used as a rollout strategy in RL (e.g., GRPO).

Significance. If the causal link holds, the result is significant because it identifies a concrete mechanism (final-layer entropy collapse) behind post-training exploration loss and supplies a simple, parameter-free decoding fix that improves both inference and RL training efficiency without new parameters or data. The empirical consistency across models and benchmarks, plus the RL integration, would make it a practical contribution to reasoning model development.

major comments (1)

[Experiments] The central claim attributes the 0.61/1.03 pp gains specifically to LED's max-entropy selection from cumulative intermediate posteriors. The manuscript reports final-layer entropy collapse but provides no ablation comparing this rule to non-entropy depth heuristics (e.g., random depth selection or fixed-layer variation). This is load-bearing for the training-free claim; without it, the entropy criterion's necessity remains unestablished.

minor comments (2)

[Results] The abstract and results lack details on statistical significance testing, variance across runs, and exact baseline implementations (e.g., standard temperature sampling hyperparameters).
[Method] Notation for cumulative posterior aggregation and depth configuration selection could be clarified with an explicit algorithm or pseudocode box.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below and will incorporate the requested ablation in the revised manuscript.

read point-by-point responses

Referee: The central claim attributes the 0.61/1.03 pp gains specifically to LED's max-entropy selection from cumulative intermediate posteriors. The manuscript reports final-layer entropy collapse but provides no ablation comparing this rule to non-entropy depth heuristics (e.g., random depth selection or fixed-layer variation). This is load-bearing for the training-free claim; without it, the entropy criterion's necessity remains unestablished.

Authors: We agree that the current experiments do not fully isolate the contribution of the entropy-maximization rule. In the revised version we will add ablations that directly compare LED's max-entropy depth selection against (i) random depth sampling and (ii) fixed-layer baselines, using the same models and benchmarks. These results will quantify whether the entropy criterion is necessary for the reported gains or whether any depth variation suffices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation directly motivates heuristic without definitional reduction

full rationale

The paper's chain begins with an empirical observation of final-layer entropy collapse versus higher intermediate-layer entropy, then directly defines LED as the rule that aggregates cumulative posteriors and selects the maximum-entropy depth configuration. This selection rule is a straightforward heuristic derived from the stated observation; it is not obtained by fitting parameters to the target accuracy metric, nor does any equation equate the output accuracy gain to an input quantity by algebraic identity. No self-citations are invoked to justify uniqueness or to close a derivation loop, and the reported pass@1/pass@16 improvements are presented as measured outcomes rather than predictions forced by the construction itself. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation of entropy asymmetry and the assumption that maximal-entropy intermediate configurations improve exploration; no free parameters or new entities are introduced.

axioms (1)

domain assumption Entropy asymmetry between final and intermediate layers is a general property of post-trained LRMs that causes exploration collapse.
This is presented as an empirical finding that motivates the method.

pith-pipeline@v0.9.0 · 5512 in / 1134 out tokens · 44891 ms · 2026-05-16T08:41:00.950851+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

exploration depth d=8 ... top-k coverage ratio ... entropy decay

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Role of Generator Access in Autoregressive Post-Training
cs.LG 2026-04 unverdicted novelty 5.0

Limited generator access in autoregressive post-training confines learners to root-start rollouts whose value is bounded by on-policy prefix probabilities, while weak prefix control unlocks richer observations and pro...