pith. machine review for the scientific record. sign in

arxiv: 2602.01698 · v3 · submitted 2026-02-02 · 💻 cs.CL · cs.LG

Recognition: unknown

Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models

Authors on Pith no claims yet
classification 💻 cs.CL cs.LG
keywords explorationreasoningentropydecodingmodelspasspost-trainingaccuracy
0
0 comments X
read the original abstract

Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@$n$ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Furthermore, integrating LED into reinforcement learning, e.g., using GRPO as the rollout strategy, yields faster reward improvement and higher final performance, due to the efficient exploration capability of LED. Project page: https://github.com/AlbertTan404/LED.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Role of Generator Access in Autoregressive Post-Training

    cs.LG 2026-04 unverdicted novelty 5.0

    Limited generator access in autoregressive post-training confines learners to root-start rollouts whose value is bounded by on-policy prefix probabilities, while weak prefix control unlocks richer observations and pro...