Reasoning with Sampling: Cutting at Decision Points

Anay Mehrotra; Felix Zhou; Quanquan C. Liu

arxiv: 2605.30327 · v1 · pith:EHXUG5PRnew · submitted 2026-05-28 · 💻 cs.LG · cs.AI· cs.CL· math.ST· stat.ML· stat.TH

Reasoning with Sampling: Cutting at Decision Points

Felix Zhou , Anay Mehrotra , Quanquan C. Liu This is my paper

Pith reviewed 2026-06-29 08:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLmath.STstat.MLstat.TH

keywords reasoning samplingpower distributionmetropolis-hastingsentropy cutsdecision pointslanguage model reasoningmixing timetrace resampling

0 comments

The pith

Entropy jumps let samplers cut reasoning traces at decision points, scaling mixing time to the number of decisions instead of tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that sampling from a power distribution over a base language model's outputs can produce strong reasoning without reinforcement learning or curated data. Prior samplers cut traces uniformly at random, which mostly rewrites local details rather than revisiting the few key decisions that shape a solution. The Entropy-Cut Metropolis-Hastings method instead selects cut positions where next-token entropy jumps, treating these as proxies for consequential choices. In a stylized model of reasoning, this change makes the sampler's mixing time grow with the number of decisions rather than total trace length. The approach yields higher accuracy than both uniform-cut baselines and RL-trained models on MATH500, HumanEval, GPQA Diamond, and AIME26.

Core claim

The central claim is that entropy jumps in the base model's next-token predictions identify consequential decision points, so that an Entropy-Cut Metropolis-Hastings sampler mixes to the target power distribution in time proportional to the number of such points; this produces better reasoning performance than uniform random cuts or trained models on the listed benchmarks.

What carries the argument

Entropy-Cut Metropolis-Hastings algorithm that proposes cut positions according to jumps in the base model's next-token entropy and resamples the suffix from those positions.

If this is right

Mixing time scales with the number of decisions rather than the number of tokens.
The sampler revisits and revises high-impact choices more often than local details.
Performance improves consistently over both uniform-cut sampling and RL-trained models on MATH500, HumanEval, GPQA Diamond, and AIME26.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If entropy reliably flags decisions, models could be encouraged during pretraining to produce sharper entropy signals at those points.
The same cut-selection idea might apply to other sampling problems where a small number of choices dominate the output distribution.
Human studies that label decision points could be used to test or refine the entropy proxy on new domains.

Load-bearing premise

Jumps in the base model's next-token entropy reliably mark the consequential decision points in a reasoning trace rather than incidental local variations.

What would settle it

An experiment in which uniform random cuts achieve the same benchmark scores as entropy-selected cuts, or in which human-labeled decision points show no correlation with entropy jumps.

Figures

Figures reproduced from arXiv: 2605.30327 by Anay Mehrotra, Felix Zhou, Quanquan C. Liu.

**Figure 1.** Figure 1: Entropy-Cut MH revises reasoning traces at decision points and improves accuracy. Left: uniform cuts can splice proposals inside a local calculation, producing suffixes that only rewrite nearby tokens. Entropy-Cut instead cuts near high-uncertainty reasoning steps, allowing the proposal to reconsider the underlying continuation. Right: on Qwen2.5-7B, this targeted proposal improves accuracy over standard s… view at source ↗

**Figure 3.** Figure 3: Illustration of a reasoning tree. Chain nodes (gray) are positions where the next token is [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Entropy-Cut MH (ours) samples from higher-probability regions than baselines. We plot Qwen2.5-Math-7B log-likelihoods for MATH500 responses across samplers. The 25th (Q1), 50th (Q2; median), and 75th (Q3) percentile values are marked with dashed, solid, and dotted vertical lines, respectively. 5.2 Results Our main findings are displayed in [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 6.** Figure 6: Entropy-Cut MH (ours) samples from high-probability regions. We plot Qwen2.5-Math-7B log-likelihoods for MATH500 responses across samplers. C.5 Running Times To demonstrate the scalability of our method, we report the average running time to generate the solution to a MATH500 question using Qwen2.5-Math-7B for the various (power) samplers in [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗

**Figure 7.** Figure 7: Entropy-Cut MH (ours) samples from high-confidence regions. We plot Qwen2.5-Math-7B average confidence values for MATH500 responses across samplers. 1 2 4 8 Power Distribution Exponent ( ) 0 20 40 60 80 100 MATH500 Accuracy (%) 30.4 80.6 79.0 76.3 [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗

**Figure 8.** Figure 8: Entropy-Cut MH is stable to the choice of power exponent α. We plot Qwen2.5-Math-7B scores for MATH500 across various values of α generated by entropy-cut MH. {Question} Remember to present your final answer within \\boxed{{}}!" Phi Family. We format the following messages using the respective model chat templates where {Qwen Prompt} is replaced with the prompt above. [ 30 [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 9.** Figure 9: Entropy-Cut MH is stable to the choice of cut law exponent β. We plot Qwen2.5-Math-7B scores for MATH500 across various values of β generated by entropy-cut MH. 0 5 10 20 Metropolis-Hastings Steps 0 20 40 60 80 100 MATH500 Accuracy (%) 68.3 77.5 79.0 79.6 [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗

**Figure 10.** Figure 10: Entropy-Cut MH improves with additional transition steps NMCMC. We plot Qwen2.5-Math-7B scores for MATH500 across various values of NMCMC generated by entropycut MH. Standard Low-Temp SMC TMC MH (Uniform) MH (Entropy) 0 5 10 15 20 25 30 Seconds / MATH500 question 0.27s 0.28s 16.85s 25.71s 10.37s 14.14s Standard Sampler Power Sampler [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗

**Figure 11.** Figure 11: Entropy-Cut MH has running times comparable to other power sampling methods. We plot Qwen2.5-Math-7B average time to generate a solution per MATH500 question across various (power) sampling algorithms. {"role": "system", "content": "You are an AI math expert."}, {"role": "user", "content": {Qwen Prompt}}, ] 31 [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗

read the original abstract

Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-called power distribution, elicits comparable reasoning without additional training, curated datasets, or verifiers. However, making this method practical requires efficiently sampling from the power distribution. A sampler needs to "mix" to the power distribution, which necessitates moving between modes of the target distribution; intuitively, e.g., trying different reasoning strategies. The samplers proposed in prior works repeatedly select a "cut" position in the current reasoning trace uniformly at random and resample the suffix from that position onward. However, reasoning traces typically contain a few consequential decisions (e.g., the choice of proof strategy or algorithm), and we observe that a uniformly chosen cut tends to rewrite local details rather than revisit decision points. We introduce an algorithm (Entropy-Cut Metropolis-Hastings) that uses the base model's next-token entropy as a proxy to identify key decision points and resample from those positions. We empirically verify that entropy jumps are a useful proxy for decision points and, in a stylized model of reasoning, prove that our method's mixing time scales with the number of decisions in a trace rather than with the number of tokens, which can be much larger. Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Entropy-cut MH sampling targets high-entropy positions to improve mixing on power distributions, but the gains depend on whether those positions actually mark consequential decisions.

read the letter

The paper replaces uniform random cuts with entropy-driven ones inside Metropolis-Hastings sampling from the power distribution. The claim is that this hits the few real decision points in a reasoning trace instead of local rewrites, and they back it with a stylized-model proof that mixing time scales with the number of decisions rather than total tokens.

The new piece is the entropy proxy plus that decision-count scaling bound. The experiments report consistent lifts on MATH500, HumanEval, GPQA Diamond, and AIME26 over both uniform-cut baselines and some RL-trained models, which is direct evidence worth checking.

The soft spot is the proxy itself. If entropy jumps mostly flag local syntactic uncertainty instead of branch points that separate reasoning modes, the sampler will still waste steps on non-decision suffixes and the claimed O(#decisions) mixing will not appear. The abstract states they verified the proxy empirically and proved the bound, but the provided details on controls, statistical tests, and the exact stylized model are thin, so it is hard to judge how much the reported gains actually rely on the assumption holding.

This is for groups working on inference-time methods that try to extract reasoning from base models without RL. A reader focused on sampling algorithms or power-distribution approximations would find the cut-selection change and the mixing argument useful. It deserves a serious referee because the technical angle is distinct from the uniform-cut papers it cites and the empirical results are concrete enough to evaluate.

Referee Report

2 major / 2 minor

Summary. The paper claims that Entropy-Cut Metropolis-Hastings, which selects cut positions for suffix resampling using jumps in the base model's next-token entropy as a proxy for decision points, enables efficient sampling from the power distribution. In a stylized model it proves that mixing time scales with the number of decisions rather than tokens; empirically it reports consistent gains over uniform-cut baselines and RL-trained models on MATH500, HumanEval, GPQA Diamond, and AIME26.

Significance. If the mixing-time result and the validity of the entropy proxy hold, the work supplies a training-free, parameter-free route to strong reasoning performance that directly challenges the necessity of RL post-training. The explicit scaling proof and the multi-benchmark comparison are concrete strengths that would be of broad interest if substantiated.

major comments (2)

[Stylized model section (proof of mixing time)] The stylized-model proof that mixing time scales with the number of decisions (rather than tokens) is load-bearing for the efficiency claim, yet the manuscript supplies no definition of the model, no statement of its assumptions, and no derivation steps. Without these, it is impossible to verify whether the claimed O(#decisions) bound actually follows or whether it relies on the very entropy-decision coincidence that the skeptic questions.
[Empirical verification of entropy proxy] The central modeling assumption—that entropy jumps reliably mark consequential branch points whose resampling mixes distinct reasoning modes—is required for both the theoretical scaling and the empirical gains. The abstract asserts empirical verification of the proxy, but the manuscript provides no quantitative analysis showing that high-entropy positions correspond to strategy-level changes rather than local syntactic variation; this gap directly affects whether the sampler outperforms uniform-cut baselines on the target distribution.

minor comments (2)

[Experimental results] The abstract and results sections should report the precise number of samples, temperature settings, and statistical significance tests used for the benchmark comparisons.
[Method description] Notation for the power distribution and the Metropolis-Hastings acceptance ratio should be introduced explicitly with equations rather than left implicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thorough review and constructive comments. We will revise the manuscript to address the concerns regarding the stylized model and the empirical verification of the entropy proxy, as detailed in our point-by-point responses below.

read point-by-point responses

Referee: [Stylized model section (proof of mixing time)] The stylized-model proof that mixing time scales with the number of decisions (rather than tokens) is load-bearing for the efficiency claim, yet the manuscript supplies no definition of the model, no statement of its assumptions, and no derivation steps. Without these, it is impossible to verify whether the claimed O(#decisions) bound actually follows or whether it relies on the very entropy-decision coincidence that the skeptic questions.

Authors: We agree that the current presentation of the stylized model lacks sufficient detail for independent verification. In the revised version, we will provide a complete definition of the stylized model, explicitly state all assumptions, and include full derivation steps for the mixing time result. This will demonstrate that the O(#decisions) scaling derives from the structure of decision points in the model and does not presuppose the entropy proxy. revision: yes
Referee: [Empirical verification of entropy proxy] The central modeling assumption—that entropy jumps reliably mark consequential branch points whose resampling mixes distinct reasoning modes—is required for both the theoretical scaling and the empirical gains. The abstract asserts empirical verification of the proxy, but the manuscript provides no quantitative analysis showing that high-entropy positions correspond to strategy-level changes rather than local syntactic variation; this gap directly affects whether the sampler outperforms uniform-cut baselines on the target distribution.

Authors: We acknowledge the need for more rigorous quantitative validation of the entropy proxy. While the manuscript includes some empirical verification, we will enhance this section in the revision by adding quantitative analyses, such as correlations between entropy jumps and annotated strategy changes, or comparisons of resampling outcomes at high- vs. low-entropy positions, to better distinguish strategy-level decisions from syntactic variations. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation and claims are self-contained

full rationale

The paper introduces Entropy-Cut MH by using next-token entropy as a proxy for decision points, proves an O(#decisions) mixing-time bound in an explicit stylized model, and reports benchmark gains on MATH500/HumanEval/GPQA/AIME26. No equations, fitted parameters, or self-citations are shown that would make any claimed prediction or scaling result equivalent to its own inputs by construction; the entropy-proxy assumption is stated as such rather than derived from the target distribution, and the empirical results are measured against external baselines and RL models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that next-token entropy reliably marks consequential decisions and that the stylized model captures the essential branching structure of real LLM reasoning traces.

axioms (1)

domain assumption Next-token entropy serves as a useful proxy for consequential decision points in reasoning traces
The Entropy-Cut algorithm selects cut positions on the basis of this proxy; the abstract states that the authors empirically verify the proxy but provides no further justification.

pith-pipeline@v0.9.1-grok · 5805 in / 1355 out tokens · 32321 ms · 2026-06-29T08:14:46.085094+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

arXiv: 2209.02001. URL: https://arxiv.org/abs/2209.02001 (cit. on p. 5). [CBIL+23] Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating Large Language Model Decoding with Speculative Sampling. 2023. arXiv: 2302.01318 [cs.CL]. URL: https://arxiv.org/abs/2302.01318 (cit. on p. 18). [CBSP+25] ...

work page doi:10.1038/s41586-025-09422-z 2023
[2]

Fast Inference from Transformers via Speculative Decoding

OpenReview.net, 2024. URL: https://openreview.net/forum?id=v8L0pN6EOi (cit. on p. 27). [LKC25] Marvin Li, Aayush Karan, and Sitan Chen. Blink of an Eye: A Simple Theory for Feature Local- ization in Generative Models. 2025. arXiv: 2502.00921. URL: https://arxiv.org/abs/ 2502.00921 (cit. on pp. 4, 19). [LKM23] Yaniv Leviathan, Matan Kalman, and Yossi Matia...

work page arXiv 2024
[3]

Spurious Rewards: Rethinking Training Signals in RLVR

arXiv: 2506.10947 [cs.AI] (cit. on pp. 1, 4). [WYGZ+25] Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. “Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Lea...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

arXiv: 2209.02001. URL: https://arxiv.org/abs/2209.02001 (cit. on p. 5). [CBIL+23] Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating Large Language Model Decoding with Speculative Sampling. 2023. arXiv: 2302.01318 [cs.CL]. URL: https://arxiv.org/abs/2302.01318 (cit. on p. 18). [CBSP+25] ...

work page doi:10.1038/s41586-025-09422-z 2023

[2] [2]

Fast Inference from Transformers via Speculative Decoding

OpenReview.net, 2024. URL: https://openreview.net/forum?id=v8L0pN6EOi (cit. on p. 27). [LKC25] Marvin Li, Aayush Karan, and Sitan Chen. Blink of an Eye: A Simple Theory for Feature Local- ization in Generative Models. 2025. arXiv: 2502.00921. URL: https://arxiv.org/abs/ 2502.00921 (cit. on pp. 4, 19). [LKM23] Yaniv Leviathan, Matan Kalman, and Yossi Matia...

work page arXiv 2024

[3] [3]

Spurious Rewards: Rethinking Training Signals in RLVR

arXiv: 2506.10947 [cs.AI] (cit. on pp. 1, 4). [WYGZ+25] Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. “Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Lea...

work page internal anchor Pith review Pith/arXiv arXiv 2025