arxiv: 2604.26940 · v1 · submitted 2026-04-29 · 💻 cs.CL

Recognition: unknown

Select to Think: Unlocking SLM Potential with Local Sufficiency

Wenxuan Ye , Yangyang Zhang , Xueli An , Georg Carle , Yunpu Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:37 UTC · model grok-4.3

classification 💻 cs.CL

keywords small language modelslocal sufficiencyselect to thinkknowledge distillationreasoningself-consistencynext-token predictionmodel compression

0 comments

The pith

The top-8 predictions of a 1.5B SLM contain the 32B LLM's preferred token 95 percent of the time at reasoning divergence points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that small language models keep the large model's correct choice among their top few guesses even when they pick wrong initially. This local sufficiency lets the large model simply select the right one from the small model's list instead of generating from scratch. The selection process is then distilled back so the small model learns to pick correctly on its own. As a result a 1.5 billion parameter model shows 24 percent better average performance on reasoning benchmarks and reaches the quality of generating eight separate answers while using only one.

Core claim

We address the gap between SLMs and LLMs by identifying local sufficiency: at divergence points, the LLM's preferred token consistently resides within the SLM's top-K next-token predictions. SELECT TO THINK reframes the LLM's role from open-ended generation to selection among the SLM's proposals, simplifying the supervision signal to discrete candidate rankings. Leveraging this, S2T-LOCAL distills the selection logic into the SLM, empowering it to perform autonomous re-ranking without inference-time LLM dependency.

What carries the argument

Local sufficiency, the observation that the LLM's preferred token lies in the SLM's top-K predictions at points of divergence, enabling simplified supervision for distillation of selection logic.

If this is right

The 1.5B SLM with S2T-LOCAL improves greedy decoding performance by 24.1% on average across benchmarks.
S2T-LOCAL matches the efficacy of 8-path self-consistency while using single-trajectory efficiency.
Top-8 candidates from the 1.5B SLM capture the 32B LLM's choice with a 95% hit rate.
SLMs can operate without external LLM calls at inference time after distillation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Many apparent reasoning failures in SLMs may actually be selection failures rather than knowledge gaps.
Similar top-K sufficiency checks could be explored for other generation domains such as code completion or mathematical proof steps.
Training data efficiency might increase if models are taught to re-rank their own candidates instead of full imitation.

Load-bearing premise

The local sufficiency property holds consistently across diverse tasks and model scales, allowing the distilled selection logic to generalize without degrading performance on unseen data.

What would settle it

An experiment on a new benchmark or different SLM-LLM pair where the top-K hit rate drops substantially below 95% or where S2T-LOCAL training yields no improvement or negative results.

Figures

Figures reproduced from arXiv: 2604.26940 by Georg Carle, Wenxuan Ye, Xueli An, Yangyang Zhang, Yunpu Ma.

**Figure 1.** Figure 1: The SELECT TO THINK (S2T) Paradigm. (A) Prior method faces a dilemma: collaborative inference suffers from high latency due to external calls, while standard distillation is limited by the capacity gap, leading to brittle distribution matching. (B) Question. Is LLM generation strictly necessary, or does the SLM’s candidate set suffice? Empirically, we observe that the LLM’s preferred token resides within t… view at source ↗

**Figure 2.** Figure 2: Analysis of S2T performance. (a) Accuracy improves consistently as the candidate size K and trigger ratio τ increase; notably, S2T matches generative baselines even at a moderate K = 8. (b) Effect of K: Performance saturates beyond K = 8. (c) Effect of τ : Accuracy improves via more frequent interventions, while the hit rate remains stable. candidates per step annotated by the teacher. For optimization, w… view at source ↗

**Figure 3.** Figure 3: Validation of Local Sufficiency. Radial visualization of intervention steps, where radial distance indicates the minimal K needed to capture the LLM’s choice, and color denotes KL divergence. The dense central clustering underscores the significant coverage gain: while Hit@1 is limited to only 30%, a compact candidate set of K = 8 successfully captures the target in over 95% of cases. It confirms that LLM… view at source ↗

read the original abstract

Small language models (SLMs) offer computational efficiency for scalable deployment, yet they often fall short of the reasoning power exhibited by their larger counterparts (LLMs). To mitigate this gap, current approaches invoke an LLM to generate tokens at points of reasoning divergence, but these external calls introduce substantial latency and costs. Alternatively, standard distillation is often hindered by the capacity limitation, as SLMs struggle to accurately mimic the LLM's complex generative distribution. We address this dilemma by identifying local sufficiency: at divergence points, the LLM's preferred token consistently resides within the SLM's top-K next-token predictions, even when failing to emerge as the SLM top-1 choice. We therefore propose SELECT TO THINK (S2T), which reframes the LLM's role from open-ended generation to selection among the SLM's proposals, simplifying the supervision signal to discrete candidate rankings. Leveraging this, we introduce S2T-LOCAL, which distills the selection logic into the SLM, empowering it to perform autonomous re-ranking without inference-time LLM dependency. Empirically, we demonstrate that a 1.5B SLM's top-8 candidates capture the 32B LLM's choice with 95% hit rate. Translating this potential into performance, S2T-LOCAL improves greedy decoding by 24.1% on average across benchmarks, effectively matching the efficacy of 8-path self-consistency while operating with single-trajectory efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a 1.5B SLM's top-8 often contains the 32B LLM's token at divergence points, then distills a selector so the SLM can re-rank autonomously and gain 24% over greedy decoding.

read the letter

The punchline is that they measure a high hit rate where the large model's preferred token sits inside the small model's top-K at the moments the small model would otherwise go wrong. They use the LLM only to pick among those K candidates instead of generating freely, then distill that discrete selection step back into the SLM so it runs alone at inference time. This produces S2T-LOCAL, which lifts greedy decoding by 24.1% on average and matches 8-path self-consistency while staying single-trajectory. The 95% hit rate for top-8 on the 1.5B/32B pair is a clean, direct measurement rather than a fitted claim, and turning the supervision into ranking among proposals is a practical simplification over full-distribution distillation. The method sits on top of existing speculative and distillation work but narrows the supervision to a discrete choice at specific points. The main limitation is scope. Results come from one model pair and the abstract gives no per-task variance, no out-of-distribution sets, and no scaling curves across sizes. If the local-sufficiency property weakens on other tasks or model families, the distilled selector could lose its edge. They also leave open how sensitive the gains are to the exact K value and how the training divergences were sampled. This is useful for groups working on efficient reasoning inference and distillation pipelines. The empirical hook is concrete enough that a serious editor should send it to review so the community can check the variance and generalization claims.

Referee Report

3 major / 2 minor

Summary. The paper identifies a 'local sufficiency' property: at reasoning divergence points, the token preferred by a 32B LLM lies within the top-K (K=8) predictions of a 1.5B SLM with 95% hit rate. It proposes Select to Think (S2T), in which the LLM is used only to rank among the SLM's candidate tokens rather than generate freely, and S2T-LOCAL, which distills the ranking logic into the SLM so that it can perform autonomous re-ranking at inference time. Empirically, S2T-LOCAL yields a 24.1% average improvement over greedy decoding across benchmarks while matching the accuracy of 8-path self-consistency at single-trajectory cost.

Significance. If the local-sufficiency property and the distilled selector prove robust, the work offers a practical route to close much of the capability gap between SLMs and LLMs without incurring repeated LLM calls or the latency of multi-path sampling. The single-trajectory efficiency claim is a concrete strength relative to self-consistency and similar ensembles. The result would be of interest to the efficiency and distillation communities provided the generalization evidence is strengthened.

major comments (3)

[§4] §4 (Experiments): The reported 24.1% average gain and 95% hit rate are presented without per-task standard deviations, confidence intervals, or statistical significance tests against the greedy and self-consistency baselines. This information is load-bearing for the claim that S2T-LOCAL reliably matches 8-path self-consistency.
[§3.3] §3.3 and §4.3: The distillation procedure for S2T-LOCAL is described at a high level, but the manuscript provides no out-of-distribution test sets, scaling curves across model sizes, or ablation on how divergence points are detected at inference time. These omissions directly affect the central generalization assumption that the distilled selector will remain effective on unseen tasks and scales.
[§4.1] §4.1: The experimental setup does not specify the exact method used to identify divergence points during data collection for distillation, nor the precise definition of the training distribution over which the 95% hit rate was measured. Without these details the reproducibility of the local-sufficiency observation cannot be verified.

minor comments (2)

[Abstract] The abstract introduces 'local sufficiency' without a concise formal statement; a one-sentence definition early in the introduction would improve readability.
[§3] Notation for the top-K candidate set and the selection head should be introduced once and used consistently; occasional shifts between 'candidates' and 'proposals' are distracting.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which highlight important aspects of reproducibility and generalization. We address each major comment below with clarifications and commitments to strengthen the manuscript. All requested details and analyses can be incorporated in the revision without altering the core claims.

read point-by-point responses

Referee: [§4] §4 (Experiments): The reported 24.1% average gain and 95% hit rate are presented without per-task standard deviations, confidence intervals, or statistical significance tests against the greedy and self-consistency baselines. This information is load-bearing for the claim that S2T-LOCAL reliably matches 8-path self-consistency.

Authors: We agree that statistical rigor would improve the presentation of results. In the revised manuscript, we will report per-task standard deviations across multiple random seeds, include 95% confidence intervals, and add paired statistical significance tests (e.g., t-tests) comparing S2T-LOCAL against greedy decoding and 8-path self-consistency. These additions will directly support the reliability claims. revision: yes
Referee: [§3.3] §3.3 and §4.3: The distillation procedure for S2T-LOCAL is described at a high level, but the manuscript provides no out-of-distribution test sets, scaling curves across model sizes, or ablation on how divergence points are detected at inference time. These omissions directly affect the central generalization assumption that the distilled selector will remain effective on unseen tasks and scales.

Authors: We acknowledge that additional evidence on generalization would strengthen the central claims. In the revision, we will add experiments on held-out out-of-distribution tasks, scaling curves for SLM sizes from 1B to 7B, and an ablation study isolating the impact of different divergence detection heuristics at inference time. These will be presented in an expanded §4.3. revision: yes
Referee: [§4.1] §4.1: The experimental setup does not specify the exact method used to identify divergence points during data collection for distillation, nor the precise definition of the training distribution over which the 95% hit rate was measured. Without these details the reproducibility of the local-sufficiency observation cannot be verified.

Authors: We agree that precise methodological details are essential for reproducibility. The revised manuscript will include a complete description of the divergence point detection algorithm (including exact thresholds and token probability comparisons used during data collection), along with the precise definition of the training distribution, dataset splits, and sampling procedure underlying the 95% hit rate measurement. revision: yes

Circularity Check

0 steps flagged

Empirical hit-rate measurements and accuracy gains derive from direct observation, not tautological construction

full rationale

The paper's core claims rest on two direct empirical steps: (1) counting the frequency with which the 32B LLM's chosen token appears inside the 1.5B SLM's top-8 logits at divergence points (reported as 95% hit rate), and (2) measuring downstream benchmark accuracy after distilling a selection head into the SLM and comparing against greedy and self-consistency baselines. Neither step invokes an equation whose output is algebraically identical to its fitted input, nor does any load-bearing premise reduce to a self-citation whose own justification is unverified. The local-sufficiency property is defined by the observed token-overlap statistic rather than presupposed; the distillation objective is a standard ranking loss whose success is evaluated on held-out task accuracy. No uniqueness theorem, ansatz smuggling, or renaming of prior results is required to reach the reported 24.1% average improvement. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The method depends on the observed local sufficiency property and the specific K value; no mathematical axioms or new entities with external evidence are introduced.

free parameters (1)

top-K value = 8
K=8 is chosen to achieve 95% capture rate for the tested models.

invented entities (1)

local sufficiency no independent evidence
purpose: Property that LLM token is in SLM top-K at divergence points
Empirically observed in experiments with no external independent confirmation.

pith-pipeline@v0.9.0 · 9327 in / 1069 out tokens · 161582 ms · 2026-05-07T08:37:23.877641+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 20 canonical work pages · 7 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review arXiv
[2]

Accelerating Large Language Model Decoding with Speculative Sampling

URL https://openreview. net/forum?id=zph7e5JaXc. Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023a. Chen, L., Zaharia, M., and Zou, J. Frugalgpt: How to use large language models while reducing cost and improving performa...

work page internal anchor Pith review arXiv
[3]

Unveiling the key factors for distilling chain-of-thought reasoning

Chen, X., Sun, Z., Wenjin, G., Zhang, M., Chen, Y ., Sun, Y ., Su, H., Pan, Y ., Klakow, D., Li, W., and Shen, X. Unveiling the key factors for distilling chain-of-thought reasoning. InFindings of the Association for Computa- tional Linguistics: ACL 2025, pp. 15094–15119,

2025
[4]

Dola: Decoding by contrasting layers improves factuality in large language models

Chuang, Y .-S., Xie, Y ., Luo, H., Kim, Y ., Glass, J., and He, P. Dola: Decoding by contrasting layers improves factuality in large language models.arXiv preprint arXiv:2309.03883,

work page arXiv
[5]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review arXiv
[6]

R2r: Efficiently navigating divergent reasoning paths with small-large model token routing.arXiv preprint arXiv:2505.21600, 2025a

Fu, T., Ge, Y ., You, Y ., Liu, E., Yuan, Z., Dai, G., Yan, S., Yang, H., and Wang, Y . R2r: Efficiently navigating divergent reasoning paths with small-large model token routing.arXiv preprint arXiv:2505.21600, 2025a. Fu, T., You, Y ., Chen, Z., Dai, G., Yang, H., and Wang, Y . Think-at-hard: Selective latent iterations to improve reasoning language mode...

work page arXiv
[7]

Relayllm: Efficient reasoning via collaborative decod- ing.arXiv preprint arXiv:2601.05167,

Huang, C., Zheng, T., Huang, L., Li, J., Liu, H., and Huang, J. Relayllm: Efficient reasoning via collaborative decod- ing.arXiv preprint arXiv:2601.05167,

work page arXiv
[8]

arXiv preprint arXiv:2502.18581 , year=

Kang, Z., Zhao, X., and Song, D. Scalable best-of-n selec- tion for large language models via self-certainty.arXiv preprint arXiv:2502.18581,

work page arXiv
[9]

Guiding reasoning in small language models with llm assistance

Kim, Y ., Yi, E., Kim, M., Yun, S.-Y ., and Kim, T. Guiding reasoning in small language models with llm assistance. arXiv preprint arXiv:2504.09923,

work page arXiv
[10]

Small models struggle to learn from strong reasoners

9 Select to Think: Unlocking SLM Potential with Local Sufficiency Li, Y ., Yue, X., Xu, Z., Jiang, F., Niu, L., Lin, B. Y ., Ra- masubramanian, B., and Poovendran, R. Small models struggle to learn from strong reasoners.arXiv preprint arXiv:2502.12143,

work page arXiv
[11]

Critical tokens matter: Token-level contrastive estimation enhances llm’s reasoning capability.arXiv preprint arXiv:2411.19943, 2024

Lin, Z., Liang, T., Xu, J., Lin, Q., Wang, X., Luo, R., Shi, C., Li, S., Yang, Y ., and Tu, Z. Critical tokens matter: Token-level contrastive estimation enhances llm’s reason- ing capability.arXiv preprint arXiv:2411.19943,

work page arXiv
[12]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review arXiv
[13]

Zip-rc: Zero-overhead inference-time pre- diction of reward and cost for adaptive and interpretable generation.arXiv preprint arXiv:2512.01457,

Manvi, R., Hong, J., Seyde, T., Labonne, M., Lechner, M., and Levine, S. Zip-rc: Zero-overhead inference-time pre- diction of reward and cost for adaptive and interpretable generation.arXiv preprint arXiv:2512.01457,

work page arXiv
[14]

Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Zhang, Z., Wong, R. Y . Y ., Zhu, A., Yang, L., Shi, X., et al. Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification.arXiv preprint arXiv:2305.09781,

work page arXiv
[15]

Specreason: Fast and ac- curate inference-time compute via speculative reasoning

Pan, R., Dai, Y ., Zhang, Z., Oliaro, G., Jia, Z., and Ne- travali, R. Specreason: Fast and accurate inference- time compute via speculative reasoning.arXiv preprint arXiv:2504.07891,

work page arXiv
[16]

Qwen2.5 Technical Report

Qwen. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review arXiv
[17]

and Sch ¨utze, H

Schick, T. and Sch ¨utze, H. It’s not just size that matters: Small language models are also few-shot learners. InPro- ceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies, pp. 2339–2352,

2021
[18]

Trusting your evidence: Hallucinate less with context-aware decoding

Shi, W., Han, X., Lewis, M., Tsvetkov, Y ., Zettlemoyer, L., and Yih, W.-t. Trusting your evidence: Hallucinate less with context-aware decoding. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pp. 783–791,

2024
[19]

Large language model routing with benchmark datasets.arXiv preprint arXiv:2309.15789, 2023

Shnitzer, T., Ou, A., Silva, M., Soule, K., Sun, Y ., Solomon, J., Thompson, N., and Yurochkin, M. Large language model routing with benchmark datasets.arXiv preprint arXiv:2309.15789,

work page arXiv
[20]

Multiplex thinking: Reasoning via token-wise branch- and-merge.arXiv preprint arXiv:2601.08808,

Tang, Y ., Dong, L., Hao, Y ., Dong, Q., Wei, F., and Gu, J. Multiplex thinking: Reasoning via token-wise branch- and-merge.arXiv preprint arXiv:2601.08808,

work page arXiv
[21]

Gemma 2: Improving Open Language Models at a Practical Size

Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahri- ari, B., Ram ´e, A., et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review arXiv
[22]

and Baral, C

Varshney, N. and Baral, C. Model cascading: Towards jointly improving efficiency and accuracy of nlp systems. arXiv preprint arXiv:2210.05528,

work page arXiv
[23]

Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., et al. Qwen2. 5-math techni- cal report: Toward mathematical expert model via self- improvement.arXiv preprint arXiv:2409.12122,

work page internal anchor Pith review arXiv
[24]

and Math-AI, T

Zhang, Y . and Math-AI, T. American invitational mathemat- ics examination (aime) 2025,

2025
[25]

30:end while 31:Returnsequencey. B. Evaluation Details B.1. Benchmark Mathematical reasoning benchmarks. • GSM8K (Cobbe et al., 2021): A dataset of 8,500 linguistically diverse grade school math word problems requiring 2-8 step reasoning, serving as a standard benchmark for evaluating basic mathematical reasoning capabilities. 12 Select to Think: Unlockin...

2021
[26]

amateur” (or counterfactual) signal from an “expert

reshape logits by subtracting an “amateur” (or counterfactual) signal from an “expert” one to suppress generic or hallucinated tokens. This paradigm has inspired various extensions, including context-aware variants to resolve evidence-parametric conflicts (Shi et al., 2024), and internal-contrast methods (e.g.,DoLa(Chuang et al., 2023)) that utilize intra...

2024
[27]

Local Sufficiency

During inference, we apply a sigmoid activation to the criticality logit and threshold at 0.7 (configurable) to make binary triggering decisions, eliminating the need to compute expensive cross-model KL divergence at every step. 15 Select to Think: Unlocking SLM Potential with Local Sufficiency Table A1.Accuracy comparison under varying trigger threshold ...

2024