pith. sign in

arxiv: 2605.25745 · v1 · pith:I3MJLGEXnew · submitted 2026-05-25 · 💻 cs.CL

Selective Latent Thinking: Adaptive Compression of LLM Reasoning Chains

Pith reviewed 2026-06-29 21:39 UTC · model grok-4.3

classification 💻 cs.CL
keywords latent reasoningchain-of-thoughtLLM efficiencyselective compressionreasoning chainsmathematical reasoningadaptive compression
0
0 comments X

The pith

Selective Latent Thinking compresses only redundant spans in LLM reasoning chains into latent form while preserving precision-critical steps as explicit text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Selective Latent Thinking to improve the efficiency of LLM reasoning without the accuracy losses seen in uniform compression methods. Explicit chain-of-thought traces deliver strong performance on math tasks but grow long and expensive during inference. Uniform latent approaches shorten traces but often degrade results by compressing steps that need precision. SLT instead anticipates upcoming spans with a lightweight decoder, applies confidence gating to pick the longest reliably compressible segment, and encodes only those segments into compact latent representations. A three-stage training process teaches the model when to compress and when to stay explicit.

Core claim

SLT shows that reasoning trajectories contain a mix of redundant spans that can be safely encoded as latent vectors and precision-critical spans that must remain in explicit form. The framework learns a selective policy through span-level compression training, reliability-aware future prediction, and trajectory-level reinforcement learning that optimizes the joint objective of answer correctness and reduced reasoning cost.

What carries the argument

Confidence-based gating after lightweight decoder span anticipation, which selects the longest safe span for latent encoding at each step.

Load-bearing premise

The lightweight decoder and confidence gate can distinguish compressible redundant spans from precision-critical ones without selection bias or dataset-specific retuning.

What would settle it

On a new mathematical reasoning benchmark the accuracy gain over uniform latent baselines drops below 10 percentage points or the accuracy drop relative to explicit CoT exceeds 5 percent at the reported compression levels.

Figures

Figures reproduced from arXiv: 2605.25745 by Hui Xie, Jie Liu, Joaquin Vanschore, Ziyue Qiao.

Figure 1
Figure 1. Figure 1: Comparison of reasoning paradigms and efficiency analysis. (a) Explicit CoT relies on verbose textual generation, whereas implicit latent CoT compresses the reasoning into latent space. Our Selective Latent Thinking (SLT) instead dynamically interleaves explicit CoT and latent reasoning, preserving precision-critical tokens in explicit form while selectively compressing redundant reasoning into latent. (b)… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed Selective Latent Thinking (SLT). (a) During inference, a lightweight decoder D predicts a short future reasoning trajectory, a confidence gate selects the reliable prefix, and the latent compressor C maps the accepted span into a single latent block for efficient LLM state updates. (b) The compressor is trained to replace multi-step reasoning spans with latent blocks while preservi… view at source ↗
Figure 3
Figure 3. Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Compression length distribution. Bars show the RL policy, while dashed lines show the supervised policy before RL. SLT adap￾tively uses shorter or longer compression lengths depending on dataset difficulty. 0 5 10 15 20 25 CoT Length (# Tokens) 20 25 30 35 40 45 50 55 Pass@1 Accuracy (%) +1.0 +0.8 +0.6 +0.4 +0.2 0.0 -0.2 -0.4 -0.6 -0.8 -1.0 GSM Accuracy vs. CoT Length SFT-CoT (Baseline) colar-2 coconut 1.0… view at source ↗
Figure 6
Figure 6. Figure 6: Average inference time comparison.Inference times for all models are measured on the [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy-efficiency Pareto frontier across different test datasets. By modulating the gating [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of SLT generation trajectories. The timeline arrows illustrate the causal [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
read the original abstract

Explicit chain-of-thought (CoT) reasoning substantially improves the reasoning ability of large language models (LLMs), but incurs high inference cost due to lengthy autoregressive traces. Existing latent reasoning methods offer a promising alternative, yet they often treat reasoning as uniformly compressible, causing precision-critical intermediate steps to be overly compressed and thereby degrading reasoning accuracy. In this work, we propose Selective Latent Thinking (SLT), a framework that selectively compresses redundant reasoning spans into latent representations while preserving precision-critical spans as explicit CoT within the same reasoning trajectory. Specifically, SLT first uses a lightweight decoder to anticipate a short upcoming reasoning span, and then applies confidence-based gating to determine the longest span that can be reliably compressed. The accepted span is encoded into a compact latent representation to improve reasoning efficiency, while uncertain or precision-critical reasoning remains in explicit CoT form to preserve accuracy. To learn this selective compression policy, SLT adopts a three-stage training strategy that combines span-level latent compression, reliability-aware future reasoning prediction, and trajectory-level reinforcement learning to optimize the trade-off between answer correctness and reasoning cost. Extensive experiments across four mathematical reasoning benchmarks demonstrate that SLT achieves 22.7\% higher accuracy than latent reasoning baselines at comparable compression ratios, while reducing reasoning chain length by 58.4\% with only 2.8\% accuracy degradation compared to explicit CoT,Our code can be found in https://github.com/hunshi34/SLT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Selective Latent Thinking (SLT), a framework that selectively compresses redundant reasoning spans into latent representations while preserving precision-critical spans as explicit CoT. It uses a lightweight decoder for span anticipation, confidence-based gating to select compressible spans, and a three-stage training process (span-level compression, reliability-aware prediction, trajectory-level RL) to optimize the accuracy-cost trade-off. On four mathematical reasoning benchmarks, SLT is claimed to achieve 22.7% higher accuracy than latent reasoning baselines at comparable compression ratios, with 58.4% shorter reasoning chains and only 2.8% accuracy degradation relative to explicit CoT.

Significance. If the selective gating mechanism reliably distinguishes compressible from precision-critical spans without introducing bias or requiring per-benchmark tuning, the approach could meaningfully advance efficient LLM reasoning by improving the compression-accuracy frontier over uniform latent methods. The reported empirical trade-offs on standard math benchmarks indicate potential practical value for reducing inference costs while maintaining performance.

major comments (2)
  1. [Abstract] Abstract: The reported gains (22.7% accuracy improvement over latent baselines, 58.4% length reduction, 2.8% degradation vs. CoT) are presented without error bars, number of runs, statistical significance tests, or ablations that isolate the confidence-based gate from the RL objective, making it impossible to verify that the selective mechanism—not post-hoc threshold tuning or dataset-specific fitting—drives the results.
  2. [Abstract] Abstract: The three-stage training strategy is described only at a high level with no details on the lightweight decoder architecture, the exact formulation of the confidence gate for determining maximum compressible span length, the RL reward function balancing correctness and cost, or any cross-benchmark policy transfer experiments; these omissions are load-bearing for the central claim that selective compression avoids precision loss.
minor comments (1)
  1. [Abstract] Abstract: Typographical error with missing space ('CoT,Our code').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback emphasizing statistical rigor and methodological clarity. We will revise the manuscript to strengthen the presentation of results and provide additional details on the training components.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported gains (22.7% accuracy improvement over latent baselines, 58.4% length reduction, 2.8% accuracy degradation vs. CoT) are presented without error bars, number of runs, statistical significance tests, or ablations that isolate the confidence-based gate from the RL objective, making it impossible to verify that the selective mechanism—not post-hoc threshold tuning or dataset-specific fitting—drives the results.

    Authors: We agree that the abstract and main results would benefit from explicit statistical reporting. In the revision we will add error bars from 5 independent runs, report the number of seeds, and include paired t-test p-values for the key comparisons. We will also insert a dedicated ablation that disables the confidence gate (replacing it with a fixed threshold) while keeping the RL stage fixed, to isolate its contribution. revision: yes

  2. Referee: [Abstract] Abstract: The three-stage training strategy is described only at a high level with no details on the lightweight decoder architecture, the exact formulation of the confidence gate for determining maximum compressible span length, the RL reward function balancing correctness and cost, or any cross-benchmark policy transfer experiments; these omissions are load-bearing for the central claim that selective compression avoids precision loss.

    Authors: The full manuscript already specifies the decoder as a 2-layer Transformer in Section 3.1, the gate as a sigmoid over span-level entropy in Equation (4), and the RL reward as accuracy minus λ·length in Section 4.3. To make these elements immediately visible, we will add one-sentence summaries of each component to the abstract and include a short paragraph on the absence of cross-benchmark transfer experiments (noting it as a limitation for future work). revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical results on benchmarks

full rationale

The paper describes an empirical framework (SLT) with a three-stage training process and reports performance numbers from experiments on four mathematical reasoning benchmarks. No equations, mathematical derivations, or load-bearing self-citations appear in the provided text. Performance claims (accuracy lifts, length reductions) are framed as direct experimental outcomes compared to baselines rather than any quantity that reduces by construction to fitted inputs or prior author work. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; standard LLM training assumptions (e.g., existence of useful latent representations for text spans) are implicit but not itemized.

pith-pipeline@v0.9.1-grok · 5791 in / 1179 out tokens · 27191 ms · 2026-06-29T21:39:56.979705+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 21 canonical work pages · 12 internal anchors

  1. [1]

    L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

    Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning.arXiv preprint arXiv:2503.04697, 2025

  2. [2]

    Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching

    Simon A Aytes, Jinheon Baek, and Sung Ju Hwang. Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24307–24331, 2025

  3. [3]

    Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024

  4. [4]

    Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025

    Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025

  5. [5]

    Better & Faster Large Language Models via Multi-token Prediction

    Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Syn- naeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  7. [7]

    Token-budget-aware llm reasoning

    Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 24842–24855, 2025

  8. [8]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

  9. [9]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  10. [10]

    EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

  11. [11]

    O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.arXiv preprint arXiv:2501.12570, 2025

    Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.arXiv preprint arXiv:2501.12570, 2025

  12. [12]

    Onelatent: Single-token compression for visual latent reasoning.arXiv preprint arXiv:2602.13738, 2026

    Bo Lv, Yasheng Sun, Junjie Wang, and Haoxiang Shi. Onelatent: Single-token compression for visual latent reasoning.arXiv preprint arXiv:2602.13738, 2026. 10

  13. [13]

    Your llm knows the future: Uncovering its multi-token prediction potential.arXiv preprint arXiv:2507.11851, 2025

    Mohammad Samragh, Arnav Kundu, David Harrison, Kumari Nishu, Devang Naik, Minsik Cho, and Mehrdad Farajtabar. Your llm knows the future: Uncovering its multi-token prediction potential.arXiv preprint arXiv:2507.11851, 2025

  14. [14]

    Efficient Reasoning with Hidden Thinking

    Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi Wang, Pu Zhao, and Jiuxiang Gu. Efficient reasoning with hidden thinking.arXiv preprint arXiv:2501.19201, 2025

  15. [15]

    CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

    Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv:2502.21074, 2025

  16. [16]

    Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419, 2025

  17. [17]

    Think silently, think fast: Dynamic latent compression of llm reasoning chains.arXiv preprint arXiv:2505.16552, 2025

    Wenhui Tan, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Ruihua Song, and Jian Luan. Think silently, think fast: Dynamic latent compression of llm reasoning chains.arXiv preprint arXiv:2505.16552, 2025

  18. [18]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  19. [19]

    Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

    Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, and Zheng Wei. Render- of-thought: Rendering textual chain-of-thought as images for visual latent reasoning.arXiv preprint arXiv:2601.14750, 2026

  20. [20]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  21. [21]

    Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067, 2025

    Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067, 2025

  22. [22]

    Towards system 2 reasoning in llms: Learning how to think with meta chain-of-thought.arXiv preprint arXiv:2501.04682, 2025

    Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, et al. Towards system 2 reasoning in llms: Learning how to think with meta chain-of-thought.arXiv preprint arXiv:2501.04682, 2025

  23. [23]

    Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600, 2025

    Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600, 2025

  24. [24]

    Dynamic early exit in reasoning models.arXiv preprint arXiv:2504.15895, 2025

    Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning models.arXiv preprint arXiv:2504.15895, 2025

  25. [25]

    semantic routing

    Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, and Ningyu Zhang. Lightthinker: Thinking step-by-step compression. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13318– 13339, 2025. 11 A Inference Efficiency Analysis To evaluate practical efficiency, we compare ...