Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

Wooil Jung

arxiv: 2606.10184 · v1 · pith:PZIESAFJnew · submitted 2026-06-08 · 💻 cs.LG · cs.AI

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

Wooil Jung This is my paper

Pith reviewed 2026-06-27 17:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords dropoutGRPOlatent reasoningvariational inferencecontinuous hidden statesBayesian model averagingreinforcement learningGSM8K

0 comments

The pith

A fixed Bernoulli mask per rollout supplies the diversity GRPO needs by sampling from a variational posterior over parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that continuous latent reasoning models produce identical trajectories across rollouts, causing GRPO's group-mean advantage to collapse to zero. It introduces structured dropout that applies one Bernoulli mask unchanged across all recurrence steps within a rollout. This mask creates trajectory variance that can be interpreted as draws from a variational distribution over model parameters, so GRPO optimizes the expected reward of the corresponding Bayesian model-average policy. Theoretical arguments establish that the resulting advantage estimator remains unbiased with well-defined latent gradients. Experiments on GSM8K lift a Coconut baseline from 27.29 percent to 29.01 percent pass@1.

Core claim

By applying a single Bernoulli mask held constant across all latent recurrence steps for a given rollout, the method generates essential trajectory variance. This shared mask effectively treats each rollout as a posterior sample from a variational distribution over parameters, allowing GRPO to optimize the expected reward of a Bayesian model-average policy while preserving unbiasedness, variance reduction, and well-defined latent gradients.

What carries the argument

A single Bernoulli dropout mask held constant across all latent recurrence steps within each rollout, which induces trajectory variance interpretable as variational posterior sampling over parameters.

If this is right

GRPO becomes applicable to deterministic latent-reasoning models without trajectory collapse.
The estimator for the group-relative advantage remains unbiased under the variational interpretation.
Post-training of latent-reasoning LLMs can now use group-relative reinforcement learning.
The same masking supplies both stochasticity and a well-defined gradient through the latent phase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same constant-mask construction could be tested on other recurrent latent architectures to check whether the variational interpretation generalizes.
If the mask truly approximates posterior sampling, combining it with other policy-gradient methods beyond GRPO becomes a direct next step.
Varying the mask probability across groups rather than fixing it might further reduce variance in the advantage estimates.

Load-bearing premise

That a shared Bernoulli mask across recurrence steps produces samples whose distribution matches a variational posterior over parameters well enough for the GRPO advantage estimator to stay unbiased and useful.

What would settle it

An experiment in which removing the shared-mask constraint (or replacing it with independent per-step masks) causes the observed performance gain on GSM8K to disappear while GRPO advantages become near-zero.

Figures

Figures reproduced from arXiv: 2606.10184 by Wooil Jung.

**Figure 1.** Figure 1: Dropout-GRPO pipeline. A single prompt x is broadcast to K parallel rollouts; each rollout draws an independent mask ξ (k) that is held constant across all T latent steps (denoted by f ×T ). The resulting answers y (k) receive verifier rewards r (k) , the group mean µr is computed, and each rollout’s advantage calculated by Ak = r (k)−µr. Mask replay at update time ensures the policy gradient is computed a… view at source ↗

**Figure 2.** Figure 2: Evaluation accuracy across training steps [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Group Relative Policy Optimization (GRPO) relies on the diversity of $K$ rollouts within each group; otherwise, the group-mean advantage $A^{(k)} = r^{(k)} - \mu_r$ collapses to zero. This presents a structural challenge for latent-reasoning models like Coconut, which feed continuous hidden states recurrently in place of discrete chain-of-thought tokens. Because the latent phase is inherently deterministic given the parameters and prompt, multiple rollouts produce identical trajectories, stalling GRPO's progress. Consequently, applying group-relative reinforcement learning to continuous latent reasoning has proven difficult. To address this, we propose sourcing the necessary stochasticity through structured dropout. By applying a single Bernoulli mask held constant across all latent recurrence steps for a given rollout, we generate essential trajectory variance. This shared mask effectively treats each rollout as a posterior sample from a variational distribution over parameters, allowing GRPO to optimize the expected reward of a Bayesian model-average policy. We provide both theoretical justification for this method -- including unbiasedness, variance reduction, and the well-definedness of the latent gradient -- and empirical validation. On GSM8K, dropout-GRPO improves a Coconut baseline from $27.29\%$ to $29.01\%$ pass@1, demonstrating the viability of GRPO learning for latent-reasoning models. Our work positions this as a practical, theoretically grounded approach for post-training latent-reasoning LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Shared-mask dropout gives GRPO a workable variance source for Coconut-style latent models, but the unbiasedness argument is asserted rather than shown and the empirical lift is modest.

read the letter

The core move is to hold one Bernoulli dropout mask fixed across all recurrent latent steps in a rollout. That single mask creates different trajectories across the K group members, so the group-mean advantage in GRPO does not collapse to zero. The authors frame each masked rollout as a draw from a variational posterior over parameters and claim this keeps the advantage estimator unbiased while also reducing variance and keeping the latent gradient well-defined.

What is actually new is the specific structured-dropout construction that turns the deterministic latent recurrence into something GRPO can use. Prior Coconut work had no stochasticity mechanism for group-relative updates, and standard GRPO assumes token-level diversity that latent models lack. The paper therefore fills a narrow but real gap for anyone trying to run GRPO on continuous-state reasoning models.

The empirical result is a 1.72-point pass@1 gain on GSM8K over the Coconut baseline. That is positive but small, and the abstract gives no error bars, no ablation on mask probability or number of groups, and no comparison to other variance-injection methods. The theoretical claims (unbiasedness, variance reduction, well-defined gradients) are stated without derivation steps or explicit handling of the covariance that the recurrence might introduce between mask and reward.

The paper is aimed at researchers already working on latent-reasoning post-training who need a concrete way to apply group-relative RL. A reader who wants a fully worked-out policy-gradient identity or multi-task results with statistical detail will find the current version thin. Still, the problem it attacks is genuine and the proposed fix is simple enough to test quickly.

I would send it to referees. The idea is narrow but practical, and the missing pieces (full derivation, ablations, error bars) are exactly what review can surface.

Referee Report

2 major / 1 minor

Summary. The paper proposes Dropout-GRPO to enable Group Relative Policy Optimization (GRPO) for continuous latent-reasoning models such as Coconut. It introduces structured dropout via a single Bernoulli mask held fixed across all latent recurrence steps within a rollout, claiming this generates trajectory diversity by treating each rollout as a sample from a variational posterior over parameters. This is asserted to allow GRPO to optimize the expected reward of a Bayesian model-average policy, with theoretical properties of unbiasedness, variance reduction, and well-defined latent gradients. Empirically, the method improves pass@1 on GSM8K from 27.29% to 29.01%.

Significance. If the claimed theoretical properties hold, the approach would address a structural barrier to applying group-relative RL to deterministic latent-reasoning models and could support post-training of such architectures. The reported gain is modest and limited to a single task, so broader significance would depend on validation across additional benchmarks and confirmation that the variational equivalence supports unbiased advantage estimation.

major comments (2)

[Theory section] Theory section (asserted justification for unbiasedness): The manuscript states that the shared-mask construction yields samples from a variational posterior q(θ) such that the group-relative advantage A^(k) = r^(k) − μ_r remains an unbiased estimator of the gradient of expected reward under the Bayesian model average, yet provides no derivation showing that the policy-gradient identity is preserved after marginalization over the mask. The recurrence couples the mask to the hidden-state trajectory, which can introduce an extra covariance term not cancelled by the group baseline; this step must be shown explicitly for the central claim to hold.
[Experiments] Empirical results (GSM8K evaluation): The reported 1.72-point gain is presented without error bars, multiple random seeds, or ablations isolating the contribution of the shared-mask variational interpretation versus generic dropout; given that the central claim rests on the specific equivalence to a variational posterior, these controls are required to establish that the improvement is not attributable to other factors.

minor comments (1)

[Abstract] The abstract and introduction use the phrase 'well-definedness of the latent gradient' without defining the precise quantity being differentiated or the measure under which the gradient is taken.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for strengthening the theoretical and empirical foundations of our work. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Theory section] Theory section (asserted justification for unbiasedness): The manuscript states that the shared-mask construction yields samples from a variational posterior q(θ) such that the group-relative advantage A^(k) = r^(k) − μ_r remains an unbiased estimator of the gradient of expected reward under the Bayesian model average, yet provides no derivation showing that the policy-gradient identity is preserved after marginalization over the mask. The recurrence couples the mask to the hidden-state trajectory, which can introduce an extra covariance term not cancelled by the group baseline; this step must be shown explicitly for the central claim to hold.

Authors: We agree that the manuscript currently asserts the unbiasedness property at a high level without providing the explicit derivation requested. The potential covariance term arising from the recurrent coupling of the mask to the hidden state is a valid concern that requires careful treatment. In the revised manuscript, we will add a detailed derivation in the Theory section that explicitly shows preservation of the policy-gradient identity after marginalization over the mask, demonstrating how the group baseline cancels the relevant covariance terms. revision: yes
Referee: [Experiments] Empirical results (GSM8K evaluation): The reported 1.72-point gain is presented without error bars, multiple random seeds, or ablations isolating the contribution of the shared-mask variational interpretation versus generic dropout; given that the central claim rests on the specific equivalence to a variational posterior, these controls are required to establish that the improvement is not attributable to other factors.

Authors: We acknowledge that the current results lack error bars, multi-seed statistics, and targeted ablations, which are necessary to substantiate that the gains stem from the variational posterior equivalence rather than generic dropout effects. In the revision, we will report results over multiple random seeds with standard error bars and include ablations that compare the shared-mask construction against standard per-step dropout to isolate the contribution of the proposed mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; modeling choice presented as independent step with claimed external justification

full rationale

The abstract and provided text introduce the shared Bernoulli mask as a deliberate modeling choice to generate trajectory variance, explicitly framing it as treating rollouts as variational posterior samples to enable GRPO on a Bayesian model-average policy. The paper asserts it supplies separate theoretical justification (unbiasedness, variance reduction, latent gradient well-definedness) plus empirical results on GSM8K. No equations are shown that reduce the advantage estimator or the variational equivalence to a fitted parameter defined by the same data, nor is any load-bearing premise justified solely by self-citation. The derivation chain therefore remains self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that fixed-mask dropout across recurrence steps yields trajectories that can be treated as variational posterior samples; no free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption A single Bernoulli mask held constant across latent recurrence steps produces samples from a variational distribution over parameters that supports unbiased GRPO updates.
This premise is required for the method to convert deterministic rollouts into usable group-relative advantages.

pith-pipeline@v0.9.1-grok · 5785 in / 1224 out tokens · 19226 ms · 2026-06-27T17:16:52.827882+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 14 canonical work pages · 3 internal anchors

[1]

Weight uncertainty in neural network

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. InProceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 1613–1622. PMLR, 2015. URL https://proceedings.mlr.press/ v37/blundell15.html

2015
[2]

Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking.arXiv preprint arXiv:2502.13842, 2025

Yilong Chen, Junyuan Shang, Zhenyu Zhang, Yanxi Xie, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, and Haifeng Wang. Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking.arXiv preprint arXiv:2502.13842, 2025. URLhttps://arxiv.org/abs/2502.13842

arXiv 2025
[3]

Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171, 2024

Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171, 2024. URLhttps://arxiv.org/abs/2412.13171

Pith/arXiv arXiv 2024
[4]

Llm latent reasoning as chain of superposition.arXiv preprint arXiv:2510.15522, 2025

Jingcheng Deng, Liang Pang, Zihao Wei, Shicheng Xu, Zenghao Duan, Kun Xu, Yang Song, Huawei Shen, and Xueqi Cheng. Llm latent reasoning as chain of superposition.arXiv preprint arXiv:2510.15522, 2025. doi: 10.48550/arXiv.2510.15522. URLhttps://arxiv.org/abs/2510.15522

work page doi:10.48550/arxiv.2510.15522 2025
[5]

Implicit chain of thought reasoning via knowledge distillation.arXiv preprint arXiv:2311.01460, 2023

Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. Implicit chain of thought reasoning via knowledge distillation.arXiv preprint arXiv:2311.01460, 2023. URL https: //arxiv.org/abs/2311.01460

arXiv 2023
[6]

From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405.14838, 2024

Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405.14838, 2024. URLhttps://arxiv.org/abs/2405.14838

Pith/arXiv arXiv 2024
[7]

Latent thinking optimization: Your latent reasoning language model secretly encodes reward signals in its latent thoughts.arXiv preprint arXiv:2509.26314, 2025

Hanwen Du, Yuxin Dong, and Xia Ning. Latent thinking optimization: Your latent reasoning language model secretly encodes reward signals in its latent thoughts.arXiv preprint arXiv:2509.26314, 2025. URL https: //arxiv.org/abs/2509.26314

arXiv 2025
[8]

Towards revealing the mystery behind chain of thought: A theoretical perspective

Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, pages 70757–70798, 2023

2023
[9]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. InProceedings of the 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 1050–1059. PMLR, 2016. URL https://proceedings.mlr.press/ v48/gal16.html

2016
[10]

Pal: Program-aided language models.arXiv preprint arXiv:2211.10435, 2022

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models.arXiv preprint arXiv:2211.10435, 2022. URL https://arxiv.org/abs/ 2211.10435. 9 Dropout-GRPO A preprint

Pith/arXiv arXiv 2022
[11]

Paul Glasserman and David D. Yao. Some guidelines and guarantees for common random numbers.Management Science, 38(6):884–908, 1992. doi: 10.1287/mnsc.38.6.884

work page doi:10.1287/mnsc.38.6.884 1992
[12]

Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, and Samet Oymak

Halil Alperen Gozeten, M. Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and reasoning.arXiv preprint arXiv:2505.23648,

arXiv
[13]

Continuous chain of thought enables parallel exploration and reasoning

doi: 10.48550/arXiv.2505.23648. URLhttps://arxiv.org/abs/2505.23648

work page doi:10.48550/arxiv.2505.23648
[14]

Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024. URL https://arxiv.org/abs/2412.06769

Pith/arXiv arXiv 2024
[15]

In: Findings of the Association for Computational Linguistics: ACL 2023

Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics (ACL), pages 8003–8017, 2023. doi: 10.18653/v1/...

work page doi:10.18653/v1/2023.findings-acl.507 2023
[16]

LLML ingua: Compressing Prompts for Accelerated Inference of Large Language Models

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13358–13376, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.825...

work page doi:10.18653/v1/2023.emnlp-main.825 2023
[17]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 22199–22213, 2022. URLhttps://arxiv.org/abs/2205.11916

Pith/arXiv arXiv 2022
[18]

Houjun Liu, John Bauer, and Christopher D. Manning. Drop dropout on single epoch language model pretraining. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2157–2166, Vienna, Austria,

2025
[19]

doi: 10.18653/v1/2025.findings-acl.111

Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.111. URL https:// aclanthology.org/2025.findings-acl.111/

work page doi:10.18653/v1/2025.findings-acl.111 2025
[20]

Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025. URL https://arxiv.org/abs/2503.20783

Pith/arXiv arXiv 2025
[21]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.n...

work page internal anchor Pith review doi:10.18653/v1/2021.naacl-main.168 2021
[22]

Solving general arithmetic word problems

Subhro Roy and Dan Roth. Solving general arithmetic word problems. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752. Association for Computational Linguistics,

2015
[23]

URLhttps://aclanthology.org/D15-1202/

doi: 10.18653/v1/D15-1202. URLhttps://aclanthology.org/D15-1202/

work page doi:10.18653/v1/d15-1202
[24]

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transformers.arXiv preprint arXiv:2502.17416, 2025. URL https://arxiv. org/abs/2502.17416

arXiv 2025
[25]

Approximating kl divergence

John Schulman. Approximating kl divergence. Blog post, 2020. http://joschu.net/blog/kl-approx.html

2020
[26]

Gradient estimation using stochastic computation graphs

John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. InAdvances in Neural Information Processing Systems, 2015

2015
[27]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017
[28]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[29]

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Compressing chain-of- thought into continuous space via self-distillation.arXiv preprint arXiv:2502.21074, 2025. doi: 10.48550/arXiv. 2502.21074. URLhttps://arxiv.org/abs/2502.21074

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
[30]

Preprint, arXiv:2410.01560

Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, and Igor Gitman. Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data.arXiv preprint arXiv:2410.01560, 2024. doi: 10.48550/arXiv.2410.01560. URLhttps://arxiv.org/abs/2410.01560

work page doi:10.48550/arxiv.2410.01560 2024
[31]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language 10 Dropout-GRPO A preprint models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 24824–24837, 2022. URL https://proceedings.neurips.cc/paper_fil...

2022
[32]

Flipout: Efficient pseudo-independent weight perturbations on mini-batches

Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, and Roger Grosse. Flipout: Efficient pseudo-independent weight perturbations on mini-batches. InInternational Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJNpifWAb

2018
[33]

Simple statistical gradient-following algorithms for connectionist reinforcement learning,

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 1992. doi: 10.1007/BF00992696

work page doi:10.1007/bf00992696 1992
[34]

Qwen2.5 Technical Report

An Yang et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. doi: 10.48550/arXiv.2412.15115. URLhttps://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024
[35]

Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Qiying Yu et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. URLhttps://arxiv.org/abs/2503.14476

Pith/arXiv arXiv 2025
[36]

Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025

Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025. URL https://arxiv.org/abs/2505.18454

arXiv 2025
[37]

arXiv preprint arXiv:2505.15778 , year=

Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space.arXiv, 2025. doi: 10.48550/arxiv.2505.15778

work page doi:10.48550/arxiv.2505.15778 2025
[38]

Least-to-most prompting enables complex reasoning in large language models

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022. URLhttps://arxiv.org/abs/2205.10625

Pith/arXiv arXiv 2022
[39]

Lepo: Latent reasoning policy optimization for large language models.arXiv preprint arXiv:2604.17892, 2026

Yuyan Zhou, Jiarui Yu, Hande Dong, Zhezheng Hao, Hong Wang, Jianqing Zhang, and Qiang Lin. Lepo: Latent reasoning policy optimization for large language models.arXiv preprint arXiv:2604.17892, 2026. URL https://arxiv.org/abs/2604.17892

Pith/arXiv arXiv 2026
[40]

Reasoning by superposition: A theoretical perspective on chain of continuous thought

Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning by superposition: A theoretical perspective on chain of continuous thought.arXiv, 2025. doi: 10.48550/arxiv.2505.12514. A Extended Discussion and Future Work A.1 Interpreting the performance gain from Dropout-GRPO The empirical improvement of 2.03 percentage poin...

work page doi:10.48550/arxiv.2505.12514 2025

[1] [1]

Weight uncertainty in neural network

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. InProceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 1613–1622. PMLR, 2015. URL https://proceedings.mlr.press/ v37/blundell15.html

2015

[2] [2]

Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking.arXiv preprint arXiv:2502.13842, 2025

Yilong Chen, Junyuan Shang, Zhenyu Zhang, Yanxi Xie, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, and Haifeng Wang. Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking.arXiv preprint arXiv:2502.13842, 2025. URLhttps://arxiv.org/abs/2502.13842

arXiv 2025

[3] [3]

Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171, 2024

Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171, 2024. URLhttps://arxiv.org/abs/2412.13171

Pith/arXiv arXiv 2024

[4] [4]

Llm latent reasoning as chain of superposition.arXiv preprint arXiv:2510.15522, 2025

Jingcheng Deng, Liang Pang, Zihao Wei, Shicheng Xu, Zenghao Duan, Kun Xu, Yang Song, Huawei Shen, and Xueqi Cheng. Llm latent reasoning as chain of superposition.arXiv preprint arXiv:2510.15522, 2025. doi: 10.48550/arXiv.2510.15522. URLhttps://arxiv.org/abs/2510.15522

work page doi:10.48550/arxiv.2510.15522 2025

[5] [5]

Implicit chain of thought reasoning via knowledge distillation.arXiv preprint arXiv:2311.01460, 2023

Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. Implicit chain of thought reasoning via knowledge distillation.arXiv preprint arXiv:2311.01460, 2023. URL https: //arxiv.org/abs/2311.01460

arXiv 2023

[6] [6]

From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405.14838, 2024

Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405.14838, 2024. URLhttps://arxiv.org/abs/2405.14838

Pith/arXiv arXiv 2024

[7] [7]

Latent thinking optimization: Your latent reasoning language model secretly encodes reward signals in its latent thoughts.arXiv preprint arXiv:2509.26314, 2025

Hanwen Du, Yuxin Dong, and Xia Ning. Latent thinking optimization: Your latent reasoning language model secretly encodes reward signals in its latent thoughts.arXiv preprint arXiv:2509.26314, 2025. URL https: //arxiv.org/abs/2509.26314

arXiv 2025

[8] [8]

Towards revealing the mystery behind chain of thought: A theoretical perspective

Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, pages 70757–70798, 2023

2023

[9] [9]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. InProceedings of the 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 1050–1059. PMLR, 2016. URL https://proceedings.mlr.press/ v48/gal16.html

2016

[10] [10]

Pal: Program-aided language models.arXiv preprint arXiv:2211.10435, 2022

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models.arXiv preprint arXiv:2211.10435, 2022. URL https://arxiv.org/abs/ 2211.10435. 9 Dropout-GRPO A preprint

Pith/arXiv arXiv 2022

[11] [11]

Paul Glasserman and David D. Yao. Some guidelines and guarantees for common random numbers.Management Science, 38(6):884–908, 1992. doi: 10.1287/mnsc.38.6.884

work page doi:10.1287/mnsc.38.6.884 1992

[12] [12]

Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, and Samet Oymak

Halil Alperen Gozeten, M. Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and reasoning.arXiv preprint arXiv:2505.23648,

arXiv

[13] [13]

Continuous chain of thought enables parallel exploration and reasoning

doi: 10.48550/arXiv.2505.23648. URLhttps://arxiv.org/abs/2505.23648

work page doi:10.48550/arxiv.2505.23648

[14] [14]

Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024. URL https://arxiv.org/abs/2412.06769

Pith/arXiv arXiv 2024

[15] [15]

In: Findings of the Association for Computational Linguistics: ACL 2023

Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics (ACL), pages 8003–8017, 2023. doi: 10.18653/v1/...

work page doi:10.18653/v1/2023.findings-acl.507 2023

[16] [16]

LLML ingua: Compressing Prompts for Accelerated Inference of Large Language Models

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13358–13376, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.825...

work page doi:10.18653/v1/2023.emnlp-main.825 2023

[17] [17]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 22199–22213, 2022. URLhttps://arxiv.org/abs/2205.11916

Pith/arXiv arXiv 2022

[18] [18]

Houjun Liu, John Bauer, and Christopher D. Manning. Drop dropout on single epoch language model pretraining. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2157–2166, Vienna, Austria,

2025

[19] [19]

doi: 10.18653/v1/2025.findings-acl.111

Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.111. URL https:// aclanthology.org/2025.findings-acl.111/

work page doi:10.18653/v1/2025.findings-acl.111 2025

[20] [20]

Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025. URL https://arxiv.org/abs/2503.20783

Pith/arXiv arXiv 2025

[21] [21]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.n...

work page internal anchor Pith review doi:10.18653/v1/2021.naacl-main.168 2021

[22] [22]

Solving general arithmetic word problems

Subhro Roy and Dan Roth. Solving general arithmetic word problems. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752. Association for Computational Linguistics,

2015

[23] [23]

URLhttps://aclanthology.org/D15-1202/

doi: 10.18653/v1/D15-1202. URLhttps://aclanthology.org/D15-1202/

work page doi:10.18653/v1/d15-1202

[24] [24]

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transformers.arXiv preprint arXiv:2502.17416, 2025. URL https://arxiv. org/abs/2502.17416

arXiv 2025

[25] [25]

Approximating kl divergence

John Schulman. Approximating kl divergence. Blog post, 2020. http://joschu.net/blog/kl-approx.html

2020

[26] [26]

Gradient estimation using stochastic computation graphs

John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. InAdvances in Neural Information Processing Systems, 2015

2015

[27] [27]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017

[28] [28]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[29] [29]

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Compressing chain-of- thought into continuous space via self-distillation.arXiv preprint arXiv:2502.21074, 2025. doi: 10.48550/arXiv. 2502.21074. URLhttps://arxiv.org/abs/2502.21074

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025

[30] [30]

Preprint, arXiv:2410.01560

Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, and Igor Gitman. Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data.arXiv preprint arXiv:2410.01560, 2024. doi: 10.48550/arXiv.2410.01560. URLhttps://arxiv.org/abs/2410.01560

work page doi:10.48550/arxiv.2410.01560 2024

[31] [31]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language 10 Dropout-GRPO A preprint models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 24824–24837, 2022. URL https://proceedings.neurips.cc/paper_fil...

2022

[32] [32]

Flipout: Efficient pseudo-independent weight perturbations on mini-batches

Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, and Roger Grosse. Flipout: Efficient pseudo-independent weight perturbations on mini-batches. InInternational Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJNpifWAb

2018

[33] [33]

Simple statistical gradient-following algorithms for connectionist reinforcement learning,

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 1992. doi: 10.1007/BF00992696

work page doi:10.1007/bf00992696 1992

[34] [34]

Qwen2.5 Technical Report

An Yang et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. doi: 10.48550/arXiv.2412.15115. URLhttps://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024

[35] [35]

Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Qiying Yu et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. URLhttps://arxiv.org/abs/2503.14476

Pith/arXiv arXiv 2025

[36] [36]

Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025

Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025. URL https://arxiv.org/abs/2505.18454

arXiv 2025

[37] [37]

arXiv preprint arXiv:2505.15778 , year=

Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space.arXiv, 2025. doi: 10.48550/arxiv.2505.15778

work page doi:10.48550/arxiv.2505.15778 2025

[38] [38]

Least-to-most prompting enables complex reasoning in large language models

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022. URLhttps://arxiv.org/abs/2205.10625

Pith/arXiv arXiv 2022

[39] [39]

Lepo: Latent reasoning policy optimization for large language models.arXiv preprint arXiv:2604.17892, 2026

Yuyan Zhou, Jiarui Yu, Hande Dong, Zhezheng Hao, Hong Wang, Jianqing Zhang, and Qiang Lin. Lepo: Latent reasoning policy optimization for large language models.arXiv preprint arXiv:2604.17892, 2026. URL https://arxiv.org/abs/2604.17892

Pith/arXiv arXiv 2026

[40] [40]

Reasoning by superposition: A theoretical perspective on chain of continuous thought

Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning by superposition: A theoretical perspective on chain of continuous thought.arXiv, 2025. doi: 10.48550/arxiv.2505.12514. A Extended Discussion and Future Work A.1 Interpreting the performance gain from Dropout-GRPO The empirical improvement of 2.03 percentage poin...

work page doi:10.48550/arxiv.2505.12514 2025