pith. sign in

arxiv: 2606.10184 · v1 · pith:PZIESAFJnew · submitted 2026-06-08 · 💻 cs.LG · cs.AI

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

Pith reviewed 2026-06-27 17:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords dropoutGRPOlatent reasoningvariational inferencecontinuous hidden statesBayesian model averagingreinforcement learningGSM8K
0
0 comments X

The pith

A fixed Bernoulli mask per rollout supplies the diversity GRPO needs by sampling from a variational posterior over parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that continuous latent reasoning models produce identical trajectories across rollouts, causing GRPO's group-mean advantage to collapse to zero. It introduces structured dropout that applies one Bernoulli mask unchanged across all recurrence steps within a rollout. This mask creates trajectory variance that can be interpreted as draws from a variational distribution over model parameters, so GRPO optimizes the expected reward of the corresponding Bayesian model-average policy. Theoretical arguments establish that the resulting advantage estimator remains unbiased with well-defined latent gradients. Experiments on GSM8K lift a Coconut baseline from 27.29 percent to 29.01 percent pass@1.

Core claim

By applying a single Bernoulli mask held constant across all latent recurrence steps for a given rollout, the method generates essential trajectory variance. This shared mask effectively treats each rollout as a posterior sample from a variational distribution over parameters, allowing GRPO to optimize the expected reward of a Bayesian model-average policy while preserving unbiasedness, variance reduction, and well-defined latent gradients.

What carries the argument

A single Bernoulli dropout mask held constant across all latent recurrence steps within each rollout, which induces trajectory variance interpretable as variational posterior sampling over parameters.

If this is right

  • GRPO becomes applicable to deterministic latent-reasoning models without trajectory collapse.
  • The estimator for the group-relative advantage remains unbiased under the variational interpretation.
  • Post-training of latent-reasoning LLMs can now use group-relative reinforcement learning.
  • The same masking supplies both stochasticity and a well-defined gradient through the latent phase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same constant-mask construction could be tested on other recurrent latent architectures to check whether the variational interpretation generalizes.
  • If the mask truly approximates posterior sampling, combining it with other policy-gradient methods beyond GRPO becomes a direct next step.
  • Varying the mask probability across groups rather than fixing it might further reduce variance in the advantage estimates.

Load-bearing premise

That a shared Bernoulli mask across recurrence steps produces samples whose distribution matches a variational posterior over parameters well enough for the GRPO advantage estimator to stay unbiased and useful.

What would settle it

An experiment in which removing the shared-mask constraint (or replacing it with independent per-step masks) causes the observed performance gain on GSM8K to disappear while GRPO advantages become near-zero.

Figures

Figures reproduced from arXiv: 2606.10184 by Wooil Jung.

Figure 1
Figure 1. Figure 1: Dropout-GRPO pipeline. A single prompt x is broadcast to K parallel rollouts; each rollout draws an independent mask ξ (k) that is held constant across all T latent steps (denoted by f ×T ). The resulting answers y (k) receive verifier rewards r (k) , the group mean µr is computed, and each rollout’s advantage calculated by Ak = r (k)−µr. Mask replay at update time ensures the policy gradient is computed a… view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation accuracy across training steps [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Group Relative Policy Optimization (GRPO) relies on the diversity of $K$ rollouts within each group; otherwise, the group-mean advantage $A^{(k)} = r^{(k)} - \mu_r$ collapses to zero. This presents a structural challenge for latent-reasoning models like Coconut, which feed continuous hidden states recurrently in place of discrete chain-of-thought tokens. Because the latent phase is inherently deterministic given the parameters and prompt, multiple rollouts produce identical trajectories, stalling GRPO's progress. Consequently, applying group-relative reinforcement learning to continuous latent reasoning has proven difficult. To address this, we propose sourcing the necessary stochasticity through structured dropout. By applying a single Bernoulli mask held constant across all latent recurrence steps for a given rollout, we generate essential trajectory variance. This shared mask effectively treats each rollout as a posterior sample from a variational distribution over parameters, allowing GRPO to optimize the expected reward of a Bayesian model-average policy. We provide both theoretical justification for this method -- including unbiasedness, variance reduction, and the well-definedness of the latent gradient -- and empirical validation. On GSM8K, dropout-GRPO improves a Coconut baseline from $27.29\%$ to $29.01\%$ pass@1, demonstrating the viability of GRPO learning for latent-reasoning models. Our work positions this as a practical, theoretically grounded approach for post-training latent-reasoning LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Dropout-GRPO to enable Group Relative Policy Optimization (GRPO) for continuous latent-reasoning models such as Coconut. It introduces structured dropout via a single Bernoulli mask held fixed across all latent recurrence steps within a rollout, claiming this generates trajectory diversity by treating each rollout as a sample from a variational posterior over parameters. This is asserted to allow GRPO to optimize the expected reward of a Bayesian model-average policy, with theoretical properties of unbiasedness, variance reduction, and well-defined latent gradients. Empirically, the method improves pass@1 on GSM8K from 27.29% to 29.01%.

Significance. If the claimed theoretical properties hold, the approach would address a structural barrier to applying group-relative RL to deterministic latent-reasoning models and could support post-training of such architectures. The reported gain is modest and limited to a single task, so broader significance would depend on validation across additional benchmarks and confirmation that the variational equivalence supports unbiased advantage estimation.

major comments (2)
  1. [Theory section] Theory section (asserted justification for unbiasedness): The manuscript states that the shared-mask construction yields samples from a variational posterior q(θ) such that the group-relative advantage A^(k) = r^(k) − μ_r remains an unbiased estimator of the gradient of expected reward under the Bayesian model average, yet provides no derivation showing that the policy-gradient identity is preserved after marginalization over the mask. The recurrence couples the mask to the hidden-state trajectory, which can introduce an extra covariance term not cancelled by the group baseline; this step must be shown explicitly for the central claim to hold.
  2. [Experiments] Empirical results (GSM8K evaluation): The reported 1.72-point gain is presented without error bars, multiple random seeds, or ablations isolating the contribution of the shared-mask variational interpretation versus generic dropout; given that the central claim rests on the specific equivalence to a variational posterior, these controls are required to establish that the improvement is not attributable to other factors.
minor comments (1)
  1. [Abstract] The abstract and introduction use the phrase 'well-definedness of the latent gradient' without defining the precise quantity being differentiated or the measure under which the gradient is taken.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for strengthening the theoretical and empirical foundations of our work. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Theory section] Theory section (asserted justification for unbiasedness): The manuscript states that the shared-mask construction yields samples from a variational posterior q(θ) such that the group-relative advantage A^(k) = r^(k) − μ_r remains an unbiased estimator of the gradient of expected reward under the Bayesian model average, yet provides no derivation showing that the policy-gradient identity is preserved after marginalization over the mask. The recurrence couples the mask to the hidden-state trajectory, which can introduce an extra covariance term not cancelled by the group baseline; this step must be shown explicitly for the central claim to hold.

    Authors: We agree that the manuscript currently asserts the unbiasedness property at a high level without providing the explicit derivation requested. The potential covariance term arising from the recurrent coupling of the mask to the hidden state is a valid concern that requires careful treatment. In the revised manuscript, we will add a detailed derivation in the Theory section that explicitly shows preservation of the policy-gradient identity after marginalization over the mask, demonstrating how the group baseline cancels the relevant covariance terms. revision: yes

  2. Referee: [Experiments] Empirical results (GSM8K evaluation): The reported 1.72-point gain is presented without error bars, multiple random seeds, or ablations isolating the contribution of the shared-mask variational interpretation versus generic dropout; given that the central claim rests on the specific equivalence to a variational posterior, these controls are required to establish that the improvement is not attributable to other factors.

    Authors: We acknowledge that the current results lack error bars, multi-seed statistics, and targeted ablations, which are necessary to substantiate that the gains stem from the variational posterior equivalence rather than generic dropout effects. In the revision, we will report results over multiple random seeds with standard error bars and include ablations that compare the shared-mask construction against standard per-step dropout to isolate the contribution of the proposed mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; modeling choice presented as independent step with claimed external justification

full rationale

The abstract and provided text introduce the shared Bernoulli mask as a deliberate modeling choice to generate trajectory variance, explicitly framing it as treating rollouts as variational posterior samples to enable GRPO on a Bayesian model-average policy. The paper asserts it supplies separate theoretical justification (unbiasedness, variance reduction, latent gradient well-definedness) plus empirical results on GSM8K. No equations are shown that reduce the advantage estimator or the variational equivalence to a fitted parameter defined by the same data, nor is any load-bearing premise justified solely by self-citation. The derivation chain therefore remains self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that fixed-mask dropout across recurrence steps yields trajectories that can be treated as variational posterior samples; no free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption A single Bernoulli mask held constant across latent recurrence steps produces samples from a variational distribution over parameters that supports unbiased GRPO updates.
    This premise is required for the method to convert deterministic rollouts into usable group-relative advantages.

pith-pipeline@v0.9.1-grok · 5785 in / 1224 out tokens · 19226 ms · 2026-06-27T17:16:52.827882+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 14 canonical work pages · 3 internal anchors

  1. [1]

    Weight uncertainty in neural network

    Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. InProceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 1613–1622. PMLR, 2015. URL https://proceedings.mlr.press/ v37/blundell15.html

  2. [2]

    Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking.arXiv preprint arXiv:2502.13842, 2025

    Yilong Chen, Junyuan Shang, Zhenyu Zhang, Yanxi Xie, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, and Haifeng Wang. Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking.arXiv preprint arXiv:2502.13842, 2025. URLhttps://arxiv.org/abs/2502.13842

  3. [3]

    Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171, 2024

    Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171, 2024. URLhttps://arxiv.org/abs/2412.13171

  4. [4]

    Llm latent reasoning as chain of superposition.arXiv preprint arXiv:2510.15522, 2025

    Jingcheng Deng, Liang Pang, Zihao Wei, Shicheng Xu, Zenghao Duan, Kun Xu, Yang Song, Huawei Shen, and Xueqi Cheng. Llm latent reasoning as chain of superposition.arXiv preprint arXiv:2510.15522, 2025. doi: 10.48550/arXiv.2510.15522. URLhttps://arxiv.org/abs/2510.15522

  5. [5]

    Implicit chain of thought reasoning via knowledge distillation.arXiv preprint arXiv:2311.01460, 2023

    Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. Implicit chain of thought reasoning via knowledge distillation.arXiv preprint arXiv:2311.01460, 2023. URL https: //arxiv.org/abs/2311.01460

  6. [6]

    From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405.14838, 2024

    Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step.arXiv preprint arXiv:2405.14838, 2024. URLhttps://arxiv.org/abs/2405.14838

  7. [7]

    Latent thinking optimization: Your latent reasoning language model secretly encodes reward signals in its latent thoughts.arXiv preprint arXiv:2509.26314, 2025

    Hanwen Du, Yuxin Dong, and Xia Ning. Latent thinking optimization: Your latent reasoning language model secretly encodes reward signals in its latent thoughts.arXiv preprint arXiv:2509.26314, 2025. URL https: //arxiv.org/abs/2509.26314

  8. [8]

    Towards revealing the mystery behind chain of thought: A theoretical perspective

    Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, pages 70757–70798, 2023

  9. [9]

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning

    Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. InProceedings of the 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 1050–1059. PMLR, 2016. URL https://proceedings.mlr.press/ v48/gal16.html

  10. [10]

    Pal: Program-aided language models.arXiv preprint arXiv:2211.10435, 2022

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models.arXiv preprint arXiv:2211.10435, 2022. URL https://arxiv.org/abs/ 2211.10435. 9 Dropout-GRPO A preprint

  11. [11]

    Paul Glasserman and David D. Yao. Some guidelines and guarantees for common random numbers.Management Science, 38(6):884–908, 1992. doi: 10.1287/mnsc.38.6.884

  12. [12]

    Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, and Samet Oymak

    Halil Alperen Gozeten, M. Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and reasoning.arXiv preprint arXiv:2505.23648,

  13. [13]

    Continuous chain of thought enables parallel exploration and reasoning

    doi: 10.48550/arXiv.2505.23648. URLhttps://arxiv.org/abs/2505.23648

  14. [14]

    Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024. URL https://arxiv.org/abs/2412.06769

  15. [15]

    In: Findings of the Association for Computational Linguistics: ACL 2023

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics (ACL), pages 8003–8017, 2023. doi: 10.18653/v1/...

  16. [16]

    LLML ingua: Compressing Prompts for Accelerated Inference of Large Language Models

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13358–13376, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.825...

  17. [17]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 22199–22213, 2022. URLhttps://arxiv.org/abs/2205.11916

  18. [18]

    Houjun Liu, John Bauer, and Christopher D. Manning. Drop dropout on single epoch language model pretraining. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2157–2166, Vienna, Austria,

  19. [19]

    doi: 10.18653/v1/2025.findings-acl.111

    Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.111. URL https:// aclanthology.org/2025.findings-acl.111/

  20. [20]

    Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025. URL https://arxiv.org/abs/2503.20783

  21. [21]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.n...

  22. [22]

    Solving general arithmetic word problems

    Subhro Roy and Dan Roth. Solving general arithmetic word problems. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752. Association for Computational Linguistics,

  23. [23]

    URLhttps://aclanthology.org/D15-1202/

    doi: 10.18653/v1/D15-1202. URLhttps://aclanthology.org/D15-1202/

  24. [24]

    Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transformers.arXiv preprint arXiv:2502.17416, 2025. URL https://arxiv. org/abs/2502.17416

  25. [25]

    Approximating kl divergence

    John Schulman. Approximating kl divergence. Blog post, 2020. http://joschu.net/blog/kl-approx.html

  26. [26]

    Gradient estimation using stochastic computation graphs

    John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. InAdvances in Neural Information Processing Systems, 2015

  27. [27]

    Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  28. [28]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

    Zhihong Shao et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  29. [29]

    CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

    Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Compressing chain-of- thought into continuous space via self-distillation.arXiv preprint arXiv:2502.21074, 2025. doi: 10.48550/arXiv. 2502.21074. URLhttps://arxiv.org/abs/2502.21074

  30. [30]

    Preprint, arXiv:2410.01560

    Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, and Igor Gitman. Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data.arXiv preprint arXiv:2410.01560, 2024. doi: 10.48550/arXiv.2410.01560. URLhttps://arxiv.org/abs/2410.01560

  31. [31]

    Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language 10 Dropout-GRPO A preprint models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 24824–24837, 2022. URL https://proceedings.neurips.cc/paper_fil...

  32. [32]

    Flipout: Efficient pseudo-independent weight perturbations on mini-batches

    Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, and Roger Grosse. Flipout: Efficient pseudo-independent weight perturbations on mini-batches. InInternational Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJNpifWAb

  33. [33]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning,

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 1992. doi: 10.1007/BF00992696

  34. [34]

    Qwen2.5 Technical Report

    An Yang et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. doi: 10.48550/arXiv.2412.15115. URLhttps://arxiv.org/abs/2412.15115

  35. [35]

    Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

    Qiying Yu et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. URLhttps://arxiv.org/abs/2503.14476

  36. [36]

    Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025

    Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025. URL https://arxiv.org/abs/2505.18454

  37. [37]

    arXiv preprint arXiv:2505.15778 , year=

    Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space.arXiv, 2025. doi: 10.48550/arxiv.2505.15778

  38. [38]

    Least-to-most prompting enables complex reasoning in large language models

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022. URLhttps://arxiv.org/abs/2205.10625

  39. [39]

    Lepo: Latent reasoning policy optimization for large language models.arXiv preprint arXiv:2604.17892, 2026

    Yuyan Zhou, Jiarui Yu, Hande Dong, Zhezheng Hao, Hong Wang, Jianqing Zhang, and Qiang Lin. Lepo: Latent reasoning policy optimization for large language models.arXiv preprint arXiv:2604.17892, 2026. URL https://arxiv.org/abs/2604.17892

  40. [40]

    Reasoning by superposition: A theoretical perspective on chain of continuous thought

    Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning by superposition: A theoretical perspective on chain of continuous thought.arXiv, 2025. doi: 10.48550/arxiv.2505.12514. A Extended Discussion and Future Work A.1 Interpreting the performance gain from Dropout-GRPO The empirical improvement of 2.03 percentage poin...