Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs
Pith reviewed 2026-06-28 18:46 UTC · model grok-4.3
The pith
Prompting can force LLMs to expose detailed reasoning traces even when designed to hide them.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reasoning Exposure Prompting (REP) is a lightweight in-context elicitation method that uses shadow-model-generated demonstrations wrapped in auxiliary code-like formats to raise user-visible reasoning traces from a victim model. Across the common reasoning dataset, different victim models, and different student model distillation, REP substantially increases similarity between exposed and REP-conditioned internal traces while preserving useful reasoning signals.
What carries the argument
Reasoning Exposure Prompting (REP), an in-context elicitation technique that wraps shadow-model demonstrations in auxiliary code-like formats to surface internal reasoning traces.
If this is right
- Interface-level hiding of reasoning traces fails to block extraction of usable supervision signals.
- Reasoning distillation can occur using only user-visible prompted outputs without direct internal access.
- Deployed reasoning systems may leak their thinking processes through ordinary prompting interactions.
- Effective trace protection requires defenses beyond simple output summarization or hiding.
Where Pith is reading between the lines
- Models may need output filtering or adversarial training against elicitation to keep reasoning private.
- REP-style methods could be evaluated against models with explicit safety alignments to test bypass potential.
- The similarity measure could be validated by checking whether REP traces yield measurable downstream task improvements.
- This suggests that reasoning capability and extractability may be tightly coupled in current model architectures.
Load-bearing premise
That increased similarity between elicited traces and internal traces reflects extraction of useful transferable reasoning supervision rather than superficial format matching.
What would settle it
An experiment in which student models trained on REP-elicited traces show no accuracy gains over those trained on non-REP prompted outputs or answer-only summaries across standard reasoning benchmarks.
Figures
read the original abstract
Reasoning traces have become a valuable form of learning signals for improving and transferring the capabilities of large language models. In particular, detailed traces can help distill reasoning behavior from stronger teacher models into weaker student models. The value of capability transfer has motivated many deployed systems with reasoning models to hide raw internal traces and expose at most summaries and answers to users. As a result, we ask whether such interface-level trace hiding prevents users from obtaining useful reasoning supervision through prompting. We study this question with Reasoning Exposure Prompting (REP), a lightweight in-context elicitation method that uses shadow-model-generated demonstrations wrapped in auxiliary code-like formats to raise user-visible reasoning traces from a victim model. Across the common reasoning dataset, different victim models, and different student model distillation, REP substantially increases similarity between exposed and REP-conditioned internal traces while preserving useful reasoning signals.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Reasoning Exposure Prompting (REP), an in-context elicitation technique that wraps shadow-model demonstrations in auxiliary code-like formats to surface reasoning traces from LLMs that otherwise expose only summaries or answers. It claims that, across common reasoning datasets, multiple victim models, and student-model distillation settings, REP substantially raises similarity between the elicited traces and the models' internal traces while preserving the utility of those traces as supervision signals.
Significance. If the empirical results are supported by quantitative metrics and appropriate controls, the finding would show that interface-level trace hiding does not block extraction of usable reasoning supervision. This would bear on the design of deployed reasoning systems, the feasibility of trace-based distillation, and the limits of output-side obfuscation.
major comments (2)
- [Abstract] Abstract: the central claim that REP 'substantially increases similarity between exposed and REP-conditioned internal traces while preserving useful reasoning signals' is stated without any reported metrics, baselines, statistical tests, or quantitative results, so the support for the claim cannot be evaluated from the provided text.
- The experimental design does not describe ablations that use structurally identical but content-free or randomized traces; without such controls it remains unclear whether observed similarity gains reflect extraction of internal reasoning or simply format priming induced by the code-like demonstration style.
minor comments (1)
- The phrase 'shadow-model-generated demonstrations' is used without an explicit definition or pointer to the precise construction of the shadow models.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that REP 'substantially increases similarity between exposed and REP-conditioned internal traces while preserving useful reasoning signals' is stated without any reported metrics, baselines, statistical tests, or quantitative results, so the support for the claim cannot be evaluated from the provided text.
Authors: We agree that the abstract would benefit from quantitative support. The body of the manuscript reports specific metrics (trace similarity scores, distillation performance deltas, baselines, and statistical tests), but these are not summarized in the abstract. We will revise the abstract to include key quantitative results, such as the reported similarity gains and distillation utility metrics, while remaining within length constraints. revision: yes
-
Referee: The experimental design does not describe ablations that use structurally identical but content-free or randomized traces; without such controls it remains unclear whether observed similarity gains reflect extraction of internal reasoning or simply format priming induced by the code-like demonstration style.
Authors: This is a fair point about potential format confounds. Our experiments include comparisons to standard prompting and other elicitation methods, but do not explicitly include content-free or randomized-trace controls. We will add these ablations in the revised manuscript, using structurally matched but semantically null or randomized demonstrations, to isolate the contribution of content versus format. revision: yes
Circularity Check
No circularity: purely empirical comparisons with no self-referential definitions or fitted predictions
full rationale
The paper describes an empirical method (REP) and reports similarity increases across datasets and models. No equations, parameters fitted to subsets then re-predicted, or self-citations that bear the central load appear in the provided text. The claim rests on observable trace similarity metrics and distillation utility, which are externally measurable and not defined in terms of the result itself. This matches the default expectation of a non-circular empirical paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Detailed reasoning traces serve as effective learning signals for distilling capabilities from teacher to student models
invented entities (1)
-
Reasoning Exposure Prompting (REP)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Aho and Jeffrey D
Alfred V. Aho and Jeffrey D. Ullman , title =. 1972
1972
-
[2]
Publications Manual , year = "1983", publisher =
1983
-
[3]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
-
[4]
Scalable training of
Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of. 2007 , url=
2007
-
[5]
Dan Gusfield , title =. 1997
1997
-
[6]
Tetreault , title =
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
2015
-
[7]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =. 2005 , url=
2005
-
[8]
and Tukey, John W
Cooley, James W. and Tukey, John W. , journal=. An algorithm for the machine calculation of complex. 1965 , url=
1965
-
[9]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[10]
Advances in neural information processing systems , volume=
Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=
-
[11]
and Chi, Ed H
Wang, Xuezhi and Wei, Jason and Schuurmans, Dale and Le, Quoc V. and Chi, Ed H. and Narang, Sharan and Chowdhery, Aakanksha and Zhou, Denny , title =. International Conference on Learning Representations , year =
-
[12]
arXiv preprint arXiv:2207.00747 , year =
Rationale-Augmented Ensembles in Language Models , author =. arXiv preprint arXiv:2207.00747 , year =
-
[13]
Advances in Neural Information Processing Systems , volume=
Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author=. Advances in Neural Information Processing Systems , volume=
-
[14]
Measuring Faithfulness in Chain-of-Thought Reasoning
Measuring faithfulness in chain-of-thought reasoning , author=. arXiv preprint arXiv:2307.13702 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
2024
-
[16]
How to Steal Reasoning Without Reasoning Traces
How to Steal Reasoning Without Reasoning Traces , author=. arXiv preprint arXiv:2603.07267 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Yang, An and others , title =. arXiv preprint arXiv:2505.09388 , year =. 2505.09388 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Qwen2.5 Technical Report , journal =. 2025 , month = jan, day =. doi:10.48550/arXiv.2412.15115 , url =. 2412.15115 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2025
-
[19]
OpenThoughts: Data Recipes for Reasoning Models
OpenThoughts: Data Recipes for Reasoning Models , author =. arXiv preprint arXiv:2506.04178 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
2024 , month = sep, day =
Learning to Reason with. 2024 , month = sep, day =
2024
-
[21]
2026 , month = apr, day =
Claude Mythos Preview System Card , institution =. 2026 , month = apr, day =
2026
-
[22]
2024 , howpublished =
American Invitational Mathematics Examination (. 2024 , howpublished =
2024
-
[23]
2025 , howpublished =
American Invitational Mathematics Examination (. 2025 , howpublished =
2025
-
[24]
International Conference on Learning Representations , volume=
Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. International Conference on Learning Representations , volume=
-
[25]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
NeurIPS , year =
Measuring Mathematical Problem Solving With the MATH Dataset , author =. NeurIPS , year =
-
[27]
2025 , howpublished =
OpenThoughts-114k , author =. 2025 , howpublished =
2025
-
[28]
Gemini Thinking , year =
-
[29]
Building with Extended Thinking , author =
-
[30]
Terms of Use , year =
-
[31]
Can I Use My Outputs to Train an AI Model? , year =
-
[32]
Gemini API Additional Terms of Service , year =
-
[33]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
s1: Simple test-time scaling , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
2025
-
[34]
International Conference on Learning Representations , volume=
Let's verify step by step , author=. International Conference on Learning Representations , volume=
-
[35]
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Monitoring reasoning models for misbehavior and the risks of promoting obfuscation , author=. arXiv preprint arXiv:2503.11926 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Advances in Neural Information Processing Systems , volume=
Star: Bootstrapping reasoning with reasoning , author=. Advances in Neural Information Processing Systems , volume=
-
[37]
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=
Teaching small language models to reason , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=
-
[38]
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Symbolic chain-of-thought distillation: Small models can also “think” step-by-step , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[39]
Findings of the Association for Computational Linguistics: ACL 2023 , pages=
Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=
2023
-
[40]
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
Orca: Progressive learning from complex explanation traces of gpt-4 , author=. arXiv preprint arXiv:2306.02707 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Nature , volume=
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=
2025
-
[42]
Detecting and Preventing Distillation Attacks , year =
-
[43]
2026 , month = feb, day =
2026
-
[44]
2026 , month = feb, day =
Updated Stakes for American-Led, Democratic. 2026 , month = feb, day =
2026
-
[45]
2026 , month = mar, day =
Bearman, Theo , title =. 2026 , month = mar, day =
2026
-
[46]
Reasoning Models Don't Always Say What They Think
Reasoning models don't always say what they think , author=. arXiv preprint arXiv:2505.05410 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
arXiv preprint arXiv:2601.05076 , year=
Chain-of-Sanitized-Thoughts: Plugging PII Leakage in CoT of Large Reasoning Models , author=. arXiv preprint arXiv:2601.05076 , year=
-
[48]
arXiv preprint arXiv:2511.07772 , year=
SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought , author=. arXiv preprint arXiv:2511.07772 , year=
-
[49]
arXiv preprint arXiv:2603.05618 , year=
Safer Reasoning Traces: Measuring and Mitigating Chain-of-Thought Leakage in LLMs , author=. arXiv preprint arXiv:2603.05618 , year=
-
[50]
2025 , month = nov, day =
Gehlot, Abhishek , title =. 2025 , month = nov, day =
2025
-
[51]
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
Have llms advanced enough? a challenging problem solving benchmark for large language models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.