Revisiting Chain-of-Thought Reasoning under Limited Supervision: Semi-supervised Chain-of-Thought Learning
Pith reviewed 2026-07-03 20:02 UTC · model grok-4.3
The pith
Answer-level semantic entropy selects high-precision pseudo chain-of-thought chains from unlabeled questions for semi-supervised training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Semi-CoT samples multiple pseudo-CoTs for each unlabeled question, estimates answer-level semantic entropy, and selects low-entropy reasoning chains as reliable pseudo-CoT demonstrations, achieving pseudo-answer precision from 91.36% to 100% across AQuA, SVAMP, GSM8K, and MultiArith and thereby showing that unlabeled questions can supply usable reasoning supervision under this filter.
What carries the argument
The entropy gate: sampling multiple pseudo-CoTs per unlabeled question and retaining only those with low answer-level semantic entropy as pseudo-supervision.
Load-bearing premise
Low answer-level semantic entropy on the final answer serves as a reliable proxy that the full reasoning chain is correct.
What would settle it
An audit that finds many low-entropy pseudo-CoTs containing incorrect intermediate steps despite correct final answers would falsify the gate's reliability.
read the original abstract
Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent reasoning capabilities in large language models. However, most existing CoT methods use reasoning chains mainly as inference-time prompts, while the generated reasoning traces are rarely reused as semi-supervised learning signals. In this report, we define \textbf{Semi-supervised Chain-of-Thought Learning} and propose \textbf{Semi-CoT}, a simple framework that uses unlabeled questions to construct pseudo reasoning supervision. Semi-CoT samples multiple pseudo-CoTs for each unlabeled question, estimates answer-level semantic entropy, and selects low-entropy reasoning chains as reliable pseudo-CoT demonstrations. This extends the self-training view of CoT from inference-time refinement to semi-supervised pseudo-supervision. Pilot experiments on AQuA, SVAMP, GSM8K, and MultiArith show that the entropy gate selects high-precision pseudo-CoTs, with pseudo-answer precision ranging from $91.36\%$ to $100\%$. Semi-CoT also gives small gains on SVAMP and GSM8K, while AQuA shows negative transfer and MultiArith reaches a ceiling. These results suggest that unlabeled questions can provide reliable pseudo reasoning signals, but their effective use still requires stronger demonstration selection or student training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper defines Semi-supervised Chain-of-Thought Learning and proposes Semi-CoT, which samples multiple pseudo-CoTs per unlabeled question, thresholds on answer-level semantic entropy to select low-entropy chains as pseudo-supervision, and reports pilot results on AQuA, SVAMP, GSM8K and MultiArith showing 91.36–100% pseudo-answer precision together with small or mixed downstream gains.
Significance. If the selected chains supply verifiably high-quality reasoning traces rather than merely correct final answers, the framework would offer a practical route to semi-supervised CoT training. The current evidence remains preliminary and the significance is therefore limited until the reasoning-step quality assumption is directly tested.
major comments (2)
- [method and pilot experiments] The selection procedure thresholds answer-level semantic entropy, which only certifies agreement on the final answer. No experiment or analysis checks whether the intermediate reasoning steps in the retained chains are valid; this assumption is load-bearing for the claim that the selected chains constitute reliable pseudo-CoT supervision (see the entropy-gate description and the pilot-experiment paragraph).
- [pilot experiments] Downstream results are reported without error bars, statistical tests, or comparisons against standard self-training or CoT baselines; the small gains on SVAMP/GSM8K and negative transfer on AQuA therefore cannot be interpreted as evidence that the pseudo-CoT signal is effective.
minor comments (2)
- [method] The exact procedure for computing semantic entropy (number of samples, clustering method, temperature) is not specified.
- [abstract and experiments] The manuscript repeatedly refers to 'pilot experiments' yet presents the precision numbers as the headline result; clarify the scope and limitations of these runs.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our pilot study. We address each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [method and pilot experiments] The selection procedure thresholds answer-level semantic entropy, which only certifies agreement on the final answer. No experiment or analysis checks whether the intermediate reasoning steps in the retained chains are valid; this assumption is load-bearing for the claim that the selected chains constitute reliable pseudo-CoT supervision (see the entropy-gate description and the pilot-experiment paragraph).
Authors: We agree that answer-level semantic entropy certifies final-answer agreement rather than step-by-step validity. Our pilot reports high pseudo-answer precision as evidence of selection quality, but does not include direct checks on reasoning-step correctness. We will revise the manuscript to explicitly state this assumption as a limitation and clarify that pseudo-CoT reliability is inferred from answer consistency. revision: yes
-
Referee: [pilot experiments] Downstream results are reported without error bars, statistical tests, or comparisons against standard self-training or CoT baselines; the small gains on SVAMP/GSM8K and negative transfer on AQuA therefore cannot be interpreted as evidence that the pseudo-CoT signal is effective.
Authors: The downstream numbers are from a small-scale pilot and lack error bars or statistical tests, limiting interpretability of the mixed gains. We will revise the text to emphasize the preliminary character of these results, focus primary claims on the observed pseudo-answer precision, and note that rigorous baseline comparisons are left for future work. revision: partial
Circularity Check
No circularity; entropy selection and precision measurement are independent
full rationale
The paper defines Semi-CoT by sampling multiple CoTs per unlabeled question, computing answer-level semantic entropy from those samples, and thresholding to select low-entropy chains. Precision is then measured by comparing the selected pseudo-answers to ground-truth labels on the evaluation sets. Because the entropy computation uses only model samples and the precision metric uses external labels never seen during selection, no equation or procedure reduces the reported 91.36–100 % figures to a quantity fitted on the same data. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The derivation is therefore self-contained empirical observation rather than tautological.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Low answer-level semantic entropy indicates high-quality reasoning chains suitable for use as pseudo-supervision
Reference graph
Works this paper leans on
-
[1]
Self- training: A survey.Neurocomputing, 616:128904, 2025
Massih-Reza Amini, Vasilii Feofanov, Loic Pauletto, Lies Hadjadj, Emilie Devijver, and Yury Maximov. Self- training: A survey.Neurocomputing, 616:128904, 2025
2025
-
[2]
Boosting the margin: A new explanation for the effectiveness of voting methods.The annals of statistics, 26(5):1651–1686, 1998
Peter Bartlett, Yoav Freund, Wee Sun Lee, and Robert E Schapire. Boosting the margin: A new explanation for the effectiveness of voting methods.The annals of statistics, 26(5):1651–1686, 1998
1998
-
[3]
Debiased self-training for semi-supervised learning.Advances in Neural Information Processing Systems, 35:32424–32437, 2022
Baixu Chen, Junguang Jiang, Ximei Wang, Pengfei Wan, Jianmin Wang, and Mingsheng Long. Debiased self-training for semi-supervised learning.Advances in Neural Information Processing Systems, 35:32424–32437, 2022
2022
-
[4]
Contrastive chain-of-thought prompting.arXiv preprint arXiv:2311.09277, 2023
Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, and Lidong Bing. Contrastive chain-of-thought prompting.arXiv preprint arXiv:2311.09277, 2023
-
[5]
Navigate through enigmatic labyrinth a survey of chain of thought reasoning: Advances, frontiers and future
Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. Navigate through enigmatic labyrinth a survey of chain of thought reasoning: Advances, frontiers and future. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11...
2024
-
[6]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024
Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024
2024
-
[8]
Self-training converts weak learners to strong learners in mixture models
Spencer Frei, Difan Zou, Zixiang Chen, and Quanquan Gu. Self-training converts weak learners to strong learners in mixture models. InInternational Conference on Artificial Intelligence and Statistics, pages 8003–8021. PMLR, 2022
2022
-
[9]
Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies.Transactions of the Association for Computational Linguistics, 9:346–361, 2021
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies.Transactions of the Association for Computational Linguistics, 9:346–361, 2021
2021
-
[10]
Semi-supervised learning by entropy minimization.Advances in neural information processing systems, 17, 2004
Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization.Advances in neural information processing systems, 17, 2004
2004
-
[11]
Trustmatch: mitigating pseudo-label bias in semi-supervised learning with trust-aware refinement
Hongyang He and Yundi Hong. Trustmatch: mitigating pseudo-label bias in semi-supervised learning with trust-aware refinement. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 594–603, 2025
2025
-
[12]
Trico: Triadic game-theoretic co-training for robust semi-supervised learning, 2025
Hongyang He, Xinyuan Song, Yangfan He, Zeyu Zhang, Yanshu Li, Haochen You, Lifan Sun, and Wenqiao Zhang. Trico: Triadic game-theoretic co-training for robust semi-supervised learning, 2025
2025
-
[13]
4s-classifier: Empowering conservation through semi-supervised learning for rare and endangered species
Hongyang He, Hongyang Xie, Guodong Shen, Boyang Fu, Haochen You, and Victor Sanchez. 4s-classifier: Empowering conservation through semi-supervised learning for rare and endangered species. In2025 International Joint Conference on Neural Networks (IJCNN), pages 1–10. IEEE, 2025
2025
-
[14]
Semi-vim: Bidirectional state space model for mitigating label imbalance in semi-supervised learning
Hongyang He, Hongyang Xie, Haochen You, and Victor Sanchez. Semi-vim: Bidirectional state space model for mitigating label imbalance in semi-supervised learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 765–774, 2025. 15
2025
-
[15]
Token-aware representation augmentation for fine-grained semi-supervised learning
Hongyang He, Yan Zhong, Xinyuan Song, Daizong Liu, and Victor Sanchez. Token-aware representation augmentation for fine-grained semi-supervised learning. InThe Third Conference on Parsimony and Learning (Proceedings Track), 2026
2026
-
[16]
Newton-coupled dual-teacher semi-supervised learning framework
Hongyang He, Yan Zhong, Xinyuan Song, Daizong Lui, Xuanyu Liu, and Victor Sanchez Silva. Newton-coupled dual-teacher semi-supervised learning framework. 2026
2026
-
[17]
Partmatch: part-aware pseudo-labeling for fine-grained semi-supervised learning
Yundi Hong, Hongyang He, Yanbin Li, Ao Li, and Victor Sanchez Silva. Partmatch: part-aware pseudo-labeling for fine-grained semi-supervised learning. InIEEE International Conference on Multimedia and Expo 2026. IEEE, 2026
2026
-
[18]
Learning to solve arithmetic word problems with verb categorization
Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 523–533, 2014
2014
-
[19]
Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022
2022
-
[20]
Parsing algebraic word problems into equations.Transactions of the Association for Computational Linguistics, 3:585–597, 2015
Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. Parsing algebraic word problems into equations.Transactions of the Association for Computational Linguistics, 3:585–597, 2015
2015
-
[21]
Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks
Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. InWorkshop on challenges in representation learning, ICML, volume 3, page 896. Atlanta, 2013
2013
-
[22]
Program induction by rationale generation: Learning to solve and explain algebraic word problems
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. InProceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 158–167, 2017
2017
-
[23]
Deductive verification of chain-of-thought reasoning.Advances in Neural Information Processing Systems, 36:36407–36433, 2023
Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, and Hao Su. Deductive verification of chain-of-thought reasoning.Advances in Neural Information Processing Systems, 36:36407–36433, 2023
2023
-
[24]
How much can rag help the reasoning of llm?arXiv preprint arXiv:2410.02338, 2024
Jingyu Liu, Jiaen Lin, and Yong Liu. How much can rag help the reasoning of llm?arXiv preprint arXiv:2410.02338, 2024
-
[25]
Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning.IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018
1979
-
[26]
Uncertainty-aware self-training for few-shot text classification
Subhabrata Mukherjee and Ahmed Awadallah. Uncertainty-aware self-training for few-shot text classification. Advances in Neural Information Processing Systems, 33:21199–21212, 2020
2020
-
[27]
Sania Nayab, Giulio Rossolini, Marco Simoni, Andrea Saracino, Giorgio Buttazzo, Nicolamaria Manes, and Fabrizio Giacomelli. Concise thoughts: Impact of output length on llm reasoning and cost.arXiv preprint arXiv:2407.19825, 2024
-
[28]
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 2080–2094, 2021
2021
-
[29]
Solving general arithmetic word problems
Subhro Roy and Dan Roth. Solving general arithmetic word problems. InProceedings of the 2015 conference on empirical methods in natural language processing, pages 1743–1752, 2015
2015
-
[30]
Adaptive communication receivers.IEEE Transactions on Information Theory, 11(2):167–174, 1965
H Scudder. Adaptive communication receivers.IEEE Transactions on Information Theory, 11(2):167–174, 1965
1965
-
[31]
Fixmatch: Simplifying semi-supervised learning with consistency and confidence.Advances in neural information processing systems, 33:596–608, 2020
Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence.Advances in neural information processing systems, 33:596–608, 2020
2020
-
[32]
A survey of reasoning with foundation models: Concepts, methodologies, and outlook.ACM Computing Surveys, 57(11):1–43, 2025
Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu Ding, Hongyang Li, Mengzhe Geng, et al. A survey of reasoning with foundation models: Concepts, methodologies, and outlook.ACM Computing Surveys, 57(11):1–43, 2025
2025
-
[33]
Commonsenseqa: A question answering challenge targeting commonsense knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American 16 Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019
2019
-
[34]
Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017
Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017
2017
-
[35]
Yolo-lrdd: A lightweight method for road damage detection based on improved yolov5s.EURASIP Journal on Advances in Signal Processing, 2022(1): 98, 2022
Fang Wan, Chen Sun, Hongyang He, Guangbo Lei, Li Xu, and Teng Xiao. Yolo-lrdd: A lightweight method for road damage detection based on improved yolov5s.EURASIP Journal on Advances in Signal Processing, 2022(1): 98, 2022
2022
-
[36]
Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models
Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 2609–2634, 2023
2023
-
[37]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[38]
Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
2022
-
[39]
Zongqian Wu, Baoduo Xu, Ruochen Cui, Mengmeng Zhan, Xiaofeng Zhu, and Lei Feng. Rethinking chain-of- thought from the perspective of self-training.arXiv preprint arXiv:2412.10827, 2024
-
[40]
Grdt: Towards robust deepfake detection using geometric representation distribution and texture
Hongyang Xie, Hongyang He, Boyang Fu, and Victor Sanchez. Grdt: Towards robust deepfake detection using geometric representation distribution and texture. InProceedings of the Winter Conference on Applications of Computer Vision, pages 734–744, 2025
2025
-
[41]
Re-reading improves reasoning in large language models
Xiaohan Xu, Chongyang Tao, Tao Shen, Can Xu, Hongbo Xu, Guodong Long, Jian-Guang Lou, and Shuai Ma. Re-reading improves reasoning in large language models. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 15549–15575, 2024
2024
-
[42]
A survey on deep semi-supervised learning.IEEE transactions on knowledge and data engineering, 35(9):8934–8954, 2022
Xiangli Yang, Zixing Song, Irwin King, and Zenglin Xu. A survey on deep semi-supervised learning.IEEE transactions on knowledge and data engineering, 35(9):8934–8954, 2022
2022
-
[43]
Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling.Advances in neural information processing systems, 34:18408–18419, 2021
Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling.Advances in neural information processing systems, 34:18408–18419, 2021
2021
-
[44]
Automatic Chain of Thought Prompting in Large Language Models
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[45]
Evaluation of openai o1: Opportunities and challenges of agi.arXiv preprint arXiv:2409.18486, 2024
Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Zeyu Zhang, Yifan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, et al. Evaluation of openai o1: Opportunities and challenges of agi.arXiv preprint arXiv:2409.18486, 2024
-
[46]
Confidence regularized self-training
Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jinsong Wang. Confidence regularized self-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5982–5991, 2019. 17 1
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.