On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective
Pith reviewed 2026-05-21 05:06 UTC · model grok-4.3
The pith
Chain-of-thought reasoning risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We model CoT as the interaction between an answer map and a chain rule that generates intermediate questions autoregressively, and define the reasoning risk of a hypothesis under this interaction. Our first result is a tight canonical decomposition of this risk into two terms with opposing roles: an oracle-trajectory risk (OTR), which captures the benefit of CoT and reduces to a target-domain risk in a domain adaptation problem, and a trajectory-mismatch risk (TMR), which captures the cost of CoT through error accumulation along mismatched reasoning trajectories. Under stability, we prove a tight upper bound on the TMR governed by an exact amplification factor that identifies bounded, linear
What carries the argument
The canonical decomposition of reasoning risk into oracle-trajectory risk (OTR) and trajectory-mismatch risk (TMR), which separates CoT's benefit from its cost of error accumulation on mismatched trajectories.
Load-bearing premise
The modeling of CoT as the interaction between an answer map and a chain rule that generates intermediate questions autoregressively.
What would settle it
Construct an unstable chain rule and an accurate hypothesis with zero oracle-trajectory risk, then measure whether the trajectory-mismatch risk grows without bound.
read the original abstract
We develop a learning-theoretic framework for understanding Chain of Thought (CoT). We model CoT as the interaction between an answer map and a chain rule that generates intermediate questions autoregressively, and define the reasoning risk of a hypothesis under this interaction. Our first result is a tight canonical decomposition of this risk into two terms with opposing roles: an oracle-trajectory risk (OTR), which captures the benefit of CoT and reduces to a target-domain risk in a domain adaptation problem, and a trajectory-mismatch risk (TMR), which captures the cost of CoT through error accumulation along mismatched reasoning trajectories. We then show that this cost is unavoidable without structure: if any one of the loss, the hypothesis answer map, or the chain rule lacks stability, the TMR can be arbitrarily large even when the OTR is zero and the hypothesis is uniformly close to the ground truth. Conversely, under stability, we prove a tight upper bound on the TMR governed by an exact amplification factor that identifies bounded, linear, and exponential error-growth regimes. Together, these results give a precise theory of when CoT helps, when it hurts, and what controls the transition between the two.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a learning-theoretic framework for understanding Chain of Thought (CoT). It models CoT as the interaction between an answer map and a chain rule that generates intermediate questions autoregressively, and defines the reasoning risk of a hypothesis under this interaction. The main results are a tight canonical decomposition of this risk into oracle-trajectory risk (OTR) capturing the benefit of CoT (reducing to target-domain risk in domain adaptation) and trajectory-mismatch risk (TMR) capturing the cost through error accumulation. It shows that without stability, TMR can be arbitrarily large even when OTR is zero, and under stability provides a tight upper bound governed by an amplification factor identifying bounded, linear, and exponential error-growth regimes.
Significance. This framework offers a precise theory of when CoT helps or hurts by identifying stability as the key factor. The OTR/TMR decomposition provides clear separation of benefits and costs, with the domain adaptation analogy adding interpretability. The error growth regimes could help predict and mitigate issues in long reasoning chains. These results, if verified, contribute to the theoretical foundations of reasoning in large models.
major comments (1)
- [Modeling of CoT and definition of reasoning risk] The canonical decomposition into OTR and TMR is a direct algebraic consequence of the modeling choice where CoT is the interaction between a fixed answer map and an independent autoregressive chain rule. This modeling enables the split but may not reflect the joint training dynamics typical in CoT, where a single model generates both the chain and the answer without explicit separation. Since this is the load-bearing assumption for defining the reasoning risk and enabling the subsequent analysis, the paper would benefit from additional discussion on how the results translate to jointly optimized models or why the separation is a reasonable abstraction.
minor comments (1)
- [Notation and definitions] Ensure that all symbols, such as the amplification factor, are clearly defined upon first use and that any assumptions on the hypothesis class are explicitly stated to aid readability.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments, as well as for recognizing the potential of the OTR/TMR framework. We address the single major comment below and will revise the manuscript accordingly to strengthen the discussion of modeling assumptions.
read point-by-point responses
-
Referee: [Modeling of CoT and definition of reasoning risk] The canonical decomposition into OTR and TMR is a direct algebraic consequence of the modeling choice where CoT is the interaction between a fixed answer map and an independent autoregressive chain rule. This modeling enables the split but may not reflect the joint training dynamics typical in CoT, where a single model generates both the chain and the answer without explicit separation. Since this is the load-bearing assumption for defining the reasoning risk and enabling the subsequent analysis, the paper would benefit from additional discussion on how the results translate to jointly optimized models or why the separation is a reasonable abstraction.
Authors: We agree that the separation of the answer map and chain rule is a deliberate modeling abstraction chosen to enable the clean algebraic decomposition of reasoning risk. This choice is reasonable because the autoregressive generation of intermediate steps followed by a final answer map is the functional form of CoT even when a single model is trained end-to-end; the parameters may be shared, but the roles remain distinct and the risk decomposition continues to hold formally under the same interaction. The framework thereby isolates the benefit (OTR, which reduces to target risk in a domain-adaptation view) from the cost (TMR due to trajectory mismatch), providing insight that remains relevant for jointly optimized models. To address the referee's suggestion, we will add a new paragraph in the Discussion section explaining this rationale, noting that the stability conditions and error-growth regimes apply directly to the composed hypothesis regardless of training procedure, and briefly relating the abstraction to modular versus monolithic reasoning architectures in the literature. This revision clarifies scope without changing any theorems or proofs. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper defines a reasoning risk via the composition of an answer map and an autoregressive chain rule for generating intermediate questions. It then algebraically decomposes this defined quantity into an oracle-trajectory risk (OTR) term and a trajectory-mismatch risk (TMR) term. Subsequent stability-based bounds on TMR follow from additional assumptions on the loss, hypothesis, and chain rule rather than from any fitted parameters, self-citations, or reductions of the target claims to the inputs by construction. The framework remains self-contained as a sequence of definitions, identities, and conditional theorems without load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Existence of probability distributions over input sequences and output labels for defining expectations in the risk terms.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We model CoT as the interaction between an answer map and a chain rule that generates intermediate questions autoregressively, and define the reasoning risk... tight canonical decomposition... trajectory-mismatch risk (TMR)... oracle-trajectory risk (OTR)... amplification factor α_K(ϕ, δ)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
under stability... exact amplification factor that identifies bounded, linear, and exponential error-growth regimes
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
D. Acuna, G. Zhang, M. T. Law, and S. Fidler. f-domain adversarial learning: Theory and algorithms. In M. Meila and T. Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 66–75. PMLR, 18–24 Jul 2021
work page 2021
-
[3]
M. Aghajohari, K. Chitsaz, A. Kazemnejad, S. Chandar, A. Sordoni, A. Courville, and S. Reddy. The markovian thinker: Architecture-agnostic linear scaling of reasoning. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[4]
A. Altabaa, O. Montasser, and J. Lafferty. Cot information: Improved sample complexity under chain- of-thought supervision. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 24822–24862. Curran Associates, Inc., 2025
work page 2025
-
[5]
A. Amiri, X. Huang, M. Rofin, and M. Hahn. Lower bounds for chain-of-thought reasoning in hard- attention transformers. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd International Conference on Machine Learn- ing, volume 267 ofProceedings of Machine Learning Research, ...
work page 2025
-
[6]
Anthropic. Claude Opus 4.6 System Card. System card, Anthropic, Feb. 2026. URLhttps://www-cdn. anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf. Accessed: 2026-05-06
work page 2026
-
[7]
G. Bachmann and V. Nagarajan. The pitfalls of next-token prediction. InForty-first International Conference on Machine Learning, 2024
work page 2024
- [8]
-
[9]
G. Bao, H. Zhang, C. Wang, L. Yang, and Y. Zhang. How likely do LLMs with CoT mimic human reasoning? In O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, and S. Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 7831– 7850, Abu Dhabi, UAE, Jan. 2025. Association for Computational Linguistics
work page 2025
-
[10]
P. Barcelo, A. Kozachinskiy, and T. Steifer. Ehrenfeucht-haussler rank and chain of thought. In Forty-second International Conference on Machine Learning, 2025
work page 2025
-
[11]
S. Ben-David and R. Urner. On the hardness of domain adaptation and the utility of unlabeled target samples. InInternational Conference on Algorithmic Learning Theory, pages 139–153. Springer, 2012
work page 2012
-
[12]
S. Ben-David and R. Urner. Domain adaptation–can quantity compensate for quality?Annals of Mathematics and Artificial Intelligence, 70(3):185–202, 2014
work page 2014
-
[13]
S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adapta- tion. In B. Sch¨ olkopf, J. Platt, and T. Hoffman, editors,Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006
work page 2006
-
[14]
S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains.Machine learning, 79(1):151–175, 2010. 11
work page 2010
-
[15]
M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024
work page 2024
- [16]
-
[17]
J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adap- tation. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors,Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007
work page 2007
-
[18]
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei....
work page 1901
-
[19]
L. Chen, B. Peng, and H. Wu. Theoretical limitations of multi-layer transformer. In2025 IEEE 66th Annual Symposium on Foundations of Computer Science (FOCS), pages 2631–2653, 2025. doi: 10.1109/FOCS63196.2025.00136
-
[20]
Q. Chen, L. Qin, J. Wang, J. Zhou, and W. Che. Unlocking the capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of-thought. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Process- ing Systems, volume 37, pages 54872–54904. Curran Associates,...
-
[21]
Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y. Zhou, T. Gao, and W. Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567v5, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
X. Chen, R. Aksitov, U. Alon, J. Ren, K. Xiao, P. Yin, S. Prakash, C. Sutton, X. Wang, and D. Zhou. Universal self-consistency for large language models. InICML 2024 Workshop on In-Context Learning, 2024
work page 2024
-
[23]
X. Chen, J. Xu, T. Liang, Z. He, J. Pang, D. Yu, L. Song, Q. Liu, M. Zhou, Z. Zhang, R. Wang, Z. Tu, H. Mi, and D. Yu. Do NOT think that much for 2+3=? On the overthinking of long reasoning models. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd International Conference ...
work page 2025
-
[24]
Y. Chen, J. Benton, A. Radhakrishnan, J. Uesato, C. Denison, J. Schulman, A. Somani, P. Hase, M. Wagner, F. Roger, et al. Reasoning models don’t always say what they think.arXiv preprint arXiv:2505.05410, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations
J. Cheng and B. V. Durme. Compressed chain of thought: Efficient reasoning through dense represen- tations.CoRR, abs/2412.13171, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Y. Cheng, X. Liang, Y. Gong, W. Xiao, S. Wang, Y. Zhang, W. Hou, K. Xu, W. Liu, W. Li, J. Jiao, Q. Chen, P. CHENG, and W. Xiong. Integrative decoding: Improving factuality via implicit self- consistency. InThe Thirteenth International Conference on Learning Representations, 2025. 12
work page 2025
-
[27]
Y. Cui, P. He, X. Tang, Q. He, C. Luo, J. Tang, and Y. Xing. A theoretical understanding of chain- of-thought: Coherent reasoning and error-aware demonstration. In Y. Li, S. Mandt, S. Agrawal, and E. Khan, editors,Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, volume 258 ofProceedings of Machine Learning Resear...
work page 2025
-
[28]
S. B. David, T. Lu, T. Luu, and D. Pal. Impossibility theorems for domain adaptation. In Y. W. Teh and M. Titterington, editors,Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 ofProceedings of Machine Learning Research, pages 129–136, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR
work page 2010
-
[29]
Z. Dong, Z. Liu, and Y. Mao. On the hardness of unsupervised domain adaptation: Optimal learners and information-theoretic perspective. In S. Chandar, R. Pascanu, E. Eaton, B. Liu, R. Mahmood, and A. Rannen-Triki, editors,Proceedings of The 4th Conference on Lifelong Learning Agents, volume 330 ofProceedings of Machine Learning Research, pages 89–111. PML...
work page 2025
-
[30]
G. Feng, B. Zhang, Y. Gu, H. Ye, D. He, and L. Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 70757–70798. Curran Associates, Inc., 2023
work page 2023
-
[31]
A. Gambardella, Y. Iwasawa, and Y. Matsuo. Language models do hard arithmetic tasks easily and hardly do easy arithmetic tasks. In L.-W. Ku, A. Martins, and V. Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 85–91, Bangkok, Thailand, Aug. 2024. Association for Comput...
work page 2024
-
[32]
Z. Gan, Y. Liao, and Y. Liu. Rethinking external slow-thinking: From snowball errors to probability of correct reasoning. InForty-second International Conference on Machine Learning, 2025
work page 2025
- [33]
- [34]
-
[35]
J. Geiping, S. M. McLeish, N. Jain, J. Kirchenbauer, S. Singh, B. R. Bartoldson, B. Kailkhura, A. Bhatele, and T. Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[36]
H. A. Gozeten, M. E. Ildiz, X. Zhang, H. Harutyunyan, A. S. Rawat, and S. Oymak. Continuous chain of thought enables parallel exploration and reasoning. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[37]
D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
M. Hahn. Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 2020. doi: 10.1162/tacl a 00306
work page internal anchor Pith review doi:10.1162/tacl 2020
-
[39]
S. Hanneke and S. Kpotufe. On the value of target data in transfer learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, editors,Advances in Neu- ral Information Processing Systems, volume 32. Curran Associates, Inc., 2019. 13
work page 2019
-
[40]
S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. E. Weston, and Y. Tian. Training large language models to reason in a continuous latent space. InSecond Conference on Language Modeling, 2025
work page 2025
- [41]
- [42]
- [43]
-
[44]
J. Jiang and C. Zhai. Instance weighting for domain adaptation in NLP. In A. Zaenen and A. van den Bosch, editors,Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 264–271, Prague, Czech Republic, June 2007. Association for Computational Linguistics
work page 2007
-
[45]
N. Joshi, G. Vardi, A. Block, S. Goel, Z. Li, T. Misiakiewicz, and N. Srebro. A theory of learning with autoregressive chain of thought. In N. Haghtalab and A. Moitra, editors,Proceedings of Thirty Eighth Conference on Learning Theory, volume 291 ofProceedings of Machine Learning Research, pages 3161–3212. PMLR, 30 Jun–04 Jul 2025
work page 2025
- [46]
- [47]
-
[48]
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
T. Korbak, M. Balesni, E. Barnes, Y. Bengio, J. Benton, J. Bloom, M. Chen, A. Cooney, A. Dafoe, A. D. Dragan, S. Emmons, O. Evans, D. Farhi, R. Greenblatt, D. Hendrycks, M. Hobbhahn, E. Hub- inger, G. Irving, E. Jenner, D. Kokotajlo, V. Krakovna, S. Legg, D. Lindner, D. Luan, A. Madry, J. Michael, N. Nanda, D. Orr, J. Pachocki, E. Perez, M. Phuong, F. Rog...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
G. Kruttschnitt, J. Shim, A. Ma, D. Kim, B. Chek, A. Anand, K. Zhu, and S. O’Brien. Contrastive chain-of-thought prompting.CoRR, abs/2407.03600, 2024
-
[50]
A. Lee, E. Che, and T. Peng. How well do LLMs compress their own chain-of-thought? a token complexity approach. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 2025
work page 2025
-
[51]
H. Li, S. Lu, P.-Y. Chen, X. Cui, and M. Wang. Training nonlinear transformers for chain-of-thought inference: A theoretical generalization analysis. InThe Thirteenth International Conference on Learn- ing Representations, 2025
work page 2025
- [52]
-
[53]
Z. Li, H. Liu, D. Zhou, and T. Ma. Chain of thought empowers transformers to solve inherently serial problems. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[54]
T. Liu, Q. Guo, X. Hu, C. Jiayang, Y. Zhang, X. Qiu, and Z. Zhang. Can language models learn to skip steps? In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 45359–45385. Curran Associates, Inc., 2024. doi: 10.52202/079017-1441. 14
-
[55]
T. Liu, W. Xu, W. Huang, Y. Zeng, J. Wang, X. Wang, H. Yang, and J. Li. Logic-of-thought: Injecting logic into contexts for full reasoning in large language models. In L. Chiruzzo, A. Ritter, and L. Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Techn...
-
[56]
M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. In D. Precup and Y. W. Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 2208–2217. PMLR, 06–11 Aug 2017
work page 2017
-
[57]
X. Ma, G. Wan, R. Yu, G. Fang, and X. Wang. CoT-valve: Length-compressible chain-of-thought tuning. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6025–6035, Vienna, Austria, July 2025. Association for Computational Lingui...
-
[58]
A. Madaan, K. Hermann, and A. Yazdanbakhsh. What makes chain-of-thought prompting effective? a counterfactual study. In H. Bouamor, J. Pino, and K. Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1448–1535, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.101
-
[59]
E. Malach. Auto-regressive next-token predictors are universal learners. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Proceedings of the 41st Inter- national Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 34417–34431. PMLR, 21–27 Jul 2024
work page 2024
-
[60]
C. Malon and X. Zhu. Self-consistent decoding for more factual open responses.ArXiv, abs/2403.00696, 2024
-
[61]
Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. InProceedings of The 22nd Annual Conference on Learning Theory (COLT 2009), Montr´ eal, Canada, 2009
work page 2009
-
[62]
W. Merrill and A. Sabharwal. The expressive power of transformers with chain of thought. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[63]
S. I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar. GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[64]
P. Mondorf and B. Plank. Beyond accuracy: Evaluating the reasoning behavior of large language models - a survey. InFirst Conference on Language Modeling, 2024
work page 2024
- [65]
-
[66]
M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luan, et al. Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [67]
-
[68]
B. Peng, S. Narayanan, and C. Papadimitriou. On limitations of the transformer architecture. InFirst Conference on Language Modeling, 2024
work page 2024
-
[69]
J. P´ erez, P. Barcel´ o, and J. Marinkovic. Attention is turing-complete.Journal of Machine Learning Research, 22(75):1–35, 2021
work page 2021
-
[70]
B. Prystawski, M. Li, and N. Goodman. Why think step by step? reasoning emerges from the locality of experience. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 70926–70947. Curran Associates, Inc., 2023
work page 2023
-
[71]
C. Qian, D. Liu, H. Wen, Z. Bai, Y. Liu, and J. Shao. Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in LLM reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[72]
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving Language Understanding by Generative Pre-Training. Technical report, OpenAI, 2018
work page 2018
-
[73]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language Models are Unsupervised Multitask Learners. Technical report, OpenAI, 2019
work page 2019
- [74]
-
[75]
B. Roark and M. Bacchiani. Supervised and unsupervised PCFG adaptation to novel domains. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 205–212, 2003
work page 2003
-
[76]
K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3723–3732, 2018. doi: 10.1109/CVPR.2018.00392
-
[77]
C. Sanford, D. Hsu, and M. Telgarsky. Representational strengths and limitations of transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[78]
N. Saunshi, N. Dikkala, Z. Li, S. Kumar, and S. J. Reddi. Reasoning with latent thoughts: On the power of looped transformers. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[79]
J. Shen, Y. Qu, W. Zhang, and Y. Yu. Wasserstein distance guided representation learning for do- main adaptation. InProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAA...
work page 2018
-
[80]
A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.