Recognition: 2 theorem links
· Lean TheoremA Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning
Pith reviewed 2026-05-11 00:47 UTC · model grok-4.3
The pith
In online autoregressive learning, end-to-end feedback allows mistake bounds to scale up to logarithmically with the generation horizon M, while full chain-of-thought feedback eliminates any dependence on M.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove a taxonomy of possible mistake-bound growth rates in the generation horizon M for the end-to-end model: essentially any rate between constant and logarithmic can arise. We further show that this logarithmic ceiling is unavoidable. In the chain-of-thought model, we show that access to the full generated trajectory eliminates the dependence on M altogether. We also analyze autoregressive linear threshold classes and prove optimal mistake bounds, as well as a new lower bound for the statistical setting.
What carries the argument
The online mistake bound for the final output of M repeated applications of an unknown next-token generator, under either final-token-only feedback or full-trajectory feedback.
If this is right
- Any growth rate between constant and logarithmic in M is achievable for some autoregressive classes under end-to-end feedback.
- Logarithmic growth in M is the worst-case rate that can be forced in the end-to-end model.
- Full trajectory observation makes the optimal mistake bound independent of M.
- Optimal mistake bounds are established for autoregressive linear threshold classes.
- Several open questions from the prior PAC analysis of autoregressive maps are resolved.
Where Pith is reading between the lines
- Training with full reasoning traces could remove the sample-complexity penalty associated with long generation horizons.
- The taxonomy suggests that the scaling of online regret with sequence length can be controlled by the choice of function class.
- The same qualitative distinction between end-to-end and trajectory feedback may appear in non-realizable or noisy online settings.
- Empirical tests on concrete next-token classes could check whether real language-model families attain the logarithmic worst case.
Load-bearing premise
The target function is realized exactly by repeated application of some fixed but unknown next-token generator for a fixed horizon M, in a realizable online setting with feedback after each prediction.
What would settle it
A concrete family of next-token generators for which the minimal end-to-end mistake bound grows faster than any logarithmic function of M, or a chain-of-thought setting in which the bound still scales with M.
Figures
read the original abstract
Autoregressive generation lies at the heart of the mechanism of large language models. It can be viewed as the repeated application of a next-token generator: starting from an input string (prompt), the generator is applied for $M$ steps, and the last generated token is taken as the final output. [Joshi et al., 2025] proposed a PAC model for studying the learnability of the input-output maps arising from this process. We develop an online analogue of this framework, focusing on the mistake bound of learning the final output induced by an unknown next-token generator. We distinguish between two forms of feedback. In the End-to-End model, after each round the learner observes only the final token produced after $M$ autoregressive steps. In the Chain-of-Thought model, the learner is additionally shown the entire $M$-step trajectory. Our goal is to understand how the optimal mistake bound depends on the generation horizon $M$, and to what extent observing intermediate tokens can reduce this dependence. Our main results show that the online theory of autoregressive learning exhibits a qualitative picture analogous to the statistical one found by [Hanneke et al., 2026], but with a different scale of dependence on the generation horizon. In the End-to-End model, we prove a taxonomy of possible mistake-bound growth rates in the generation horizon $M$: essentially any rate between constant and logarithmic can arise. We further show that this logarithmic ceiling is unavoidable. In the Chain-of-Thought model, we show that access to the full generated trajectory eliminates the dependence on $M$ altogether. We also analyze autoregressive linear threshold classes, and prove optimal mistake bounds, as well as a new lower bound for the statistical setting. Along the way, our results resolve several questions left open by [Joshi et al., 2025].
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops an online learning framework for autoregressive generation, viewing it as repeated application of an unknown next-token generator over a fixed horizon M. It distinguishes the End-to-End model (feedback only on the final token) from the Chain-of-Thought model (full trajectory feedback). Main results establish a taxonomy of possible mistake-bound growth rates in M for End-to-End (any rate between constant and logarithmic), prove the logarithmic ceiling is tight, and show that Chain-of-Thought feedback removes all M-dependence. Additional contributions include optimal mistake bounds for autoregressive linear threshold classes, a new lower bound in the statistical setting, and resolution of open questions from Joshi et al. (2025). The analysis assumes a realizable setting with online feedback after each prediction.
Significance. If the derivations hold, the work provides a valuable theoretical foundation for online learnability of autoregressive models, extending prior PAC results with a different scale of M-dependence. The taxonomy and tightness results, combined with the demonstration that full trajectory access eliminates M-dependence, offer clear insights into feedback mechanisms relevant to LLM training. Credit is due for the parameter-free nature of the bounds under standard realizable assumptions and for resolving multiple open questions from the 2025 reference. This could inform algorithm design for chain-of-thought prompting and online adaptation.
minor comments (2)
- [Abstract and Introduction] The abstract and introduction would benefit from an explicit enumeration of the open questions from Joshi et al. (2025) that are resolved, to make the contribution more immediately verifiable.
- [Section 2 (Preliminaries)] Notation for the next-token generator and the induced target function is introduced clearly in the abstract but should be cross-referenced with a formal definition in the first technical section for consistency.
Simulated Author's Rebuttal
We thank the referee for the positive review and the recommendation for minor revision. We appreciate the accurate summary of our contributions, including the taxonomy of mistake-bound growth rates in the End-to-End model, the tightness of the logarithmic ceiling, the elimination of M-dependence under Chain-of-Thought feedback, the optimal bounds for linear threshold classes, the new statistical lower bound, and the resolution of open questions from Joshi et al. (2025).
Circularity Check
No significant circularity identified
full rationale
The derivation establishes mistake bounds for autoregressive online learning under standard realizable assumptions with fixed unknown next-token generator and horizon M. End-to-End results prove a taxonomy of growth rates (constant to logarithmic in M) with a tight log ceiling via direct analysis of the feedback model; Chain-of-Thought results show M-independence from full trajectory access. These follow from applying online learning theory to the autoregressive process without reducing to fitted parameters, self-definitions, or load-bearing self-citations. The linear threshold analysis and resolutions of prior open questions are likewise independent of the target claims. The framework is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The target concept is induced by M repeated applications of a fixed unknown next-token generator.
- domain assumption Feedback is received after each online prediction, either end-to-end or full trajectory.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 2.1: L(F_CoT-M) ≤ L(F) for every base class F and every M; Theorem 2.3: L(F_e2e-M) = O(L(F) log(L(F)M)); taxonomy of well-behaved sub-logarithmic rates r(M) realized by classes with L(F)=1.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Littlestone trees, generation trees T_M(x), SOA learner maintaining version space V_t, halving arguments for sub-logarithmic rates.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Structured-seed local pseudorandom generators and their applications
[ABCM25] Benny Applebaum, Dung Bui, Geoffroy Couteau, and Nikolas Melissaris. Structured-seed local pseudorandom generators and their applications. InAp- proximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2025), volume 353, pages 63–1. Schloss Dagstuhl– Leibniz-Zentrum f¨ ur Informatik,
2025
-
[2]
[AML25] Awni Altabaa, Omar Montasser, and John Lafferty. Cot information: Im- proved sample complexity under chain-of-thought supervision.arXiv preprint arXiv:2505.15927,
-
[3]
[BMR+20] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateu...
2020
-
[4]
Deterministic apple tasting.arXiv preprint arXiv:2410.10404,
[CM24] Zachary Chase and Idan Mehalel. Deterministic apple tasting.arXiv preprint arXiv:2410.10404,
-
[5]
[DGY+25] DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan...
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
URL: https://doi.org/10.48550/arXiv.2501.12948, arXiv:2501.12948,doi:10.48550/ARXIV.2501.12948. [DJP+24] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hins...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948
-
[7]
URL: https://doi.org/10.48550/ arXiv.2407.21783,arXiv:2407.21783,doi:10.48550/ARXIV.2407.21783. [DSBDSS15] Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the ERM principle.J. Mach. Learn. Res., 16(1):2377–2404,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783
-
[8]
Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End
Private communication with the authors. [HMM26] Steve Hanneke, Idan Mehalel, and Shay Moran. Sample complexity of autoregressive reasoning: Chain-of-thought vs. end-to-end.arXiv preprint arXiv:2604.12013,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End
arXiv:2604.12013,doi:10.48550/arXiv.2604.12013. [HWS+25] Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, and Yuxin Chen. Transformers provably learn chain-of-thought reasoning with length generalization.arXiv preprint arXiv:2511.07378,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.12013
-
[10]
[JKL+24] Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex If- timie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.16720 2024
-
[11]
A theory of learning with autoregressive chain of thought.arXiv preprint arXiv:2503.07932,
[JVB+25] Nirmit Joshi, Gal Vardi, Adam Block, Surbhi Goel, Zhiyuan Li, Theodor Misi- akiewicz, and Nathan Srebro. A theory of learning with autoregressive chain of thought.arXiv preprint arXiv:2503.07932,
-
[12]
[KS24] Juno Kim and Taiji Suzuki. Transformers provably solve parity efficiently with chain of thought.arXiv preprint arXiv:2410.08633,
-
[13]
From on-line to batch learning
[Lit89] Nick Littlestone. From on-line to batch learning. InProceedings of the Second Annual Workshop on Computational Learning Theory (COLT 1989), pages 269– 284, San Francisco, CA, USA,
1989
-
[14]
URL: https://dl.acm.org/doi/10.5555/93335.93365,doi:10.5555/93335.93365
Morgan Kaufmann Publishers Inc. URL: https://dl.acm.org/doi/10.5555/93335.93365,doi:10.5555/93335.93365. [LLZM24] Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. InThe Twelfth International Conference on Learning Representations,
-
[15]
Auto-regressive next-token predictors are universal learners.arXiv preprint arXiv:2309.06979, 2023
[Mal23] Eran Malach. Auto-regressive next-token predictors are universal learners.arXiv preprint arXiv:2309.06979,
-
[16]
The expressive power of transformers with chain of thought, 2024
[MS23] William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought.arXiv preprint arXiv:2310.07923,
-
[18]
Show Your Work: Scratchpads for Intermediate Computation with Language Models
URL:https://arxiv.org/abs/2112.00114,arXiv:2112.00114. [RWC+19] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI,
work page internal anchor Pith review arXiv
-
[19]
[Ser25] Rocco A Servedio. The probably approximately correct learning model in compu- tational learning theory.arXiv preprint arXiv:2511.08791,
-
[20]
[TMUK25] Nikolaos Tsilivis, Eran Malach, Karen Ullrich, and Julia Kempe. How reinforce- ment learning after next-token prediction facilitates learning.arXiv preprint arXiv:2510.11495,
-
[21]
Gomez, Lukasz Kaiser, and Illia Polosukhin
[VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Co...
2017
-
[22]
Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M
44 [WBZ+22] Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022,
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.