A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits
Pith reviewed 2026-05-20 07:56 UTC · model grok-4.3
The pith
Position-dependent attention in Transformers produces constant expected risk under Wasserstein shifts on reasoning tasks, while shift-invariant mechanisms bound the error and depth scaling is required to avoid representation collapse.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that two architectural constraints govern generalization on reasoning tasks: position-dependent attention yields an Omega(1) Lipschitz constant and therefore constant expected risk under Wasserstein shifts, whereas shift-invariant mechanisms bound the error; separately, TC^0 Transformers on Dyck-k languages face an irreducible approximation limit in Barron spaces, so width scaling cannot substitute for depth and representation collapse occurs without sufficient layer depth.
What carries the argument
Lipschitz continuity of the attention map under the Wasserstein-1 metric, combined with the circuit-depth lower bound for TC^0 circuits realizing Dyck-k languages.
Load-bearing premise
The mapping of sequential backtracking to a Dyck-k language is a faithful model of the reasoning process the Transformer must implement.
What would settle it
A shallow TC^0 Transformer with absolute positional encodings that maintains low generalization risk on Dyck-k tasks even when the Wasserstein distance between training and test distributions is large.
Figures
read the original abstract
While empirical scaling laws for LLM reasoning are well-documented, the theoretical mechanisms governing out-of-distribution (OOD) generalization remain elusive. We formalize reasoning via optimal transport, projecting discrete trajectories into a continuous metric space to quantify domain shifts using the Wasserstein-1 distance. Invoking Kantorovich duality, we bound OOD generalization via architectural Lipschitz continuity and functional approximation limits. This exposes two primary constraints. First, position-dependent attention (e.g., Absolute Positional Encoding) fails to preserve shift invariance, yielding an $\Omega(1)$ Lipschitz constant and expected risk, whereas shift-invariant mechanisms (e.g., Rotary Embeddings) preserve equivariance and bound the error. Second, by mapping sequential backtracking to a Dyck-$k$ language, we establish a strict circuit depth lower bound for $\text{TC}^0$ Transformers. Scaling physical layer depth is necessary to avert representation collapse -- a constraint that scaling representation width cannot bypass due to irreducible approximation bounds in Barron spaces. Evaluations across 54 Transformer configurations on combinatorial search corroborate these bounds, demonstrating that generalization risk degrades monotonically with the Wasserstein domain shift.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes reasoning in Transformers as optimal transport of discrete trajectories into a continuous metric space, using the Wasserstein-1 distance to quantify OOD shifts. Invoking Kantorovich duality, it derives generalization bounds from network Lipschitz continuity and Barron-space approximation limits. It claims that position-dependent attention (e.g., absolute positional encodings) produces an Ω(1) Lipschitz constant and thus unbounded risk under shifts, while shift-invariant mechanisms (e.g., Rotary embeddings) preserve equivariance and control error. By reducing sequential backtracking to Dyck-k language recognition, the paper establishes a TC^0 circuit-depth lower bound, arguing that depth scaling is required to avoid representation collapse because width scaling cannot overcome irreducible Barron-space limits. These theoretical constraints are corroborated by experiments on 54 Transformer configurations showing monotonic degradation of generalization risk with increasing Wasserstein shift.
Significance. If the central reductions hold, the work supplies a measure-theoretic explanation for why architectural choices such as positional encoding and depth matter for structural generalization in reasoning tasks. The explicit linkage of optimal transport, Kantorovich duality, and circuit-complexity lower bounds to Transformer Lipschitz constants and Barron approximation is a novel framing that could guide architectural decisions beyond empirical scaling laws. The evaluation across 54 configurations provides concrete empirical support for the predicted monotonic risk increase, which is a positive feature when the theoretical claims are made rigorous.
major comments (3)
- [Dyck-k mapping and circuit depth lower bound] Abstract and § on Dyck-k reduction: the claim that depth scaling is necessary to avert representation collapse rests on transferring known TC^0 depth lower bounds for Dyck-k recognition through a continuous Wasserstein embedding. No explicit construction is given showing that the optimal-transport map preserves the discrete acceptance condition or that the Wasserstein-1 distance respects the context-free structure; without this step the depth lower bound does not imply the stated architectural necessity.
- [Kantorovich duality application] Abstract, Kantorovich duality paragraph: the derivation of an Ω(1) Lipschitz constant for position-dependent attention and the resulting expected-risk bound is stated without the explicit dual formulation or error terms. It is therefore unclear whether the bound follows directly from the duality or relies on additional approximations that are not controlled.
- [Empirical evaluation] Evaluation section: the 54-configuration study is invoked to corroborate the bounds, yet no detail is supplied on how the Wasserstein shifts were constructed, which metrics quantified generalization risk, or how the configurations were sampled to isolate depth versus width effects. This makes it impossible to assess whether the observed monotonic degradation actually tests the claimed irreducible Barron-space limits.
minor comments (2)
- [Preliminaries] Notation for the Wasserstein-1 distance and the Lipschitz constant should be introduced with explicit definitions before being used in the duality argument.
- [Experiments] The abstract mentions '54 Transformer configurations' but the main text should include a table or appendix listing the exact hyper-parameter ranges and random seeds to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while making revisions where the concerns are valid and require additional rigor.
read point-by-point responses
-
Referee: [Dyck-k mapping and circuit depth lower bound] Abstract and § on Dyck-k reduction: the claim that depth scaling is necessary to avert representation collapse rests on transferring known TC^0 depth lower bounds for Dyck-k recognition through a continuous Wasserstein embedding. No explicit construction is given showing that the optimal-transport map preserves the discrete acceptance condition or that the Wasserstein-1 distance respects the context-free structure; without this step the depth lower bound does not imply the stated architectural necessity.
Authors: We agree that the transfer of the TC^0 lower bound requires an explicit construction to be fully rigorous. The original manuscript sketched the reduction but did not detail the preservation properties. In the revised version we add a dedicated subsection that constructs the optimal transport map explicitly: valid Dyck-k backtracking trajectories are mapped to a zero-distance manifold in the Wasserstein space while invalid trajectories incur positive distance proportional to the number of unmatched brackets. Because the Wasserstein-1 metric is defined on the trajectory space and the acceptance predicate is Lipschitz with respect to this metric, the known circuit-depth lower bound carries over directly, implying that width scaling alone cannot overcome the representation collapse. revision: yes
-
Referee: [Kantorovich duality application] Abstract, Kantorovich duality paragraph: the derivation of an Ω(1) Lipschitz constant for position-dependent attention and the resulting expected-risk bound is stated without the explicit dual formulation or error terms. It is therefore unclear whether the bound follows directly from the duality or relies on additional approximations that are not controlled.
Authors: The referee correctly notes that the dual formulation and error control were only summarized. We have now inserted the complete Kantorovich dual problem together with the explicit error terms arising from the network Lipschitz constant. The derivation shows that absolute positional encodings yield a shift-dependent Lipschitz constant that remains Ω(1) even after averaging over the data measure, producing an expected-risk bound that grows linearly with the Wasserstein shift; no uncontrolled approximations are used. The revised text makes this step-by-step. revision: yes
-
Referee: [Empirical evaluation] Evaluation section: the 54-configuration study is invoked to corroborate the bounds, yet no detail is supplied on how the Wasserstein shifts were constructed, which metrics quantified generalization risk, or how the configurations were sampled to isolate depth versus width effects. This makes it impossible to assess whether the observed monotonic degradation actually tests the claimed irreducible Barron-space limits.
Authors: We accept that the experimental protocol was insufficiently specified. The revised Evaluation section now details: (i) Wasserstein shifts are realized by controlled perturbations of the input distribution over combinatorial search problems while keeping the target function fixed; (ii) generalization risk is measured as the absolute difference in accuracy between in-distribution and OOD test sets; (iii) the 54 configurations were sampled by independently varying depth (2–12 layers) and width (64–2048 hidden units) under a fixed hyperparameter grid. These additions allow direct verification that the observed monotonic risk increase aligns with the Barron-space lower bound rather than confounding factors. revision: yes
Circularity Check
No significant circularity; derivation relies on external results
full rationale
The paper formalizes reasoning via optimal transport and Wasserstein-1 distance, then invokes Kantorovich duality to bound generalization risk through Lipschitz continuity of position-dependent vs. shift-invariant attention. It separately maps sequential backtracking to Dyck-k languages to obtain TC^0 circuit depth lower bounds and invokes Barron-space approximation limits to argue depth scaling is required. These steps cite standard external theorems rather than defining any quantity in terms of itself or renaming a fitted parameter as a prediction. No self-citation chain is load-bearing for the central claims, and the 54-configuration evaluation provides an independent empirical check. The derivation chain therefore remains self-contained against external mathematical benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Kantorovich duality holds for the Wasserstein-1 distance between trajectory measures
- domain assumption Sequential backtracking in reasoning can be faithfully represented as membership in a Dyck-k language
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
by mapping sequential backtracking to a Dyck-k language, we establish a strict circuit depth lower bound for TC0 Transformers... irreducible approximation bounds in Barron spaces
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Scaling physical layer depth is necessary to avert representation collapse... due to irreducible Ω(m^{-1/2}) approximation bounds in Barron spaces
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Barron, a.e.: Universal approximation bounds for superpositions of a sigmoidal function
Andrew Barron. Barron, a.e.: Universal approximation bounds for superpositions of a sigmoidal function. ieee trans. on information theory 39, 930-945. Information Theory, IEEE Transactions on, 39: 0 930 -- 945, 06 1993. doi:10.1109/18.256500
-
[2]
Optimal Transport for Domain Adaptation
Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation, 2016. URL https://arxiv.org/abs/1507.00504
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances
Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transportation distances, 2013. URL https://arxiv.org/abs/1306.0895
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[4]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023. URL https://arxiv.org/abs/2307.08691
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
The Power of Depth for Feedforward Neural Networks
Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks, 2016. URL https://arxiv.org/abs/1512.03965
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[7]
Towards revealing the mystery behind chain of thought: A theoretical perspective, 2023
Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective, 2023. URL https://arxiv.org/abs/2305.15408
-
[8]
Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong,...
work page 2021
- [9]
-
[10]
Theoretical limitations of self-attention in neural sequence models
Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8: 0 156–171, December 2020. ISSN 2307-387X. doi:10.1162/tacl_a_00306. URL http://dx.doi.org/10.1162/tacl_a_00306
-
[11]
Two stones hit one bird: Bilevel positional encoding for better length extrapolation, 2024
Zhenyu He, Guhao Feng, Shengjie Luo, Kai Yang, Liwei Wang, Jingjing Xu, Zhi Zhang, Hongxia Yang, and Di He. Two stones hit one bird: Bilevel positional encoding for better length extrapolation, 2024. URL https://arxiv.org/abs/2401.16421
-
[12]
The impact of positional encoding on length generalization in transformers, 2023
Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers, 2023. URL https://arxiv.org/abs/2305.19466
-
[13]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
The depth-to-width interplay in self-attention, 2021
Yoav Levine, Noam Wies, Or Sharir, Hofit Bata, and Amnon Shashua. The depth-to-width interplay in self-attention, 2021. URL https://arxiv.org/abs/2006.12467
-
[15]
arXiv preprint arXiv:2310.04418 , year=
Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, and Srinadh Bhojanapalli. Functional interpolation for relative positions improves long context transformers, 2024. URL https://arxiv.org/abs/2310.04418
-
[16]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[17]
Y. Makovoz. Random approximants and neural networks. Journal of Approximation Theory, 85 0 (1): 0 98--109, 1996. ISSN 0021-9045. doi:https://doi.org/10.1006/jath.1996.0031. URL https://www.sciencedirect.com/science/article/pii/S0021904596900313
-
[18]
The parallelism tradeoff: Limitations of log-precision transformers, 2023
William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers, 2023. URL https://arxiv.org/abs/2207.00729
-
[19]
B. S. Mityagin. The zero set of a real analytic function. Mathematical Notes, 107 0 (3--4): 0 529--530, 2020. doi:10.1134/S0001434620030189. URL https://doi.org/10.1134/S0001434620030189
-
[20]
OpenAI. Learning to reason with llms. https://openai.com/index/learning-to-reason-with-llms/. Accessed: 2025-03-21
work page 2025
-
[21]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Ben Prystawski, Michael Y. Li, and Noah D. Goodman. Why think step by step? reasoning emerges from the locality of experience, 2023. URL https://arxiv.org/abs/2304.03843
-
[23]
Theoretical Analysis of Domain Adaptation with Optimal Transport
Ievgen Redko, Amaury Habrard, and Marc Sebban. Theoretical analysis of domain adaptation with optimal transport, 2017. URL https://arxiv.org/abs/1610.04420
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[24]
Randomized positional encodings boost length generalization of transformers, 2023
Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, and Joel Veness. Randomized positional encodings boost length generalization of transformers, 2023. URL https://arxiv.org/abs/2305.16843
-
[25]
What formal languages can transformers express? a survey
Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. What formal languages can transformers express? a survey. Transactions of the Association for Computational Linguistics, 12: 0 543–561, 2024. ISSN 2307-387X. doi:10.1162/tacl_a_00663. URL http://dx.doi.org/10.1162/tacl_a_00663
-
[26]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URL https://arxiv.org/abs/2104.09864
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Statistical behavior and consistency of classification methods based on convex risk minimization
Alexander B. Tsybakov. Optimal aggregation of classifiers in statistical learning . The Annals of Statistics, 32 0 (1): 0 135 -- 166, 2004. doi:10.1214/aos/1079120131. URL https://doi.org/10.1214/aos/1079120131
-
[28]
The Wasserstein distances, pages 93--111
C \'e dric Villani. The Wasserstein distances, pages 93--111. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009. ISBN 978-3-540-71050-9. doi:10.1007/978-3-540-71050-9_6. URL https://doi.org/10.1007/978-3-540-71050-9_6
-
[29]
Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25 0 (4A): 0 2620--2648, 2019
work page 2019
-
[30]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Logit margin matters: Improving transferable targeted adversarial attack by logit calibration, 2023
Juanjuan Weng, Zhiming Luo, Zhun Zhong, Shaozi Li, and Nicu Sebe. Logit margin matters: Improving transferable targeted adversarial attack by logit calibration, 2023. URL https://arxiv.org/abs/2303.03680
-
[32]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023. URL https://arxiv.org/abs/2305.10601
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, and Huan Liu. Is chain-of-thought reasoning of llms a mirage? a data distribution lens, 2026. URL https://arxiv.org/abs/2508.01191
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Length extrapolation of transformers: A survey from the perspective of positional encoding, 2024
Liang Zhao, Xiachong Feng, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin, and Ting Liu. Length extrapolation of transformers: A survey from the perspective of positional encoding, 2024. URL https://arxiv.org/abs/2312.17044
-
[35]
Dape: Data-adaptive positional encoding for length extrapolation, 2024
Chuanyang Zheng, Yihang Gao, Han Shi, Minbin Huang, Jingyao Li, Jing Xiong, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, and Yu Li. Dape: Data-adaptive positional encoding for length extrapolation, 2024. URL https://arxiv.org/abs/2405.14722
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.