A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits

Xiaoyin Chen; Xuehai Zhou; Yifu Zhang; Yuyang Zhang

arxiv: 2605.19944 · v1 · pith:PWT7IR4Anew · submitted 2026-05-19 · 💻 cs.LG · cs.AI· cs.CC· cs.CL

A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits

Yuyang Zhang , Yifu Zhang , Xuehai Zhou , Xiaoyin Chen This is my paper

Pith reviewed 2026-05-20 07:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CCcs.CL

keywords TransformersOut-of-distribution generalizationWasserstein distanceLipschitz continuityDyck languagesCircuit complexityReasoningBarron spaces

0 comments

The pith

Position-dependent attention in Transformers produces constant expected risk under Wasserstein shifts on reasoning tasks, while shift-invariant mechanisms bound the error and depth scaling is required to avoid representation collapse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper uses optimal transport to project reasoning trajectories into a metric space and applies Kantorovich duality to bound out-of-distribution generalization by the network's Lipschitz constant and its approximation power. It shows that absolute positional encodings destroy shift invariance and therefore fix an Omega(1) Lipschitz constant, while rotary embeddings preserve equivariance and keep the bound finite. Mapping sequential backtracking to Dyck-k languages yields a circuit-depth lower bound for TC^0 Transformers; because approximation error in Barron spaces cannot be removed by width alone, physical depth must increase to prevent collapse. Experiments across 54 configurations confirm that risk rises monotonically with the measured Wasserstein domain shift.

Core claim

The central claim is that two architectural constraints govern generalization on reasoning tasks: position-dependent attention yields an Omega(1) Lipschitz constant and therefore constant expected risk under Wasserstein shifts, whereas shift-invariant mechanisms bound the error; separately, TC^0 Transformers on Dyck-k languages face an irreducible approximation limit in Barron spaces, so width scaling cannot substitute for depth and representation collapse occurs without sufficient layer depth.

What carries the argument

Lipschitz continuity of the attention map under the Wasserstein-1 metric, combined with the circuit-depth lower bound for TC^0 circuits realizing Dyck-k languages.

Load-bearing premise

The mapping of sequential backtracking to a Dyck-k language is a faithful model of the reasoning process the Transformer must implement.

What would settle it

A shallow TC^0 Transformer with absolute positional encodings that maintains low generalization risk on Dyck-k tasks even when the Wasserstein distance between training and test distributions is large.

Figures

Figures reproduced from arXiv: 2605.19944 by Xiaoyin Chen, Xuehai Zhou, Yifu Zhang, Yuyang Zhang.

**Figure 2.** Figure 2: Empirical verification of structural regularization via translation equivariance on [PITH_FULL_IMAGE:figures/full_fig_p026_2.png] view at source ↗

**Figure 3.** Figure 3: Comprehensive baseline performance and structural variance across the experimental grid. [PITH_FULL_IMAGE:figures/full_fig_p027_3.png] view at source ↗

read the original abstract

While empirical scaling laws for LLM reasoning are well-documented, the theoretical mechanisms governing out-of-distribution (OOD) generalization remain elusive. We formalize reasoning via optimal transport, projecting discrete trajectories into a continuous metric space to quantify domain shifts using the Wasserstein-1 distance. Invoking Kantorovich duality, we bound OOD generalization via architectural Lipschitz continuity and functional approximation limits. This exposes two primary constraints. First, position-dependent attention (e.g., Absolute Positional Encoding) fails to preserve shift invariance, yielding an $\Omega(1)$ Lipschitz constant and expected risk, whereas shift-invariant mechanisms (e.g., Rotary Embeddings) preserve equivariance and bound the error. Second, by mapping sequential backtracking to a Dyck-$k$ language, we establish a strict circuit depth lower bound for $\text{TC}^0$ Transformers. Scaling physical layer depth is necessary to avert representation collapse -- a constraint that scaling representation width cannot bypass due to irreducible approximation bounds in Barron spaces. Evaluations across 54 Transformer configurations on combinatorial search corroborate these bounds, demonstrating that generalization risk degrades monotonically with the Wasserstein domain shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames reasoning generalization through optimal transport and Wasserstein shifts, then ties architectural Lipschitz constants and Dyck-k circuit depth to OOD risk, but the transfer from discrete backtracking to continuous bounds is the part that needs the most checking.

read the letter

The core move here is to treat reasoning trajectories as points in a metric space and measure domain shifts with Wasserstein-1 distance. From there the authors use Kantorovich duality to link the shift size to the network's Lipschitz constant, which immediately separates absolute positional encodings (constant factor, bad) from rotary-style shift-invariant ones (bounded error). They then reduce sequential backtracking to Dyck-k recognition to import a TC^0 depth lower bound and conclude that width scaling cannot overcome the Barron-space approximation limit, so depth must grow. That combination of optimal transport with explicit positional-encoding Lipschitz arguments and a circuit-depth claim is the genuinely new piece; most prior work either stays empirical or uses different complexity tools.

Referee Report

3 major / 2 minor

Summary. The paper formalizes reasoning in Transformers as optimal transport of discrete trajectories into a continuous metric space, using the Wasserstein-1 distance to quantify OOD shifts. Invoking Kantorovich duality, it derives generalization bounds from network Lipschitz continuity and Barron-space approximation limits. It claims that position-dependent attention (e.g., absolute positional encodings) produces an Ω(1) Lipschitz constant and thus unbounded risk under shifts, while shift-invariant mechanisms (e.g., Rotary embeddings) preserve equivariance and control error. By reducing sequential backtracking to Dyck-k language recognition, the paper establishes a TC^0 circuit-depth lower bound, arguing that depth scaling is required to avoid representation collapse because width scaling cannot overcome irreducible Barron-space limits. These theoretical constraints are corroborated by experiments on 54 Transformer configurations showing monotonic degradation of generalization risk with increasing Wasserstein shift.

Significance. If the central reductions hold, the work supplies a measure-theoretic explanation for why architectural choices such as positional encoding and depth matter for structural generalization in reasoning tasks. The explicit linkage of optimal transport, Kantorovich duality, and circuit-complexity lower bounds to Transformer Lipschitz constants and Barron approximation is a novel framing that could guide architectural decisions beyond empirical scaling laws. The evaluation across 54 configurations provides concrete empirical support for the predicted monotonic risk increase, which is a positive feature when the theoretical claims are made rigorous.

major comments (3)

[Dyck-k mapping and circuit depth lower bound] Abstract and § on Dyck-k reduction: the claim that depth scaling is necessary to avert representation collapse rests on transferring known TC^0 depth lower bounds for Dyck-k recognition through a continuous Wasserstein embedding. No explicit construction is given showing that the optimal-transport map preserves the discrete acceptance condition or that the Wasserstein-1 distance respects the context-free structure; without this step the depth lower bound does not imply the stated architectural necessity.
[Kantorovich duality application] Abstract, Kantorovich duality paragraph: the derivation of an Ω(1) Lipschitz constant for position-dependent attention and the resulting expected-risk bound is stated without the explicit dual formulation or error terms. It is therefore unclear whether the bound follows directly from the duality or relies on additional approximations that are not controlled.
[Empirical evaluation] Evaluation section: the 54-configuration study is invoked to corroborate the bounds, yet no detail is supplied on how the Wasserstein shifts were constructed, which metrics quantified generalization risk, or how the configurations were sampled to isolate depth versus width effects. This makes it impossible to assess whether the observed monotonic degradation actually tests the claimed irreducible Barron-space limits.

minor comments (2)

[Preliminaries] Notation for the Wasserstein-1 distance and the Lipschitz constant should be introduced with explicit definitions before being used in the duality argument.
[Experiments] The abstract mentions '54 Transformer configurations' but the main text should include a table or appendix listing the exact hyper-parameter ranges and random seeds to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while making revisions where the concerns are valid and require additional rigor.

read point-by-point responses

Referee: [Dyck-k mapping and circuit depth lower bound] Abstract and § on Dyck-k reduction: the claim that depth scaling is necessary to avert representation collapse rests on transferring known TC^0 depth lower bounds for Dyck-k recognition through a continuous Wasserstein embedding. No explicit construction is given showing that the optimal-transport map preserves the discrete acceptance condition or that the Wasserstein-1 distance respects the context-free structure; without this step the depth lower bound does not imply the stated architectural necessity.

Authors: We agree that the transfer of the TC^0 lower bound requires an explicit construction to be fully rigorous. The original manuscript sketched the reduction but did not detail the preservation properties. In the revised version we add a dedicated subsection that constructs the optimal transport map explicitly: valid Dyck-k backtracking trajectories are mapped to a zero-distance manifold in the Wasserstein space while invalid trajectories incur positive distance proportional to the number of unmatched brackets. Because the Wasserstein-1 metric is defined on the trajectory space and the acceptance predicate is Lipschitz with respect to this metric, the known circuit-depth lower bound carries over directly, implying that width scaling alone cannot overcome the representation collapse. revision: yes
Referee: [Kantorovich duality application] Abstract, Kantorovich duality paragraph: the derivation of an Ω(1) Lipschitz constant for position-dependent attention and the resulting expected-risk bound is stated without the explicit dual formulation or error terms. It is therefore unclear whether the bound follows directly from the duality or relies on additional approximations that are not controlled.

Authors: The referee correctly notes that the dual formulation and error control were only summarized. We have now inserted the complete Kantorovich dual problem together with the explicit error terms arising from the network Lipschitz constant. The derivation shows that absolute positional encodings yield a shift-dependent Lipschitz constant that remains Ω(1) even after averaging over the data measure, producing an expected-risk bound that grows linearly with the Wasserstein shift; no uncontrolled approximations are used. The revised text makes this step-by-step. revision: yes
Referee: [Empirical evaluation] Evaluation section: the 54-configuration study is invoked to corroborate the bounds, yet no detail is supplied on how the Wasserstein shifts were constructed, which metrics quantified generalization risk, or how the configurations were sampled to isolate depth versus width effects. This makes it impossible to assess whether the observed monotonic degradation actually tests the claimed irreducible Barron-space limits.

Authors: We accept that the experimental protocol was insufficiently specified. The revised Evaluation section now details: (i) Wasserstein shifts are realized by controlled perturbations of the input distribution over combinatorial search problems while keeping the target function fixed; (ii) generalization risk is measured as the absolute difference in accuracy between in-distribution and OOD test sets; (iii) the 54 configurations were sampled by independently varying depth (2–12 layers) and width (64–2048 hidden units) under a fixed hyperparameter grid. These additions allow direct verification that the observed monotonic risk increase aligns with the Barron-space lower bound rather than confounding factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external results

full rationale

The paper formalizes reasoning via optimal transport and Wasserstein-1 distance, then invokes Kantorovich duality to bound generalization risk through Lipschitz continuity of position-dependent vs. shift-invariant attention. It separately maps sequential backtracking to Dyck-k languages to obtain TC^0 circuit depth lower bounds and invokes Barron-space approximation limits to argue depth scaling is required. These steps cite standard external theorems rather than defining any quantity in terms of itself or renaming a fitted parameter as a prediction. No self-citation chain is load-bearing for the central claims, and the 54-configuration evaluation provides an independent empirical check. The derivation chain therefore remains self-contained against external mathematical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard results from optimal transport and circuit complexity plus the modeling choice that backtracking equals Dyck-k recognition; no new free parameters or invented entities are introduced in the abstract.

axioms (2)

standard math Kantorovich duality holds for the Wasserstein-1 distance between trajectory measures
Invoked to bound generalization error via Lipschitz constants
domain assumption Sequential backtracking in reasoning can be faithfully represented as membership in a Dyck-k language
Used to establish the TC^0 circuit depth lower bound

pith-pipeline@v0.9.0 · 5738 in / 1405 out tokens · 37592 ms · 2026-05-20T07:56:17.521939+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

by mapping sequential backtracking to a Dyck-k language, we establish a strict circuit depth lower bound for TC0 Transformers... irreducible approximation bounds in Barron spaces
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Scaling physical layer depth is necessary to avert representation collapse... due to irreducible Ω(m^{-1/2}) approximation bounds in Barron spaces

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 12 internal anchors

[1]

Barron, a.e.: Universal approximation bounds for superpositions of a sigmoidal function

Andrew Barron. Barron, a.e.: Universal approximation bounds for superpositions of a sigmoidal function. ieee trans. on information theory 39, 930-945. Information Theory, IEEE Transactions on, 39: 0 930 -- 945, 06 1993. doi:10.1109/18.256500

work page doi:10.1109/18.256500 1993
[2]

Optimal Transport for Domain Adaptation

Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation, 2016. URL https://arxiv.org/abs/1507.00504

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transportation distances, 2013. URL https://arxiv.org/abs/1306.0895

work page internal anchor Pith review Pith/arXiv arXiv 2013
[4]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023. URL https://arxiv.org/abs/2307.08691

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

The Power of Depth for Feedforward Neural Networks

Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks, 2016. URL https://arxiv.org/abs/1512.03965

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

Towards revealing the mystery behind chain of thought: A theoretical perspective, 2023

Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective, 2023. URL https://arxiv.org/abs/2305.15408

work page arXiv 2023
[8]

Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H

Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong,...

work page 2021
[9]

Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D. Goodman. Stream of search (sos): Learning to search in language, 2024. URL https://arxiv.org/abs/2404.03683

work page arXiv 2024
[10]

Theoretical limitations of self-attention in neural sequence models

Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8: 0 156–171, December 2020. ISSN 2307-387X. doi:10.1162/tacl_a_00306. URL http://dx.doi.org/10.1162/tacl_a_00306

work page doi:10.1162/tacl_a_00306 2020
[11]

Two stones hit one bird: Bilevel positional encoding for better length extrapolation, 2024

Zhenyu He, Guhao Feng, Shengjie Luo, Kai Yang, Liwei Wang, Jingjing Xu, Zhi Zhang, Hongxia Yang, and Di He. Two stones hit one bird: Bilevel positional encoding for better length extrapolation, 2024. URL https://arxiv.org/abs/2401.16421

work page arXiv 2024
[12]

The impact of positional encoding on length generalization in transformers, 2023

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers, 2023. URL https://arxiv.org/abs/2305.19466

work page arXiv 2023
[13]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

The depth-to-width interplay in self-attention, 2021

Yoav Levine, Noam Wies, Or Sharir, Hofit Bata, and Amnon Shashua. The depth-to-width interplay in self-attention, 2021. URL https://arxiv.org/abs/2006.12467

work page arXiv 2021
[15]

arXiv preprint arXiv:2310.04418 , year=

Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, and Srinadh Bhojanapalli. Functional interpolation for relative positions improves long context transformers, 2024. URL https://arxiv.org/abs/2310.04418

work page arXiv 2024
[16]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2023

work page 2023
[17]

Y. Makovoz. Random approximants and neural networks. Journal of Approximation Theory, 85 0 (1): 0 98--109, 1996. ISSN 0021-9045. doi:https://doi.org/10.1006/jath.1996.0031. URL https://www.sciencedirect.com/science/article/pii/S0021904596900313

work page doi:10.1006/jath.1996.0031 1996
[18]

The parallelism tradeoff: Limitations of log-precision transformers, 2023

William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers, 2023. URL https://arxiv.org/abs/2207.00729

work page arXiv 2023
[19]

B. S. Mityagin. The zero set of a real analytic function. Mathematical Notes, 107 0 (3--4): 0 529--530, 2020. doi:10.1134/S0001434620030189. URL https://doi.org/10.1134/S0001434620030189

work page doi:10.1134/s0001434620030189 2020
[20]

Learning to reason with llms

OpenAI. Learning to reason with llms. https://openai.com/index/learning-to-reason-with-llms/. Accessed: 2025-03-21

work page 2025
[21]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Li, and Noah D

Ben Prystawski, Michael Y. Li, and Noah D. Goodman. Why think step by step? reasoning emerges from the locality of experience, 2023. URL https://arxiv.org/abs/2304.03843

work page arXiv 2023
[23]

Theoretical Analysis of Domain Adaptation with Optimal Transport

Ievgen Redko, Amaury Habrard, and Marc Sebban. Theoretical analysis of domain adaptation with optimal transport, 2017. URL https://arxiv.org/abs/1610.04420

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

Randomized positional encodings boost length generalization of transformers, 2023

Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, and Joel Veness. Randomized positional encodings boost length generalization of transformers, 2023. URL https://arxiv.org/abs/2305.16843

work page arXiv 2023
[25]

What formal languages can transformers express? a survey

Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. What formal languages can transformers express? a survey. Transactions of the Association for Computational Linguistics, 12: 0 543–561, 2024. ISSN 2307-387X. doi:10.1162/tacl_a_00663. URL http://dx.doi.org/10.1162/tacl_a_00663

work page doi:10.1162/tacl_a_00663 2024
[26]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URL https://arxiv.org/abs/2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Statistical behavior and consistency of classification methods based on convex risk minimization

Alexander B. Tsybakov. Optimal aggregation of classifiers in statistical learning . The Annals of Statistics, 32 0 (1): 0 135 -- 166, 2004. doi:10.1214/aos/1079120131. URL https://doi.org/10.1214/aos/1079120131

work page doi:10.1214/aos/1079120131 2004
[28]

The Wasserstein distances, pages 93--111

C \'e dric Villani. The Wasserstein distances, pages 93--111. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009. ISBN 978-3-540-71050-9. doi:10.1007/978-3-540-71050-9_6. URL https://doi.org/10.1007/978-3-540-71050-9_6

work page doi:10.1007/978-3-540-71050-9_6 2009
[29]

Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance

Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25 0 (4A): 0 2620--2648, 2019

work page 2019
[30]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Logit margin matters: Improving transferable targeted adversarial attack by logit calibration, 2023

Juanjuan Weng, Zhiming Luo, Zhun Zhong, Shaozi Li, and Nicu Sebe. Logit margin matters: Improving transferable targeted adversarial attack by logit calibration, 2023. URL https://arxiv.org/abs/2303.03680

work page arXiv 2023
[32]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023. URL https://arxiv.org/abs/2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, and Huan Liu. Is chain-of-thought reasoning of llms a mirage? a data distribution lens, 2026. URL https://arxiv.org/abs/2508.01191

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Length extrapolation of transformers: A survey from the perspective of positional encoding, 2024

Liang Zhao, Xiachong Feng, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin, and Ting Liu. Length extrapolation of transformers: A survey from the perspective of positional encoding, 2024. URL https://arxiv.org/abs/2312.17044

work page arXiv 2024
[35]

Dape: Data-adaptive positional encoding for length extrapolation, 2024

Chuanyang Zheng, Yihang Gao, Han Shi, Minbin Huang, Jingyao Li, Jing Xiong, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, and Yu Li. Dape: Data-adaptive positional encoding for length extrapolation, 2024. URL https://arxiv.org/abs/2405.14722

work page arXiv 2024

[1] [1]

Barron, a.e.: Universal approximation bounds for superpositions of a sigmoidal function

Andrew Barron. Barron, a.e.: Universal approximation bounds for superpositions of a sigmoidal function. ieee trans. on information theory 39, 930-945. Information Theory, IEEE Transactions on, 39: 0 930 -- 945, 06 1993. doi:10.1109/18.256500

work page doi:10.1109/18.256500 1993

[2] [2]

Optimal Transport for Domain Adaptation

Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation, 2016. URL https://arxiv.org/abs/1507.00504

work page internal anchor Pith review Pith/arXiv arXiv 2016

[3] [3]

Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transportation distances, 2013. URL https://arxiv.org/abs/1306.0895

work page internal anchor Pith review Pith/arXiv arXiv 2013

[4] [4]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023. URL https://arxiv.org/abs/2307.08691

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

The Power of Depth for Feedforward Neural Networks

Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks, 2016. URL https://arxiv.org/abs/1512.03965

work page internal anchor Pith review Pith/arXiv arXiv 2016

[7] [7]

Towards revealing the mystery behind chain of thought: A theoretical perspective, 2023

Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective, 2023. URL https://arxiv.org/abs/2305.15408

work page arXiv 2023

[8] [8]

Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H

Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong,...

work page 2021

[9] [9]

Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D. Goodman. Stream of search (sos): Learning to search in language, 2024. URL https://arxiv.org/abs/2404.03683

work page arXiv 2024

[10] [10]

Theoretical limitations of self-attention in neural sequence models

Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8: 0 156–171, December 2020. ISSN 2307-387X. doi:10.1162/tacl_a_00306. URL http://dx.doi.org/10.1162/tacl_a_00306

work page doi:10.1162/tacl_a_00306 2020

[11] [11]

Two stones hit one bird: Bilevel positional encoding for better length extrapolation, 2024

Zhenyu He, Guhao Feng, Shengjie Luo, Kai Yang, Liwei Wang, Jingjing Xu, Zhi Zhang, Hongxia Yang, and Di He. Two stones hit one bird: Bilevel positional encoding for better length extrapolation, 2024. URL https://arxiv.org/abs/2401.16421

work page arXiv 2024

[12] [12]

The impact of positional encoding on length generalization in transformers, 2023

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers, 2023. URL https://arxiv.org/abs/2305.19466

work page arXiv 2023

[13] [13]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

The depth-to-width interplay in self-attention, 2021

Yoav Levine, Noam Wies, Or Sharir, Hofit Bata, and Amnon Shashua. The depth-to-width interplay in self-attention, 2021. URL https://arxiv.org/abs/2006.12467

work page arXiv 2021

[15] [15]

arXiv preprint arXiv:2310.04418 , year=

Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, and Srinadh Bhojanapalli. Functional interpolation for relative positions improves long context transformers, 2024. URL https://arxiv.org/abs/2310.04418

work page arXiv 2024

[16] [16]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2023

work page 2023

[17] [17]

Y. Makovoz. Random approximants and neural networks. Journal of Approximation Theory, 85 0 (1): 0 98--109, 1996. ISSN 0021-9045. doi:https://doi.org/10.1006/jath.1996.0031. URL https://www.sciencedirect.com/science/article/pii/S0021904596900313

work page doi:10.1006/jath.1996.0031 1996

[18] [18]

The parallelism tradeoff: Limitations of log-precision transformers, 2023

William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers, 2023. URL https://arxiv.org/abs/2207.00729

work page arXiv 2023

[19] [19]

B. S. Mityagin. The zero set of a real analytic function. Mathematical Notes, 107 0 (3--4): 0 529--530, 2020. doi:10.1134/S0001434620030189. URL https://doi.org/10.1134/S0001434620030189

work page doi:10.1134/s0001434620030189 2020

[20] [20]

Learning to reason with llms

OpenAI. Learning to reason with llms. https://openai.com/index/learning-to-reason-with-llms/. Accessed: 2025-03-21

work page 2025

[21] [21]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Li, and Noah D

Ben Prystawski, Michael Y. Li, and Noah D. Goodman. Why think step by step? reasoning emerges from the locality of experience, 2023. URL https://arxiv.org/abs/2304.03843

work page arXiv 2023

[23] [23]

Theoretical Analysis of Domain Adaptation with Optimal Transport

Ievgen Redko, Amaury Habrard, and Marc Sebban. Theoretical analysis of domain adaptation with optimal transport, 2017. URL https://arxiv.org/abs/1610.04420

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [24]

Randomized positional encodings boost length generalization of transformers, 2023

Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, and Joel Veness. Randomized positional encodings boost length generalization of transformers, 2023. URL https://arxiv.org/abs/2305.16843

work page arXiv 2023

[25] [25]

What formal languages can transformers express? a survey

Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. What formal languages can transformers express? a survey. Transactions of the Association for Computational Linguistics, 12: 0 543–561, 2024. ISSN 2307-387X. doi:10.1162/tacl_a_00663. URL http://dx.doi.org/10.1162/tacl_a_00663

work page doi:10.1162/tacl_a_00663 2024

[26] [26]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URL https://arxiv.org/abs/2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Statistical behavior and consistency of classification methods based on convex risk minimization

Alexander B. Tsybakov. Optimal aggregation of classifiers in statistical learning . The Annals of Statistics, 32 0 (1): 0 135 -- 166, 2004. doi:10.1214/aos/1079120131. URL https://doi.org/10.1214/aos/1079120131

work page doi:10.1214/aos/1079120131 2004

[28] [28]

The Wasserstein distances, pages 93--111

C \'e dric Villani. The Wasserstein distances, pages 93--111. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009. ISBN 978-3-540-71050-9. doi:10.1007/978-3-540-71050-9_6. URL https://doi.org/10.1007/978-3-540-71050-9_6

work page doi:10.1007/978-3-540-71050-9_6 2009

[29] [29]

Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance

Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25 0 (4A): 0 2620--2648, 2019

work page 2019

[30] [30]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Logit margin matters: Improving transferable targeted adversarial attack by logit calibration, 2023

Juanjuan Weng, Zhiming Luo, Zhun Zhong, Shaozi Li, and Nicu Sebe. Logit margin matters: Improving transferable targeted adversarial attack by logit calibration, 2023. URL https://arxiv.org/abs/2303.03680

work page arXiv 2023

[32] [32]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023. URL https://arxiv.org/abs/2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, and Huan Liu. Is chain-of-thought reasoning of llms a mirage? a data distribution lens, 2026. URL https://arxiv.org/abs/2508.01191

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

Length extrapolation of transformers: A survey from the perspective of positional encoding, 2024

Liang Zhao, Xiachong Feng, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin, and Ting Liu. Length extrapolation of transformers: A survey from the perspective of positional encoding, 2024. URL https://arxiv.org/abs/2312.17044

work page arXiv 2024

[35] [35]

Dape: Data-adaptive positional encoding for length extrapolation, 2024

Chuanyang Zheng, Yihang Gao, Han Shi, Minbin Huang, Jingyao Li, Jing Xiong, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, and Yu Li. Dape: Data-adaptive positional encoding for length extrapolation, 2024. URL https://arxiv.org/abs/2405.14722

work page arXiv 2024