Recognition: no theorem link
On the Expressive Power of Contextual Relations in Transformers
Pith reviewed 2026-05-14 23:58 UTC · model grok-4.3
The pith
Transformers can approximate any contextual relation by treating it as a probability distribution or coupling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Contextual relations are modeled as probabilistic objects, either conditional distributions or joint distributions (couplings). Standard softmax attention corresponds to entropy-regularized optimal transport, giving a unified view of attention as normalization of an underlying affinity function. The central result is a universal approximation theorem: transformer architectures equipped with standard softmax attention or Sinkhorn normalization can approximate arbitrary contextual relation rules, and the specific normalization determines how the relations are represented.
What carries the argument
Measure-theoretic framework that treats contextual relations as conditional distributions or couplings, with attention realized as normalization of an affinity function equivalent to entropy-regularized optimal transport.
If this is right
- Any contextual rule expressible as a conditional distribution or coupling can be approximated arbitrarily well by a sufficiently large transformer.
- Switching between softmax and Sinkhorn normalization changes whether the model represents relations primarily as conditionals or as full joints.
- The same architecture can therefore serve different contextual modeling needs simply by altering the normalization step.
- Empirical effectiveness of transformers on context-heavy tasks follows directly from the universal approximation property rather than from architecture-specific heuristics.
Where Pith is reading between the lines
- Different normalizations could be chosen deliberately to match the probabilistic structure of a target domain, such as using couplings for symmetric relations.
- The optimal-transport view suggests deriving new attention variants by replacing the affinity function with other cost measures suited to particular data types.
- Initialization or regularization schemes could be designed around the entropy-regularized transport objective to improve training stability.
Load-bearing premise
Contextual relations of interest can be fully and faithfully captured by modeling them as conditional distributions or as joint distributions between the relevant variables.
What would settle it
A specific contextual relation, such as a deterministic non-probabilistic ordering rule on sequences, that no transformer of any finite width and depth using softmax attention can approximate within a fixed error tolerance on a chosen test set of contexts.
read the original abstract
Transformer architectures have achieved remarkable empirical success in modeling contextual relations, yet a clear understanding of their expressive power is still lacking. In this work, we introduce a measure-theoretic framework in which contextual relations are modeled as probabilistic objects, either as conditional distributions or as joint distributions (couplings). This perspective reveals a natural connection between standard softmax attention and entropy-regularized optimal transport, providing a unified view of attention as a normalization of an underlying affinity function. Within this framework, we establish a universal approximation theorem for contextual systems using standard Softmax Attention and alternately Sinkhorn normalization. These results show that Transformer architectures can approximate arbitrary contextual relations rules, and that the choice of normalization determines how these relations are represented. Moreover, they provide a principled explanation for why Transformers are effective at modeling contextual relations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a measure-theoretic framework in which contextual relations are modeled as conditional distributions or joint distributions (couplings). It connects standard softmax attention to entropy-regularized optimal transport via normalization of an affinity function, and establishes a universal approximation theorem asserting that Transformers using standard Softmax Attention (and alternately Sinkhorn normalization) can approximate arbitrary contextual relation rules.
Significance. If the universal approximation result holds under the actual restrictions of dot-product attention, the work would supply a principled theoretical account of why Transformers succeed at contextual modeling and would unify attention with optimal transport. The framework itself is a clean conceptual contribution even if the density claim requires additional justification.
major comments (1)
- [Abstract] Abstract (universal approximation theorem): the claim that standard Softmax Attention suffices rests on the density of the realized affinities exp((W_Q x_i)^T (W_K x_j)/sqrt(d)) (or multi-head variants) in the space of measurable affinity functions. The manuscript does not demonstrate that this restricted parametric class remains dense on general measure spaces when embedding dimension is finite; this is load-bearing for the central claim.
minor comments (1)
- The abstract refers to 'alternately Sinkhorn normalization' without specifying the precise algorithmic steps or convergence criteria used in the Transformer setting; the full manuscript should include an explicit definition or pseudocode.
Simulated Author's Rebuttal
Thank you for your thorough review and valuable comments on our paper. We appreciate your recognition of the measure-theoretic framework and its connection to entropy-regularized optimal transport as a conceptual contribution. We address the major comment below and will make the corresponding revisions to strengthen the central claim.
read point-by-point responses
-
Referee: [Abstract] Abstract (universal approximation theorem): the claim that standard Softmax Attention suffices rests on the density of the realized affinities exp((W_Q x_i)^T (W_K x_j)/sqrt(d)) (or multi-head variants) in the space of measurable affinity functions. The manuscript does not demonstrate that this restricted parametric class remains dense on general measure spaces when embedding dimension is finite; this is load-bearing for the central claim.
Authors: We agree that an explicit demonstration of density for the parametric affinities is necessary and load-bearing. The original manuscript states the universal approximation result assuming the affinity can be realized via the dot-product form but does not include a dedicated density argument for finite embedding dimension on general measure spaces. In the revised version we will add a lemma showing that, under standard assumptions (e.g., compactly supported measures on metric spaces), the class of functions of the form exp((W_Q x)^T (W_K y)/sqrt(d)) becomes dense in the continuous affinity functions as the embedding dimension d increases (with d chosen depending on the target accuracy). This follows from the fact that finite sums of separable products g_l(x) h_l(y) are dense in C(X × Y) by the Stone-Weierstrass theorem, and multi-head attention further enlarges the effective rank. We will also clarify that the result holds in the limit of sufficiently large (but finite) d, consistent with other universal-approximation statements in neural networks, and note the limitations for strictly fixed d on arbitrary measure spaces. revision: yes
Circularity Check
No circularity: derivation proceeds from standard definitions to approximation theorem without self-referential reductions.
full rationale
The paper defines contextual relations measure-theoretically as conditionals or couplings, identifies softmax attention as normalization of an affinity matrix (standard construction), and invokes density of entropic OT plans to obtain a universal approximation result. No equation reduces to a fitted parameter renamed as prediction, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled via prior work. The framework is self-contained against external measure-theoretic benchmarks; the skeptic concern addresses model restriction versus unrestricted affinities but does not exhibit a definitional loop or construction-by-fit inside the given derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Contextual relations can be modeled as probabilistic objects, either as conditional distributions or as joint distributions (couplings).
Forward citations
Cited by 2 Pith papers
-
One Operator for Many Densities: Amortized Approximation of Conditioning by Neural Operators
A single neural operator can approximate the map from arbitrary joint densities to their conditionals, backed by new continuity results and illustrated on Gaussian mixtures.
-
One Operator for Many Densities: Amortized Approximation of Conditioning by Neural Operators
A single neural operator can approximate the map from joint densities to conditional densities to arbitrary accuracy, with a proof based on continuity of the conditioning operator and a demonstration on Gaussian mixtures.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.