arxiv: 2603.25860 · v2 · submitted 2026-03-26 · 📊 stat.ML · cs.LG

Recognition: no theorem link

On the Expressive Power of Contextual Relations in Transformers

Demi\'an Fraiman

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:58 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords transformersattention mechanismsuniversal approximationcontextual relationsoptimal transportsoftmax normalizationSinkhorn algorithmmeasure theory

0 comments

The pith

Transformers can approximate any contextual relation by treating it as a probability distribution or coupling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a measure-theoretic framework that represents contextual relations as either conditional distributions or joint distributions called couplings. It shows that standard softmax attention arises naturally as entropy-regularized optimal transport, so attention simply normalizes an affinity function between elements. A universal approximation theorem follows: transformers using either softmax or Sinkhorn normalization can approximate arbitrary contextual rules to any desired accuracy. A reader would care because the result supplies a single probabilistic account for why attention succeeds on tasks where meaning depends on surrounding context, and it clarifies that the normalization choice controls whether the model emphasizes conditional or joint structure.

Core claim

Contextual relations are modeled as probabilistic objects, either conditional distributions or joint distributions (couplings). Standard softmax attention corresponds to entropy-regularized optimal transport, giving a unified view of attention as normalization of an underlying affinity function. The central result is a universal approximation theorem: transformer architectures equipped with standard softmax attention or Sinkhorn normalization can approximate arbitrary contextual relation rules, and the specific normalization determines how the relations are represented.

What carries the argument

Measure-theoretic framework that treats contextual relations as conditional distributions or couplings, with attention realized as normalization of an affinity function equivalent to entropy-regularized optimal transport.

If this is right

Any contextual rule expressible as a conditional distribution or coupling can be approximated arbitrarily well by a sufficiently large transformer.
Switching between softmax and Sinkhorn normalization changes whether the model represents relations primarily as conditionals or as full joints.
The same architecture can therefore serve different contextual modeling needs simply by altering the normalization step.
Empirical effectiveness of transformers on context-heavy tasks follows directly from the universal approximation property rather than from architecture-specific heuristics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Different normalizations could be chosen deliberately to match the probabilistic structure of a target domain, such as using couplings for symmetric relations.
The optimal-transport view suggests deriving new attention variants by replacing the affinity function with other cost measures suited to particular data types.
Initialization or regularization schemes could be designed around the entropy-regularized transport objective to improve training stability.

Load-bearing premise

Contextual relations of interest can be fully and faithfully captured by modeling them as conditional distributions or as joint distributions between the relevant variables.

What would settle it

A specific contextual relation, such as a deterministic non-probabilistic ordering rule on sequences, that no transformer of any finite width and depth using softmax attention can approximate within a fixed error tolerance on a chosen test set of contexts.

read the original abstract

Transformer architectures have achieved remarkable empirical success in modeling contextual relations, yet a clear understanding of their expressive power is still lacking. In this work, we introduce a measure-theoretic framework in which contextual relations are modeled as probabilistic objects, either as conditional distributions or as joint distributions (couplings). This perspective reveals a natural connection between standard softmax attention and entropy-regularized optimal transport, providing a unified view of attention as a normalization of an underlying affinity function. Within this framework, we establish a universal approximation theorem for contextual systems using standard Softmax Attention and alternately Sinkhorn normalization. These results show that Transformer architectures can approximate arbitrary contextual relations rules, and that the choice of normalization determines how these relations are represented. Moreover, they provide a principled explanation for why Transformers are effective at modeling contextual relations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean measure-theoretic framing that ties softmax attention to entropic OT and claims a universal approximation theorem for contextual relations, but the restriction to dot-product affinities from embeddings looks like a real gap in the argument.

read the letter

The main thing to know is that the authors model contextual relations as conditional distributions or couplings, show that softmax attention amounts to entropy-regularized optimal transport normalization of an affinity matrix, and then claim a universal approximation theorem for what standard transformers can represent. They also note that the choice of normalization (softmax versus Sinkhorn) changes how the relations are expressed. This is the core of the work. The framing is direct and avoids some of the usual loose language around attention. The explicit link to entropic OT is a useful perspective even though both attention and OT are established tools; it gives a principled way to see why normalization matters without relying only on experiments. The derivations appear to start from standard definitions rather than fitted parameters, which keeps the circularity burden low. The setup is measure-theoretic and first-principles, which is a plus for this kind of theory paper. The soft spot is the approximation claim itself. The abstract states that standard softmax attention suffices, yet the stress-test concern holds up on the given material: real attention computes affinities only through the dot-product of query and key projections from finite-dimensional embeddings. If the density argument for approximating arbitrary couplings or conditionals requires free choice of affinity functions, then the restricted class spanned by dot products may not be dense enough on general measure spaces. The abstract does not address how the embedding dimension or the specific functional form affects the result, so the theorem as stated for “standard Softmax Attention” is not yet convincing without those details. Error bounds and explicit assumptions are also missing from the summary. This paper is for people working on theoretical foundations of transformers, approximation results, or OT connections in ML. A reader who already knows measure theory and optimal transport will find the perspective worth checking. It deserves a serious referee because the framing is fresh and the claim is substantial, even if the approximation step will need tightening and possibly a revised statement.

Referee Report

1 major / 1 minor

Summary. The paper introduces a measure-theoretic framework in which contextual relations are modeled as conditional distributions or joint distributions (couplings). It connects standard softmax attention to entropy-regularized optimal transport via normalization of an affinity function, and establishes a universal approximation theorem asserting that Transformers using standard Softmax Attention (and alternately Sinkhorn normalization) can approximate arbitrary contextual relation rules.

Significance. If the universal approximation result holds under the actual restrictions of dot-product attention, the work would supply a principled theoretical account of why Transformers succeed at contextual modeling and would unify attention with optimal transport. The framework itself is a clean conceptual contribution even if the density claim requires additional justification.

major comments (1)

[Abstract] Abstract (universal approximation theorem): the claim that standard Softmax Attention suffices rests on the density of the realized affinities exp((W_Q x_i)^T (W_K x_j)/sqrt(d)) (or multi-head variants) in the space of measurable affinity functions. The manuscript does not demonstrate that this restricted parametric class remains dense on general measure spaces when embedding dimension is finite; this is load-bearing for the central claim.

minor comments (1)

The abstract refers to 'alternately Sinkhorn normalization' without specifying the precise algorithmic steps or convergence criteria used in the Transformer setting; the full manuscript should include an explicit definition or pseudocode.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for your thorough review and valuable comments on our paper. We appreciate your recognition of the measure-theoretic framework and its connection to entropy-regularized optimal transport as a conceptual contribution. We address the major comment below and will make the corresponding revisions to strengthen the central claim.

read point-by-point responses

Referee: [Abstract] Abstract (universal approximation theorem): the claim that standard Softmax Attention suffices rests on the density of the realized affinities exp((W_Q x_i)^T (W_K x_j)/sqrt(d)) (or multi-head variants) in the space of measurable affinity functions. The manuscript does not demonstrate that this restricted parametric class remains dense on general measure spaces when embedding dimension is finite; this is load-bearing for the central claim.

Authors: We agree that an explicit demonstration of density for the parametric affinities is necessary and load-bearing. The original manuscript states the universal approximation result assuming the affinity can be realized via the dot-product form but does not include a dedicated density argument for finite embedding dimension on general measure spaces. In the revised version we will add a lemma showing that, under standard assumptions (e.g., compactly supported measures on metric spaces), the class of functions of the form exp((W_Q x)^T (W_K y)/sqrt(d)) becomes dense in the continuous affinity functions as the embedding dimension d increases (with d chosen depending on the target accuracy). This follows from the fact that finite sums of separable products g_l(x) h_l(y) are dense in C(X × Y) by the Stone-Weierstrass theorem, and multi-head attention further enlarges the effective rank. We will also clarify that the result holds in the limit of sufficiently large (but finite) d, consistent with other universal-approximation statements in neural networks, and note the limitations for strictly fixed d on arbitrary measure spaces. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation proceeds from standard definitions to approximation theorem without self-referential reductions.

full rationale

The paper defines contextual relations measure-theoretically as conditionals or couplings, identifies softmax attention as normalization of an affinity matrix (standard construction), and invokes density of entropic OT plans to obtain a universal approximation result. No equation reduces to a fitted parameter renamed as prediction, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled via prior work. The framework is self-contained against external measure-theoretic benchmarks; the skeptic concern addresses model restriction versus unrestricted affinities but does not exhibit a definitional loop or construction-by-fit inside the given derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on modeling contextual relations as probabilistic objects and on standard results from measure theory and optimal transport; no free parameters or new entities are introduced.

axioms (1)

domain assumption Contextual relations can be modeled as probabilistic objects, either as conditional distributions or as joint distributions (couplings).
This is the foundational modeling choice that enables the connection to attention and the approximation theorem.

pith-pipeline@v0.9.0 · 5420 in / 1212 out tokens · 42095 ms · 2026-05-14T23:58:58.970158+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

One Operator for Many Densities: Amortized Approximation of Conditioning by Neural Operators
stat.ML 2026-05 unverdicted novelty 7.0

A single neural operator can approximate the map from arbitrary joint densities to their conditionals, backed by new continuity results and illustrated on Gaussian mixtures.
One Operator for Many Densities: Amortized Approximation of Conditioning by Neural Operators
stat.ML 2026-05 unverdicted novelty 6.0

A single neural operator can approximate the map from joint densities to conditional densities to arbitrary accuracy, with a proof based on continuity of the conditioning operator and a demonstration on Gaussian mixtures.