pith. sign in

arxiv: 2606.20547 · v1 · pith:XIYZ6OIYnew · submitted 2026-06-18 · 💻 cs.LG · cs.CV· cs.GR· cs.RO· math.DG

The Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups

Pith reviewed 2026-06-26 17:28 UTC · model grok-4.3

classification 💻 cs.LG cs.CVcs.GRcs.ROmath.DG
keywords lie algebra attentionmatrix lie groupsgroup equivarianceclosed-form kernelSE(2)affine groupstoken representation
0
0 comments X

The pith

Tokens defined as bare matrix Lie group elements yield a closed-form attention score from the relative pose logarithm that applies to affine groups and uses 50-80 times fewer parameters than learned kernels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper places each attention token directly as an element g_i of a matrix Lie group G, without any feature vector or external representation. The pairwise score is then the negative squared norm of the Lie algebra element log(g_i inverse times g_j) under a block-weighted Frobenius inner product. This construction is automatically equivariant under the diagonal group action and requires no irreducible representations, spherical harmonics, or learned kernel function. It reaches non-compact affine groups that previous vector-token or irrep-based methods exclude. Sequence completion experiments on SE(2), SO(3), and Aff(2) show the closed-form score matches or exceeds an MLP kernel on the same invariant while using far fewer parameters and preserving exact invariance.

Core claim

By defining tokens as bare elements g_i of a matrix Lie group G, the attention score becomes the negative squared algebra norm s_ij = −‖log(g_i^{-1} g_j)‖_λ² / τ under a block-weighted Frobenius inner product. This score is intrinsic to the group, automatically satisfies equivariance and the cocycle condition, and applies to any matrix Lie group admitting a suitable logarithm chart, including the affine groups that exclude irrep-based or surjective-exp methods.

What carries the argument

The Lie algebra element w_ij = log(g_i^{-1} g_j) obtained from the relative group element, whose block-weighted Frobenius norm supplies the closed-form attention score without learned parameters or representation theory.

If this is right

  • The attention construction reaches the non-compact affine group Aff(2) that irrep-based and surjective-exp methods must exclude.
  • On SE(2) the closed-form score matches or outperforms a learned MLP kernel on the same invariant while using 50 to 80 times fewer score parameters.
  • A vector-token baseline breaks invariance by five to twelve orders of magnitude on the tested sequence-completion tasks.
  • Equivariance under the diagonal G-action and the cocycle condition hold automatically from the group structure without additional design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same relative-pose construction could be tested on SE(3) or Aff(3) to check whether the parameter advantage scales to higher-dimensional groups.
  • Hybrid models could attach feature payloads to the group-element tokens to combine geometric invariance with learned content.

Load-bearing premise

A logarithm chart exists that contains all relevant relative poses g_i^{-1} g_j for the chosen matrix Lie groups and that the block-weighted Frobenius inner product supplies a suitable metric on the algebra.

What would settle it

Applying a fixed group element to all tokens via left multiplication and checking whether the resulting attention scores remain exactly unchanged; any measurable deviation would show the claimed automatic equivariance does not hold.

Figures

Figures reproduced from arXiv: 2606.20547 by Przemyslaw Musialski.

Figure 1
Figure 1. Figure 1: SE(2) sequence completion: pose error and equivariance error per model (log scale). G is 34% better than C at 54× fewer score parameters; both vastly outperform the equivariance-blind A, whose equivariance error is eleven orders of magnitude larger. G uses 36 score parameters and reaches pose error 0.003; C uses 1,932 (54× more) and reaches 0.005. The closed-form algebra-norm score is 34% better than the l… view at source ↗
Figure 2
Figure 2. Figure 2: SO(3) sequence completion: pose error and equivariance error per model (log scale). G with 24 score parameters matches C at 80× fewer parameters (both at the 10−4 floor); G’s and C’s equivariance errors sit at ∼ 10−14 in float32, while A’s is ∼ 10−1 . Both G and C reach pose error ∼10−4 (statistically indistinguishable within seed noise); C’s third seed lands lower and accounts for the wider C-std, but the… view at source ↗
Figure 3
Figure 3. Figure 3: Aff(2) sequence completion: pose error and equivariance error per model (log scale). G with 60 score parameters matches C at 51× fewer parameters; A’s pose error is 117× worse than G’s, and its equivariance error is 1.29 — five orders of magnitude above G’s and nine above C’s. G and C reach pose error ≈ 0.007, indistinguishable within seed noise, despite C using 51× more score parameters. A collapses. Its … view at source ↗
read the original abstract

We place the attention token on the group: a token is an element $g_i$ of a matrix Lie group $G$ -- a bare transformation, with no feature payload and no external action $\rho(g)$ carrying it. To our knowledge this is the first attention construction whose tokens are bare matrix Lie group elements: their score is the closed-form algebra norm of the relative pose rather than a learned kernel, and it reaches the affine full-frame groups that every irrep- or surjective-exp-based method must exclude. We call it Lie-Algebra Attention. Once tokens are group elements, the rest follows with none of the usual representation-theoretic machinery. The relative geometry of a pair is canonical, $g_i^{-1} g_j$, so the pairwise invariant $w_{ij} = \log(g_i^{-1} g_j)$ is intrinsic rather than designed; equivariance under the diagonal $G$-action is tautological, and the cocycle condition holds automatically. The attention score is the negative squared algebra norm, $s_{ij} = -\|\log(g_i^{-1} g_j)\|_\lambda^2/\tau$: the canonical proximity kernel under a block-weighted Frobenius inner product, with no irreducible representations, spherical harmonics, Clebsch-Gordan products, or learned kernel. The construction applies to any matrix Lie group on a chosen logarithm chart containing the relative poses, including the non-compact non-abelian affine groups with scale and shear that no vector-token attention method reaches: neither the irrep tradition nor surjective-exp methods. Three sequence-completion experiments, on SE(2), SO(3), and Aff(2), bear this out: the closed-form score matches a learned MLP kernel on the same invariant and outperforms it on SE(2), using 50 to 80x fewer score parameters, while a vector-token baseline breaks invariance by five to twelve orders of magnitude.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Lie-Algebra Attention in which tokens are bare elements g_i of a matrix Lie group G (no feature vectors or external representations). The pairwise score is the closed-form s_ij = -||log(g_i^{-1} g_j)||_λ²/τ under a block-weighted Frobenius inner product on the Lie algebra; equivariance is automatic. The construction is claimed to apply to any matrix Lie group on a chosen logarithm chart containing the relative poses, including non-compact affine groups excluded by irrep- or surjective-exp-based methods. Three sequence-completion experiments (SE(2), SO(3), Aff(2)) are reported to show that the closed-form score matches or outperforms a learned MLP kernel on the same invariant while using 50-80× fewer score parameters, and that a vector-token baseline violates invariance by 5-12 orders of magnitude.

Significance. If the central claims hold, the work supplies a parameter-efficient, representation-free attention mechanism whose equivariance and cocycle properties follow directly from the group structure. The closed-form score and explicit applicability to Aff(2) and similar groups constitute a genuine extension beyond existing geometric attention constructions. The reported parameter reduction and invariance preservation, if reproducible, would be a concrete practical advantage.

major comments (2)
  1. [Abstract and Aff(2) experiment description] Abstract and Aff(2) experiment: the score s_ij = -||log(g_i^{-1} g_j)||_λ²/τ is defined only when every relative pose lies in the image of the chosen logarithm chart. For Aff(2) the exponential map is not surjective (a matrix [[A b]; [0 1]] admits a real logarithm only under specific eigenvalue and Jordan conditions on A). The manuscript states that the construction applies “on a chosen logarithm chart containing the relative poses” but supplies no verification that the Aff(2) sequence-completion dataset satisfies this condition for all pairs. This verification is load-bearing for the Aff(2) results and for the claim that the method reaches affine groups.
  2. [Experiments section] Experiments: the quantitative claims (matching/outperforming the MLP, 50-80× parameter reduction, invariance violation by 5-12 orders of magnitude) are presented without data splits, error bars, training details, or code. Because the soundness of these outcomes cannot be assessed from the given text, the experimental support for the central efficiency and invariance claims remains unverifiable.
minor comments (2)
  1. The choice and tuning procedure for the block weights λ and temperature τ should be stated explicitly, including whether they are fixed across groups or selected per experiment.
  2. The exact architecture, hidden dimension, and training protocol of the MLP baseline (and of the vector-token baseline) should be given so that the parameter-count comparison can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract and Aff(2) experiment description] Abstract and Aff(2) experiment: the score s_ij = -||log(g_i^{-1} g_j)||_λ²/τ is defined only when every relative pose lies in the image of the chosen logarithm chart. For Aff(2) the exponential map is not surjective (a matrix [[A b]; [0 1]] admits a real logarithm only under specific eigenvalue and Jordan conditions on A). The manuscript states that the construction applies “on a chosen logarithm chart containing the relative poses” but supplies no verification that the Aff(2) sequence-completion dataset satisfies this condition for all pairs. This verification is load-bearing for the Aff(2) results and for the claim that the method reaches affine groups.

    Authors: We agree that explicit verification is required for the Aff(2) claim. The Aff(2) sequences were generated so that all relative poses satisfy the eigenvalue and Jordan conditions for a real matrix logarithm to exist under the standard chart. In the revision we will add a dedicated paragraph in the Aff(2) experiment subsection that (i) states the precise conditions used, (ii) reports that every generated relative pose meets them, and (iii) confirms the chosen logarithm chart therefore contains the entire dataset. This makes the load-bearing assumption verifiable. revision: yes

  2. Referee: [Experiments section] Experiments: the quantitative claims (matching/outperforming the MLP, 50-80× parameter reduction, invariance violation by 5-12 orders of magnitude) are presented without data splits, error bars, training details, or code. Because the soundness of these outcomes cannot be assessed from the given text, the experimental support for the central efficiency and invariance claims remains unverifiable.

    Authors: We accept that the current experimental description is insufficient for reproducibility. In the revised manuscript we will expand the Experiments section with: (a) explicit train/validation/test splits and generation procedure, (b) results averaged over multiple random seeds together with standard deviations, (c) full hyper-parameter tables and optimizer settings, and (d) a public code repository link (to be activated upon acceptance). These additions will allow direct assessment of the reported efficiency and invariance numbers. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is explicitly definitional and externally validated

full rationale

The paper's core construction defines the token as a bare group element g_i and the score directly as s_ij = -||log(g_i^{-1} g_j)||_λ²/τ from the relative pose and block-weighted Frobenius norm, with equivariance called tautological by construction. No step claims a derived prediction that reduces to a fitted parameter or self-citation chain; the MLP kernel comparison uses the same invariant as an external baseline, and experiments on SE(2), SO(3), Aff(2) provide independent performance checks. The logarithm-chart assumption is stated explicitly but does not create a definitional loop. The result is self-contained against the supplied group-theoretic definitions.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central construction rests on standard Lie-group properties plus two tunable scalars and one domain assumption about the logarithm chart.

free parameters (2)
  • λ (block weights)
    Weights inside the block-weighted Frobenius inner product that defines the algebra norm.
  • τ (temperature)
    Scalar that scales the squared norm in the attention score.
axioms (2)
  • domain assumption A logarithm chart exists containing the relative poses g_i^{-1} g_j for the groups under study.
    Invoked when defining w_ij = log(g_i^{-1} g_j) and the score s_ij.
  • standard math Matrix Lie groups are closed under inversion and the group operation.
    Background fact used to form the relative pose.

pith-pipeline@v0.9.1-grok · 5899 in / 1362 out tokens · 39296 ms · 2026-06-26T17:28:30.577451+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 5 linked inside Pith

  1. [1]

    Thomas et al.Tensor Field Networks: Rotation- and Translation-Equivariant Neural Networks for 3D Point Clouds

    N. Thomas et al.Tensor Field Networks: Rotation- and Translation-Equivariant Neural Networks for 3D Point Clouds. arXiv:1802.08219, 2018

  2. [2]

    F. B. Fuchs, D. E. Worrall, V. Fischer, M. Welling.SE(3)-Transformers: 3D Roto-Translation Equivariant At- tention Networks. NeurIPS 2020. 18

  3. [3]

    Y.-L. Liao, T. Smidt.Equiformer: Equivariant Graph Attention Transformer for 3D Atomistic Graphs. arXiv:2206.11990, 2022

  4. [4]

    Y.-L. Liao, A. J. Hoffman, S. C. Shen, A. Duval, S. W. Norwood, T. Smidt.EquiformerV3: Scaling Efficient, Expressive, and General SE(3)-Equivariant Graph Attention Transformers. arXiv:2604.09130, 2026

  5. [5]

    Batatia et al.MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields

    I. Batatia et al.MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields. NeurIPS 2022

  6. [6]

    Cohen, M

    T. Cohen, M. Welling.Group Equivariant Convolutional Networks. ICML 2016

  7. [7]

    Weiler, G

    M. Weiler, G. Cesa.General E(2)-Equivariant Steerable CNNs. NeurIPS 2019

  8. [8]

    V. G. Satorras, E. Hoogeboom, M. Welling.E(n) Equivariant Graph Neural Networks. ICML 2021

  9. [9]

    Brehmer, P

    J. Brehmer, P. de Haan, S. Behrends, T. Cohen.Geometric Algebra Transformer. NeurIPS 2023

  10. [10]

    de Haan, T

    P. de Haan, T. Cohen, J. Brehmer.Euclidean, Projective, Conformal: Choosing a Geometric Algebra for Equiv- ariant Transformers. AISTATS 2024

  11. [11]

    S. Xu, D. Chen, K. Wong, C. Zhang, K. Fallah, R. Urtasun.Efficient Equivariant Transformer for Self-Driving Agent Modeling(DriveGATr). arXiv:2604.01466, 2026

  12. [12]

    G. E. Hinton, S. Sabour, N. Frosst.Matrix Capsules with EM Routing. ICLR 2018

  13. [13]

    Jaderberg, K

    M. Jaderberg, K. Simonyan, A. Zisserman, K. Kavukcuoglu.Spatial Transformer Networks. NeurIPS 2015

  14. [14]

    Jumper et al.Highly Accurate Protein Structure Prediction with AlphaFold

    J. Jumper et al.Highly Accurate Protein Structure Prediction with AlphaFold. Nature 596, 583–589, 2021

  15. [15]

    N. Funk, J. Urain, J. Carvalho, V. Prasad, G. Chalvatzaki, J. Peters.ActionFlow: Equivariant, Accurate, and Efficient Policies with Spatially Symmetric Flow Matching. arXiv:2409.04576, 2024

  16. [16]

    Hutchinson, C

    M. Hutchinson, C. Le Lan, S. Zaidi, E. Dupont, Y. W. Teh, H. Kim.LieTransformer: Equivariant Self-Attention for Lie Groups. ICML 2021

  17. [17]

    Finzi, S

    M. Finzi, S. Stanton, P. Izmailov, A. G. Wilson.Generalizing Convolutional Neural Networks for Equivariance to Lie Groups on Arbitrary Continuous Data. ICML 2020. arXiv:2002.12880

  18. [18]

    E. J. Bekkers, S. Vadgama, R. D. Hesselink et al.Fast, Expressive SE(n) Equivariant Networks Through Weight- Sharing in Position-Orientation Space. ICLR 2024

  19. [19]

    Mironenco, P

    M. Mironenco, P. Forr ´e.Lie Group Decompositions for Equivariant Neural Networks. ICLR 2024

  20. [20]

    J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, Y. Liu.RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864, 2021

  21. [21]

    Ji.RiemannFormer: A Framework for Attention in Curved Spaces

    Z. Ji.RiemannFormer: A Framework for Attention in Curved Spaces. arXiv:2506.07405, 2025

  22. [22]

    Ostmeier et al.LieRE: Lie Rotational Positional Encodings

    S. Ostmeier et al.LieRE: Lie Rotational Positional Encodings. arXiv:2406.10322, 2024

  23. [23]

    Miyato, B

    T. Miyato, B. Jaeger, M. Welling, A. Geiger.GTA: A Geometry-A ware Attention Mechanism for Multi-View Transformers. arXiv:2310.10375, 2024

  24. [24]

    Howell et al.Clebsch–Gordan Transformer: Fast and Global Equivariant Attention

    O. Howell et al.Clebsch–Gordan Transformer: Fast and Global Equivariant Attention. arXiv:2509.24093, 2025

  25. [25]

    J. Fu, Q. Xie, D. Meng, Z. Xu.Vanilla Group Equivariant Vision Transformer: Simple and Effective. arXiv:2602.08047, 2026

  26. [26]

    Zhang, Z

    Y. Zhang, Z. Chen, Y. Liu, Z. Qin, H. Yuan, K. Xu, Y. Yuan, Q. Gu, A. C.-C. Yao.Group Representational Position Encoding. arXiv:2512.07805, 2025

  27. [27]

    Franc ¸ois, L

    J. Franc ¸ois, L. Ravera.Toward Manifest Relationality in Transformers via Symmetry Reduction. arXiv:2602.18948, 2026

  28. [28]

    C. Kim, S. Zhao, M. Zhu, T.-Y. Lin, M. Ghaffari.Equivariant Neural Networks for General Linear Symmetries on Lie Algebras(Reductive Lie Neurons). arXiv:2510.22984, 2025

  29. [29]

    Hoyos, S

    P. Hoyos, S. Ubaru, D. Huh, V. Kalantzis, K. L. Clarkson, M. Kilmer, H. Avron, L. Horesh.Group-Algebraic Tensors: Provably-optimal Equivariant Learning and Physical Symmetry Discovery. arXiv:2605.20440, 2026. 19