Beyond Sinusoids: A Morlet Wavelet Framework for Transformer Positional Encoding

Athanasios Zeris

arxiv: 2606.01258 · v1 · pith:NRNM47ZNnew · submitted 2026-05-31 · 💻 cs.LG · cs.CL· eess.SP

Beyond Sinusoids: A Morlet Wavelet Framework for Transformer Positional Encoding

Athanasios Zeris This is my paper

Pith reviewed 2026-06-28 17:32 UTC · model grok-4.3

classification 💻 cs.LG cs.CLeess.SP

keywords positional encodingtransformerMorlet waveletRoPEsinusoidal encodinglocality kernelattention mechanism

0 comments

The pith

Morlet wavelets unify sinusoidal and rotary positional encodings as limiting cases while adding per-dimension learnable locality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Morlet Positional Encoding where each embedding dimension learns its own frequency and locality bandwidth. It proves that standard sinusoidal encodings and the RoPE correlation kernel both arise exactly when the locality parameter is taken to infinity. The phase component matches the RoPE rotation angle while the amplitude supplies a Gaussian locality kernel that prior methods do not include. Experiments combine MoPE with energy-gated attention to obtain a 0.119 gain on TinyShakespeare over either component alone, and the learned parameters consistently reach the wavelet admissibility boundary.

Core claim

The central claim is that the Morlet wavelet supplies the natural basis for positional encoding because it minimizes joint uncertainty in position and frequency. MoPE therefore generalizes both sinusoidal PE and RoPE: when every locality bandwidth sigma_i tends to infinity the correlation kernel of MoPE reduces to the RoPE kernel and the phase recovers the exact rotation angle of RoPE, while the amplitude term supplies an additional learned Gaussian locality envelope absent from the earlier encodings.

What carries the argument

Morlet Positional Encoding (MoPE), in which each embedding dimension independently learns a frequency and a locality bandwidth parameter from data.

If this is right

Sinusoidal positional encodings and the RoPE correlation kernel both emerge as exact limiting cases of MoPE.
MoPE supplies an extra learned Gaussian locality kernel that neither sinusoidal nor RoPE encodings contain.
MoPE combined with energy-gated attention produces a 0.119 accuracy gain on TinyShakespeare over either component used alone.
All 128 learned frequency-bandwidth pairs converge to the wavelet admissibility boundary on character-level language data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The convergence of parameters to the admissibility boundary may indicate a reproducible spectral property of character-level sequences that could be tested on other modalities.
Because MoPE interpolates continuously between fully local and fully global positional effects, it could be used to diagnose how much locality a given task actually requires.
The unification suggests that future positional schemes could be derived by choosing different admissible wavelets instead of fixing the Morlet form.

Load-bearing premise

The Morlet wavelet is the natural basis for positional encoding because it simultaneously minimizes uncertainty in position and frequency.

What would settle it

Running the identical transformer architecture with MoPE but forcing every locality bandwidth to infinity and verifying that both the learned weights and the task performance become statistically indistinguishable from a standard RoPE baseline.

Figures

Figures reproduced from arXiv: 2606.01258 by Athanasios Zeris.

**Figure 1.** Figure 1: Complex-plane geometry of MOPE vs ROPE and sin/cos PE at three representative embedding dimensions (learned EGA-MORLET parameters after 5000 training steps on TinyShakespeare). Each panel shows token positions b = 0, . . . , 256 as a trajectory in the complex plane (cos(ωib), sin(ωib)). ROPE / sin/cos (dashed unit circle): every position lies on the circle at constant magnitude; only the angle encodes po… view at source ↗

**Figure 2.** Figure 2: Proposition 1 visualised: MOPE cross-correlation = ROPE × Gaussian locality kernel (Eq. 17), shown for two learned dimensions from the EGA-MORLET model. Blue dashed: the ROPE cross-correlation cos(ωiτ )—a pure cosine at center frequency ωi , with equal amplitude at all lags τ ; ROPE has no notion of “nearby” vs “distant.” Orange shaded: the Gaussian envelope e −τ 2/4σ 2 i —the locality kernel contributed b… view at source ↗

**Figure 3.** Figure 3: Learned MOPE parameters after 5000 training steps (EGA-MORLET, TinyShakespeare). Left: Uncertainty plane (σi , ωi); each point is one embedding dimension pair, coloured by dimension index (fine → coarse). Red dashed line: admissibility boundary ωσ = 5. All 128 learned pairs lie exactly on the boundary (see [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: All 128 learned MOPE dimension pairs (σi , ωi) converge to the admissibility boundary ωiσi = 5 after 5000 training steps (blue dots, log–log axes). The red curve is the full constraint hyperbola ω = 5/σ; the blue cluster occupies only a narrow band σ ∈ [1.49, 4.50] tokens, confirming the scale compression from the ∼9300× dyadic initialization range to a 3× learned band. The optimizer consistently pushes ev… view at source ↗

read the original abstract

Standard positional encodings for transformers - sinusoidal and rotary (RoPE) - treat every position as equally local: they encode where a token is, but not how far its positional influence should extend. We propose that the Morlet wavelet, which simultaneously minimises uncertainty in position and frequency, is the natural basis for positional encoding, and introduce Morlet Positional Encoding (MoPE): each embedding dimension learns its own frequency and locality bandwidth from data. The main theoretical result is a unification: sinusoidal PE and the RoPE correlation kernel both emerge as limiting cases of MoPE when locality is switched off (sigma_i -> infinity). The phase of MoPE recovers the RoPE rotation angle exactly; the amplitude adds a learned Gaussian locality kernel that standard encodings lack. Empirically, MoPE combined with Energy-Gated Attention achieves +0.119 improvement over standard attention on TinyShakespeare, outperforming either component alone. Analysis of the learned parameters reveals that all 128 frequency-bandwidth pairs converge to the wavelet admissibility boundary - an empirical observation consistent with a companion result on energy gating, suggesting a reproducible property of character-level language signals that warrants further investigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoPE unifies prior encodings as limits but its experiments do not isolate the contribution of the new positional encoding.

read the letter

The main thing to know is that this paper gives a Morlet wavelet positional encoding where each dimension learns frequency and bandwidth, recovering sinusoidal and RoPE as exact limits when locality is turned off. The phase matches the rotary angle by construction in that limit, and the amplitude supplies a learned Gaussian locality kernel missing from prior encodings.

The construction is new in allowing per-dimension control over locality via the Gaussian envelope. The math checks out for the phase recovery and the limit cases. The paper also reports that the learned parameters all reach the admissibility boundary in their run on character data.

The experiments are the weaker part. They test MoPE only in combination with energy-gated attention on TinyShakespeare and report a gain of 0.119, but without separating the two changes it's unclear what drives the result. The motivation for picking the Morlet wavelet specifically because it minimizes time-frequency uncertainty is stated up front but not tested against alternatives.

The unification is definitional once the wavelet is chosen, so it does not provide an independent reason to prefer this form over others. Still, the added flexibility in learning locality per dimension is a concrete difference from fixed encodings.

This paper is for people tinkering with transformer positional encodings. A reader working on sequence models could extract the parameterization and try it, though the practical payoff is not yet demonstrated on larger tasks or with isolated ablations.

I would recommend peer review to get the derivations checked and to see if the idea can be evaluated more cleanly on bigger tasks.

Referee Report

2 major / 1 minor

Summary. The paper proposes Morlet Positional Encoding (MoPE) based on the Morlet wavelet, with each embedding dimension learning its own frequency and locality bandwidth (sigma_i). The central theoretical claim is a unification result: standard sinusoidal positional encodings and the RoPE correlation kernel both emerge exactly as limiting cases of MoPE when sigma_i -> infinity, with the phase of MoPE recovering the RoPE rotation angle exactly. The amplitude term supplies an additional learned Gaussian locality kernel absent from prior encodings. Empirically, MoPE combined with Energy-Gated Attention yields a +0.119 gain over standard attention on TinyShakespeare; learned parameters are observed to converge to the wavelet admissibility boundary.

Significance. If the unification holds, the framework supplies a single parametric family that recovers two widely used encodings as special cases while adding learnable locality, which is a clean generalization. The observation that all 128 learned (frequency, bandwidth) pairs converge to the admissibility boundary is presented as potentially reproducible for character-level signals. These strengths are offset by the fact that the reported numeric gain cannot be attributed to the positional encoding in isolation.

major comments (2)

[Abstract] Abstract / empirical evaluation: the reported +0.119 improvement is obtained only for the combination of MoPE with Energy-Gated Attention on a single small dataset (TinyShakespeare); no ablation isolating the positional-encoding contribution is provided, so the performance claim for MoPE itself is not load-bearing.
[Theoretical result] Theoretical unification result: the claimed exact recovery of the RoPE angle and the emergence of both sinusoidal PE and the RoPE kernel are obtained by substituting the limit sigma_i -> infinity directly into the MoPE definition; the correspondence is therefore definitional once the wavelet form is chosen rather than an independent derivation.

minor comments (1)

[Abstract] The abstract states that the Morlet wavelet 'simultaneously minimises uncertainty in position and frequency' as motivation; a brief reference to the Heisenberg uncertainty principle or the wavelet admissibility condition would clarify this for readers unfamiliar with the property.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract / empirical evaluation: the reported +0.119 improvement is obtained only for the combination of MoPE with Energy-Gated Attention on a single small dataset (TinyShakespeare); no ablation isolating the positional-encoding contribution is provided, so the performance claim for MoPE itself is not load-bearing.

Authors: We agree that the reported gain is for the combined MoPE + Energy-Gated Attention system and that isolating the positional-encoding contribution would make the empirical claims more robust. In the revised manuscript we will add ablation experiments that compare (i) MoPE versus sinusoidal encodings under standard attention and (ii) MoPE versus the baseline under the same energy-gated attention, all on TinyShakespeare. These results will be reported in a new table and discussed in the experimental section. revision: yes
Referee: [Theoretical result] Theoretical unification result: the claimed exact recovery of the RoPE angle and the emergence of both sinusoidal PE and the RoPE kernel are obtained by substituting the limit sigma_i -> infinity directly into the MoPE definition; the correspondence is therefore definitional once the wavelet form is chosen rather than an independent derivation.

Authors: We respectfully disagree that the result is merely definitional. The contribution consists in selecting the Morlet wavelet on the basis of its joint time-frequency localization properties and then demonstrating that this specific parametric family recovers both sinusoidal positional encodings and the RoPE kernel exactly in the infinite-bandwidth limit while simultaneously introducing a learnable Gaussian locality term absent from prior work. The exact phase recovery and the emergence of the two prior methods are therefore consequences of a principled choice rather than an arbitrary substitution. We will add a short paragraph in the theoretical section clarifying this motivation. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper defines MoPE using the standard Morlet wavelet form (complex exponential times Gaussian envelope) with per-dimension parameters, then states that the sigma_i -> infinity limit recovers sinusoidal PE and the RoPE kernel with exact phase match. This is a direct mathematical consequence of the chosen functional form and standard wavelet properties, not a reduction of an independent derivation to its inputs. No self-citation chain, fitted parameter renamed as prediction, or ansatz smuggled via prior work is present in the unification claim. The empirical results on TinyShakespeare and learned parameter analysis are separate and do not rely on the limit statement. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on treating the Morlet wavelet as the appropriate basis and on the existence of 128 independent learnable frequency-bandwidth pairs whose values are fitted to the training signal.

free parameters (1)

per-dimension frequency and locality bandwidth (sigma_i)
Each of the 128 embedding dimensions learns its own frequency and sigma from data; these are the quantities that converge to the admissibility boundary.

axioms (1)

domain assumption The Morlet wavelet simultaneously minimises uncertainty in position and frequency and is therefore the natural basis for positional encoding.
Invoked in the first sentence of the abstract to justify the choice of wavelet.

pith-pipeline@v0.9.1-grok · 5732 in / 1482 out tokens · 28695 ms · 2026-06-28T17:32:05.541798+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Multiscale POD of Transformer Attention Fields: Scale-Selective Analysis via Morlet Scalogram
physics.flu-dyn 2026-06 unverdicted novelty 4.0

Applies multiscale POD with Morlet scalograms to transformer attention fields to extract dominant modes per scale and reports layer-dependent scale organisation.

Reference graph

Works this paper leans on

12 extracted references · 3 linked inside Pith · cited by 1 Pith paper

[1]

Energy- G ated A ttention: S pectral S alience as an I nductive B ias for T ransformer A ttention

Zeris, A. Energy- G ated A ttention: S pectral S alience as an I nductive B ias for T ransformer A ttention. arXiv preprint arXiv:2605.21842v1, 2026

Pith/arXiv arXiv 2026
[2]

Zeris. A. E nergy- G ated A ttention and W avelet P ositional E ncoding: C omplementary I nductive B iases for T ransformer A ttention. arXiv preprint arXiv:2605.26355v1, 2026

Pith/arXiv arXiv 2026
[3]

FNet : M ixing T okens with F ourier T ransforms

Lee-Thorp, J., Ainslie, J., Eckstein, I., and Ontanon, S. FNet : M ixing T okens with F ourier T ransforms. arXiv preprint arXiv:2105.03824, 2021

arXiv 2021
[4]

A Wavelet Tour of Signal Processing

Mallat, S. A Wavelet Tour of Signal Processing. Academic Press, 1999

1999
[5]

A., and Lewis, M

Press, O., Smith, N. A., and Lewis, M. Train short, test long: A ttention with linear biases enables input length extrapolation. In ICLR, 2022

2022
[6]

N., Vinyals, O., Senior, A., and Sak, H

Sainath, T. N., Vinyals, O., Senior, A., and Sak, H. Convolutional, long short-term memory, fully connected deep neural networks. In ICASSP, pp.\ 4580--4584, 2015

2015
[7]

Self-attention with relative position representations

Shaw, P., Uszkoreit, J., and Vaswani, A. Self-attention with relative position representations. In NAACL, 2018

2018
[8]

RoFormer : E nhanced T ransformer with R otary P osition E mbedding

Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. RoFormer : E nhanced T ransformer with R otary P osition E mbedding. arXiv preprint arXiv:2104.09864, 2021

Pith/arXiv arXiv 2021
[9]

Language through a prism: A spectral approach for multiscale language representations

Tamkin, A., Jurafsky, D., and Goodman, N. Language through a prism: A spectral approach for multiscale language representations. In NeurIPS, volume 33, 2020

2020
[10]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. In NeurIPS, volume 30, 2017

2017
[11]

and Pilanci, M

Verma, P. and Pilanci, M. Towards signal processing in large language models. arXiv preprint arXiv:2406.10254, 2024

arXiv 2024
[12]

LEAF : A L earnable F rontend for A udio C lassification

Zeghidour, N., Teboul, O., de Chaumont Quitry, F., and Tagliasacchi, M. LEAF : A L earnable F rontend for A udio C lassification. In ICLR, 2021

2021

[1] [1]

Energy- G ated A ttention: S pectral S alience as an I nductive B ias for T ransformer A ttention

Zeris, A. Energy- G ated A ttention: S pectral S alience as an I nductive B ias for T ransformer A ttention. arXiv preprint arXiv:2605.21842v1, 2026

Pith/arXiv arXiv 2026

[2] [2]

Zeris. A. E nergy- G ated A ttention and W avelet P ositional E ncoding: C omplementary I nductive B iases for T ransformer A ttention. arXiv preprint arXiv:2605.26355v1, 2026

Pith/arXiv arXiv 2026

[3] [3]

FNet : M ixing T okens with F ourier T ransforms

Lee-Thorp, J., Ainslie, J., Eckstein, I., and Ontanon, S. FNet : M ixing T okens with F ourier T ransforms. arXiv preprint arXiv:2105.03824, 2021

arXiv 2021

[4] [4]

A Wavelet Tour of Signal Processing

Mallat, S. A Wavelet Tour of Signal Processing. Academic Press, 1999

1999

[5] [5]

A., and Lewis, M

Press, O., Smith, N. A., and Lewis, M. Train short, test long: A ttention with linear biases enables input length extrapolation. In ICLR, 2022

2022

[6] [6]

N., Vinyals, O., Senior, A., and Sak, H

Sainath, T. N., Vinyals, O., Senior, A., and Sak, H. Convolutional, long short-term memory, fully connected deep neural networks. In ICASSP, pp.\ 4580--4584, 2015

2015

[7] [7]

Self-attention with relative position representations

Shaw, P., Uszkoreit, J., and Vaswani, A. Self-attention with relative position representations. In NAACL, 2018

2018

[8] [8]

RoFormer : E nhanced T ransformer with R otary P osition E mbedding

Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. RoFormer : E nhanced T ransformer with R otary P osition E mbedding. arXiv preprint arXiv:2104.09864, 2021

Pith/arXiv arXiv 2021

[9] [9]

Language through a prism: A spectral approach for multiscale language representations

Tamkin, A., Jurafsky, D., and Goodman, N. Language through a prism: A spectral approach for multiscale language representations. In NeurIPS, volume 33, 2020

2020

[10] [10]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. In NeurIPS, volume 30, 2017

2017

[11] [11]

and Pilanci, M

Verma, P. and Pilanci, M. Towards signal processing in large language models. arXiv preprint arXiv:2406.10254, 2024

arXiv 2024

[12] [12]

LEAF : A L earnable F rontend for A udio C lassification

Zeghidour, N., Teboul, O., de Chaumont Quitry, F., and Tagliasacchi, M. LEAF : A L earnable F rontend for A udio C lassification. In ICLR, 2021

2021