pith. sign in

arxiv: 2606.04106 · v1 · pith:VAOTGGQTnew · submitted 2026-06-02 · 💻 cs.LG · cs.AI

Building The Ph(ysical)AI Layer Of Machine Intelligence

Pith reviewed 2026-06-28 10:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords cross-modal transferRF dataprinciple-driven modelsfrozen encodersignal principlesfoundation modelsphysical AI
0
0 comments X

The pith

A model trained only on radio-frequency data with embedded physical principles transfers to audio, images, text, and video using a frozen encoder and linear probes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that foundation models can achieve cross-domain generalization by encoding signal-theoretic principles such as Fourier decomposition, energy conservation, and symmetry rather than learning purely statistical correlations from diverse paired data. It tests the idea by training exclusively on RF signals and then applying the frozen representations to tasks in other modalities. The resulting 1.99M-parameter encoder reaches 77.7 percent average accuracy across 15 tasks, with higher performance on physically grounded problems than on semantic ones. This positions principle-driven training as a route to efficient transfer that does not require fine-tuning or domain-specific paired examples.

Core claim

Training exclusively on radio-frequency data with an architecture and losses that incorporate Fourier decomposition, energy conservation, and symmetry produces representations that transfer to audio, images, text, and video domains using only a frozen encoder and linear probing, without any fine-tuning on the target domains.

What carries the argument

The principle-driven RF encoder that embeds Fourier decomposition, energy conservation, and symmetry to support learnable transformations across modalities.

Load-bearing premise

Domains differ not in fundamental physics but in learnable transformations in time, frequency, magnitude, or phase.

What would settle it

If linear probes applied to the frozen RF encoder yield accuracy no better than chance on a new modality such as video classification, the cross-modal transfer result would not hold.

Figures

Figures reproduced from arXiv: 2606.04106 by Brooks Olney, Daniel Capecci, Liam Smith, Pooya Khorrami, Sage Trudeau, Steven Kusiak, Ulbert Jose Botero, Watson Jia.

Figure 1
Figure 1. Figure 1: Multi-Head Parseval Focus architecture. (a) Multi-Head Orthogonal Parseval Focus fuses in-domain (time-time, freq-freq) and cross-domain (time-freq, freq-time) focus via gated linear units for comprehensive signal analysis. (b) Parseval Scaled Covariance Focus: Cross-domain transformations (FFT/IFFT) with covariance-based attention. JSD between bidirectional distributions enforces Parseval consistent gatin… view at source ↗
Figure 2
Figure 2. Figure 2: Time-Frequency Joint Embedding Predictive Architecture for Invariant and Equivariant [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Learned representation quality. RF modulation t-SNE and confusion matrix show tight clustering for physical task. FashionMNIST shows structural clustering with within-category confusion. Zero-shot image reconstruction: reconstructions (middle) preserve structural coherence and spatial relationships from originals (top) . structure. We achieve 36.9% top-1 accuracy (vs. 11.1% random chance) and 71.8% top-3 a… view at source ↗
Figure 4
Figure 4. Figure 4: Full Convolutional Tokenizer Block Diagram [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Construction of the Noise Sink: (1)Noise Estimation (2)Noise Subtraction (3)Regularization [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Causal Cross Window Focus Block 22 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: High Level Block Diagram of PlanFormer Encoder: Starting with domain specific trans [PITH_FULL_IMAGE:figures/full_fig_p043_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Random weight baseline for reconstruction. Same architecture as trained model, but with random weight initialization. Left: Original image. Right: Reconstruction with random weights—complete failure producing uniform gray field. This validates that learned representations (not architectural biases or skip connections) drive reconstruction quality. Compare to [PITH_FULL_IMAGE:figures/full_fig_p050_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Learned representation structure: Physical vs. Semantic tasks. t-SNE visualizations of frozen encoder representations across six representative tasks. Left Column (Physical tasks): (a) RF Fingerprinting shows well structured emitter representations (b) Bilingual speaker recognition demonstrates clear speaker-specific clusters consistent across languages. (c) Instrument family classification shows distinct … view at source ↗
Figure 10
Figure 10. Figure 10: Interpretable confusion patterns across semantic spectrum. Confusion matrices for three representative 1D tasks. Top: Music genre classification shows confusion between physically similar genres (rock/metal, classical/jazz) that share instrumentation but differ in cultural context. Middle: Individual instrument classification shows within-family confusion (brass instruments cluster, strings cluster), demo… view at source ↗
Figure 11
Figure 11. Figure 11: Zero-shot image reconstructions from encoder-decoder system trained exclusively on [PITH_FULL_IMAGE:figures/full_fig_p095_11.png] view at source ↗
read the original abstract

Foundation models achieve generalization through massive-scale training on diverse data, but have limitations with transfer to truly unseen domains without paired training data. We propose principle-driven foundation models that encode signal-theoretic principles (Fourier decomposition, energy conservation, symmetry) rather than learn untethered statistical correlations. We hypothesize that domains differ not in fundamental physics, but in learnable transformations in time, frequency, magnitude, or phase. Training exclusively on radio-frequency (RF) data with co-designed architecture and losses incorporating these principles, we achieve cross-modal transfer to audio, images, text, and video using only frozen representations learned from RF data, requiring no fine-tuning of the encoder on target domains. Our 1.99M parameter frozen encoder achieves 77.7% average accuracy (91.9% top-3) across 15 diverse tasks via linear probing, with systematic variation: 84.5 on physically-grounded tasks (speaker recognition, seismology, RF fingerprinting) versus 70.0% on semantic tasks (music genre, language recognition). This reveals that principle-driven and scale-driven approaches offer complementary paths: physical principles enable efficient cross-modal transfer while naturally establishing the boundary between physical and semantic understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes principle-driven foundation models that encode signal-theoretic principles (Fourier decomposition, energy conservation, symmetry) rather than untethered statistical correlations. Training a 1.99M-parameter encoder exclusively on radio-frequency (RF) data with co-designed architecture and losses, the authors claim cross-modal transfer to audio, images, text, and video using only frozen RF representations (no encoder fine-tuning on target domains). They report 77.7% average accuracy (91.9% top-3) across 15 tasks via linear probing, with 84.5% on physically-grounded tasks versus 70.0% on semantic tasks, arguing this reveals complementary physical-principle and scale-driven paths.

Significance. If the central claim holds, the work would demonstrate that embedding physical principles enables efficient cross-modal generalization from a single small-scale modality (RF) without paired data or fine-tuning, while naturally separating physical from semantic understanding. The small parameter count, explicit accuracies, and systematic task variation are concrete strengths that could be reproduced and tested.

major comments (2)
  1. [Abstract / Methods] Abstract and methods (input mapping description): the central claim that representations are 'only frozen representations learned from RF data' with 'no fine-tuning of the encoder on target domains' is load-bearing. The required cross-modal input transformations (time/frequency/magnitude/phase mappings) must be shown to derive exclusively from RF pretraining or physics-only rules; any hand-crafted or target-tuned component would violate the 'RF-only' condition and the hypothesis that domains differ only in learnable transformations.
  2. [Results] Results section (task breakdown and ablations): the reported 77.7% average and the 84.5% vs 70.0% split between physical and semantic tasks support the boundary claim only if ablations isolate the contribution of each principle (Fourier, energy conservation, symmetry) and confirm that linear-probe performance does not rely on modality-specific preprocessing that leaks target-domain information.
minor comments (2)
  1. [Abstract] Abstract: the parenthetical 'Ph(ysical)AI' is informal; expand or remove for a formal journal submission.
  2. [Results] The 15 tasks and exact linear-probe protocol should be enumerated with references to standard benchmarks to allow direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below with clarifications drawn directly from the manuscript's methodology and commit to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and methods (input mapping description): the central claim that representations are 'only frozen representations learned from RF data' with 'no fine-tuning of the encoder on target domains' is load-bearing. The required cross-modal input transformations (time/frequency/magnitude/phase mappings) must be shown to derive exclusively from RF pretraining or physics-only rules; any hand-crafted or target-tuned component would violate the 'RF-only' condition and the hypothesis that domains differ only in learnable transformations.

    Authors: The transformations are derived from the signal-theoretic principles (Fourier decomposition, energy conservation, symmetry) encoded in the co-designed architecture and losses during exclusive RF pretraining. These are universal physics rules applied uniformly rather than modality-specific or target-tuned components. We will revise the methods section to include an explicit derivation and pseudocode showing that the mappings follow fixed RF-physics rules with no per-target adjustment, thereby reinforcing that the encoder remains strictly frozen and RF-only. revision: partial

  2. Referee: [Results] Results section (task breakdown and ablations): the reported 77.7% average and the 84.5% vs 70.0% split between physical and semantic tasks support the boundary claim only if ablations isolate the contribution of each principle (Fourier, energy conservation, symmetry) and confirm that linear-probe performance does not rely on modality-specific preprocessing that leaks target-domain information.

    Authors: We agree that explicit ablations isolating each principle's contribution would provide stronger support for the boundary claim. The manuscript already reports the physical-versus-semantic performance gap as evidence of complementary understanding, but we will add the requested ablation studies and preprocessing-consistency analysis in the revised results section to confirm that no target-domain information is leaked and that performance derives from the RF-learned principles. revision: yes

Circularity Check

0 steps flagged

Empirical performance result with no derivation chain reducing to inputs

full rationale

The paper reports a measured outcome: training a 1.99M-parameter encoder exclusively on RF data with co-designed architecture and losses, then obtaining 77.7% average accuracy via linear probing on 15 tasks across audio/images/text/video without encoder fine-tuning. No equations, uniqueness theorems, or fitted-parameter renamings are presented that equate any claimed prediction to its own inputs by construction. The hypothesis that domains differ only in learnable time/frequency/magnitude/phase transformations is stated as an assumption but is not invoked to force the reported accuracies; the result remains an external benchmark measurement rather than a self-referential identity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that physical principles suffice to bridge modalities once transformations are learned; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Domains differ not in fundamental physics, but in learnable transformations in time, frequency, magnitude, or phase.
    Presented as the motivating hypothesis that justifies training exclusively on RF data.

pith-pipeline@v0.9.1-grok · 5763 in / 1099 out tokens · 27644 ms · 2026-06-28T10:34:39.543267+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

156 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  2. [2]

    Imagebind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15180–15190, 2023

  3. [3]

    Data determines distributional robustness in contrastive language image pre-training (clip)

    Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, and Ludwig Schmidt. Data determines distributional robustness in contrastive language image pre-training (clip). InInternational conference on machine learning, pages 6216–6234. PMLR, 2022

  4. [4]

    Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization

    John P Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. InInternational conference on machine learning, pages 7721–7735. PMLR, 2021

  5. [5]

    Measuring robustness to natural distribution shifts in image classification.Advances in Neural Information Processing Systems, 33:18583–18599, 2020

    Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification.Advances in Neural Information Processing Systems, 33:18583–18599, 2020

  6. [6]

    Shuman, Sunil K

    David I. Shuman, Sunil K. Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains.IEEE Signal Processing Magazine, 30(3):83–98, 2013. 10

  7. [7]

    Karniadakis

    Maziar Raissi, Paris Perdikaris, and George E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational Physics, 378:686–707, 2019

  8. [8]

    Fourier neural operator for parametric partial differen- tial equations

    Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differen- tial equations. InInternational Conference on Learning Representations, 2021

  9. [9]

    Group equivariant convolutional networks

    Taco Cohen and Max Welling. Group equivariant convolutional networks. InProceedings of the 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 2990–2999. PMLR, 2016

  10. [10]

    General e(2)-equivariant steerable cnns

    Maurice Weiler and Gabriele Cesa. General e(2)-equivariant steerable cnns. InAdvances in Neural Information Processing Systems 32 (NeurIPS 2019), pages 14357–14368, 2019

  11. [11]

    Self- supervised transformation learning for equivariant representations, 2025

    Jaemyung Yu, Jaehyun Choi, Dong-Jae Lee, HyeongGwon Hong, and Junmo Kim. Self- supervised transformation learning for equivariant representations, 2025

  12. [12]

    arXiv preprint arXiv:2302.10283 , year =

    Quentin Garrido, Laurent Najman, and Yann Lecun. Self-supervised learning of split invariant equivariant representations.arXiv preprint arXiv:2302.10283, 2023

  13. [13]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  14. [14]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 1597–1607. PMLR, 2020

  15. [15]

    VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

    Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regular- ization for self-supervised learning.arXiv preprint arXiv:2105.04906, 2021

  16. [16]

    McGraw-Hill, New York, 2005

    Mark A Richards.Fundamentals of Radar Signal Processing. McGraw-Hill, New York, 2005

  17. [17]

    FNet: Mixing tokens with Fourier transforms

    James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon. FNet: Mixing tokens with Fourier transforms. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4296–4313, Seattle, United States, July 2022. Association for Computational Linguistics

  18. [18]

    Self-supervised contrastive pre-training for time series via time-frequency consistency.Advances in neural information processing systems, 35:3988–4003, 2022

    Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. Self-supervised contrastive pre-training for time series via time-frequency consistency.Advances in neural information processing systems, 35:3988–4003, 2022

  19. [19]

    On the spectral bias of neural networks

    Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5301–

  20. [20]

    PMLR, 09–15 Jun 2019

  21. [21]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023

  22. [22]

    Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

  23. [23]

    Bendat and Allan G

    Julius S. Bendat and Allan G. Piersol.The Hilbert Transform, chapter 13, pages 473–503. Wiley, Hoboken, NJ, USA, 2010

  24. [24]

    Mémoire sur les séries et sur l’intégration complète d’une équation aux différences partielles linéaire du second ordre, à coefficients constants.Paris, 1799

    Marc-Antoine Parseval des Chênes. Mémoire sur les séries et sur l’intégration complète d’une équation aux différences partielles linéaire du second ordre, à coefficients constants.Paris, 1799

  25. [25]

    Differential transformer.arXiv preprint arXiv:2410.05258, 2024

    Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, and Furu Wei. Differential transformer.arXiv preprint arXiv:2410.05258, 2024. 11

  26. [26]

    Escaping the big data paradigm with compact transformers.arXiv preprint arXiv:2104.05704, 2021

    Ali Hassani, Steven Walton, Nikhil Shah, Abulikemu Abuduweili, Jiachen Li, and Humphrey Shi. Escaping the big data paradigm with compact transformers.arXiv preprint arXiv:2104.05704, 2021

  27. [27]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

  28. [28]

    Film: Visual reasoning with a general conditioning layer, 2017

    Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer, 2017

  29. [29]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. InProceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

  30. [30]

    Kunal Sankhe, Mauro Belgiovine, Fan Zhou, Luca Angioloni, Frank Restuccia, Salvatore D’Oro, Tommaso Melodia, Stratis Ioannidis, and Kaushik Chowdhury. No radio left behind: Radio fingerprinting through deep learning of physical-layer hardware impairments.IEEE Transactions on Cognitive Communications and Networking, 6(1):165–178, 2019

  31. [31]

    Trust in 5g open rans through machine learning: Rf fingerprinting on the powder pawr platform

    Guillem Reus-Muns, Dheryta Jaisinghani, Kunal Sankhe, and Kaushik R Chowdhury. Trust in 5g open rans through machine learning: Rf fingerprinting on the powder pawr platform. In GLOBECOM 2020-2020 IEEE Global Communications Conference, pages 1–6. IEEE, 2020

  32. [32]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9729–9738, 2020

  33. [33]

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

  34. [34]

    Tidyvoice: A curated multilingual dataset for speaker verification derived from common voice.arXiv preprint arXiv:2601.16358, 2026

    Aref Farhadipour, Jan Marquenie, Srikanth Madikeri, and Eleanor Chodroff. Tidyvoice: A curated multilingual dataset for speaker verification derived from common voice.arXiv preprint arXiv:2601.16358, 2026

  35. [35]

    Tinysol: an audio dataset of isolated musical notes, January 2020

    Carmine-Emanuele Cella, Daniele Ghisi, Vincent Lostanlen, Fabien Lévy, Joshua Fineberg, and Yan Maresz. Tinysol: an audio dataset of isolated musical notes, January 2020

  36. [36]

    Musical genre classification of audio signals.IEEE Transactions on Audio and Speech Processing, 10(5):293–302, 2002

    George Tzanetakis and Perry Cook. Musical genre classification of audio signals.IEEE Transactions on Audio and Speech Processing, 10(5):293–302, 2002

  37. [37]

    Southern california seismic network, 1926

    California Institute of Technology (Caltech) and United States Geological Survey (USGS). Southern california seismic network, 1926

  38. [38]

    Southern california earthquake data center, 2013

    Southern California Earthquake Data Center. Southern california earthquake data center, 2013

  39. [39]

    Long document classification from local word glimpses via recurrent attention learning.IEEE Access, 7:40707–40718, 2019

    Jun He, Liqun Wang, Liu Liu, Jiao Feng, and Hao Wu. Long document classification from local word glimpses via recurrent attention learning.IEEE Access, 7:40707–40718, 2019

  40. [40]

    Mnist handwritten digit database, 2010

    Yann LeCun, Corinna Cortes, Chris Burges, et al. Mnist handwritten digit database, 2010

  41. [41]

    Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017

    Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017

  42. [42]

    Fashion-mnist github repository and benchmark leaderboard

    Zalando Research. Fashion-mnist github repository and benchmark leaderboard. https: //github.com/zalandoresearch/fashion-mnist, 2017. Accessed: 2024-05-22

  43. [43]

    Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific Data, 10(1):41, 2023

    Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific Data, 10(1):41, 2023

  44. [44]

    Bird and Ahmad Lotfi

    Jordan J. Bird and Ahmad Lotfi. Cifake: Image classification and explainable identification of ai-generated synthetic images, 2023. 12

  45. [45]

    Automatic classification of normal and abnormal cell division using deep learning.Scientific Reports, 14(1):14241, 2024

    Pablo Delgado-Rodriguez, Rodrigo Morales Sánchez, Elouan Rouméas-Noël, François Paris, and Arrate Munoz-Barrutia. Automatic classification of normal and abnormal cell division using deep learning.Scientific Reports, 14(1):14241, 2024

  46. [46]

    Milone, and Enzo Ferrante

    Pablo Delgado, Nicolas Gaggion, Lucas Mansilla, Diego H. Milone, and Enzo Ferrante. Mitosis Classification, 6 2023

  47. [47]

    Openclip, July 2021

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021

  48. [48]

    Tyers, Ingo Siegert, and Eleanor Chodroff

    Aref Farhadipour, Jan Marquenie, Srikanth Madikeri, Teodora Vukovic, V olker Dellwo, Kathy Reid, Francis M. Tyers, Ingo Siegert, and Eleanor Chodroff. Tidyvoice 2026 challenge evalua- tion plan, 2026

  49. [49]

    Gtzan dataset - music genre classification, 2020

    Andrada Olteanu. Gtzan dataset - music genre classification, 2020

  50. [50]

    Noise2void-learning denoising from single noisy images

    Alexander Krull, Tim-Oliver Buchholz, and Florian Jug. Noise2void-learning denoising from single noisy images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2129–2137, 2019

  51. [51]

    Claude E. Shannon. Communication in the presence of noise.Proceedings of the IRE, 37(1):10– 21, 1949

  52. [52]

    Making convolutional networks shift-invariant again

    Richard Zhang. Making convolutional networks shift-invariant again. InInternational confer- ence on machine learning, pages 7324–7334. PMLR, 2019

  53. [53]

    Efficient channel- temporal attention for boosting rf fingerprinting.IEEE Open Journal of Signal Processing, 5:478–492, 2024

    Hanqing Gu, Lisheng Su, Yuxia Wang, Weifeng Zhang, and Chuan Ran. Efficient channel- temporal attention for boosting rf fingerprinting.IEEE Open Journal of Signal Processing, 5:478–492, 2024

  54. [54]

    Transformers without normalization

    Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and Zhuang Liu. Transformers without normalization. InProceedings of the computer vision and pattern recognition conference, pages 14901–14911, 2025

  55. [55]

    Adam: A method for stochastic optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInterna- tional Conference on Learning Representations (ICLR), 2015

  56. [56]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.arXiv preprint arXiv:2205.14135, 2022

  57. [57]

    Pedregosa and et al

    F. Pedregosa and et al. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011

  58. [58]

    Probabilistic outputs for support vector machines and comparisons to regular- ized likelihood methods

    John Platt. Probabilistic outputs for support vector machines and comparisons to regular- ized likelihood methods. InAdvances in large margin classifiers, volume 10, pages 61–74. Cambridge, MA, 1999

  59. [59]

    jameslyons/python_speech_features: release v0.6.1, January 2020

    James Lyons et al. jameslyons/python_speech_features: release v0.6.1, January 2020

  60. [60]

    How well do self-supervised models transfer? InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5414–5423, 2021

    Linus Ericsson, Henry Gouk, and Timothy M Hospedales. How well do self-supervised models transfer? InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5414–5423, 2021. A Appendix A: Encoder Architecture A.1 Planformer Encoder Architecture Overview: The PlanFormer encoder is designed to embed signal-theoretic principl...

  61. [61]

    Parseval Transformer blocks- Physics-informed attention mechanisms that we term "Focus" based on our dynamic attention sharpening machinery

  62. [62]

    This enables seamless extension of our architecture originally developed for complex valued domains to real valued domains

    Attentional pooling- Fixed-size latent representations per domain via pooling over variable token sequence lengths Key architectural parameters: •Input length:5120 total samples (minimum), 1024 samples per window, 5 windows total •Embedding dimension:128 after IQ interleaving for complex tokenization •Number of transformer blocks:1 •Number of attention he...

  63. [63]

    A feature learned at position t in one window transfers to positiontin other windows, reducing redundancy and parameter requirements

    Translational equivariance:Convolution’s translational equivariance property minimizes the need for overlapping windows. A feature learned at position t in one window transfers to positiontin other windows, reducing redundancy and parameter requirements

  64. [64]

    By processing windows of length W , attention cost per window is O(W 2), and total cost across Nw =N/W windows is O(Nw ·W 2) =O(N·W) —linear in sequence length

    Computational efficiency:Attention mechanisms scale quadratically with sequence length (O(N 2)). By processing windows of length W , attention cost per window is O(W 2), and total cost across Nw =N/W windows is O(Nw ·W 2) =O(N·W) —linear in sequence length. ForW≪N, this provides substantial savings

  65. [65]

    Reconstruction efficiency:In the decoder, operating on windowed representations enables localized reconstruction with manageable memory footprints, particularly important for long sequences (N >10,000). Addressing Long-Range Dependencies:The primary drawback of windowed processing is that convolutions operate in isolation within each window, potentially m...

  66. [66]

    Domain-specific positional encodings enable the network to learn causal phase relationships (time domain) and time-varying spectral evolution (frequency domain)

    Causal Cross-Window Focus(Section A.4.2.3): Explicit attention between consecutive windows’ tokenized representations models inter-window dependencies. Domain-specific positional encodings enable the network to learn causal phase relationships (time domain) and time-varying spectral evolution (frequency domain)

  67. [67]

    This captures long-range dependencies that span multiple windows while maintaining the benefits of localized spectral analysis

    Parseval Transformer(Section A.6): After tokenization, all window tokens are processed jointly through transformer blocks, enabling global attention across the entire sequence. This captures long-range dependencies that span multiple windows while maintaining the benefits of localized spectral analysis. In summary, windowed processing provides the best of...

  68. [68]

    We reshape the frequency-domain representation [B, F, T] (where F represents frequency bins) into token format [B, T /2,2F] where each token encodes a complex-valued frequency representation

  69. [69]

    We apply a learned11 convolution with stride s in token space, which effectively learns to compress the frequency spectrum by selecting which frequency bins to preserve before downsampling

  70. [70]

    when the 100 Hz component is strong in the previous window, the 200 Hz component tends to be strong in the current window,

    For time-domain processing, we convert back via IFFT, ensuring the temporal signal has reduced bandwidth appropriate for the lower sampling rate. This approach implements a learned, adaptive low-pass filter in the frequency domain, preventing aliasing while preserving task-relevant spectral components. T↓ =Conv (b) 1×1,s(T)∈R B×Ndown×(2C) (22) whereN down...

  71. [71]

    Causality:Information flows strictly from past to present, enabling online/streaming processing

  72. [72]

    Efficiency:We avoid the O(L2) complexity of full-sequence self-focus, instead computing O((L/s)2)focus within compressed windows

  73. [73]

    Interpretability:focus weights reveal how the model uses past context to inform current predictions, facilitating analysis of learned temporal dependencies. A.4.2.4 Spectral Compression Via Frequency Domain PoolingPooling operations are ubiq- uitous in modern deep learning architectures, serving to reduce computational costs while learning abstract featur...

  74. [74]

    De-interleave:Convert the real-valued interleaved representation (where adjacent elements per channel represent real and imaginary components) to explicit complex format

  75. [75]

    FFT (if time-domain):If the input is in the time domain, transform to frequency domain via FFT

  76. [76]

    4.IFFT (if time-domain):If the original input was time-domain, transform back via IFFT

    Average Pool:Apply average pooling along the frequency axis, reducing the sequence length by a factor ofr(typicallyr∈ {2,4}). 4.IFFT (if time-domain):If the original input was time-domain, transform back via IFFT

  77. [77]

    frequency hop at 100 Hz in this window

    Re-Interleave:Convert the complex-valued sequence back to a real-valued interleaved sequence for subsequent real-valued processing blocks. This produces a representation that retains thespectral envelopeof the higher-resolution signal but at reduced sequence length. Crucially, this operation preserves both magnitude and phase information in a coarsened fo...

  78. [78]

    Complex-valued structure preservation:Each token position corresponds to a complex sample (I/Q pair), enabling the transformer to model relationships between complex samples rather than treating I and Q components as independent entities

  79. [79]

    A.5.2 Cross-Domain Information Fusion Once tokenized, we leverage the complementary nature of time and frequency domain representations

    Computational efficiency:Halving the sequence length reduces the quadratic complexity of self-attention from O(L2) to O((L/2)2) =O(L 2/4), providing a 4× reduction in attention computation cost. A.5.2 Cross-Domain Information Fusion Once tokenized, we leverage the complementary nature of time and frequency domain representations. Comprehensive signal anal...

  80. [80]

    Inter-block:Between each Parseval Transformer block (currently one block in our architec- ture)

Showing first 80 references.