arxiv: 2604.06129 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

Arijit Ghosh, Corentin Sautier, Davide Allegro, David Picard, Elliot Vincent, Fei Meng, Loic Landrieu, Lucas Degeorge, Marta L\'opez-Rauhut, Nicolas Dufour, Raphael Baena, S\'egol\`ene Albouy, Syrine Kalleli, Thibaut Loiseau, Tom Ravaud, Yohann Perron, Zeynep Sonat Baltaci

Pith reviewed 2026-05-10 19:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords Polynomial Mixertoken mixingself-attention replacementlinear complexitytransformer architecturecontextual mappinguniversal approximationsequence modeling

0 comments

The pith

A learned polynomial aggregates tokens into a compact representation from which each token retrieves context, replacing self-attention at linear cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Polynomial Mixer as a drop-in replacement for self-attention in transformers. It works by using a learned polynomial to combine all input tokens into one compact form, then letting each token pull the contextual information it needs from that form. The authors prove that this mechanism still satisfies the contextual mapping property, so transformers built with it remain universal sequence-to-sequence approximators. Experiments in text generation, image generation, 3D modeling, and other domains show performance matching standard attention while cutting computational cost for long sequences.

Core claim

PoM aggregates input tokens into a compact representation through a learned polynomial function, from which each token retrieves contextual information. This satisfies the contextual mapping property and therefore preserves the universal approximation capability of the transformer. When substituted for self-attention, PoM produces models that match attention-based performance across five domains while running in linear time.

What carries the argument

The Polynomial Mixer, a token-mixing operation that aggregates inputs via a learned polynomial into a compact representation and lets tokens retrieve context from it.

If this is right

Transformers can process longer sequences without the quadratic cost of attention.
The same model architectures and training procedures remain valid when attention is replaced by PoM.
Performance parity holds across text, image, 3D, and remote-sensing domains.
The universality guarantee means no fundamental loss of modeling capacity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same aggregation idea could be tested in other sequence models that currently rely on attention for mixing.
If the polynomial degree can be kept small while retaining performance, further speed-ups become possible on hardware optimized for dense operations.
Replacing only the mixing step leaves the rest of the transformer unchanged, so existing scaling recipes and optimizers transfer directly.

Load-bearing premise

The learned polynomial can aggregate and retrieve the necessary contextual information across tasks without losing the expressive power that the universality proof assumes.

What would settle it

A sequence-to-sequence task where a transformer equipped with PoM fails to approximate the target mapping that a standard attention transformer can learn, or a long-sequence benchmark where PoM runtime grows quadratically instead of linearly.

Figures

Figures reproduced from arXiv: 2604.06129 by Arijit Ghosh, Corentin Sautier, Davide Allegro, David Picard, Elliot Vincent, Fei Meng, Loic Landrieu, Lucas Degeorge, Marta L\'opez-Rauhut, Nicolas Dufour, Raphael Baena, S\'egol\`ene Albouy, Syrine Kalleli, Thibaut Loiseau, Tom Ravaud, Yohann Perron, Zeynep Sonat Baltaci.

**Figure 1.** Figure 1: Polynomial Mixer ( PoM). Inference time on an H100 as a function of sequence length, shown for the PoM module alone (a) and within a full Transformer pipeline (b). PoM scales nearly linearly and is significantly faster than self-attention for long sequences, even with FlashAttention. As measured for NLP, OCR, 3D point cloud segmentation, Earth Observation analysis, and image generation (c), replacing atten… view at source ↗

**Figure 3.** Figure 3: Evaluation Across Domains. We evaluate PoM on various tasks from multiple domains, simply replacing some or all the self-attention blocks in SOTA Transformer-based models. Left: For OCR, given a page of handwritten text (top) split into lines, the goal is to recognize each character and group them into words (bottom). Middle: For Earth observation, given a time series of satellite images (top), the goal is… view at source ↗

**Figure 4.** Figure 4: Computational Efficiency. We report the speed at generating image for various resolutions for SiT-XL/2 and its PoM counterpart SiPoM-XL/2. a limit imposed by the quadratic cost of attention and GPU memory constraints. We investigate two PoM-based replacements for the attention modules: (i) PPoMv3, which replaces all MHA layers with PoM while keeping the same context length, and (ii) PPoMv3-Hybrid, which… view at source ↗

**Figure 5.** Figure 5: Qualitative Results on Class-Conditional Generation. Images sampled from SiPom-XL/2 with different classes at 256 resolution. We use classifier-free guidance with ω = 6. #Param Performance Throughput Model ×106 FID↓ img/s SiT-L/2 458 18.8 9842 SiPoM-L/2 414 17.6 11,954 SiT-XL/2 675 17.2 4041 SiPoM-XL/2 609 17.2 8146 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Uncurated 256² images for the class loggerhead, loggerhead turtle, Caretta caretta (33). 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Uncurated 256² images for the class macaw (88). 15 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Uncurated 256² images for the class otter (360). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Uncurated 256² images for the class volcano (980). 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

This paper introduces the Polynomial Mixer (PoM), a novel token mixing mechanism with linear complexity that serves as a drop-in replacement for self-attention. PoM aggregates input tokens into a compact representation through a learned polynomial function, from which each token retrieves contextual information. We prove that PoM satisfies the contextual mapping property, ensuring that transformers equipped with PoM remain universal sequence-to-sequence approximators. We replace standard self-attention with PoM across five diverse domains: text generation, handwritten text recognition, image generation, 3D modeling, and Earth observation. PoM matches the performance of attention-based models while drastically reducing computational cost when working with long sequences. The code is available at https://github.com/davidpicard/pom.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PoM gives a polynomial token mixer that comes with an explicit universality proof and matches attention across five domains at linear cost.

read the letter

The main point is that PoM replaces self-attention with a learned polynomial aggregation step that stays linear in sequence length while preserving the contextual mapping property needed for universality. The paper supplies the proof and then swaps PoM into models for text generation, handwriting recognition, image generation, 3D modeling, and Earth observation, reporting performance parity with standard transformers and clear speed gains on long inputs. Code is released, which is useful for checking the details.

Referee Report

2 major / 3 minor

Summary. The paper introduces the Polynomial Mixer (PoM) as a linear-complexity drop-in replacement for self-attention. PoM aggregates input tokens into a compact representation via a learned polynomial function, from which each token retrieves contextual information. The authors prove that PoM satisfies the contextual mapping property, ensuring transformers using PoM remain universal sequence-to-sequence approximators. They evaluate PoM by replacing attention in five domains (text generation, handwritten text recognition, image generation, 3D modeling, Earth observation), reporting performance parity with attention models at linear cost for long sequences. Code is released.

Significance. If the proof and results hold, this is a significant contribution to efficient transformer design, offering a scalable alternative for long-sequence tasks while preserving theoretical universality. The explicit proof of the contextual mapping property and multi-domain validation are notable strengths; open-sourced code aids reproducibility.

major comments (2)

[Theoretical analysis / contextual mapping proof] Proof of contextual mapping property: the central universality claim depends on this property holding for the learned polynomial aggregation and retrieval. The manuscript should provide an explicit derivation or bounds showing that finite-degree polynomials do not restrict the representable contextual mappings, particularly addressing how coefficient learning interacts with the property.
[Experiments / results across domains] Empirical evaluation sections: performance is reported to match attention baselines across five domains, but without details on run counts, variance, hyperparameter search, or ablations on polynomial degree, it is difficult to confirm that the linear-cost replacement incurs no hidden accuracy loss, undermining the 'matches performance' claim.

minor comments (3)

[Related work] Related work: add explicit positioning against other linear-time attention variants (e.g., Performer, Linformer) to clarify novelty.
[Method / PoM definition] Notation and definitions: the polynomial function and aggregation/retrieval steps should be formalized with explicit equations early in the main text for reader clarity.
[Figures] Figure captions: ensure visual diagrams of the PoM mechanism include labels for polynomial degree and token flow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Theoretical analysis / contextual mapping proof] Proof of contextual mapping property: the central universality claim depends on this property holding for the learned polynomial aggregation and retrieval. The manuscript should provide an explicit derivation or bounds showing that finite-degree polynomials do not restrict the representable contextual mappings, particularly addressing how coefficient learning interacts with the property.

Authors: We appreciate the referee's emphasis on making the theoretical guarantees fully explicit. Section 3.2 of the manuscript already contains a proof that PoM satisfies the contextual mapping property for any finite polynomial degree, relying on the fact that polynomials are dense in continuous functions on compact sets (via Stone-Weierstrass) and that coefficient learning allows the aggregation and retrieval steps to realize arbitrary mappings. To address the request for explicit derivation and bounds, we will add a corollary with approximation-error bounds and a step-by-step derivation of how learned coefficients interact with the property in the revised version. revision: yes
Referee: [Experiments / results across domains] Empirical evaluation sections: performance is reported to match attention baselines across five domains, but without details on run counts, variance, hyperparameter search, or ablations on polynomial degree, it is difficult to confirm that the linear-cost replacement incurs no hidden accuracy loss, undermining the 'matches performance' claim.

Authors: We agree that additional experimental details are necessary to fully substantiate the performance-parity claim. The revised manuscript will report the number of independent runs per experiment, include standard deviations or confidence intervals, describe the hyperparameter search protocol, and add an ablation study on polynomial degree (in an appendix) to demonstrate that accuracy remains stable across reasonable degrees without hidden costs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces PoM as a new token-mixing mechanism defined via a learned polynomial aggregation followed by per-token retrieval. It supplies an explicit proof that this construction satisfies the contextual mapping property, from which universality of the resulting transformer follows directly. No step reduces a claimed prediction or first-principles result to a fitted parameter, self-citation, or definitional renaming of the target quantity. The five-domain empirical replacements are presented as independent validation rather than forced outputs of the same fit. The derivation chain therefore remains non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a learned polynomial can serve as a sufficient aggregator for contextual information while preserving universality; this is treated as a domain assumption rather than derived from first principles.

free parameters (1)

polynomial coefficients
Coefficients of the learned polynomial function are fitted during training to enable token aggregation and retrieval.

axioms (1)

domain assumption A polynomial function can encode sufficient contextual mappings for sequence-to-sequence tasks
Invoked to support the universality claim and drop-in replacement property.

pith-pipeline@v0.9.0 · 5499 in / 1344 out tokens · 67175 ms · 2026-05-10T19:16:55.919967+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 17 canonical work pages · 11 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

wav2vec 2.0: A framework for self-supervised learning of speech representations.NeurIPS, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.NeurIPS, 2020

2020
[3]

Fixed point diffusion models

Xingjian Bai and Luke Melas-Kyriazi. Fixed point diffusion models. InCVPR, 2024

2024
[4]

ediff-i: Text-to-image diffusion models with ensemble of expert denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Ji- aming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. eDiff-I: Text-to-image diffusion models with ensemble of expert denoisers.arXiv:2211.01324, 2022

work page arXiv 2022
[5]

Se- manticKITTI: A dataset for semantic scene understanding of LiDAR sequences

Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se- manticKITTI: A dataset for semantic scene understanding of LiDAR sequences. InICCV, 2019

2019
[6]

Reducing transformer key-value cache size with cross-layer attention

William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and Jonathan Ragan Kelly. Reducing transformer key-value cache size with cross-layer attention. NeurIPS, 2024

2024
[7]

The LAM dataset: A novel benchmark for line- level handwritten text recognition

Silvia Cascianelli, Vittorio Pippi, Martin Maarand, Marcella Cornia, Lorenzo Baraldi, Christopher Kermorvant, and Rita Cucchiara. The LAM dataset: A novel benchmark for line- level handwritten text recognition. InICPR, 2022

2022
[8]

Pixart-\sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-\sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In ECCV, 2024

2024
[9]

Fit: Far-reaching interleaved transformers,

Ting Chen and Lala Li. FIT: Far-reaching interleaved trans- formers.arXiv:2305.12689, 2023

work page arXiv 2023
[10]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transform- ers.arXiv:1904.10509, 2019

work page internal anchor Pith review arXiv 1904
[11]

Self-supervised learning with random- projection quantizer for speech recognition

Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, and Yonghui Wu. Self-supervised learning with random- projection quantizer for speech recognition. InInternational Conference on Machine Learning. PMLR, 2022

2022
[12]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Ttry ARC, the AI2 reasoning challenge.arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Scalable high-resolution pixel-space image syn- thesis with hourglass diffusion transformers

Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image syn- thesis with hourglass diffusion transformers. InInt. Conf. Mach. Learn., 2024

2024
[14]

Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Niessner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Niessner. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In CVPR, 2017

2017
[15]

FlashAttention-2: Faster attention with better par- allelism and work partitioning.ICLR, 2024

Tri Dao. FlashAttention-2: Faster attention with better par- allelism and work partitioning.ICLR, 2024

2024
[16]

Transformers are SSMs: General- ized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are SSMs: General- ized models and efficient algorithms through structured state space duality. InInt. Conf. Mach. Learn., 2024

2024
[17]

FlashAttention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher Ré. FlashAttention: Fast and memory-efficient exact attention with io-awareness. InNeurIPS, 2022

2022
[18]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InNeurIPS, 2022

2022
[19]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: A speech-text foundation model for real- time dialogue.arXiv:2410.00037, 2024

work page internal anchor Pith review arXiv 2024
[20]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2020

2020
[21]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The LLAMA 3 herd of models.arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Don’t drop your samples! Coherence-aware training benefits conditional diffusion

Nicolas Dufour, Victor Besnier, Vicky Kalogeiton, and David Picard. Don’t drop your samples! Coherence-aware training benefits conditional diffusion. InCVPR, 2024

2024
[23]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InInt. Conf. Mach. Learn., 2024

2024
[24]

Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers, 2024

Peng Gao, Le Zhuo, Ziyi Lin, Chris Liu, Junsong Chen, Ruoyi Du, Enze Xie, Xu Luo, Longtian Qiu, Yuhang Zhang, et al. Lumina-T2X: Transforming text into any modality, resolution, and duration via flow-based large diffusion trans- formers.arXiv:2405.05945, 2024

work page arXiv 2024
[25]

Zamba: A compact 7b SSM.arXiv preprint arXiv:2405.16712,

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7B SSM hybrid model. arXiv:2405.16712, 2024

work page arXiv 2024
[26]

Feder Cooper, Jasmine Collins, Lan- dan Seguin, Austin Jacobson, Mihir Patel, Jonathan Frankle, Cory Stephenson, and V olodymyr Kuleshov

Aaron Gokaslan, A. Feder Cooper, Jasmine Collins, Lan- dan Seguin, Austin Jacobson, Mihir Patel, Jonathan Frankle, Cory Stephenson, and V olodymyr Kuleshov. CommonCan- vas: Open diffusion models trained on creative-commons images. InCVPR, 2024

2024
[27]

Efficiently mod- eling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Re. Efficiently mod- eling long sequences with structured state spaces. InICLR, 2021. 9

2021
[28]

Combining recurrent, convolutional, and continuous-time models with linear state space layers

Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. InNeurIPS, 2021

2021
[29]

Matryoshka diffusion models

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M Susskind, and Navdeep Jaitly. Matryoshka diffusion models. InICLR, 2023

2023
[30]

Improved noise schedule for diffusion training.ICCV, 2024

Tiankai Hang and Shuyang Gu. Improved noise schedule for diffusion training.ICCV, 2024

2024
[31]

DifFit: Diffusion vision transformers for im- age generation

Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, and Arash Vahdat. DifFit: Diffusion vision transformers for im- age generation. InECCV, 2024

2024
[32]

Mea- suring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. InICLR, 2021

2021
[33]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020

2020
[34]

Long short-term memory.Neural computation, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 1997

1997
[35]

ZigMa: A DiT-style zigzag Mamba diffusion model

Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes Fischer, and Björn Ommer. ZigMa: A DiT-style zigzag Mamba diffusion model. InECCV, 2024

2024
[36]

Scalable adap- tive computation for iterative generation

Allan Jabri, David J Fleet, and Ting Chen. Scalable adap- tive computation for iterative generation. InInt. Conf. Mach. Learn., 2023

2023
[37]

Metric learning with horde: High-order regularizer for deep embeddings

Pierre Jacob, David Picard, Aymeric Histace, and Edouard Klein. Metric learning with horde: High-order regularizer for deep embeddings. InICCV, pages 6539–6548, 2019

2019
[38]

Perceiver IO: A general architecture for structured inputs & outputs

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Kop- pula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver IO: A general architecture for structured inputs & outputs. InICLR, 2022

2022
[39]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[40]

NanoGPT.https://github.com/ karpathy/nanoGPT, 2022

Andrej Karpathy. NanoGPT.https://github.com/ karpathy/nanoGPT, 2022

2022
[42]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InCVPR, 2024

2024
[43]

Re- Former: The efficient transformer

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Re- Former: The efficient transformer. InICLR, 2020

2020
[44]

Parrot: Pareto-optimal multi-reward rein- forcement learning framework for text-to-image generation

Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Jun- feng He, et al. Parrot: Pareto-optimal multi-reward rein- forcement learning framework for text-to-image generation. InECCV, 2025

2025
[45]

HTR-VT: Handwritten text recognition with vision trans- former.Pattern Recognition, 2025

Yuting Li, Dexiong Chen, Tinglong Tang, and Xi Shen. HTR-VT: Handwritten text recognition with vision trans- former.Pattern Recognition, 2025

2025
[46]

Jamba: A hybrid transformer-Mamba language model.ICLR, 2025

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-Mamba language model.ICLR, 2025

2025
[47]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2022

2022
[48]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In ICLR, 2023

2023
[49]

Vmamba: Visual state space model, 2024

Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, and Yunfan Liu. Vmamba: Visual state space model, 2024

2024
[50]

Correcting diffusion generation through resampling

Yujian Liu, Yang Zhang, Tommi Jaakkola, and Shiyu Chang. Correcting diffusion generation through resampling. In CVPR, 2024

2024
[51]

Keep the cost down: A review on methods to optimize LLM’s KV-cache consumption.COLM, 2024

Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, and Zhao Hai. Keep the cost down: A review on methods to optimize LLM’s KV-cache consumption.COLM, 2024

2024
[52]

SIT: Explor- ing flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. SIT: Explor- ing flow and diffusion-based generative models with scalable interpolant transformers. InECCV, 2024

2024
[53]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInt. Conf. Mach. Learn., 2021

2021
[54]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

2023
[55]

EfficientV- Mamba: Atrous selective scan for light weight visual mamba, 2025

Xiaohuan Pei, Tao Huang, and Chang Xu. EfficientV- Mamba: Atrous selective scan for light weight visual mamba, 2025

2025
[56]

The fineweb datasets: Decanting the web for the finest text data at scale.NeurIPS, 2024

Guilherme Penedo, Hynek Kydlí ˇcek, Anton Lozhkov, Mar- garet Mitchell, Colin A Raffel, Leandro V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.NeurIPS, 2024

2024
[57]

Improving image similarity with vectors of locally aggregated tensors

David Picard and Philippe-Henri Gosselin. Improving image similarity with vectors of locally aggregated tensors. InICIP, pages 669–672, 2011

2011
[58]

Language models are unsu- pervised multitask learners.OpenAI blog, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners.OpenAI blog, 2019

2019
[59]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022

2022
[60]

Panoptic seg- mentation of satellite image time series with convolutional temporal attention networks.ICCV, 2021

Vivien Sainte Fare Garnot and Loic Landrieu. Panoptic seg- mentation of satellite image time series with convolutional temporal attention networks.ICCV, 2021

2021
[61]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 2021

2021
[62]

Diffusion Schrödinger bridge matching

Yuyang Shi, Valentin De Bortoli, Andrew Campbell, and Ar- naud Doucet. Diffusion Schrödinger bridge matching. In NeurIPS, 2024

2024
[63]

FreeU: Free lunch in diffusion U-net

Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. FreeU: Free lunch in diffusion U-net. InCVPR, 2024. 10

2024
[64]

Score-based generative modeling through stochastic differential equa- tions

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InICLR, 2021

2021
[65]

ViTs for SITS: Vision transformers for satellite image time series

Michail Tarasiou, Erik Chavez, and Stefanos Zafeiriou. ViTs for SITS: Vision transformers for satellite image time series. InCVPR, 2023

2023
[66]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. GEMINI: A family of highly capable multimodal models. arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati- raju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

MLP-Mixer: An all-MLP architecture for vision

Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. MLP-Mixer: An all-MLP architecture for vision. In NeurIPS, 2021

2021
[69]

ResMLP: Feedforward networks for image classification with data-efficient training.IEEE TPAMI, 2022

Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izac- ard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, et al. ResMLP: Feedforward networks for image classification with data-efficient training.IEEE TPAMI, 2022

2022
[70]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017

2017
[71]

Diffusion model align- ment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InCVPR, 2024

2024
[72]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. LinFormer: Self-attention with linear complexity. arXiv:2006.04768, 2020

work page internal anchor Pith review arXiv 2006
[73]

Powerful and flexible: Personalized text- to-image generation via reinforcement learning

Fanyue Wei, Wei Zeng, Zhenyang Li, Dawei Yin, Lixin Duan, and Wen Li. Powerful and flexible: Personalized text- to-image generation via reinforcement learning. InECCV, 2024

2024
[74]

Point Transformer v3: Simpler, faster, stronger

Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point Transformer v3: Simpler, faster, stronger. In CVPR, 2024

2024
[75]

Diffu- sion models without attention

Jing Nathan Yan, Jiatao Gu, and Alexander M Rush. Diffu- sion models without attention. InCVPR, 2024

2024
[76]

CogVideoX: Text-to-video diffusion models with an expert transformer.ICLR, 2025

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer.ICLR, 2025

2025
[77]

Are transformers uni- versal approximators of sequence-to-sequence functions? In ICLR, 2020

Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are transformers uni- versal approximators of sequence-to-sequence functions? In ICLR, 2020

2020
[78]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence?arXiv:1905.07830, 2019

work page internal anchor Pith review arXiv 1905
[79]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. InCVPR, 2022

2022
[80]

Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara N

Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, et al. Google USM: Scaling automatic speech recognition beyond 100 languages.arXiv:2303.01037, 2023

work page arXiv 2023
[81]

MobileDiffusion: Instant text-to-image gen- eration on mobile devices

Yang Zhao, Yanwu Xu, Zhisheng Xiao, Haolin Jia, and Tingbo Hou. MobileDiffusion: Instant text-to-image gen- eration on mobile devices. InECCV, 2024

2024

Showing first 80 references.