pith. machine review for the scientific record. sign in

arxiv: 2604.06129 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

Arijit Ghosh, Corentin Sautier, Davide Allegro, David Picard, Elliot Vincent, Fei Meng, Loic Landrieu, Lucas Degeorge, Marta L\'opez-Rauhut, Nicolas Dufour, Raphael Baena, S\'egol\`ene Albouy, Syrine Kalleli, Thibaut Loiseau, Tom Ravaud, Yohann Perron, Zeynep Sonat Baltaci

Pith reviewed 2026-05-10 19:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords Polynomial Mixertoken mixingself-attention replacementlinear complexitytransformer architecturecontextual mappinguniversal approximationsequence modeling
0
0 comments X

The pith

A learned polynomial aggregates tokens into a compact representation from which each token retrieves context, replacing self-attention at linear cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Polynomial Mixer as a drop-in replacement for self-attention in transformers. It works by using a learned polynomial to combine all input tokens into one compact form, then letting each token pull the contextual information it needs from that form. The authors prove that this mechanism still satisfies the contextual mapping property, so transformers built with it remain universal sequence-to-sequence approximators. Experiments in text generation, image generation, 3D modeling, and other domains show performance matching standard attention while cutting computational cost for long sequences.

Core claim

PoM aggregates input tokens into a compact representation through a learned polynomial function, from which each token retrieves contextual information. This satisfies the contextual mapping property and therefore preserves the universal approximation capability of the transformer. When substituted for self-attention, PoM produces models that match attention-based performance across five domains while running in linear time.

What carries the argument

The Polynomial Mixer, a token-mixing operation that aggregates inputs via a learned polynomial into a compact representation and lets tokens retrieve context from it.

If this is right

  • Transformers can process longer sequences without the quadratic cost of attention.
  • The same model architectures and training procedures remain valid when attention is replaced by PoM.
  • Performance parity holds across text, image, 3D, and remote-sensing domains.
  • The universality guarantee means no fundamental loss of modeling capacity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same aggregation idea could be tested in other sequence models that currently rely on attention for mixing.
  • If the polynomial degree can be kept small while retaining performance, further speed-ups become possible on hardware optimized for dense operations.
  • Replacing only the mixing step leaves the rest of the transformer unchanged, so existing scaling recipes and optimizers transfer directly.

Load-bearing premise

The learned polynomial can aggregate and retrieve the necessary contextual information across tasks without losing the expressive power that the universality proof assumes.

What would settle it

A sequence-to-sequence task where a transformer equipped with PoM fails to approximate the target mapping that a standard attention transformer can learn, or a long-sequence benchmark where PoM runtime grows quadratically instead of linearly.

Figures

Figures reproduced from arXiv: 2604.06129 by Arijit Ghosh, Corentin Sautier, Davide Allegro, David Picard, Elliot Vincent, Fei Meng, Loic Landrieu, Lucas Degeorge, Marta L\'opez-Rauhut, Nicolas Dufour, Raphael Baena, S\'egol\`ene Albouy, Syrine Kalleli, Thibaut Loiseau, Tom Ravaud, Yohann Perron, Zeynep Sonat Baltaci.

Figure 1
Figure 1. Figure 1: Polynomial Mixer ( PoM). Inference time on an H100 as a function of sequence length, shown for the PoM module alone (a) and within a full Transformer pipeline (b). PoM scales nearly linearly and is significantly faster than self-attention for long sequences, even with FlashAttention. As measured for NLP, OCR, 3D point cloud segmentation, Earth Observation analysis, and image generation (c), replacing atten… view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation Across Domains. We evaluate PoM on various tasks from multiple domains, simply replacing some or all the self-attention blocks in SOTA Transformer-based models. Left: For OCR, given a page of handwritten text (top) split into lines, the goal is to recognize each character and group them into words (bottom). Middle: For Earth observation, given a time series of satellite images (top), the goal is… view at source ↗
Figure 4
Figure 4. Figure 4: Computational Efficiency. We report the speed at gen￾erating image for various resolutions for SiT-XL/2 and its PoM counterpart SiPoM-XL/2. a limit imposed by the quadratic cost of attention and GPU memory constraints. We investigate two PoM-based replacements for the at￾tention modules: (i) PPoMv3, which replaces all MHA lay￾ers with PoM while keeping the same context length, and (ii) PPoMv3-Hybrid, which… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Results on Class-Conditional Generation. Images sampled from SiPom-XL/2 with different classes at 256 resolution. We use classifier-free guidance with ω = 6. #Param Performance Throughput Model ×106 FID↓ img/s SiT-L/2 458 18.8 9842 SiPoM-L/2 414 17.6 11,954 SiT-XL/2 675 17.2 4041 SiPoM-XL/2 609 17.2 8146 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Uncurated 256² images for the class loggerhead, loggerhead turtle, Caretta caretta (33). 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Uncurated 256² images for the class macaw (88). 15 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Uncurated 256² images for the class otter (360). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Uncurated 256² images for the class volcano (980). 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
read the original abstract

This paper introduces the Polynomial Mixer (PoM), a novel token mixing mechanism with linear complexity that serves as a drop-in replacement for self-attention. PoM aggregates input tokens into a compact representation through a learned polynomial function, from which each token retrieves contextual information. We prove that PoM satisfies the contextual mapping property, ensuring that transformers equipped with PoM remain universal sequence-to-sequence approximators. We replace standard self-attention with PoM across five diverse domains: text generation, handwritten text recognition, image generation, 3D modeling, and Earth observation. PoM matches the performance of attention-based models while drastically reducing computational cost when working with long sequences. The code is available at https://github.com/davidpicard/pom.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces the Polynomial Mixer (PoM) as a linear-complexity drop-in replacement for self-attention. PoM aggregates input tokens into a compact representation via a learned polynomial function, from which each token retrieves contextual information. The authors prove that PoM satisfies the contextual mapping property, ensuring transformers using PoM remain universal sequence-to-sequence approximators. They evaluate PoM by replacing attention in five domains (text generation, handwritten text recognition, image generation, 3D modeling, Earth observation), reporting performance parity with attention models at linear cost for long sequences. Code is released.

Significance. If the proof and results hold, this is a significant contribution to efficient transformer design, offering a scalable alternative for long-sequence tasks while preserving theoretical universality. The explicit proof of the contextual mapping property and multi-domain validation are notable strengths; open-sourced code aids reproducibility.

major comments (2)
  1. [Theoretical analysis / contextual mapping proof] Proof of contextual mapping property: the central universality claim depends on this property holding for the learned polynomial aggregation and retrieval. The manuscript should provide an explicit derivation or bounds showing that finite-degree polynomials do not restrict the representable contextual mappings, particularly addressing how coefficient learning interacts with the property.
  2. [Experiments / results across domains] Empirical evaluation sections: performance is reported to match attention baselines across five domains, but without details on run counts, variance, hyperparameter search, or ablations on polynomial degree, it is difficult to confirm that the linear-cost replacement incurs no hidden accuracy loss, undermining the 'matches performance' claim.
minor comments (3)
  1. [Related work] Related work: add explicit positioning against other linear-time attention variants (e.g., Performer, Linformer) to clarify novelty.
  2. [Method / PoM definition] Notation and definitions: the polynomial function and aggregation/retrieval steps should be formalized with explicit equations early in the main text for reader clarity.
  3. [Figures] Figure captions: ensure visual diagrams of the PoM mechanism include labels for polynomial degree and token flow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Theoretical analysis / contextual mapping proof] Proof of contextual mapping property: the central universality claim depends on this property holding for the learned polynomial aggregation and retrieval. The manuscript should provide an explicit derivation or bounds showing that finite-degree polynomials do not restrict the representable contextual mappings, particularly addressing how coefficient learning interacts with the property.

    Authors: We appreciate the referee's emphasis on making the theoretical guarantees fully explicit. Section 3.2 of the manuscript already contains a proof that PoM satisfies the contextual mapping property for any finite polynomial degree, relying on the fact that polynomials are dense in continuous functions on compact sets (via Stone-Weierstrass) and that coefficient learning allows the aggregation and retrieval steps to realize arbitrary mappings. To address the request for explicit derivation and bounds, we will add a corollary with approximation-error bounds and a step-by-step derivation of how learned coefficients interact with the property in the revised version. revision: yes

  2. Referee: [Experiments / results across domains] Empirical evaluation sections: performance is reported to match attention baselines across five domains, but without details on run counts, variance, hyperparameter search, or ablations on polynomial degree, it is difficult to confirm that the linear-cost replacement incurs no hidden accuracy loss, undermining the 'matches performance' claim.

    Authors: We agree that additional experimental details are necessary to fully substantiate the performance-parity claim. The revised manuscript will report the number of independent runs per experiment, include standard deviations or confidence intervals, describe the hyperparameter search protocol, and add an ablation study on polynomial degree (in an appendix) to demonstrate that accuracy remains stable across reasonable degrees without hidden costs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces PoM as a new token-mixing mechanism defined via a learned polynomial aggregation followed by per-token retrieval. It supplies an explicit proof that this construction satisfies the contextual mapping property, from which universality of the resulting transformer follows directly. No step reduces a claimed prediction or first-principles result to a fitted parameter, self-citation, or definitional renaming of the target quantity. The five-domain empirical replacements are presented as independent validation rather than forced outputs of the same fit. The derivation chain therefore remains non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a learned polynomial can serve as a sufficient aggregator for contextual information while preserving universality; this is treated as a domain assumption rather than derived from first principles.

free parameters (1)
  • polynomial coefficients
    Coefficients of the learned polynomial function are fitted during training to enable token aggregation and retrieval.
axioms (1)
  • domain assumption A polynomial function can encode sufficient contextual mappings for sequence-to-sequence tasks
    Invoked to support the universality claim and drop-in replacement property.

pith-pipeline@v0.9.0 · 5499 in / 1344 out tokens · 67175 ms · 2026-05-10T19:16:55.919967+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 17 canonical work pages · 11 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv:2303.08774, 2023

  2. [2]

    wav2vec 2.0: A framework for self-supervised learning of speech representations.NeurIPS, 2020

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.NeurIPS, 2020

  3. [3]

    Fixed point diffusion models

    Xingjian Bai and Luke Melas-Kyriazi. Fixed point diffusion models. InCVPR, 2024

  4. [4]

    ediff-i: Text-to-image diffusion models with ensemble of expert denoisers

    Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Ji- aming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. eDiff-I: Text-to-image diffusion models with ensemble of expert denoisers.arXiv:2211.01324, 2022

  5. [5]

    Se- manticKITTI: A dataset for semantic scene understanding of LiDAR sequences

    Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se- manticKITTI: A dataset for semantic scene understanding of LiDAR sequences. InICCV, 2019

  6. [6]

    Reducing transformer key-value cache size with cross-layer attention

    William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and Jonathan Ragan Kelly. Reducing transformer key-value cache size with cross-layer attention. NeurIPS, 2024

  7. [7]

    The LAM dataset: A novel benchmark for line- level handwritten text recognition

    Silvia Cascianelli, Vittorio Pippi, Martin Maarand, Marcella Cornia, Lorenzo Baraldi, Christopher Kermorvant, and Rita Cucchiara. The LAM dataset: A novel benchmark for line- level handwritten text recognition. InICPR, 2022

  8. [8]

    Pixart-\sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-\sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In ECCV, 2024

  9. [9]

    Fit: Far-reaching interleaved transformers,

    Ting Chen and Lala Li. FIT: Far-reaching interleaved trans- formers.arXiv:2305.12689, 2023

  10. [10]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transform- ers.arXiv:1904.10509, 2019

  11. [11]

    Self-supervised learning with random- projection quantizer for speech recognition

    Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, and Yonghui Wu. Self-supervised learning with random- projection quantizer for speech recognition. InInternational Conference on Machine Learning. PMLR, 2022

  12. [12]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Ttry ARC, the AI2 reasoning challenge.arXiv:1803.05457, 2018

  13. [13]

    Scalable high-resolution pixel-space image syn- thesis with hourglass diffusion transformers

    Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image syn- thesis with hourglass diffusion transformers. InInt. Conf. Mach. Learn., 2024

  14. [14]

    Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Niessner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Niessner. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In CVPR, 2017

  15. [15]

    FlashAttention-2: Faster attention with better par- allelism and work partitioning.ICLR, 2024

    Tri Dao. FlashAttention-2: Faster attention with better par- allelism and work partitioning.ICLR, 2024

  16. [16]

    Transformers are SSMs: General- ized models and efficient algorithms through structured state space duality

    Tri Dao and Albert Gu. Transformers are SSMs: General- ized models and efficient algorithms through structured state space duality. InInt. Conf. Mach. Learn., 2024

  17. [17]

    FlashAttention: Fast and memory-efficient exact attention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher Ré. FlashAttention: Fast and memory-efficient exact attention with io-awareness. InNeurIPS, 2022

  18. [18]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InNeurIPS, 2022

  19. [19]

    Moshi: a speech-text foundation model for real-time dialogue

    Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: A speech-text foundation model for real- time dialogue.arXiv:2410.00037, 2024

  20. [20]

    An image is worth 16x16 words: Trans- formers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2020

  21. [21]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The LLAMA 3 herd of models.arXiv:2407.21783, 2024

  22. [22]

    Don’t drop your samples! Coherence-aware training benefits conditional diffusion

    Nicolas Dufour, Victor Besnier, Vicky Kalogeiton, and David Picard. Don’t drop your samples! Coherence-aware training benefits conditional diffusion. InCVPR, 2024

  23. [23]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InInt. Conf. Mach. Learn., 2024

  24. [24]

    Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers, 2024

    Peng Gao, Le Zhuo, Ziyi Lin, Chris Liu, Junsong Chen, Ruoyi Du, Enze Xie, Xu Luo, Longtian Qiu, Yuhang Zhang, et al. Lumina-T2X: Transforming text into any modality, resolution, and duration via flow-based large diffusion trans- formers.arXiv:2405.05945, 2024

  25. [25]

    Zamba: A compact 7b SSM.arXiv preprint arXiv:2405.16712,

    Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7B SSM hybrid model. arXiv:2405.16712, 2024

  26. [26]

    Feder Cooper, Jasmine Collins, Lan- dan Seguin, Austin Jacobson, Mihir Patel, Jonathan Frankle, Cory Stephenson, and V olodymyr Kuleshov

    Aaron Gokaslan, A. Feder Cooper, Jasmine Collins, Lan- dan Seguin, Austin Jacobson, Mihir Patel, Jonathan Frankle, Cory Stephenson, and V olodymyr Kuleshov. CommonCan- vas: Open diffusion models trained on creative-commons images. InCVPR, 2024

  27. [27]

    Efficiently mod- eling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher Re. Efficiently mod- eling long sequences with structured state spaces. InICLR, 2021. 9

  28. [28]

    Combining recurrent, convolutional, and continuous-time models with linear state space layers

    Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. InNeurIPS, 2021

  29. [29]

    Matryoshka diffusion models

    Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M Susskind, and Navdeep Jaitly. Matryoshka diffusion models. InICLR, 2023

  30. [30]

    Improved noise schedule for diffusion training.ICCV, 2024

    Tiankai Hang and Shuyang Gu. Improved noise schedule for diffusion training.ICCV, 2024

  31. [31]

    DifFit: Diffusion vision transformers for im- age generation

    Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, and Arash Vahdat. DifFit: Diffusion vision transformers for im- age generation. InECCV, 2024

  32. [32]

    Mea- suring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. InICLR, 2021

  33. [33]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020

  34. [34]

    Long short-term memory.Neural computation, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 1997

  35. [35]

    ZigMa: A DiT-style zigzag Mamba diffusion model

    Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes Fischer, and Björn Ommer. ZigMa: A DiT-style zigzag Mamba diffusion model. InECCV, 2024

  36. [36]

    Scalable adap- tive computation for iterative generation

    Allan Jabri, David J Fleet, and Ting Chen. Scalable adap- tive computation for iterative generation. InInt. Conf. Mach. Learn., 2023

  37. [37]

    Metric learning with horde: High-order regularizer for deep embeddings

    Pierre Jacob, David Picard, Aymeric Histace, and Edouard Klein. Metric learning with horde: High-order regularizer for deep embeddings. InICCV, pages 6539–6548, 2019

  38. [38]

    Perceiver IO: A general architecture for structured inputs & outputs

    Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Kop- pula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver IO: A general architecture for structured inputs & outputs. InICLR, 2022

  39. [39]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  40. [40]

    NanoGPT.https://github.com/ karpathy/nanoGPT, 2022

    Andrej Karpathy. NanoGPT.https://github.com/ karpathy/nanoGPT, 2022

  41. [42]

    Analyzing and improving the training dynamics of diffusion models

    Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InCVPR, 2024

  42. [43]

    Re- Former: The efficient transformer

    Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Re- Former: The efficient transformer. InICLR, 2020

  43. [44]

    Parrot: Pareto-optimal multi-reward rein- forcement learning framework for text-to-image generation

    Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Jun- feng He, et al. Parrot: Pareto-optimal multi-reward rein- forcement learning framework for text-to-image generation. InECCV, 2025

  44. [45]

    HTR-VT: Handwritten text recognition with vision trans- former.Pattern Recognition, 2025

    Yuting Li, Dexiong Chen, Tinglong Tang, and Xi Shen. HTR-VT: Handwritten text recognition with vision trans- former.Pattern Recognition, 2025

  45. [46]

    Jamba: A hybrid transformer-Mamba language model.ICLR, 2025

    Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-Mamba language model.ICLR, 2025

  46. [47]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2022

  47. [48]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In ICLR, 2023

  48. [49]

    Vmamba: Visual state space model, 2024

    Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, and Yunfan Liu. Vmamba: Visual state space model, 2024

  49. [50]

    Correcting diffusion generation through resampling

    Yujian Liu, Yang Zhang, Tommi Jaakkola, and Shiyu Chang. Correcting diffusion generation through resampling. In CVPR, 2024

  50. [51]

    Keep the cost down: A review on methods to optimize LLM’s KV-cache consumption.COLM, 2024

    Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, and Zhao Hai. Keep the cost down: A review on methods to optimize LLM’s KV-cache consumption.COLM, 2024

  51. [52]

    SIT: Explor- ing flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. SIT: Explor- ing flow and diffusion-based generative models with scalable interpolant transformers. InECCV, 2024

  52. [53]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInt. Conf. Mach. Learn., 2021

  53. [54]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

  54. [55]

    EfficientV- Mamba: Atrous selective scan for light weight visual mamba, 2025

    Xiaohuan Pei, Tao Huang, and Chang Xu. EfficientV- Mamba: Atrous selective scan for light weight visual mamba, 2025

  55. [56]

    The fineweb datasets: Decanting the web for the finest text data at scale.NeurIPS, 2024

    Guilherme Penedo, Hynek Kydlí ˇcek, Anton Lozhkov, Mar- garet Mitchell, Colin A Raffel, Leandro V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.NeurIPS, 2024

  56. [57]

    Improving image similarity with vectors of locally aggregated tensors

    David Picard and Philippe-Henri Gosselin. Improving image similarity with vectors of locally aggregated tensors. InICIP, pages 669–672, 2011

  57. [58]

    Language models are unsu- pervised multitask learners.OpenAI blog, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners.OpenAI blog, 2019

  58. [59]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022

  59. [60]

    Panoptic seg- mentation of satellite image time series with convolutional temporal attention networks.ICCV, 2021

    Vivien Sainte Fare Garnot and Loic Landrieu. Panoptic seg- mentation of satellite image time series with convolutional temporal attention networks.ICCV, 2021

  60. [61]

    Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 2021

  61. [62]

    Diffusion Schrödinger bridge matching

    Yuyang Shi, Valentin De Bortoli, Andrew Campbell, and Ar- naud Doucet. Diffusion Schrödinger bridge matching. In NeurIPS, 2024

  62. [63]

    FreeU: Free lunch in diffusion U-net

    Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. FreeU: Free lunch in diffusion U-net. InCVPR, 2024. 10

  63. [64]

    Score-based generative modeling through stochastic differential equa- tions

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InICLR, 2021

  64. [65]

    ViTs for SITS: Vision transformers for satellite image time series

    Michail Tarasiou, Erik Chavez, and Stefanos Zafeiriou. ViTs for SITS: Vision transformers for satellite image time series. InCVPR, 2023

  65. [66]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. GEMINI: A family of highly capable multimodal models. arXiv:2312.11805, 2023

  66. [67]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati- raju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv:2408.00118, 2024

  67. [68]

    MLP-Mixer: An all-MLP architecture for vision

    Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. MLP-Mixer: An all-MLP architecture for vision. In NeurIPS, 2021

  68. [69]

    ResMLP: Feedforward networks for image classification with data-efficient training.IEEE TPAMI, 2022

    Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izac- ard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, et al. ResMLP: Feedforward networks for image classification with data-efficient training.IEEE TPAMI, 2022

  69. [70]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017

  70. [71]

    Diffusion model align- ment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InCVPR, 2024

  71. [72]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. LinFormer: Self-attention with linear complexity. arXiv:2006.04768, 2020

  72. [73]

    Powerful and flexible: Personalized text- to-image generation via reinforcement learning

    Fanyue Wei, Wei Zeng, Zhenyang Li, Dawei Yin, Lixin Duan, and Wen Li. Powerful and flexible: Personalized text- to-image generation via reinforcement learning. InECCV, 2024

  73. [74]

    Point Transformer v3: Simpler, faster, stronger

    Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point Transformer v3: Simpler, faster, stronger. In CVPR, 2024

  74. [75]

    Diffu- sion models without attention

    Jing Nathan Yan, Jiatao Gu, and Alexander M Rush. Diffu- sion models without attention. InCVPR, 2024

  75. [76]

    CogVideoX: Text-to-video diffusion models with an expert transformer.ICLR, 2025

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer.ICLR, 2025

  76. [77]

    Are transformers uni- versal approximators of sequence-to-sequence functions? In ICLR, 2020

    Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are transformers uni- versal approximators of sequence-to-sequence functions? In ICLR, 2020

  77. [78]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence?arXiv:1905.07830, 2019

  78. [79]

    Scaling vision transformers

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. InCVPR, 2022

  79. [80]

    Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara N

    Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, et al. Google USM: Scaling automatic speech recognition beyond 100 languages.arXiv:2303.01037, 2023

  80. [81]

    MobileDiffusion: Instant text-to-image gen- eration on mobile devices

    Yang Zhao, Yanwu Xu, Zhisheng Xiao, Haolin Jia, and Tingbo Hou. MobileDiffusion: Instant text-to-image gen- eration on mobile devices. InECCV, 2024

Showing first 80 references.