arxiv: 2604.15521 · v1 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

Frequency-Aware Flow Matching for High-Quality Image Generation

Sucheng Ren , Qihang Yu , Ju He , Xiaohui Shen , Alan Yuille , Liang-Chieh Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords flow matchingimage generationfrequency conditioningtwo-branch architecturegenerative modelsImageNetcomputer vision

0 comments

The pith

Flow matching generates sharper images when low- and high-frequency components receive separate time-dependent weighting and dedicated branches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard flow matching reverses a noise addition process in which noise affects frequencies unevenly, causing global structure to form early and fine details to appear only late. The paper introduces explicit frequency-aware conditioning through time-dependent adaptive weighting together with a two-branch network: one branch processes low- and high-frequency components while the second branch performs spatial synthesis guided by the frequency output. This separation allows the model to strengthen large-scale coherence and refine textures and edges at the appropriate stages. A reader would care because the change addresses a structural limitation inside an existing generative framework and produces measurably higher-quality output on standard image benchmarks.

Core claim

Flow matching models learn to reverse a corruption process that adds Gaussian noise, yet the non-uniform impact of this noise on frequency components causes low-frequency elements to be generated early and high-frequency elements later. By adding frequency-aware conditioning via time-dependent adaptive weighting and a two-branch architecture—one branch that separately processes low- and high-frequency components to capture structure and refine details, and a spatial branch that synthesizes images in the latent domain guided by the frequency branch—the model ensures both large-scale coherence and fine-grained details are effectively modeled at each step of the reverse process.

What carries the argument

Time-dependent adaptive weighting applied to a two-branch frequency-spatial architecture that separates explicit low- and high-frequency processing from latent-domain spatial synthesis.

If this is right

Low-frequency conditioning reinforces global structure in the generated images.
High-frequency conditioning enhances texture fidelity and detail sharpness.
Both large-scale coherence and fine-grained details are modeled more effectively than in standard flow matching.
The approach yields state-of-the-art FID performance on class-conditional ImageNet-256 generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same frequency imbalance likely appears in other noise-based generative models, so the weighting and branching idea may transfer beyond flow matching.
Explicit timing of frequency emphasis could allow the generation process to reach acceptable quality in fewer steps.
The method may show larger gains on higher-resolution images where fine-detail fidelity matters more.

Load-bearing premise

Separating frequency processing into its own branch and weighting it adaptively over time will improve both global structure and fine details without causing branch interference or training instability.

What would settle it

Training the same two-branch architecture without the time-dependent frequency weighting and observing whether the FID on class-conditional ImageNet-256 remains no better than the baseline flow-matching result.

Figures

Figures reproduced from arXiv: 2604.15521 by Alan Yuille, Ju He, Liang-Chieh Chen, Qihang Yu, Sucheng Ren, Xiaohui Shen.

**Figure 1.** Figure 1: Flow matching in the spatial domain vs. frequencyaware flow matching. Unlike previous flow matching models such as SiT [34], which operate purely in the spatial domain, our FreqFlow explicitly incorporates frequency information into the spatial branch. This enhances local detail refinement while preserving structural consistency, leading to improved image quality. data distribution and a simple Gaussian … view at source ↗

**Figure 2.** Figure 2: Parameters vs. FID. Our FreqFlow-L outperforms DiTXL [35] and SiT-XL [34] by 0.73 and 0.52 FID, respectively, while using fewer parameters. Under comparable parameter budgets, FreqFlow-H surpasses DiMR-G [30] and MAR-H [28] by 0.15 and 0.07 FID, demonstrating superior efficiency and performance. SiT [34] extends this innovation by integrating DiT [35] with flow matching, improving efficiency by establishi… view at source ↗

**Figure 3.** Figure 3: Relative log amplitudes of frequency cross time steps from 1000 (pure Gaussian noise) to 0 (clean image). Flow Matching models introduce low-frequency components in the early stages and high-frequency components in the later stages of the reverse process. Compared to SiT [34], our FreqFlow constructs global structures (low-frequency information) more efficiently—reaching the lowest log amplitude earlier … view at source ↗

**Figure 4.** Figure 4: Overview of FreqFlow. FreqFlow features a two-branch design: (1) a frequency branch that captures the low-frequency global structure and high-frequency details (e.g., edges), and (2) a spatial branch that synthesizes images in the pixel or latent domain, guided by the frequency branch’s output. During training, the input noisy image is decomposed into low- and high-frequency components using lowpass and h… view at source ↗

**Figure 5.** Figure 5: Visualization of adaptive frequency integration during the reverse process from time step 1000 (pure Gaussian noise) to 0 (clean image). The learned integration weights of low- (ωt) and high- (1 − ωt) frequency components demonstrate that FreqFlow prioritizes low-frequency structure in the early stages (i.e., large time steps) and progressively shifts focus to highfrequency details toward the end (i.e., … view at source ↗

**Figure 6.** Figure 6: Visualization of generated low-, high-frequency and final outputs. The final output from the spatial branch is enhanced by the low- and high-frequency information provided by the frequency branch. Low vs. High Frequency. We propose to explicitly introduce frequency information into flow matching models. As shown in Tab. 6, adding either low or high frequency component alone consistently improves performa… view at source ↗

**Figure 7.** Figure 7: Generations. FreqFlow produces high-quality 512×512 (1st and 2nd columns) and 256×256 images (remaining columns). model #params. FID (w/o CFG)↓ LDM-4 [46] 400M 10.56 DiT-XL/2 [35] 675M 9.62 ADM-U [8] 608M 7.49 U-ViT-H/2 [2] 501M 6.58 DiMR-XL/2R [30] 505M 4.50 DiMR-G/2R [30] 1.06B 3.56 FreqFlow-L 507M 3.12 FreqFlow-H 1.08B 2.45 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of generated low-, high-frequency and final outputs. The final output from the spatial branch is enhanced by the low- and high-frequency information provided by the frequency branch [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of generated low-, high-frequency and final outputs. The final output from the spatial branch is enhanced by the low- and high-frequency information provided by the frequency branch [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of generated low-, high-frequency and final outputs. The final output from the spatial branch is enhanced by the low- and high-frequency information provided by the frequency branch [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Generated Samples from FreqFlow. FreqFlow is able to generate high-quality golden retriever (88) images [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Generated Samples from FreqFlow. FreqFlow is able to generate high-quality golden retriever (207) images [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

read the original abstract

Flow matching models have emerged as a powerful framework for realistic image generation by learning to reverse a corruption process that progressively adds Gaussian noise. However, because noise is injected in the latent domain, its impact on different frequency components is non-uniform. As a result, during inference, flow matching models tend to generate low-frequency components (global structure) in the early stages, while high-frequency components (fine details) emerge only later in the reverse process. Building on this insight, we propose Frequency-Aware Flow Matching (FreqFlow), a novel approach that explicitly incorporates frequency-aware conditioning into the flow matching framework via time-dependent adaptive weighting. We introduce a two-branch architecture: (1) a frequency branch that separately processes low- and high-frequency components to capture global structure and refine textures and edges, and (2) a spatial branch that synthesizes images in the latent domain, guided by the frequency branch's output. By explicitly integrating frequency information into the generation process, FreqFlow ensures that both large-scale coherence and fine-grained details are effectively modeled low-frequency conditioning reinforces global structure, while high-frequency conditioning enhances texture fidelity and detail sharpness. On the class-conditional ImageNet-256 generation benchmark, our method achieves state-of-the-art performance with an FID of 1.38, surpassing the prior diffusion model DiT and flow matching model SiT by 0.79 and 0.58 FID, respectively. Code is available at https://github.com/OliverRensu/FreqFlow.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FreqFlow adds a frequency branch with time-dependent weighting to flow matching and reports 1.38 FID on ImageNet-256, a clear but incremental step forward.

read the letter

The core idea is adding a frequency branch with time-dependent adaptive weighting to flow matching, and they report 1.38 FID on class-conditional ImageNet-256. They point out that noise hits frequencies unevenly, so the reverse process builds global structure first and details later. The two-branch design lets one branch handle low and high frequencies separately while the spatial branch generates the image, with the frequency output guiding it. The weighting changes over time to balance the two. This setup is not in the prior DiT or SiT work, and releasing the code is helpful for checking the details. The numbers look decent: better than the baselines by a noticeable margin. If the experiments include good ablations, this could be a practical tweak for better detail in generated images. The potential weak point is whether the gains come mainly from the frequency split or from the extra parameters and tuning that come with it. The abstract does not break that down, so the full paper needs to show controls that match capacity or isolate the components. This kind of work is for people already working on flow matching or similar generative models who want to improve frequency handling in their pipelines. A reader focused on high-quality image synthesis would find the benchmark result and code useful. I would send this to peer review. The claim is testable with the released code, and the motivation is straightforward enough that referees can evaluate the contribution directly.

Referee Report

1 major / 2 minor

Summary. The paper introduces Frequency-Aware Flow Matching (FreqFlow) as an extension to standard flow matching for image generation. It adds time-dependent adaptive weighting to incorporate frequency-aware conditioning and proposes a two-branch architecture consisting of a frequency branch that processes low- and high-frequency components separately and a spatial branch that synthesizes the image in the latent domain. The central empirical claim is state-of-the-art performance on class-conditional ImageNet-256 generation, with an FID of 1.38 that improves over DiT by 0.79 and over SiT by 0.58.

Significance. If the reported FID improvement is reproducible and attributable to the proposed components, the work would constitute a useful architectural refinement for flow-matching models by explicitly handling the non-uniform frequency impact of the corruption process. The public release of code is a clear strength that supports verification and follow-on research.

major comments (1)

[Results / Experiments] The central claim that the 1.38 FID results from the time-dependent adaptive weighting and two-branch architecture is not yet load-bearing without supporting evidence. The manuscript should include ablations that isolate these additions (e.g., removing the frequency branch or the adaptive weighting) and report the resulting FID degradation on the same ImageNet-256 benchmark.

minor comments (2)

[Abstract] The abstract states that 'low-frequency conditioning reinforces global structure, while high-frequency conditioning enhances texture fidelity' but does not reference a specific figure or equation that illustrates this separation; adding such a pointer would improve clarity.
[Method] Notation for the frequency decomposition (low- vs. high-frequency components) and how it is applied consistently to both training targets and conditioning should be defined explicitly in the method section to avoid ambiguity during re-implementation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the major comment point-by-point below.

read point-by-point responses

Referee: [Results / Experiments] The central claim that the 1.38 FID results from the time-dependent adaptive weighting and two-branch architecture is not yet load-bearing without supporting evidence. The manuscript should include ablations that isolate these additions (e.g., removing the frequency branch or the adaptive weighting) and report the resulting FID degradation on the same ImageNet-256 benchmark.

Authors: We agree that the manuscript would benefit from explicit ablations isolating the contributions of the time-dependent adaptive weighting and the two-branch architecture. In the revised version, we will add these experiments on the ImageNet-256 benchmark, including a baseline without the frequency branch and a variant without the adaptive weighting, and report the corresponding FID scores to quantify the performance degradation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architectural proposal

full rationale

The paper presents FreqFlow as an architectural extension to existing flow-matching frameworks, adding time-dependent adaptive weighting and a two-branch frequency-spatial network. Its central claim is an empirical FID result (1.38) on the standard class-conditional ImageNet-256 benchmark, with code released for direct reproduction. No derivation chain, first-principles prediction, or fitted quantity is shown to reduce by construction to its own inputs; the method is described as an explicit addition whose components can be implemented consistently with flow-matching ODEs. Self-citations, if present, are not load-bearing for the performance claim, which rests on external benchmark comparison rather than internal redefinition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The abstract relies on the domain assumption that noise in the latent domain affects frequency components non-uniformly and introduces a new architectural component whose parameters are not specified.

free parameters (1)

time-dependent adaptive weighting
The weighting scheme is described as time-dependent but no functional form or fitting procedure is given in the abstract.

axioms (1)

domain assumption Noise injection in the latent domain has non-uniform impact on different frequency components
This insight is stated as the foundation for the frequency-aware approach.

pith-pipeline@v0.9.0 · 5582 in / 1153 out tokens · 38252 ms · 2026-05-10T10:54:50.139960+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 15 canonical work pages · 8 internal anchors

[1]

Building Normalizing Flows with Stochastic Interpolants

Michael S Albergo and Eric Vanden-Eijnden. Building nor- malizing flows with stochastic interpolants.arXiv preprint arXiv:2209.15571, 2022. 2

work page internal anchor Pith review arXiv 2022
[2]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InCVPR, 2023. 2, 6, 7, 8

2023
[3]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arxiv 2018.arXiv preprint arXiv:1809.11096, 1809. 7, 8

work page internal anchor Pith review arXiv 2018
[4]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In CVPR, 2022. 7, 8

2022
[5]

Semantic image segmen- tation with deep convolutional nets and fully connected crfs

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmen- tation with deep convolutional nets and fully connected crfs. InICLR, 2015. 2

2015
[6]

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs.TPAMI, 2017

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs.TPAMI, 2017. 2

2017
[7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 2, 6

2009
[8]

Diffusion models beat gans on image synthesis.NeurIPS, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.NeurIPS, 2021. 7, 8

2021
[9]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2021. 2, 5

2021
[10]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR,
[11]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 1, 2

2024
[12]

Masked diffusion transformer is a strong image synthesizer

Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389,

work page arXiv
[13]

Generative adversarial nets.NeurIPS, 2014

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.NeurIPS, 2014. 2

2014
[14]

Diffit: Diffusion vision transformers for im- age generation

Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, and Arash Vahdat. Diffit: Diffusion vision transformers for im- age generation. InECCV, 2024. 7, 8

2024
[15]

Flow- tok: Flowing seamlessly across text and image tokens

Ju He, Qihang Yu, Qihao Liu, and Liang-Chieh Chen. Flow- tok: Flowing seamlessly across text and image tokens. In ICCV, 2025. 2

2025
[16]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.NeurIPS, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.NeurIPS, 30, 2017. 6

2017
[17]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 6

work page internal anchor Pith review arXiv 2022
[18]

Denoising diffu- sion probabilistic models.NeurIPS, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.NeurIPS, 2020. 1, 2, 3

2020
[19]

Cascaded diffusion models for high fidelity image generation.JMLR, 23(47),

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.JMLR, 23(47),
[20]

sim- ple diffusion: End-to-end diffusion for high resolution im- ages

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages. InICML, 2023. 1, 2, 7

2023
[21]

Fouriscale: A frequency perspective on training-free high-resolution image synthesis

Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution image synthesis. InECCV, 2024. 2

2024
[22]

Scaling up gans for text-to-image synthesis

Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. InCVPR, 2023. 7

2023
[23]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015. 1

2015
[24]

Auto-encoding varia- tional bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes. InICLR, 2014. 1, 2

2014
[25]

https://blackforestlabs.ai/announcements/

Black Forest Labs. https://blackforestlabs.ai/announcements/
[26]

Autoregressive image generation using residual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InCVPR, 2022. 7

2022
[27]

Return of unconditional generation: A self-supervised representation generation method.NeurIPS, 2024

Tianhong Li, Dina Katabi, and Kaiming He. Return of unconditional generation: A self-supervised representation generation method.NeurIPS, 2024. 7

2024
[28]

Autoregressive image generation without vec- tor quantization.NeurIPS, 2024

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vec- tor quantization.NeurIPS, 2024. 2, 7

2024
[29]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Alleviating distortion in image generation via multi-resolution diffusion models and time- dependent layer normalization.NeurIPS, 2024

Qihao Liu, Zhanpeng Zeng, Ju He, Qihang Yu, Xiaohui Shen, and Liang-Chieh Chen. Alleviating distortion in image generation via multi-resolution diffusion models and time- dependent layer normalization.NeurIPS, 2024. 2, 6, 7, 8

2024
[31]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 1, 2, 3

work page internal anchor Pith review arXiv 2022
[32]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InCVPR, 2022. 5

2022
[33]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

Sit: Exploring flow and diffusion-based generative models with scalable in- terpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable in- terpolant transformers. InECCV, 2024. 1, 2, 3, 4, 6, 7

2024
[35]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 2, 6, 7, 8

2023
[36]

Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks.NeurIPS, 2024

Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks.NeurIPS, 2024

2024
[37]

Turbo2k: Towards ultra-efficient and high-quality 2k video synthesis

Jingjing Ren, Wenbo Li, Zhongdao Wang, Haoze Sun, Bangzhen Liu, Haoyu Chen, Jiaqi Xu, Aoxue Li, Shifeng Zhang, Bin Shao, et al. Turbo2k: Towards ultra-efficient and high-quality 2k video synthesis. InICCV, 2025. 2

2025
[38]

Co-advise: Cross inductive bias distillation

Sucheng Ren, Zhengqi Gao, Tianyu Hua, Zihui Xue, Yong- long Tian, Shengfeng He, and Hang Zhao. Co-advise: Cross inductive bias distillation. InCVPR, 2022. 2

2022
[39]

Shunted self-attention via multi-scale token aggregation

Sucheng Ren, Daquan Zhou, Shengfeng He, Jiashi Feng, and Xinchao Wang. Shunted self-attention via multi-scale token aggregation. InCVPR, 2022

2022
[40]

Tinymim: An empirical study of distilling mim pre-trained models

Sucheng Ren, Fangyun Wei, Zheng Zhang, and Han Hu. Tinymim: An empirical study of distilling mim pre-trained models. InCVPR, 2023

2023
[41]

Sg-former: Self-guided transformer with evolving token reallocation

Sucheng Ren, Xingyi Yang, Songhua Liu, and Xinchao Wang. Sg-former: Self-guided transformer with evolving token reallocation. InICCV, 2023. 2

2023
[42]

M-var: Decoupled scale-wise autoregressive modeling for high-quality image generation

Sucheng Ren, Yaodong Yu, Nataniel Ruiz, Feng Wang, Alan Yuille, and Cihang Xie. M-var: Decoupled scale-wise autoregressive modeling for high-quality image generation. arXiv preprint arXiv:2411.10433, 2024. 7

work page arXiv 2024
[43]

Flowar: Scale-wise autoregressive image generation meets flow matching

Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Flowar: Scale-wise autoregressive image generation meets flow matching. InICML, 2025. 7

2025
[44]

Beyond next-token: Next-x predic- tion for autoregressive visual generation

Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Beyond next-token: Next-x predic- tion for autoregressive visual generation. InICCV, 2025. 1

2025
[45]

Grouping first, attending smartly: Training-free acceleration for diffusion transformers.arXiv preprint arXiv:2505.14687,

Sucheng Ren, Qihang Yu, Ju He, Alan Yuille, and Liang- Chieh Chen. Grouping first, attending smartly: Training- free acceleration for diffusion transformers.arXiv preprint arXiv:2505.14687, 2025. 2

work page arXiv 2025
[46]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022. 1, 2, 3, 6, 7, 8

2022
[47]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMICCAI, 2015. 2

2015
[48]

Improved techniques for training gans.NeurIPS, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.NeurIPS, 29, 2016. 6

2016
[49]

Stylegan- xl: Scaling stylegan to large diverse datasets

Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan- xl: Scaling stylegan to large diverse datasets. InSIGGRAPH,
[50]

On the fre- quency bias of generative models.NeurIPS, 2021

Katja Schwarz, Yiyi Liao, and Andreas Geiger. On the fre- quency bias of generative models.NeurIPS, 2021. 2

2021
[51]

Deeply supervised flow-based generative models

Inkyu Shin, Chenglin Yang, and Liang-Chieh Chen. Deeply supervised flow-based generative models. InICCV, 2025. 1

2025
[52]

Freeu: Free lunch in diffusion u-net

Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. Freeu: Free lunch in diffusion u-net. InCVPR, 2024. 2

2024
[53]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2010
[54]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2011
[55]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.NeurIPS, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Li- wei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.NeurIPS, 2024. 7, 8

2024
[56]

Attention is all you need.NeurIPS, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 2017. 2

2017
[57]

Maskbit: Embedding-free image generation via bit tokens.arXiv preprint arXiv:2409.16211, 2024

Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xi- aohui Shen, Daniel Cremers, and Liang-Chieh Chen. Maskbit: Embedding-free image generation via bit tokens. arXiv:2409.16211, 2024. 7

work page arXiv 2024
[58]

Diffu- sion models without attention

Jing Nathan Yan, Jiatao Gu, and Alexander M Rush. Diffu- sion models without attention. InCVPR, 2024. 8

2024
[59]

1.58-bit FLUX.arXiv preprint arXiv:2412.18653, 2024

Chenglin Yang, Celong Liu, Xueqing Deng, Dongwon Kim, Xing Mei, Xiaohui Shen, and Liang-Chieh Chen. 1.58-bit flux.arXiv preprint arXiv:2412.18653, 2024. 2

work page arXiv 2024
[60]

Frag: Frequency adapting group for diffu- sion video editing.arXiv preprint arXiv:2406.06044, 2024

Sunjae Yoon, Gwanhyeong Koo, Geonwoo Kim, and Chang D Yoo. Frag: Frequency adapting group for diffu- sion video editing.arXiv preprint arXiv:2406.06044, 2024. 2

work page arXiv 2024
[61]

Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627, 2021. 7

work page arXiv 2021
[62]

An image is worth 32 tokens for reconstruction and generation.NeurIPS, 2024

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation.NeurIPS, 2024. 7

2024
[63]

Randomized autoregressive visual generation

Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang- Chieh Chen. Randomized autoregressive visual generation. InICCV, 2025. 7 Appendix The supplementary material includes the following addi- tional information: • Sec. A details the model variants of FreqFlow. • Sec. B details the hyper-parameters for FreqFlow. • Sec. C provides additional ablation stu...

2025