arxiv: 2605.09503 · v1 · submitted 2026-05-10 · 💻 cs.CV

Recognition: no theorem link

PermuQuant: Lowering Per-Group Quantization Error by Reordering Channels for Diffusion Models

Junxian Li, Kai Liu, Kaiwen Tao, Renjing Pei, Yongsen Cheng, Yulun Zhang, Zhikai Chen, Zhixin Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:17 UTC · model grok-4.3

classification 💻 cs.CV

keywords post-training quantizationdiffusion modelschannel reorderingper-group quantizationlow-bit inferencequantization errorFLUX.1DiT models

0 comments

The pith

Reordering channels to group similar statistics reduces per-group quantization error in diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large diffusion models require heavy compute and memory, limiting their use on single GPUs or in interactive settings. Post-training quantization compresses them but suffers quality loss at low bits because per-group scaling lets outlier channels dominate shared scales when unlike channels sit together. The paper shows that a simple pre-sort of channels by a joint second-moment measure of activations and weights, followed by a calibration check that accepts the order only if it lowers error, consistently improves quantization. The chosen permutations are folded into weights or neighboring layers offline so inference incurs no extra cost. Experiments on several large models confirm lower error than prior PTQ methods and deliver measurable speed and memory gains.

Core claim

The central claim is that selecting a channel permutation on calibration data via a joint second-moment criterion places channels with similar activation and weight statistics into the same per-group quantization block, thereby lowering the shared scale's sensitivity to outliers and reducing overall quantization error for diffusion models. The method applies the permutation only when it improves calibration error and absorbs the reordering into adjacent modules or weights, preserving exact inference speed. This yields lower error than existing PTQ baselines across multiple large diffusion models.

What carries the argument

The joint second-moment criterion used to sort channels before grouping, paired with a calibration-data acceptance rule that applies the permutation only if it reduces measured quantization error.

If this is right

Quantization error falls under W4A4 and similar low-bit regimes for diffusion models.
Models achieve up to 1.8 times single-step speedup on hardware such as RTX 5090.
DiT memory footprint shrinks by 3.5 times under W4A4 NVFP4 quantization.
The approach outperforms prior post-training quantization baselines without requiring retraining.
Permutations can be applied offline to weights or absorbed into adjacent layers with no runtime overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reordering principle could extend to per-group quantization in non-diffusion transformers if the second-moment criterion remains predictive.
If calibration sets are chosen to cover edge-case prompts, the method might support even lower bit widths such as 2-bit without visible degradation.
Applying the technique after weight pruning or alongside activation-aware scaling could compound compression gains.
The offline absorption step suggests similar reordering tactics are feasible in other compression pipelines where runtime cost must stay zero.

Load-bearing premise

A permutation chosen on calibration data will generalize to the full input distribution seen at inference time without creating new artifacts in generated images.

What would settle it

Compare quantization error or generation metrics such as FID on a held-out prompt set between the quantized model with the selected permutation and the same model without permutation; an increase in error or drop in quality falsifies the claim.

Figures

Figures reproduced from arXiv: 2605.09503 by Junxian Li, Kai Liu, Kaiwen Tao, Renjing Pei, Yongsen Cheng, Yulun Zhang, Zhikai Chen, Zhixin Wang.

**Figure 1.** Figure 1: PermuQuant is a post-training quantization framework for low-bit diffusion models. In the W4A4 setting, it achieves 3.5× DiT memory reduction and 6.3× speedup on a single RTX 5090 32GB GPU by eliminating CPU offloading. In the challenging W3A3 setting, PermuQuant still produces visually clean results with faithful details, significantly outperforming other baselines. Abstract Large-scale visual generative … view at source ↗

**Figure 2.** Figure 2: Example of per-group quantization. Quantization error is greatly affected by channel orders. Despite this progress, extremely low-bit weight-activation quantization remains challenging (see [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: (a) and (b): Relative change in activation quantization error caused by random channel [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of PermuQuant. (a) Channel reordering places channels with similar statistics [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on SANA-1.5-1.6B and FLUX.1-dev. All methods are evaluated [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Efficiency comparison on FLUX.1-dev using an RTX 5090 Desktop 32GB GPU. Reordering [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison on SANA-1.5-1.6B. The dashed line separates BF16 from the [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison on FLUX.1-dev. The dashed line separates BF16 from the quantized [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison on Z-Image-Turbo. The dashed line separates BF16 from the [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

read the original abstract

Large-scale visual generative models have achieved remarkable performance. However, their high computational and memory costs make deployment challenging in resource-constrained scenarios, such as interactive applications and personal single-GPU usage. Post-training quantization (PTQ) offers a practical solution by compressing pretrained models without expensive retraining. However, existing PTQ methods still suffer from severe quality degradation under extremely low-bit settings. In this paper, we identify channel ordering as an important but underexplored factor in per-group quantization. In this setting, each contiguous group shares one quantization scale. When channels with very different statistics are placed in the same group, the scale can be dominated by outliers and cause large quantization errors. Based on this observation, we propose PermuQuant, a simple and effective PTQ framework for low-bit diffusion models. PermuQuant sorts channels by a joint second-moment criterion before per-group quantization, placing channels with similar activation and weight statistics into the same group. It further uses a calibration-based acceptance rule to apply reordering only when the selected permutation reduces quantization error on calibration data. The selected permutations are absorbed into adjacent modules or applied to weights offline, avoiding explicit runtime permutation operations. Extensive experiments on multiple large diffusion models show that PermuQuant consistently reduces quantization error and outperforms existing PTQ baselines. On FLUX.1-dev with an RTX 5090, PermuQuant achieves up to a 1.8$\times$ single step speedup and reduces the DiT memory footprint by 3.5$\times$ under W4A4 NVFP4 quantization. Code will be available at https://github.com/yscheng04/PermuQuant.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PermuQuant's channel reordering by joint second-moment criterion is a practical PTQ tweak for diffusion models that folds in offline with no runtime cost, but the generalization to full sampling trajectories remains the open question.

read the letter

The main point is that reordering channels to put similar activation and weight statistics into the same per-group quantization bucket can cut error without retraining. They pick the order with a joint second-moment score on calibration data and only keep it if error drops on that data, then absorb the permutation into weights or adjacent layers so inference stays unchanged. This is a straightforward engineering fix for the outlier problem in group-wise scales.

Referee Report

2 major / 2 minor

Summary. The paper introduces PermuQuant, a post-training quantization (PTQ) method for diffusion models that identifies channel ordering as a key factor in per-group quantization error. It proposes sorting channels by a joint second-moment criterion computed on calibration data to group channels with similar activation/weight statistics, applies a calibration-based acceptance rule to retain only error-reducing permutations, and absorbs the selected permutations offline into adjacent modules or weights to avoid runtime cost. The central claim is that this yields consistent quantization error reduction and outperforms existing PTQ baselines, with concrete gains reported on FLUX.1-dev (up to 1.8× single-step speedup and 3.5× DiT memory reduction under W4A4 NVFP4).

Significance. If the empirical claims hold, the work provides a lightweight, training-free improvement to low-bit quantization of large generative models, directly addressing deployment barriers on single-GPU or resource-constrained settings. The offline absorption of permutations is a practical strength that incurs no inference overhead. The focus on an underexplored aspect of per-group quantization (channel ordering) could influence future PTQ designs for diffusion and other generative architectures.

major comments (2)

[Method (permutation selection and acceptance rule)] The central claim of consistent error reduction and outperformance rests on the assumption that a permutation chosen via the joint second-moment criterion on calibration data will generalize across the full range of activations encountered during diffusion sampling. Diffusion models exhibit substantial distribution shifts from noisy to clean latents; the manuscript should therefore report quantization error or generation quality (e.g., FID) measured on held-out activations spanning multiple timesteps, or provide an ablation comparing calibration-only vs. full-trajectory error to substantiate generalization.
[Experiments] The abstract and introduction assert 'consistent' outperformance and error reduction with specific speedup/memory numbers, yet the manuscript text provides no quantitative tables, per-layer or per-model error comparisons, baseline implementation details, ablation studies on the acceptance rule, or error-bar statistics. Without these, the magnitude and reliability of the claimed improvements cannot be verified.

minor comments (2)

[Method] Clarify the exact definition of the joint second-moment criterion (e.g., the mathematical formulation combining activation and weight statistics) and whether it is computed per-layer or globally.
[Abstract] The statement 'Code will be available' should be accompanied by a concrete repository link or a note on reproducibility artifacts (e.g., calibration data splits).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, explaining our approach and planned revisions.

read point-by-point responses

Referee: [Method (permutation selection and acceptance rule)] The central claim of consistent error reduction and outperformance rests on the assumption that a permutation chosen via the joint second-moment criterion on calibration data will generalize across the full range of activations encountered during diffusion sampling. Diffusion models exhibit substantial distribution shifts from noisy to clean latents; the manuscript should therefore report quantization error or generation quality (e.g., FID) measured on held-out activations spanning multiple timesteps, or provide an ablation comparing calibration-only vs. full-trajectory error to substantiate generalization.

Authors: We agree that distribution shifts across timesteps in diffusion models warrant explicit verification of generalization. Our calibration set is constructed by sampling activations from multiple timesteps along the diffusion trajectory to capture a representative range of statistics. The acceptance rule further filters permutations to those that reduce error on this calibration data. To directly address the concern, we will add an ablation in the revised manuscript comparing per-group quantization error (and FID where feasible) on the calibration set versus held-out timesteps from the full sampling trajectory. revision: yes
Referee: [Experiments] The abstract and introduction assert 'consistent' outperformance and error reduction with specific speedup/memory numbers, yet the manuscript text provides no quantitative tables, per-layer or per-model error comparisons, baseline implementation details, ablation studies on the acceptance rule, or error-bar statistics. Without these, the magnitude and reliability of the claimed improvements cannot be verified.

Authors: We acknowledge that the initial manuscript text did not include the requested quantitative tables, per-layer breakdowns, or statistical details, which limits verifiability. The abstract reports aggregate gains, but we will revise the experiments section to add comprehensive tables showing per-layer and per-model quantization error reductions, full baseline implementation details, dedicated ablations on the acceptance rule, and error bars computed over multiple independent calibration runs. revision: yes

Circularity Check

0 steps flagged

No circularity: PermuQuant selects permutations via explicit external criterion on calibration data

full rationale

The derivation defines a joint second-moment sorting criterion and a calibration-based acceptance rule that are applied offline to weights or adjacent modules. These steps are independent of the final inference-time error reduction claims; the paper measures improvement on separate test sets and models rather than re-using the selection criterion as its own output. No equations reduce claimed gains to a fitted parameter defined by the result itself, and no self-citation chain bears the central premise. The approach is externally falsifiable on held-out diffusion sampling trajectories.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that grouping channels by similar second-moment statistics reduces per-group quantization error; the acceptance rule is an empirical safeguard rather than a free parameter.

axioms (1)

domain assumption Channels with similar activation and weight second-moment statistics benefit from sharing a quantization scale.
This is the core observation stated in the abstract that motivates the reordering step.

pith-pipeline@v0.9.0 · 5627 in / 1329 out tokens · 64553 ms · 2026-05-12T04:17:00.312464+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages

[1]

Quarot: Outlier-free 4-bit inference in rotated llms

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. InNeurIPS, 2024

work page 2024
[2]

ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv, 2022

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv, 2022

work page 2022
[3]

Demystifying mmd gans.arXiv, 2018

Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv, 2018

work page 2018
[4]

Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv, 2025

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv, 2025

work page 2025
[5]

Efficientqat: Efficient quantization-aware training for large language models

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware training for large language models. InACL, 2025

work page 2025
[6]

Asyncdiff: Parallelizing diffusion models by asynchronous denoising

Zigeng Chen, Xinyin Ma, Gongfan Fang, Zhenxiong Tan, and Xinchao Wang. Asyncdiff: Parallelizing diffusion models by asynchronous denoising. InNeurIPS, 2024

work page 2024
[7]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021

work page 2021
[8]

Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference.arXiv, 2024

Jiarui Fang, Jinzhe Pan, Aoyu Li, Xibo Sun, and Jiannan Wang. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference.arXiv, 2024

work page 2024
[9]

Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv, 2022

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv, 2022

work page 2022
[10]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

work page 2023
[11]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017

work page 2017
[12]

Classifier-free diffusion guidance.arXiv, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv, 2022

work page 2022
[13]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020

work page 2020
[14]

Convrot: Rotation-based plug-and-play 4-bit quantization for diffusion transformers.arXiv, 2025

Feice Huang, Zuliang Han, Xing Zhou, Yihuang Chen, Lifei Zhu, and Haoqian Wang. Convrot: Rotation-based plug-and-play 4-bit quantization for diffusion transformers.arXiv, 2025

work page 2025
[15]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[16]

Playground v2

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv, 2024

work page 2024
[17]

Distrifusion: Distributed parallel inference for high-resolution diffusion models

Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Kai Li, and Song Han. Distrifusion: Distributed parallel inference for high-resolution diffusion models. InCVPR, 2024

work page 2024
[18]

Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models

Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. InICLR, 2025

work page 2025
[19]

Q-diffusion: Quantizing diffusion models

Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-diffusion: Quantizing diffusion models. InICCV, 2023. 10

work page 2023
[20]

Snapfusion: Text-to-image diffusion model on mobile devices within two seconds

Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. InNeurIPS, 2023

work page 2023
[21]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, 2014

work page 2014
[22]

From reusing to forecasting: Accelerating diffusion models with taylorseers

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers. InICCV, 2025

work page 2025
[23]

Clq: Cross-layer guided orthogonal- based quantization for diffusion transformers.arXiv, 2025

Kai Liu, Shaoqiu Zhang, Linghe Kong, and Yulun Zhang. Clq: Cross-layer guided orthogonal- based quantization for diffusion transformers.arXiv, 2025

work page 2025
[24]

Low-bit model quantization for deep neural networks: A survey.arXiv, 2025

Kai Liu, Qian Zheng, Kaiwen Tao, Zhiteng Li, Haotong Qin, Wenbo Li, Yong Guo, Xianglong Liu, Linghe Kong, Guihai Chen, et al. Low-bit model quantization for deep neural networks: A survey.arXiv, 2025

work page 2025
[25]

Pseudo numerical methods for diffusion models on manifolds.arXiv, 2022

Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds.arXiv, 2022

work page 2022
[26]

Reactnet: Towards precise binary neural network with generalized activation functions

Zechun Liu, Zhiqiang Shen, Marios Savvides, and Kwang-Ting Cheng. Reactnet: Towards precise binary neural network with generalized activation functions. InECCV, 2020

work page 2020
[27]

Spinquant: Llm quantization with learned rotations.arXiv, 2024

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv, 2024

work page 2024
[28]

Post-training quantization for vision transformer

Zhenhua Liu, Yunhe Wang, Kai Han, Wei Zhang, Siwei Ma, and Wen Gao. Post-training quantization for vision transformer. InNeurIPS, 2021

work page 2021
[29]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. InNeurIPS, 2022

work page 2022
[30]

Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv, 2023

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv, 2023

work page 2023
[31]

Accelerating diffusion models via early stop of the diffusion process.arXiv, 2022

Zhaoyang Lyu, Xudong Xu, Ceyuan Yang, Dahua Lin, and Bo Dai. Accelerating diffusion models via early stop of the diffusion process.arXiv, 2022

work page 2022
[32]

Deepcache: Accelerating diffusion models for free

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InCVPR, 2024

work page 2024
[33]

Daniel Marco and David L. Neuhoff. The validity of the additive noise model for uniform scalar quantizers.TIT, 2005

work page 2005
[34]

Training binary neural networks with real-to-binary convolutions

Brais Martinez, Jing Yang, Adrian Bulat, and Georgios Tzimiropoulos. Training binary neural networks with real-to-binary convolutions. InICLR, 2020

work page 2020
[35]

On distillation of guided diffusion models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InCVPR, 2023

work page 2023
[36]

A white paper on neural network quantization.arXiv, 2021

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization.arXiv, 2021

work page 2021
[37]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InICML, 2021

work page 2021
[38]

On aliased resizing and surprising subtleties in gan evaluation

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. InCVPR, 2022

work page 2022
[39]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

work page 2023
[40]

Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv, 2023. 11

work page 2023
[41]

Forward and backward information retention for accurate binary neural networks

Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu, and Jingkuan Song. Forward and backward information retention for accurate binary neural networks. InCVPR, 2020

work page 2020
[42]

Quantsr: Accurate low-bit quantization for efficient image super-resolution

Haotong Qin, Yulun Zhang, Yifu Ding, Yifan liu, Xianglong Liu, Martin Danelljan, and Fisher Yu. Quantsr: Accurate low-bit quantization for efficient image super-resolution. InNeurIPS, 2023

work page 2023
[43]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

work page 2022
[44]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMICCAI, 2015

work page 2015
[45]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv, 2022

work page 2022
[46]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InECCV, 2024

work page 2024
[47]

Post-training quantization on diffusion models

Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. InCVPR, 2023

work page 2023
[48]

Omniquant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv, 2023

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv, 2023

work page 2023
[49]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InICML, 2015

work page 2015
[50]

Denoising diffusion implicit models.arXiv, 2020

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv, 2020

work page 2020
[51]

Score-based generative modeling through stochastic differential equations.arXiv, 2020

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv, 2020

work page 2020
[52]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, 2023

work page 2023
[53]

Efficient neural network deployment for microcontroller.arXiv, 2020

Hasan Unlu. Efficient neural network deployment for microcontroller.arXiv, 2020

work page 2020
[54]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InICML, 2023

work page 2023
[55]

Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv, 2025

Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Chengyue Wu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv, 2025

work page 2025
[56]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. InNeurIPS, 2023

work page 2023
[57]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. InNeurIPS, 2024

work page 2024
[58]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, 2024

work page 2024
[59]

Fast sampling of diffusion models with exponential integrator.arXiv, 2022

Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator.arXiv, 2022. 12 A Table of Contents In the supplementary material, we provide complete proofs, implementation details, analysis, and results, including: • Sec. B: Proofs of the expected quantization error bound, the optimality of second-moment sorting, and the a...

work page 2022
[60]

, x[π(K)]from global memory; 16

loads one activation row asx[π(1)], . . . , x[π(K)]from global memory; 16

work page
[61]

computes the RMS statistic or the mean and variance on the loaded values

work page
[62]

applies the corresponding channel-wise scale or modulation

work page
[63]

In this way, the reordering is absorbed into the mandatory input-read stage of normalization

writes the reordered normalized output contiguously. In this way, the reordering is absorbed into the mandatory input-read stage of normalization. The fused kernel avoids a standalone reorder pass over the activation tensor. The only additional work is reading the permutation indices and generating indexed memory addresses. This is much cheaper than mater...

work page