Recognition: no theorem link
PermuQuant: Lowering Per-Group Quantization Error by Reordering Channels for Diffusion Models
Pith reviewed 2026-05-12 04:17 UTC · model grok-4.3
The pith
Reordering channels to group similar statistics reduces per-group quantization error in diffusion models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that selecting a channel permutation on calibration data via a joint second-moment criterion places channels with similar activation and weight statistics into the same per-group quantization block, thereby lowering the shared scale's sensitivity to outliers and reducing overall quantization error for diffusion models. The method applies the permutation only when it improves calibration error and absorbs the reordering into adjacent modules or weights, preserving exact inference speed. This yields lower error than existing PTQ baselines across multiple large diffusion models.
What carries the argument
The joint second-moment criterion used to sort channels before grouping, paired with a calibration-data acceptance rule that applies the permutation only if it reduces measured quantization error.
If this is right
- Quantization error falls under W4A4 and similar low-bit regimes for diffusion models.
- Models achieve up to 1.8 times single-step speedup on hardware such as RTX 5090.
- DiT memory footprint shrinks by 3.5 times under W4A4 NVFP4 quantization.
- The approach outperforms prior post-training quantization baselines without requiring retraining.
- Permutations can be applied offline to weights or absorbed into adjacent layers with no runtime overhead.
Where Pith is reading between the lines
- The same reordering principle could extend to per-group quantization in non-diffusion transformers if the second-moment criterion remains predictive.
- If calibration sets are chosen to cover edge-case prompts, the method might support even lower bit widths such as 2-bit without visible degradation.
- Applying the technique after weight pruning or alongside activation-aware scaling could compound compression gains.
- The offline absorption step suggests similar reordering tactics are feasible in other compression pipelines where runtime cost must stay zero.
Load-bearing premise
A permutation chosen on calibration data will generalize to the full input distribution seen at inference time without creating new artifacts in generated images.
What would settle it
Compare quantization error or generation metrics such as FID on a held-out prompt set between the quantized model with the selected permutation and the same model without permutation; an increase in error or drop in quality falsifies the claim.
Figures
read the original abstract
Large-scale visual generative models have achieved remarkable performance. However, their high computational and memory costs make deployment challenging in resource-constrained scenarios, such as interactive applications and personal single-GPU usage. Post-training quantization (PTQ) offers a practical solution by compressing pretrained models without expensive retraining. However, existing PTQ methods still suffer from severe quality degradation under extremely low-bit settings. In this paper, we identify channel ordering as an important but underexplored factor in per-group quantization. In this setting, each contiguous group shares one quantization scale. When channels with very different statistics are placed in the same group, the scale can be dominated by outliers and cause large quantization errors. Based on this observation, we propose PermuQuant, a simple and effective PTQ framework for low-bit diffusion models. PermuQuant sorts channels by a joint second-moment criterion before per-group quantization, placing channels with similar activation and weight statistics into the same group. It further uses a calibration-based acceptance rule to apply reordering only when the selected permutation reduces quantization error on calibration data. The selected permutations are absorbed into adjacent modules or applied to weights offline, avoiding explicit runtime permutation operations. Extensive experiments on multiple large diffusion models show that PermuQuant consistently reduces quantization error and outperforms existing PTQ baselines. On FLUX.1-dev with an RTX 5090, PermuQuant achieves up to a 1.8$\times$ single step speedup and reduces the DiT memory footprint by 3.5$\times$ under W4A4 NVFP4 quantization. Code will be available at https://github.com/yscheng04/PermuQuant.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PermuQuant, a post-training quantization (PTQ) method for diffusion models that identifies channel ordering as a key factor in per-group quantization error. It proposes sorting channels by a joint second-moment criterion computed on calibration data to group channels with similar activation/weight statistics, applies a calibration-based acceptance rule to retain only error-reducing permutations, and absorbs the selected permutations offline into adjacent modules or weights to avoid runtime cost. The central claim is that this yields consistent quantization error reduction and outperforms existing PTQ baselines, with concrete gains reported on FLUX.1-dev (up to 1.8× single-step speedup and 3.5× DiT memory reduction under W4A4 NVFP4).
Significance. If the empirical claims hold, the work provides a lightweight, training-free improvement to low-bit quantization of large generative models, directly addressing deployment barriers on single-GPU or resource-constrained settings. The offline absorption of permutations is a practical strength that incurs no inference overhead. The focus on an underexplored aspect of per-group quantization (channel ordering) could influence future PTQ designs for diffusion and other generative architectures.
major comments (2)
- [Method (permutation selection and acceptance rule)] The central claim of consistent error reduction and outperformance rests on the assumption that a permutation chosen via the joint second-moment criterion on calibration data will generalize across the full range of activations encountered during diffusion sampling. Diffusion models exhibit substantial distribution shifts from noisy to clean latents; the manuscript should therefore report quantization error or generation quality (e.g., FID) measured on held-out activations spanning multiple timesteps, or provide an ablation comparing calibration-only vs. full-trajectory error to substantiate generalization.
- [Experiments] The abstract and introduction assert 'consistent' outperformance and error reduction with specific speedup/memory numbers, yet the manuscript text provides no quantitative tables, per-layer or per-model error comparisons, baseline implementation details, ablation studies on the acceptance rule, or error-bar statistics. Without these, the magnitude and reliability of the claimed improvements cannot be verified.
minor comments (2)
- [Method] Clarify the exact definition of the joint second-moment criterion (e.g., the mathematical formulation combining activation and weight statistics) and whether it is computed per-layer or globally.
- [Abstract] The statement 'Code will be available' should be accompanied by a concrete repository link or a note on reproducibility artifacts (e.g., calibration data splits).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, explaining our approach and planned revisions.
read point-by-point responses
-
Referee: [Method (permutation selection and acceptance rule)] The central claim of consistent error reduction and outperformance rests on the assumption that a permutation chosen via the joint second-moment criterion on calibration data will generalize across the full range of activations encountered during diffusion sampling. Diffusion models exhibit substantial distribution shifts from noisy to clean latents; the manuscript should therefore report quantization error or generation quality (e.g., FID) measured on held-out activations spanning multiple timesteps, or provide an ablation comparing calibration-only vs. full-trajectory error to substantiate generalization.
Authors: We agree that distribution shifts across timesteps in diffusion models warrant explicit verification of generalization. Our calibration set is constructed by sampling activations from multiple timesteps along the diffusion trajectory to capture a representative range of statistics. The acceptance rule further filters permutations to those that reduce error on this calibration data. To directly address the concern, we will add an ablation in the revised manuscript comparing per-group quantization error (and FID where feasible) on the calibration set versus held-out timesteps from the full sampling trajectory. revision: yes
-
Referee: [Experiments] The abstract and introduction assert 'consistent' outperformance and error reduction with specific speedup/memory numbers, yet the manuscript text provides no quantitative tables, per-layer or per-model error comparisons, baseline implementation details, ablation studies on the acceptance rule, or error-bar statistics. Without these, the magnitude and reliability of the claimed improvements cannot be verified.
Authors: We acknowledge that the initial manuscript text did not include the requested quantitative tables, per-layer breakdowns, or statistical details, which limits verifiability. The abstract reports aggregate gains, but we will revise the experiments section to add comprehensive tables showing per-layer and per-model quantization error reductions, full baseline implementation details, dedicated ablations on the acceptance rule, and error bars computed over multiple independent calibration runs. revision: yes
Circularity Check
No circularity: PermuQuant selects permutations via explicit external criterion on calibration data
full rationale
The derivation defines a joint second-moment sorting criterion and a calibration-based acceptance rule that are applied offline to weights or adjacent modules. These steps are independent of the final inference-time error reduction claims; the paper measures improvement on separate test sets and models rather than re-using the selection criterion as its own output. No equations reduce claimed gains to a fitted parameter defined by the result itself, and no self-citation chain bears the central premise. The approach is externally falsifiable on held-out diffusion sampling trajectories.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Channels with similar activation and weight second-moment statistics benefit from sharing a quantization scale.
Reference graph
Works this paper leans on
-
[1]
Quarot: Outlier-free 4-bit inference in rotated llms
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. InNeurIPS, 2024
work page 2024
-
[2]
ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv, 2022
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv, 2022
work page 2022
-
[3]
Demystifying mmd gans.arXiv, 2018
Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv, 2018
work page 2018
-
[4]
Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv, 2025
work page 2025
-
[5]
Efficientqat: Efficient quantization-aware training for large language models
Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware training for large language models. InACL, 2025
work page 2025
-
[6]
Asyncdiff: Parallelizing diffusion models by asynchronous denoising
Zigeng Chen, Xinyin Ma, Gongfan Fang, Zhenxiong Tan, and Xinchao Wang. Asyncdiff: Parallelizing diffusion models by asynchronous denoising. InNeurIPS, 2024
work page 2024
-
[7]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021
work page 2021
-
[8]
Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference.arXiv, 2024
Jiarui Fang, Jinzhe Pan, Aoyu Li, Xibo Sun, and Jiannan Wang. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference.arXiv, 2024
work page 2024
-
[9]
Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv, 2022
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv, 2022
work page 2022
-
[10]
Geneval: An object-focused framework for evaluating text-to-image alignment
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023
work page 2023
-
[11]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017
work page 2017
-
[12]
Classifier-free diffusion guidance.arXiv, 2022
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv, 2022
work page 2022
-
[13]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020
work page 2020
-
[14]
Convrot: Rotation-based plug-and-play 4-bit quantization for diffusion transformers.arXiv, 2025
Feice Huang, Zuliang Han, Xing Zhou, Yihuang Chen, Lifei Zhu, and Haoqian Wang. Convrot: Rotation-based plug-and-play 4-bit quantization for diffusion transformers.arXiv, 2025
work page 2025
-
[15]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[16]
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv, 2024
work page 2024
-
[17]
Distrifusion: Distributed parallel inference for high-resolution diffusion models
Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Kai Li, and Song Han. Distrifusion: Distributed parallel inference for high-resolution diffusion models. InCVPR, 2024
work page 2024
-
[18]
Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models
Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. InICLR, 2025
work page 2025
-
[19]
Q-diffusion: Quantizing diffusion models
Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-diffusion: Quantizing diffusion models. InICCV, 2023. 10
work page 2023
-
[20]
Snapfusion: Text-to-image diffusion model on mobile devices within two seconds
Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. InNeurIPS, 2023
work page 2023
-
[21]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, 2014
work page 2014
-
[22]
From reusing to forecasting: Accelerating diffusion models with taylorseers
Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers. InICCV, 2025
work page 2025
-
[23]
Clq: Cross-layer guided orthogonal- based quantization for diffusion transformers.arXiv, 2025
Kai Liu, Shaoqiu Zhang, Linghe Kong, and Yulun Zhang. Clq: Cross-layer guided orthogonal- based quantization for diffusion transformers.arXiv, 2025
work page 2025
-
[24]
Low-bit model quantization for deep neural networks: A survey.arXiv, 2025
Kai Liu, Qian Zheng, Kaiwen Tao, Zhiteng Li, Haotong Qin, Wenbo Li, Yong Guo, Xianglong Liu, Linghe Kong, Guihai Chen, et al. Low-bit model quantization for deep neural networks: A survey.arXiv, 2025
work page 2025
-
[25]
Pseudo numerical methods for diffusion models on manifolds.arXiv, 2022
Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds.arXiv, 2022
work page 2022
-
[26]
Reactnet: Towards precise binary neural network with generalized activation functions
Zechun Liu, Zhiqiang Shen, Marios Savvides, and Kwang-Ting Cheng. Reactnet: Towards precise binary neural network with generalized activation functions. InECCV, 2020
work page 2020
-
[27]
Spinquant: Llm quantization with learned rotations.arXiv, 2024
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv, 2024
work page 2024
-
[28]
Post-training quantization for vision transformer
Zhenhua Liu, Yunhe Wang, Kai Han, Wei Zhang, Siwei Ma, and Wen Gao. Post-training quantization for vision transformer. InNeurIPS, 2021
work page 2021
-
[29]
Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. InNeurIPS, 2022
work page 2022
-
[30]
Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv, 2023
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv, 2023
work page 2023
-
[31]
Accelerating diffusion models via early stop of the diffusion process.arXiv, 2022
Zhaoyang Lyu, Xudong Xu, Ceyuan Yang, Dahua Lin, and Bo Dai. Accelerating diffusion models via early stop of the diffusion process.arXiv, 2022
work page 2022
-
[32]
Deepcache: Accelerating diffusion models for free
Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InCVPR, 2024
work page 2024
-
[33]
Daniel Marco and David L. Neuhoff. The validity of the additive noise model for uniform scalar quantizers.TIT, 2005
work page 2005
-
[34]
Training binary neural networks with real-to-binary convolutions
Brais Martinez, Jing Yang, Adrian Bulat, and Georgios Tzimiropoulos. Training binary neural networks with real-to-binary convolutions. InICLR, 2020
work page 2020
-
[35]
On distillation of guided diffusion models
Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InCVPR, 2023
work page 2023
-
[36]
A white paper on neural network quantization.arXiv, 2021
Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization.arXiv, 2021
work page 2021
-
[37]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InICML, 2021
work page 2021
-
[38]
On aliased resizing and surprising subtleties in gan evaluation
Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. InCVPR, 2022
work page 2022
-
[39]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023
work page 2023
-
[40]
Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv, 2023
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv, 2023. 11
work page 2023
-
[41]
Forward and backward information retention for accurate binary neural networks
Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu, and Jingkuan Song. Forward and backward information retention for accurate binary neural networks. InCVPR, 2020
work page 2020
-
[42]
Quantsr: Accurate low-bit quantization for efficient image super-resolution
Haotong Qin, Yulun Zhang, Yifu Ding, Yifan liu, Xianglong Liu, Martin Danelljan, and Fisher Yu. Quantsr: Accurate low-bit quantization for efficient image super-resolution. InNeurIPS, 2023
work page 2023
-
[43]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022
work page 2022
-
[44]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMICCAI, 2015
work page 2015
-
[45]
Progressive distillation for fast sampling of diffusion models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv, 2022
work page 2022
-
[46]
Adversarial diffusion distillation
Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InECCV, 2024
work page 2024
-
[47]
Post-training quantization on diffusion models
Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. InCVPR, 2023
work page 2023
-
[48]
Omniquant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv, 2023
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantiza- tion for large language models.arXiv, 2023
work page 2023
-
[49]
Deep unsuper- vised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InICML, 2015
work page 2015
-
[50]
Denoising diffusion implicit models.arXiv, 2020
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv, 2020
work page 2020
-
[51]
Score-based generative modeling through stochastic differential equations.arXiv, 2020
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv, 2020
work page 2020
-
[52]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, 2023
work page 2023
-
[53]
Efficient neural network deployment for microcontroller.arXiv, 2020
Hasan Unlu. Efficient neural network deployment for microcontroller.arXiv, 2020
work page 2020
-
[54]
Smoothquant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InICML, 2023
work page 2023
-
[55]
Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Chengyue Wu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv, 2025
work page 2025
-
[56]
Imagereward: Learning and evaluating human preferences for text-to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. InNeurIPS, 2023
work page 2023
-
[57]
Improved distribution matching distillation for fast image synthesis
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. InNeurIPS, 2024
work page 2024
-
[58]
One-step diffusion with distribution matching distillation
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, 2024
work page 2024
-
[59]
Fast sampling of diffusion models with exponential integrator.arXiv, 2022
Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator.arXiv, 2022. 12 A Table of Contents In the supplementary material, we provide complete proofs, implementation details, analysis, and results, including: • Sec. B: Proofs of the expected quantization error bound, the optimality of second-moment sorting, and the a...
work page 2022
-
[60]
, x[π(K)]from global memory; 16
loads one activation row asx[π(1)], . . . , x[π(K)]from global memory; 16
-
[61]
computes the RMS statistic or the mean and variance on the loaded values
-
[62]
applies the corresponding channel-wise scale or modulation
-
[63]
In this way, the reordering is absorbed into the mandatory input-read stage of normalization
writes the reordered normalized output contiguously. In this way, the reordering is absorbed into the mandatory input-read stage of normalization. The fused kernel avoids a standalone reorder pass over the activation tensor. The only additional work is reading the permutation indices and generating indexed memory addresses. This is much cheaper than mater...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.