Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

Ali Asaria; Deep Gandhi; Tony Salomone

arxiv: 2606.12280 · v1 · pith:5SSMJBCKnew · submitted 2026-06-10 · 💻 cs.LG

Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

Deep Gandhi , Ali Asaria , Tony Salomone This is my paper

Pith reviewed 2026-06-27 10:03 UTC · model grok-4.3

classification 💻 cs.LG

keywords post-training quantizationINT8diffusion transformertext-to-image modelconsumer GPUsGGUFFP8Ideogram

0 comments

The pith

INT8 W8A8 with targeted layer protection matches FP8 quality for the 9.3B Ideogram 4.0 diffusion model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that an 8-bit post-training quantization recipe can match the output quality of FP8 on a large flow-matching diffusion transformer without needing FP8 hardware. The method combines per-channel weight scaling, per-token dynamic activation scaling, SmoothQuant, and higher-precision fallback on a small set of fragile layers. On a 200-prompt benchmark the INT8 version produces Pick and CLIP scores statistically indistinguishable from FP8 while beating NF4 by a clear margin. GGUF Q4_K further improves the quality-memory trade-off at the same file size. The work also isolates which layers drive the quality difference and shows that text rendering remains legible.

Core claim

The INT8 W8A8 recipe (per-channel weights, per-token dynamic activations, SmoothQuant, and mixed-precision protection of a small high-fragility layer set) holds the FP8 quality ceiling: on a 200-prompt benchmark the paired same-seed bootstrap CI for INT8-FP8 includes zero on both Pick and CLIP, while INT8 improves on NF4 by +1.9 CLIP (95% CI [+1.21,+2.64], excluding zero).

What carries the argument

The mixed-precision protection of FFN down-projections and other high-fragility layers inside an otherwise uniform W8A8 pipeline that uses per-channel weights and per-token dynamic activations.

If this is right

INT8 improves CLIP score over NF4 by +1.9 with a confidence interval that excludes zero.
GGUF Q4_K beats NF4 at equal on-disk size and is the Pareto winner on the quality-memory frontier.
Q8_0 quantization is quality neutral relative to the FP8 baseline.
Per-category OCR confirms text legibility is preserved under the INT8 recipe.
No on-disk size reduction occurs versus FP8, so speed gains on Ampere hardware require a fused INT8 kernel.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same protection set for FFN down-projections may generalize to other flow-matching DiTs if fragility concentrates in the same block types.
On hardware that already supports FP8, the INT8 path mainly offers broader compatibility rather than memory savings.
Ablating protection on additional layer types could reveal whether the current small set is minimal or whether further quality headroom exists.
The Pareto dominance of GGUF Q4_K suggests that hybrid GGUF formats may be worth testing on other diffusion backbones at similar bit widths.

Load-bearing premise

The 200-prompt benchmark and the particular choice of which layers receive mixed-precision protection are representative of real user prompts and model behavior.

What would settle it

Running the identical INT8 and FP8 models on a fresh 200-prompt set drawn from a different distribution and checking whether the paired bootstrap CI for the Pick or CLIP difference still contains zero.

Figures

Figures reproduced from arXiv: 2606.12280 by Ali Asaria, Deep Gandhi, Tony Salomone.

**Figure 2.** Figure 2: General scenes across FP8, INT8 (ours), NF4, and Q4_K (ours) at fixed seed. INT8 tracks [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

read the original abstract

Post-training quantization lets large text-to-image diffusion transformers run on consumer GPUs, yet the hardware-specific trade-offs are seldom measured directly. We quantize Ideogram 4.0 - a 9.3B flow-matching diffusion transformer (DiT), shipped as two separate-weight copies of a single-stream 34-layer backbone for classifier-free guidance and conditioned by a Qwen3-VL-8B encoder - for Ampere RTX 3090 GPUs, which lack FP8 tensor cores. Our INT8 W8A8 recipe (per-channel weights, per-token dynamic activations, SmoothQuant, and mixed-precision protection of a small high-fragility layer set) holds the FP8 quality ceiling: on a 200-prompt benchmark the paired same-seed bootstrap CI for INT8-FP8 includes zero on both Pick and CLIP, while INT8 improves on NF4 by $+1.9$ CLIP (95% CI $[+1.21,+2.64]$, excluding zero). A per-category OCR analysis, to our knowledge unreported for this model class, confirms text legibility is preserved, and an ablation isolates protection of the FFN down-projections as the dominant quality lever. Our GGUF Q4_K quantization beats NF4 at equal on-disk size and is the Pareto winner on the quality-memory frontier, with paired confidence intervals excluding zero (Q8_0 is quality neutral). Finally, we characterize where 8-bit quantization helps and where it does not: INT8's weights match FP8's footprint rather than shrink it, so a speed gain on Ampere awaits a fused INT8 kernel.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

INT8 matches FP8 on their 200-prompt set for Ideogram 4.0 after tuning a small layer protection set, but that equivalence is not shown to hold outside the benchmark.

read the letter

The paper applies known quantization tricks—per-channel weights, per-token activations, SmoothQuant, and mixed-precision on a few fragile layers—to Ideogram 4.0 and reports that the resulting INT8 W8A8 version lands inside the FP8 bootstrap CI on Pick and CLIP scores for their 200-prompt test. They also show GGUF Q4_K beating NF4 at the same file size and include an ablation that points to FFN down-projections as the main lever, plus a per-category OCR check on text legibility.

What stands out is the use of paired same-seed bootstrap intervals and the explicit ablation; that is more disciplined than most post-training quantization notes. The OCR breakdown is a small but useful addition for this model class.

The soft spot is exactly the one the stress-test flags. The protection set was identified via ablation on the same 200 prompts, and no separate held-out prompt distribution is reported. That makes the central claim—that INT8 holds the FP8 ceiling—tied to a benchmark that may have influenced the recipe. The paper does not show the same layer choices work on other prompt sets or other models, and the speed claim is left for future kernels. The 200-prompt size is modest for a diffusion model.

This is useful reading for anyone shipping local DiT inference on Ampere cards who needs concrete numbers rather than theory. It is not a foundational result, but the measurements are honest enough that a referee should see it. I would send it to review and ask the authors to either enlarge the prompt set or demonstrate that the protection set transfers.

Referee Report

2 major / 2 minor

Summary. The paper claims that an INT8 W8A8 post-training quantization recipe (per-channel weights, per-token dynamic activations, SmoothQuant, plus mixed-precision protection of a small high-fragility layer set) for the 9.3B Ideogram 4.0 DiT holds the FP8 quality ceiling. On a fixed 200-prompt benchmark, paired same-seed bootstrap CIs for the INT8-FP8 difference include zero on both Pick and CLIP scores; INT8 also improves over NF4 (+1.9 CLIP, CI excluding zero). GGUF Q4_K is reported as Pareto-optimal on the quality-memory frontier, an ablation isolates FFN down-projections as the dominant protection target, and a per-category OCR analysis shows preserved text legibility. The work targets Ampere GPUs lacking FP8 tensor cores.

Significance. If the statistical equivalence and ablation results hold under broader conditions, the recipe would enable practical deployment of large flow-matching DiTs on consumer hardware without FP8 support. The paired bootstrap CIs and explicit ablation constitute reproducible empirical strengths; the GGUF comparison and OCR analysis add practical value. The central limitation is that all claims rest on a single 200-prompt set and a benchmark-tuned protection set whose transferability is untested.

major comments (2)

[Abstract / Results] Abstract and Results section: the claim that the INT8 recipe 'holds the FP8 quality ceiling' is load-bearing on the 200-prompt benchmark and the specific fragility-layer protection set. No description is given of how the 200 prompts were sampled, whether any data exclusion rules were applied, or whether the protection set was chosen independently of this benchmark (e.g., via a separate validation split). This directly affects whether the zero-inclusive CI can be interpreted as general rather than benchmark-specific.
[Results (ablation)] Ablation paragraph (Results): while the ablation isolates FFN down-projections as dominant, the manuscript does not report the exact list of protected layers, the sensitivity of the Pick/CLIP CIs to alternative protection choices, or any test of whether the same set remains effective under prompt distribution shift. These omissions make the 'dominant quality lever' claim difficult to assess for robustness.

minor comments (2)

[Abstract] The abstract states the OCR analysis is 'to our knowledge unreported for this model class'; a brief literature pointer or explicit search statement would strengthen this claim.
[Methods] Notation for quantization formats (W8A8, Q4_K, NF4, Q8_0) would benefit from a short summary table early in the methods to improve readability.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for the constructive feedback on reproducibility and robustness. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results section: the claim that the INT8 recipe 'holds the FP8 quality ceiling' is load-bearing on the 200-prompt benchmark and the specific fragility-layer protection set. No description is given of how the 200 prompts were sampled, whether any data exclusion rules were applied, or whether the protection set was chosen independently of this benchmark (e.g., via a separate validation split). This directly affects whether the zero-inclusive CI can be interpreted as general rather than benchmark-specific.

Authors: We agree that the manuscript provides no description of prompt sampling, exclusion rules, or whether the protection set was selected independently of the benchmark. In revision we will add explicit details on prompt selection and state that the protection set was tuned on this benchmark, so the equivalence result is benchmark-specific. The paired bootstrap CIs still demonstrate no detectable difference on the evaluated set. revision: yes
Referee: [Results (ablation)] Ablation paragraph (Results): while the ablation isolates FFN down-projections as dominant, the manuscript does not report the exact list of protected layers, the sensitivity of the Pick/CLIP CIs to alternative protection choices, or any test of whether the same set remains effective under prompt distribution shift. These omissions make the 'dominant quality lever' claim difficult to assess for robustness.

Authors: We will add the exact list of protected layers in the revision. The ablation shows FFN down-projections as the main lever on this benchmark. We did not perform sensitivity tests on alternative layer sets or evaluate under prompt distribution shift; these would require new experiments. revision: partial

standing simulated objections not resolved

Effectiveness of the protection set under prompt distribution shift
Sensitivity of Pick/CLIP CIs to alternative protection choices

Circularity Check

0 steps flagged

No circularity; empirical benchmark results with no self-referential derivations

full rationale

The paper reports post-training quantization experiments, benchmark metrics (Pick, CLIP, OCR), bootstrap CIs, and ablations on a fixed 200-prompt set. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Central claims rest on direct measurements rather than any derivation that reduces to its own inputs by construction. The 200-prompt benchmark and layer-protection choice are empirical choices whose generalization is a separate validity concern, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the 200-prompt benchmark is sufficient to detect quality differences and that the chosen fragile-layer set is stable across seeds and prompts. No free parameters or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5852 in / 1229 out tokens · 19076 ms · 2026-06-27T10:03:33.143602+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 11 linked inside Pith

[1]

CharGen: High Accurate Character-Level Visual Text Generation Model with MultiModal Encoder

Lichen Ma, Tiezhu Yue, Pei Fu, Yujie Zhong, Kai Zhou, Xiaoming Wei, and Jie Hu. CharGen: High Accurate Character-Level Visual Text Generation Model with MultiModal Encoder. arXiv:2412.17225 [cs.CV], 2024

arXiv 2024
[2]

FonTS: Text Rendering with Typography and Style Controls

Wenda Shi, Yiren Song, Dengming Zhang, Jiaming Liu, and Xingxing Zou. FonTS: Text Rendering with Typography and Style Controls. arXiv:2412.00136 [cs.CV], 2024. 8 INT8 and GGUF Quantization of Ideogram 4.0.0

arXiv 2024
[3]

Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling

Natalia Frumkin and Diana Marculescu. Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling. arXiv:2509.01624 [cs.CV], 2025

arXiv 2025
[4]

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arXiv:2211.10438 [cs.CL], 2022

arXiv 2022
[5]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv:2208.07339 [cs.LG], 2022

Pith/arXiv arXiv 2022
[6]

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Enshu Liu, Rui Wan, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation. arXiv:2406.02540 [cs.CV], 2024

arXiv 2024
[7]

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models. arXiv:2411.05007 [cs.CV], 2024

arXiv 2024
[8]

Mills, and Di Niu

Ruichen Chen, Keith G. Mills, and Di Niu. FP4DiT: Towards Effective Floating Point Quantization for Diffusion Transformers. arXiv:2503.15465 [cs.CV], 2025

arXiv 2025
[9]

PTQD: Accurate Post-Training Quantization for Diffusion Models

Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. PTQD: Accurate Post-Training Quantization for Diffusion Models. arXiv:2305.10657 [cs.CV], 2023

arXiv 2023
[10]

PQD: Post-training Quantization for Efficient Diffusion Models

Jiaojiao Ye, Zhen Wang, and Linnan Jiang. PQD: Post-training Quantization for Efficient Diffusion Models. arXiv:2501.00124 [cs.CV], 2024

arXiv 2024
[11]

Efficiency Meets Fidelity: A Novel Quantization Framework for Stable Diffusion

Shuaiting Li, Juncan Deng, Zeyu Wang, Kedong Xu, Rongtao Deng, Hong Gu, Haibin Shen, and Kejie Huang. Efficiency Meets Fidelity: A Novel Quantization Framework for Stable Diffusion. arXiv:2412.06661 [cs.CV], 2024

arXiv 2024
[12]

DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers

Sayeh Sharify, Mahsa Salmani, and Hesham Mostafa. DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers. arXiv:2605.16732 [cs.CV], 2026

Pith/arXiv arXiv 2026
[13]

GPTQ: Accurate Post- Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate Post- Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323 [cs.LG], 2022

Pith/arXiv arXiv 2022
[14]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978 [cs.CL], 2023

Pith/arXiv arXiv 2023
[15]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314 [cs.LG], 2023

Pith/arXiv arXiv 2023
[16]

Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization

Vage Egiazarian, Denis Kuznedelev, Anton Voronov, Ruslan Svirschevski, Michael Goin, Daniil Pavlov, Dan Alistarh, and Dmitry Baranchuk. Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization. arXiv:2409.00492 [cs.CV], 2024. 9 INT8 and GGUF Quantization of Ideogram 4.0.0

arXiv 2024
[17]

ScalableDiffusionModelswithTransformers

WilliamPeeblesandSainingXie. ScalableDiffusionModelswithTransformers. arXiv:2212.09748 [cs.CV], 2022

Pith/arXiv arXiv 2022
[18]

Timestep-Aware SVDQuant-GPTQ for W4A4 Quanti- zation of Wan2.2-I2V

Junhao Wu, Dezhong Yao, and Hai Jin. Timestep-Aware SVDQuant-GPTQ for W4A4 Quanti- zation of Wan2.2-I2V. arXiv:2605.27003 [cs.CV], 2026

Pith/arXiv arXiv 2026
[19]

Boundary-Protection W8A8 HiFloat8 Quantization for Large-Scale Text-to-Video Diffusion Transformers

Yiming Zhao. Boundary-Protection W8A8 HiFloat8 Quantization for Large-Scale Text-to-Video Diffusion Transformers. arXiv:2606.00957 [cs.CV], 2026

Pith/arXiv arXiv 2026
[20]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling. arXiv:2210.02747 [cs.LG], 2022

Pith/arXiv arXiv 2022
[21]

Timestep Embedding Tells: It’s Time to Cache for Video Diffusion Model

Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep Embedding Tells: It’s Time to Cache for Video Diffusion Model. arXiv:2411.19108 [cs.CV], 2024

arXiv 2024
[22]

Accelerating Rectified Flow Models via Trajectory-Aware Caching

XiaoLiu, KaiLiu, NaiyangGuan, HongliangLu, ZhixinWang, ZhikaiChen, RenjingPei, andYu- lun Zhang. Accelerating Rectified Flow Models via Trajectory-Aware Caching. arXiv:2605.16789 [cs.CV], 2026

Pith/arXiv arXiv 2026
[23]

DeepCache: Accelerating Diffusion Models for Free

Xinyin Ma, Gongfan Fang, and Xinchao Wang. DeepCache: Accelerating Diffusion Models for Free. arXiv:2312.00858 [cs.CV], 2023

arXiv 2023
[24]

Q- DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution

Xun Zhang, Kaicheng Yang, Hongliang Lu, Haotong Qin, Yong Guo, and Yulun Zhang. Q- DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution. arXiv:2602.01273 [cs.CV], 2026

Pith/arXiv arXiv 2026
[25]

HDGlyph: A Hierarchical Disentangled Glyph-Based Framework for Long-Tail Text Rendering in Diffusion Models

Shuhan Zhuang, Mengqi Huang, Fengyi Fu, Nan Chen, Bohan Lei, and Zhendong Mao. HDGlyph: A Hierarchical Disentangled Glyph-Based Framework for Long-Tail Text Rendering in Diffusion Models. arXiv:2505.06543 [cs.CV], 2025. 10 INT8 and GGUF Quantization of Ideogram 4.0.0 Figure 2: General scenes across FP8, INT8 (ours), NF4, and Q4_K (ours) at fixed seed. INT...

arXiv 2025

[1] [1]

CharGen: High Accurate Character-Level Visual Text Generation Model with MultiModal Encoder

Lichen Ma, Tiezhu Yue, Pei Fu, Yujie Zhong, Kai Zhou, Xiaoming Wei, and Jie Hu. CharGen: High Accurate Character-Level Visual Text Generation Model with MultiModal Encoder. arXiv:2412.17225 [cs.CV], 2024

arXiv 2024

[2] [2]

FonTS: Text Rendering with Typography and Style Controls

Wenda Shi, Yiren Song, Dengming Zhang, Jiaming Liu, and Xingxing Zou. FonTS: Text Rendering with Typography and Style Controls. arXiv:2412.00136 [cs.CV], 2024. 8 INT8 and GGUF Quantization of Ideogram 4.0.0

arXiv 2024

[3] [3]

Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling

Natalia Frumkin and Diana Marculescu. Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling. arXiv:2509.01624 [cs.CV], 2025

arXiv 2025

[4] [4]

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arXiv:2211.10438 [cs.CL], 2022

arXiv 2022

[5] [5]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv:2208.07339 [cs.LG], 2022

Pith/arXiv arXiv 2022

[6] [6]

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Enshu Liu, Rui Wan, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation. arXiv:2406.02540 [cs.CV], 2024

arXiv 2024

[7] [7]

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models. arXiv:2411.05007 [cs.CV], 2024

arXiv 2024

[8] [8]

Mills, and Di Niu

Ruichen Chen, Keith G. Mills, and Di Niu. FP4DiT: Towards Effective Floating Point Quantization for Diffusion Transformers. arXiv:2503.15465 [cs.CV], 2025

arXiv 2025

[9] [9]

PTQD: Accurate Post-Training Quantization for Diffusion Models

Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. PTQD: Accurate Post-Training Quantization for Diffusion Models. arXiv:2305.10657 [cs.CV], 2023

arXiv 2023

[10] [10]

PQD: Post-training Quantization for Efficient Diffusion Models

Jiaojiao Ye, Zhen Wang, and Linnan Jiang. PQD: Post-training Quantization for Efficient Diffusion Models. arXiv:2501.00124 [cs.CV], 2024

arXiv 2024

[11] [11]

Efficiency Meets Fidelity: A Novel Quantization Framework for Stable Diffusion

Shuaiting Li, Juncan Deng, Zeyu Wang, Kedong Xu, Rongtao Deng, Hong Gu, Haibin Shen, and Kejie Huang. Efficiency Meets Fidelity: A Novel Quantization Framework for Stable Diffusion. arXiv:2412.06661 [cs.CV], 2024

arXiv 2024

[12] [12]

DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers

Sayeh Sharify, Mahsa Salmani, and Hesham Mostafa. DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers. arXiv:2605.16732 [cs.CV], 2026

Pith/arXiv arXiv 2026

[13] [13]

GPTQ: Accurate Post- Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate Post- Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323 [cs.LG], 2022

Pith/arXiv arXiv 2022

[14] [14]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978 [cs.CL], 2023

Pith/arXiv arXiv 2023

[15] [15]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314 [cs.LG], 2023

Pith/arXiv arXiv 2023

[16] [16]

Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization

Vage Egiazarian, Denis Kuznedelev, Anton Voronov, Ruslan Svirschevski, Michael Goin, Daniil Pavlov, Dan Alistarh, and Dmitry Baranchuk. Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization. arXiv:2409.00492 [cs.CV], 2024. 9 INT8 and GGUF Quantization of Ideogram 4.0.0

arXiv 2024

[17] [17]

ScalableDiffusionModelswithTransformers

WilliamPeeblesandSainingXie. ScalableDiffusionModelswithTransformers. arXiv:2212.09748 [cs.CV], 2022

Pith/arXiv arXiv 2022

[18] [18]

Timestep-Aware SVDQuant-GPTQ for W4A4 Quanti- zation of Wan2.2-I2V

Junhao Wu, Dezhong Yao, and Hai Jin. Timestep-Aware SVDQuant-GPTQ for W4A4 Quanti- zation of Wan2.2-I2V. arXiv:2605.27003 [cs.CV], 2026

Pith/arXiv arXiv 2026

[19] [19]

Boundary-Protection W8A8 HiFloat8 Quantization for Large-Scale Text-to-Video Diffusion Transformers

Yiming Zhao. Boundary-Protection W8A8 HiFloat8 Quantization for Large-Scale Text-to-Video Diffusion Transformers. arXiv:2606.00957 [cs.CV], 2026

Pith/arXiv arXiv 2026

[20] [20]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling. arXiv:2210.02747 [cs.LG], 2022

Pith/arXiv arXiv 2022

[21] [21]

Timestep Embedding Tells: It’s Time to Cache for Video Diffusion Model

Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep Embedding Tells: It’s Time to Cache for Video Diffusion Model. arXiv:2411.19108 [cs.CV], 2024

arXiv 2024

[22] [22]

Accelerating Rectified Flow Models via Trajectory-Aware Caching

XiaoLiu, KaiLiu, NaiyangGuan, HongliangLu, ZhixinWang, ZhikaiChen, RenjingPei, andYu- lun Zhang. Accelerating Rectified Flow Models via Trajectory-Aware Caching. arXiv:2605.16789 [cs.CV], 2026

Pith/arXiv arXiv 2026

[23] [23]

DeepCache: Accelerating Diffusion Models for Free

Xinyin Ma, Gongfan Fang, and Xinchao Wang. DeepCache: Accelerating Diffusion Models for Free. arXiv:2312.00858 [cs.CV], 2023

arXiv 2023

[24] [24]

Q- DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution

Xun Zhang, Kaicheng Yang, Hongliang Lu, Haotong Qin, Yong Guo, and Yulun Zhang. Q- DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution. arXiv:2602.01273 [cs.CV], 2026

Pith/arXiv arXiv 2026

[25] [25]

HDGlyph: A Hierarchical Disentangled Glyph-Based Framework for Long-Tail Text Rendering in Diffusion Models

Shuhan Zhuang, Mengqi Huang, Fengyi Fu, Nan Chen, Bohan Lei, and Zhendong Mao. HDGlyph: A Hierarchical Disentangled Glyph-Based Framework for Long-Tail Text Rendering in Diffusion Models. arXiv:2505.06543 [cs.CV], 2025. 10 INT8 and GGUF Quantization of Ideogram 4.0.0 Figure 2: General scenes across FP8, INT8 (ours), NF4, and Q4_K (ours) at fixed seed. INT...

arXiv 2025