Boundary-Protection W8A8 HiFloat8 Quantization for Large-Scale Text-to-Video Diffusion Transformers

Yiming Zhao

arxiv: 2606.00957 · v1 · pith:QMUATVLBnew · submitted 2026-05-31 · 💻 cs.CV

Boundary-Protection W8A8 HiFloat8 Quantization for Large-Scale Text-to-Video Diffusion Transformers

Yiming Zhao This is my paper

Pith reviewed 2026-06-28 17:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords post-training quantizationtext-to-video diffusionboundary protectionHiFloat8diffusion transformersWan2.1-T2Vactivation analysis

0 comments

The pith

Protecting boundary blocks in full precision while quantizing the middle ones enables W8A8 HiFloat8 inference on a 14B text-to-video diffusion transformer with no measurable accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that activation statistics vary sharply across the 40 blocks of a large video diffusion transformer, with the first and last blocks differing from the middle ones. By keeping the first two and last three blocks in BF16 and applying W8A8 HiFloat8 only to the central 35 blocks, the method matches or slightly exceeds the full-precision baseline on all five VBench dimensions. An ablation across protection schemes confirms that the full boundary set gives the best average score. The work also notes conditions where adding quantization-aware training does not improve results on single-card hardware.

Core claim

A boundary-protection post-training quantization strategy retains the first two and last three WanAttentionBlocks in BF16 while quantizing the remaining 35 blocks to W8A8 HiFloat8; this configuration matches or marginally exceeds the BF16 baseline across all VBench dimensions on the 14B Wan2.1-T2V-14B model within the 5-prompt test set, and an ablation verifies that this exact protection choice outperforms the other three configurations examined.

What carries the argument

Boundary-protection strategy that uses per-block activation analysis to select which transformer blocks remain in BF16.

If this is right

Quantizing 35 of 40 blocks reduces memory footprint and inference cost on NPUs while preserving output quality.
The same per-block analysis can be repeated on other large DiT models to identify their own boundary blocks.
QAT provides no consistent gain over this PTQ recipe on single-card hardware.
Full boundary protection is the configuration that maximizes average VBench score among the four tested options.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The heterogeneous block statistics may appear in other diffusion transformer families, suggesting the protection pattern could transfer after similar analysis.
Extending the evaluation to hundreds of prompts would test whether the no-loss result holds outside the current small set.
The approach could be combined with other compression techniques such as pruning or distillation for further efficiency gains.

Load-bearing premise

The activation statistics measured on the 40 blocks remain stable enough that protecting exactly the first two and last three blocks will continue to work for other prompts and model configurations.

What would settle it

Running the same W8A8 model on a larger prompt set or different video tasks and measuring a clear drop below the BF16 VBench scores would show the protection choice does not generalize.

Figures

Figures reproduced from arXiv: 2606.00957 by Yiming Zhao.

**Figure 1.** Figure 1: Boundary-protection W8A8 HiFloat8 quantization for Wan2.1-T2V-14B. (a) The 40 attention blocks of the model: 5 boundary blocks {0, 1, 37, 38, 39} are kept in BF16 (red), and the remaining 35 are quantized to W8A8 HiF8 (blue). (b) Per-block max activation magnitude under BF16 inference: entry blocks sit at the residual root (any quantization noise propagates to all 39 downstream blocks), and exit blocks fee… view at source ↗

**Figure 2.** Figure 2: Per-block activation statistics across all 40 WanAttentionBlocks. Red [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Per-video VBench scores for all three methods. PTQ (blue) matches [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison across all five prompts (rows) and three methods (columns). Each cell shows frame 24 (mid-sequence) of the generated video. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation qualitative comparison (frame 24). Columns: four protection configurations. Full boundary protection (rightmost) produces the sharpest details [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

We present a post-training quantization (PTQ) approach for Wan2.1-T2V-14B, a 14-billion-parameter text-to-video diffusion transformer, targeting the W8A8 HiFloat8 (HiF8) format on Ascend 910B NPUs. A central challenge in quantizing video DiT models is the heterogeneous activation distribution across transformer blocks: boundary blocks (the first and last few blocks) exhibit fundamentally different statistical properties from middle blocks, making uniform quantization ineffective. We conduct a systematic per-block activation analysis across all 40 WanAttentionBlocks and use the findings to motivate a boundary-protection strategy that retains the first two and last three blocks in BF16 while quantizing the remaining 35 blocks with W8A8 HiF8. The proposed PTQ method matches or marginally exceeds the BF16 baseline on all five VBench dimensions evaluated, indicating no measurable accuracy loss within the 5-prompt evaluation set. An ablation study over four protection configurations confirms that full boundary protection yields the highest average VBench score, validating the data-driven block selection. We additionally investigate quantization-aware training (QAT) as a complementary fine-tuning stage and analyze the conditions under which it fails to outperform plain PTQ on single-card hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows a data-driven way to quantize most blocks of a 14B video DiT to W8A8 HiF8 while keeping boundary blocks in BF16, and it matches the baseline on their five-prompt test.

read the letter

The core result is that a per-block activation scan on the 40 WanAttentionBlocks revealed clear differences at the start and end, so they protect blocks 1-2 and 38-40 in full precision and quantize the middle 35. That choice, backed by an ablation of four protection setups, produces VBench scores that match or slightly beat BF16 on the five prompts they tried.

What stands out is the systematic activation analysis that led to the boundary rule instead of a blanket approach. They also checked QAT as a follow-up and noted when it does not beat plain PTQ on single-card hardware. That kind of targeted engineering detail is useful for people who actually ship these models.

The soft spot is the evaluation size. Five prompts with no error bars or statistical checks leaves open the question of whether the same blocks stay the outliers when the prompt, timestep, or text embedding changes. The stress-test note is on point here: if activation ranges shift, the fixed protection mask could lose its edge. The work is also tied to one model and one NPU platform, so the numbers are not meant to generalize broadly.

This is for engineers who need to run large video diffusion models under tight memory on Ascend hardware and want a concrete recipe plus the analysis that produced it. A reader already working on DiT quantization or deployment will pick up the per-block method and the ablation results.

It is worth sending to peer review. The empirical claim is narrow but cleanly supported by the ablation, and the activation analysis gives other groups something to test on their own models.

Referee Report

2 major / 2 minor

Summary. The paper introduces a post-training quantization (PTQ) technique for the 14B-parameter Wan2.1-T2V text-to-video diffusion transformer targeting W8A8 HiFloat8 on Ascend NPUs. Through per-block activation analysis on the 40 WanAttentionBlocks, it identifies that boundary blocks have distinct statistics and proposes protecting the first two and last three blocks in BF16 while quantizing the middle 35 blocks. The method is shown to match or exceed BF16 performance on all five VBench dimensions using five prompts, supported by an ablation study on protection configurations and analysis of QAT.

Significance. Should the boundary-protection strategy prove robust across diverse prompts and models, this work would be significant for efficient deployment of large-scale video generation models on specialized hardware. It provides an empirical solution to the heterogeneous activation issue in DiTs without full model retraining. The systematic analysis and ablation offer practical insights, though the small evaluation scale limits broader claims.

major comments (2)

[Abstract and Evaluation] Abstract and Evaluation: The central claim of matching BF16 performance 'indicating no measurable accuracy loss' rests on results from only five prompts (Abstract). Since the per-block activation analysis used to select exactly the first two and last three blocks for protection was performed on the same prompt set, the evaluation does not test generalization; prompt-dependent shifts in activation ranges could render the fixed protection mask suboptimal.
[Per-block analysis section] Per-block analysis section: The manuscript does not clarify if the activation statistics for block selection were computed on a held-out calibration set distinct from the five evaluation prompts. This detail is load-bearing for assessing whether the observed parity is due to the method or to the data-driven choice being tuned to the test prompts.

minor comments (2)

[Notation] The term 'HiFloat8 (HiF8)' should be defined more explicitly with its format details upon first use to aid readers unfamiliar with the Ascend-specific format.
[Figures] The per-block activation range plots (presumably in the analysis section) would benefit from error bars across multiple runs or prompts to show variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the evaluation scale and calibration details. We address each point below and will revise the manuscript for greater clarity on these aspects.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation: The central claim of matching BF16 performance 'indicating no measurable accuracy loss' rests on results from only five prompts (Abstract). Since the per-block activation analysis used to select exactly the first two and last three blocks for protection was performed on the same prompt set, the evaluation does not test generalization; prompt-dependent shifts in activation ranges could render the fixed protection mask suboptimal.

Authors: We agree the evaluation uses only five prompts and that block selection derives from analysis on this same set. The abstract already qualifies results as holding 'within the 5-prompt evaluation set.' This reflects standard PTQ practice where calibration data matches the target distribution. The ablation over protection configurations shows the selected boundaries yield the highest score on these prompts. We will revise to explicitly discuss the limited scope and lack of generalization testing, without claiming broader robustness. revision: partial
Referee: [Per-block analysis section] Per-block analysis section: The manuscript does not clarify if the activation statistics for block selection were computed on a held-out calibration set distinct from the five evaluation prompts. This detail is load-bearing for assessing whether the observed parity is due to the method or to the data-driven choice being tuned to the test prompts.

Authors: The activation statistics were computed on the same five evaluation prompts; the manuscript does not describe a distinct held-out calibration set. We will revise the per-block analysis section to state this explicitly. The data-driven selection is thus tuned to the evaluated prompts, which is a limitation. However, the systematic per-block analysis and ablation still provide evidence that boundary blocks exhibit distinct statistics warranting protection. The revision will clarify this without overstating generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical PTQ selection is data-driven but not self-referential by construction

full rationale

The paper performs per-block activation analysis on the 5-prompt set to select the boundary-protection mask (first two and last three of 40 blocks kept in BF16), then directly measures VBench scores on the identical set and runs an ablation over four masks. No derivation chain, equation, or first-principles claim reduces to its own inputs; the reported parity with BF16 is a measured outcome rather than a fitted or self-defined prediction. No self-citations, uniqueness theorems, or ansatzes appear in the abstract or described method. The approach is self-contained empirical engineering whose generalization risk is a separate validity question, not circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical per-block statistics used to select which blocks to protect; no explicit free parameters beyond the data-driven choice of boundaries are stated.

free parameters (1)

number of boundary blocks to protect
First two and last three blocks selected after systematic activation analysis across 40 blocks; this choice is fitted to observed statistics rather than derived from first principles.

pith-pipeline@v0.9.1-grok · 5754 in / 1186 out tokens · 19138 ms · 2026-06-28T17:46:51.685339+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs
cs.LG 2026-06 unverdicted novelty 4.0

INT8 W8A8 post-training quantization of Ideogram 4.0 preserves FP8 quality on a 200-prompt benchmark while outperforming NF4 on CLIP score and offering a favorable quality-memory trade-off via GGUF Q4_K.

Reference graph

Works this paper leans on

9 extracted references · 1 linked inside Pith · cited by 1 Pith paper

[1]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProc. ICCV, 2023, pp. 4195–4205

2023
[2]

Wan: Open and advanced large-scale video generative models,

Wan Team, “Wan: Open and advanced large-scale video generative models,”arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[3]

GPTQ: Accurate post-training quantization for gen- erative pre-trained transformers,

E. Frantaret al., “GPTQ: Accurate post-training quantization for gen- erative pre-trained transformers,” inProc. ICLR, 2023

2023
[4]

SmoothQuant: Accurate and efficient post-training quantization for large language models,

G. Xiaoet al., “SmoothQuant: Accurate and efficient post-training quantization for large language models,” inProc. ICML, 2023

2023
[5]

Q-Diffusion: Quantizing diffusion models,

X. Liet al., “Q-Diffusion: Quantizing diffusion models,” inProc. ICCV, 2023

2023
[6]

Q-DiT: Accurate post-training quantization for diffusion transformers,

Y . Chenet al., “Q-DiT: Accurate post-training quantization for diffusion transformers,”arXiv:2406.17343, 2024

arXiv 2024
[7]

LLM.int8(): 8-bit matrix multiplication for trans- formers at scale,

T. Dettmerset al., “LLM.int8(): 8-bit matrix multiplication for trans- formers at scale,” inProc. NeurIPS, 2022

2022
[8]

HiFloat8: An 8-bit floating point format with tapered precision encoding,

L. Luoet al., “HiFloat8: An 8-bit floating point format with tapered precision encoding,”arXiv:2409.16626, 2024

arXiv 2024
[9]

VBench: Comprehensive benchmark suite for video generative models,

Z. Huanget al., “VBench: Comprehensive benchmark suite for video generative models,” inProc. CVPR, 2024. Fig. 5. Ablation qualitative comparison (frame 24). Columns: four protection configurations. Full boundary protection (rightmost) produces the sharpest details and most vivid colors

2024

[1] [1]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProc. ICCV, 2023, pp. 4195–4205

2023

[2] [2]

Wan: Open and advanced large-scale video generative models,

Wan Team, “Wan: Open and advanced large-scale video generative models,”arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[3] [3]

GPTQ: Accurate post-training quantization for gen- erative pre-trained transformers,

E. Frantaret al., “GPTQ: Accurate post-training quantization for gen- erative pre-trained transformers,” inProc. ICLR, 2023

2023

[4] [4]

SmoothQuant: Accurate and efficient post-training quantization for large language models,

G. Xiaoet al., “SmoothQuant: Accurate and efficient post-training quantization for large language models,” inProc. ICML, 2023

2023

[5] [5]

Q-Diffusion: Quantizing diffusion models,

X. Liet al., “Q-Diffusion: Quantizing diffusion models,” inProc. ICCV, 2023

2023

[6] [6]

Q-DiT: Accurate post-training quantization for diffusion transformers,

Y . Chenet al., “Q-DiT: Accurate post-training quantization for diffusion transformers,”arXiv:2406.17343, 2024

arXiv 2024

[7] [7]

LLM.int8(): 8-bit matrix multiplication for trans- formers at scale,

T. Dettmerset al., “LLM.int8(): 8-bit matrix multiplication for trans- formers at scale,” inProc. NeurIPS, 2022

2022

[8] [8]

HiFloat8: An 8-bit floating point format with tapered precision encoding,

L. Luoet al., “HiFloat8: An 8-bit floating point format with tapered precision encoding,”arXiv:2409.16626, 2024

arXiv 2024

[9] [9]

VBench: Comprehensive benchmark suite for video generative models,

Z. Huanget al., “VBench: Comprehensive benchmark suite for video generative models,” inProc. CVPR, 2024. Fig. 5. Ablation qualitative comparison (frame 24). Columns: four protection configurations. Full boundary protection (rightmost) produces the sharpest details and most vivid colors

2024