hub

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

· 2025 · cs.CL · arXiv 2512.02010

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

open full Pith review browse 12 citing papers arXiv PDF

abstract

As large language models have grown larger, interest has grown in low-precision numerical formats such as NVFP4 as a way to improve speed and reduce memory usage. However, quantizing models to NVFP4 remains challenging as the lack of precision generally degrades model performance. In this work, we address this issue with Four Over Six (4/6), a modification to the block-scaled NVFP4 quantization algorithm that yields reduced quantization error. Unlike integer formats, floating point formats have non-uniform step sizes which create larger quantization error on larger values. 4/6 takes advantage of this by adaptively scaling some blocks to smaller FP4 values, making the distribution of representable values more uniform and reducing quantization error for near-maximal values. We show that 4/6 can be implemented efficiently on modern hardware accelerators, resulting in performance gains during both pre-training and inference with minimal computational overhead. In pre-training experiments with the Nemotron 3 Nano 30B-A3B model architecture, we find that 4/6 brings training loss closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. Our code is available at https://github.com/mit-han-lab/fouroversix.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation

cs.LG · 2026-05-31 · unverdicted · novelty 8.0

GPTQ-intrinsic LoRA augments GPTQ with intrinsic low-rank compensation via Hessian modification to achieve layer-wise reconstruction bounds that match information-theoretic lower bounds under structural assumptions.

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

cs.LG · 2026-05-12 · accept · novelty 8.0

Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.

Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

cs.LG · 2026-05-19 · unverdicted · novelty 6.0 · 3 refs

MXFP4 quantization error decomposes into scale bias, deadzone truncation, and grid noise; mode-targeted corrections recover BF16 accuracy within 0.7% on Qwen2.5-3B and exceed it by 1.0% on Qwen3-30B-A3B.

SOAR: Scale Optimization for Accurate Reconstruction in NVFP4 Quantization

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

SOAR improves NVFP4 post-training quantization accuracy for LLMs by analytically solving joint scale optimization and searching decoupled scales.

Normalized Architectures are Natively 4-Bit

cs.LG · 2026-05-07 · conditional · novelty 6.0

nGPT's hypersphere constraint makes dot-product signal accumulate constructively under 4-bit quantization while noise averages out, enabling native low-precision training.

QuantClaw: Precision Where It Matters for OpenClaw

cs.AI · 2026-04-24 · unverdicted · novelty 6.0

QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.

MixFP4: Enhancing NVFP4 with Adaptive FP4/INT4 Block Representations

cs.AR · 2026-05-29 · unverdicted · novelty 5.0

MixFP4 extends NVFP4 by adaptively selecting between two FP4 micro-formats per block using repurposed scale sign bits and a unified E2M2 compute path, claiming better accuracy than standard NVFP4 at 3.1% area and 1.5% power overhead.

Finer is Better (with the Right Scaling)

cs.LG · 2026-05-08 · unverdicted · novelty 5.0 · 2 refs

The block-size paradox in LLM microscaling is caused by underflow in subnormal E4M3 scaling factors; preventing underflow and using 4-over-6 selection resolves it, with brute-force confirming MSE strictly improves as blocks get finer.

A Hardware-Aware, Per-Layer Methodology for Post-Training Quantization of Large Language Models

cs.LG · 2026-05-14 · unverdicted · novelty 4.0

SOP post-training quantization for LLMs reports lower weight reconstruction error than per-layer FP8 at 1.5 bpw lower cost using per-layer codebook search and hardware-aware formats.

DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization

cs.CV · 2026-04-20 · unverdicted · novelty 4.0

DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3 with lower cost.

HiFloat4 Format for Language Model Pre-training on Ascend NPUs

cs.LG · 2026-04-09 · unverdicted · novelty 4.0

HiFloat4 FP4 with stabilization techniques trains dense and MoE language models on Ascend NPUs at relative error within 1% of full-precision baselines.

citing papers explorer

Showing 12 of 12 citing papers.

GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation cs.LG · 2026-05-31 · unverdicted · none · ref 13 · internal anchor
GPTQ-intrinsic LoRA augments GPTQ with intrinsic low-rank compensation via Hessian modification to achieve layer-wise reconstruction bounds that match information-theoretic lower bounds under structural assumptions.
Grid Games: The Power of Multiple Grids for Quantizing Large Language Models cs.LG · 2026-05-12 · accept · none · ref 9 · internal anchor
Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 12 · internal anchor
LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.
Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor cs.LG · 2026-05-19 · unverdicted · none · ref 10 · 3 links · internal anchor
MXFP4 quantization error decomposes into scale bias, deadzone truncation, and grid noise; mode-targeted corrections recover BF16 accuracy within 0.7% on Qwen2.5-3B and exceed it by 1.0% on Qwen3-30B-A3B.
SOAR: Scale Optimization for Accurate Reconstruction in NVFP4 Quantization cs.LG · 2026-05-12 · unverdicted · none · ref 10 · internal anchor
SOAR improves NVFP4 post-training quantization accuracy for LLMs by analytically solving joint scale optimization and searching decoupled scales.
Normalized Architectures are Natively 4-Bit cs.LG · 2026-05-07 · conditional · none · ref 8 · internal anchor
nGPT's hypersphere constraint makes dot-product signal accumulate constructively under 4-bit quantization while noise averages out, enabling native low-precision training.
QuantClaw: Precision Where It Matters for OpenClaw cs.AI · 2026-04-24 · unverdicted · none · ref 37 · internal anchor
QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
MixFP4: Enhancing NVFP4 with Adaptive FP4/INT4 Block Representations cs.AR · 2026-05-29 · unverdicted · none · ref 5 · internal anchor
MixFP4 extends NVFP4 by adaptively selecting between two FP4 micro-formats per block using repurposed scale sign bits and a unified E2M2 compute path, claiming better accuracy than standard NVFP4 at 3.1% area and 1.5% power overhead.
Finer is Better (with the Right Scaling) cs.LG · 2026-05-08 · unverdicted · none · ref 2 · 2 links · internal anchor
The block-size paradox in LLM microscaling is caused by underflow in subnormal E4M3 scaling factors; preventing underflow and using 4-over-6 selection resolves it, with brute-force confirming MSE strictly improves as blocks get finer.
A Hardware-Aware, Per-Layer Methodology for Post-Training Quantization of Large Language Models cs.LG · 2026-05-14 · unverdicted · none · ref 3 · internal anchor
SOP post-training quantization for LLMs reports lower weight reconstruction error than per-layer FP8 at 1.5 bpw lower cost using per-layer codebook search and hardware-aware formats.
DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization cs.CV · 2026-04-20 · unverdicted · none · ref 3 · internal anchor
DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3 with lower cost.
HiFloat4 Format for Language Model Pre-training on Ascend NPUs cs.LG · 2026-04-09 · unverdicted · none · ref 6 · internal anchor
HiFloat4 FP4 with stabilization techniques trains dense and MoE language models on Ascend NPUs at relative error within 1% of full-precision baselines.

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer