pith. machine review for the scientific record. sign in

arxiv: 2510.27527 · v3 · submitted 2025-10-31 · 💻 cs.LG · cs.AI

Recognition: unknown

TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control

Authors on Pith no claims yet
classification 💻 cs.LG cs.AI
keywords trainingnvfp4llmsmodelsoscillationtetrajet-v2algorithmend-to-end
0
0 comments X
read the original abstract

Large Language Models (LLMs) training is prohibitively expensive, driving interest in low-precision fully-quantized training (FQT). While novel 4-bit formats like NVFP4 offer substantial efficiency gains, achieving near-lossless training at such low precision remains challenging. We introduce TetraJet-v2, an end-to-end 4-bit FQT method that leverages NVFP4 for activations, weights, and gradients in all linear layers. We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers with practically optimal convergence in LLM training, 2) OsciReset, the first effective algorithm to suppress LLMs' weight oscillation bottleneck, and 3) OutControl, a mix-precision algorithm to retain outlier accuracy. TetraJet-v2 outperforms prior methods on FP4 pre-training for LLMs across models up to 370M parameters trained up to 212B tokens, reducing the performance gap to BF16 by an average of 51.3% while enabling an 1.67x end-to-end speedup over FP8. The code is available at https://github.com/thu-ml/TetraJet-v2-NVFP4Training.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension

    cs.LG 2026-04 unverdicted novelty 5.0

    OSC separates token-persistent outlier channels in activations into a compact high-precision tensor for dual-path 4-bit GEMM computation, limiting accuracy loss to roughly 1-2 points on Qwen3 models while delivering u...