hub

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia · 2017 · cs.AI · arXiv 1710.03740

33 Pith papers cite this work. Polarity classification is still indexing.

33 Pith papers citing it

open full Pith review browse 33 citing papers arXiv PDF

abstract

Deep neural networks have enabled progress in a wide variety of applications. Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases. We introduce a technique to train deep neural networks using half precision floating point numbers. In our technique, weights, activations and gradients are stored in IEEE half-precision format. Half-precision floating numbers have limited numerical range compared to single-precision numbers. We propose two techniques to handle this loss of information. Firstly, we recommend maintaining a single-precision copy of the weights that accumulates the gradients after each optimizer step. This single-precision copy is rounded to half-precision format during training. Secondly, we propose scaling the loss appropriately to handle the loss of information with half-precision gradients. We demonstrate that this approach works for a wide variety of models including convolution neural networks, recurrent neural networks and generative adversarial networks. This technique works for large scale models with more than 100 million parameters trained on large datasets. Using this approach, we can reduce the memory consumption of deep learning models by nearly 2x. In future processors, we can also expect a significant computation speedup using half-precision hardware units.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

TransDot: An Area-efficient Reconfigurable Floating-Point Unit for Trans-Precision Dot-Product Accumulation for FPGA AI Engines

cs.AR · 2026-05-08 · unverdicted · novelty 7.0

TransDot unifies SIMD FMA and trans-precision DPA in one reconfigurable FPU, achieving 2x FP16, 4x FP8, and 8x FP4 throughput with FP32 accumulation plus 1.46x to 2.92x area efficiency gains over the FPnew baseline.

Mass Matrix Assembly on Tensor Cores for Implicit Particle-In-Cell Methods

cs.CE · 2026-04-21 · unverdicted · novelty 7.0

Mass matrix assembly for implicit PIC methods can be exactly reformulated cell-by-cell as tensor-core matrix products, delivering up to 3x kernel speedup and 15% end-to-end runtime reduction in ECSIM simulations.

From Characterization to Microarchitecture: Designing an Elegant and Reliable BFP-Based NPU

cs.AR · 2026-04-12 · unverdicted · novelty 7.0

A BFP NPU microarchitecture using row/column blocking and per-path protections achieves near-DMR reliability at 3.55% geometric mean performance overhead and under 2% hardware cost.

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

q-bio.QM · 2026-04-09 · unverdicted · novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.

Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark

cs.CR · 2026-04-09 · unverdicted · novelty 7.0

Creates the BGTD benchmark and mmTraffic architecture to enable explainable multimodal interpretation of encrypted network traffic using LLMs.

Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements

cs.AI · 2026-04-02 · unverdicted · novelty 7.0

PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

cs.LG · 2022-08-15 · conditional · novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

cs.CV · 2021-12-20 · accept · novelty 7.0

A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.

Diffusion Models Beat GANs on Image Synthesis

cs.LG · 2021-05-11 · accept · novelty 7.0

Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

cs.LG · 2021-01-11 · accept · novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.

Generating Long Sequences with Sparse Transformers

cs.LG · 2019-04-23 · unverdicted · novelty 7.0

Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.

CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification

cs.CL · 2026-05-05 · unverdicted · novelty 6.0

CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.

Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

cs.SE · 2026-04-27 · unverdicted · novelty 6.0

Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.

Training Time Prediction for Mixed Precision-based Distributed Training

cs.LG · 2026-04-17 · unverdicted · novelty 6.0

A precision-aware predictor for distributed training time achieves 9.8% MAPE across precision settings, compared to errors up to 147.85% when precision is ignored.

The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

FP16 KV caching in transformers causes deterministic token divergence versus cache-free inference due to non-associative floating-point accumulation orderings.

SHARE: Social-Humanities AI for Research and Education

cs.CL · 2026-04-13 · unverdicted · novelty 6.0

SHARE models are the first causal LMs pretrained exclusively for SSH and match general models like Phi-4 on SSH texts despite using 100 times fewer tokens, paired with a non-generative MIRROR interface to support scholarly review.

LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training

cs.AR · 2026-04-12 · unverdicted · novelty 6.0

LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.

MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of simulators.

Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction

cs.CV · 2026-04-01 · unverdicted · novelty 6.0

Neural Harmonic Textures add periodic feature interpolation and deferred neural decoding to primitive representations, achieving state-of-the-art real-time novel-view synthesis and bridging primitive and neural-field methods.

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

cs.DC · 2023-04-21 · unverdicted · novelty 6.0

PyTorch Fully Sharded Data Parallel enables training of significantly larger models than Distributed Data Parallel with comparable speed and near-linear TFLOPS scaling.

citing papers explorer

Showing 33 of 33 citing papers.

Efficient Training on Multiple Consumer GPUs with RoundPipe cs.DC · 2026-04-29 · conditional · none · ref 32 · internal anchor
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
TransDot: An Area-efficient Reconfigurable Floating-Point Unit for Trans-Precision Dot-Product Accumulation for FPGA AI Engines cs.AR · 2026-05-08 · unverdicted · none · ref 7 · internal anchor
TransDot unifies SIMD FMA and trans-precision DPA in one reconfigurable FPU, achieving 2x FP16, 4x FP8, and 8x FP4 throughput with FP32 accumulation plus 1.46x to 2.92x area efficiency gains over the FPnew baseline.
Mass Matrix Assembly on Tensor Cores for Implicit Particle-In-Cell Methods cs.CE · 2026-04-21 · unverdicted · none · ref 2 · internal anchor
Mass matrix assembly for implicit PIC methods can be exactly reformulated cell-by-cell as tensor-core matrix products, delivering up to 3x kernel speedup and 15% end-to-end runtime reduction in ECSIM simulations.
From Characterization to Microarchitecture: Designing an Elegant and Reliable BFP-Based NPU cs.AR · 2026-04-12 · unverdicted · none · ref 43 · internal anchor
A BFP NPU microarchitecture using row/column blocking and per-path protections achieves near-DMR reliability at 3.55% geometric mean performance overhead and under 2% hardware cost.
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings q-bio.QM · 2026-04-09 · unverdicted · none · ref 61 · internal anchor
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.
Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark cs.CR · 2026-04-09 · unverdicted · none · ref 26 · internal anchor
Creates the BGTD benchmark and mmTraffic architecture to enable explainable multimodal interpretation of encrypted network traffic using LLMs.
Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements cs.AI · 2026-04-02 · unverdicted · none · ref 30 · internal anchor
PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale cs.LG · 2022-08-15 · conditional · none · ref 61 · internal anchor
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 262 · internal anchor
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models cs.CV · 2021-12-20 · accept · none · ref 16 · internal anchor
A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.
Diffusion Models Beat GANs on Image Synthesis cs.LG · 2021-05-11 · accept · none · ref 38 · internal anchor
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity cs.LG · 2021-01-11 · accept · none · ref 21 · internal anchor
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Generating Long Sequences with Sparse Transformers cs.LG · 2019-04-23 · unverdicted · none · ref 16 · internal anchor
Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale cs.LG · 2026-05-11 · unverdicted · none · ref 52 · internal anchor
LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification cs.CL · 2026-05-05 · unverdicted · none · ref 51 · internal anchor
CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study cs.SE · 2026-04-27 · unverdicted · none · ref 28 · internal anchor
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
Training Time Prediction for Mixed Precision-based Distributed Training cs.LG · 2026-04-17 · unverdicted · none · ref 11 · internal anchor
A precision-aware predictor for distributed training time achieves 9.8% MAPE across precision settings, compared to errors up to 147.85% when precision is ignored.
The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference cs.LG · 2026-04-16 · unverdicted · none · ref 16 · internal anchor
FP16 KV caching in transformers causes deterministic token divergence versus cache-free inference due to non-associative floating-point accumulation orderings.
SHARE: Social-Humanities AI for Research and Education cs.CL · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
SHARE models are the first causal LMs pretrained exclusively for SSH and match general models like Phi-4 on SSH texts despite using 100 times fewer tokens, paired with a non-generative MIRROR interface to support scholarly review.
LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training cs.AR · 2026-04-12 · unverdicted · none · ref 27 · internal anchor
LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning cs.LG · 2026-04-10 · unverdicted · none · ref 141 · internal anchor
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control cs.LG · 2026-04-06 · unverdicted · none · ref 52 · internal anchor
FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of simulators.
Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction cs.CV · 2026-04-01 · unverdicted · none · ref 40 · internal anchor
Neural Harmonic Textures add periodic feature interpolation and deferred neural decoding to primitive representations, achieving state-of-the-art real-time novel-view synthesis and bridging primitive and neural-field methods.
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel cs.DC · 2023-04-21 · unverdicted · none · ref 18 · internal anchor
PyTorch Fully Sharded Data Parallel enables training of significantly larger models than Distributed Data Parallel with comparable speed and near-linear TFLOPS scaling.
ST-MoE: Designing Stable and Transferable Sparse Expert Models cs.CL · 2022-02-17 · unverdicted · none · ref 176 · internal anchor
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
Linformer: Self-Attention with Linear Complexity cs.LG · 2020-06-08 · conditional · none · ref 10 · internal anchor
Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.
Probing Routing-Conditional Calibration in Attention-Residual Transformers cs.CV · 2026-05-11 · unverdicted · none · ref 6 · internal anchor
Routing summaries and auxiliary features do not provide stable evidence of conditional miscalibration in AR transformers once confidence-matched baselines, capacity controls, and permutation nulls are applied.
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay cs.CV · 2026-05-02 · unverdicted · none · ref 34 · internal anchor
Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training cs.DC · 2026-04-27 · unverdicted · none · ref 36 · internal anchor
TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
PINNACLE: An Open-Source Computational Framework for Classical and Quantum PINNs cs.LG · 2026-04-17 · accept · none · ref 46 · internal anchor
PINNACLE is an open-source framework for classical and quantum PINNs that supplies modular training methods and benchmarks showing high sensitivity to architecture choices plus parameter-efficiency gains in some hybrid quantum regimes.
BAAI Cardiac Agent: An intelligent multimodal agent for automated reasoning and diagnosis of cardiovascular diseases from cardiac magnetic resonance imaging eess.IV · 2026-04-05 · unverdicted · none · ref 45 · internal anchor
BAAI Cardiac Agent automates end-to-end cardiac MRI analysis for seven cardiovascular diseases, achieving AUC >0.93 internally and >0.81 externally with high correlation to expert measurements.
Assessing Performance and Porting Strategies for Gravitational $N$-Body Simulations on the RISC-V-Based Tenstorrent Wormhole\textsuperscript{\texttrademark} cs.DC · 2026-05-04 · unverdicted · none · ref 10 · internal anchor
Three scaling strategies for an N-body code on Tenstorrent Wormhole accelerators are compared via execution time and energy measurements, identifying the configuration with the best efficiency-performance balance.
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding cs.CV · 2026-04-29 · unverdicted · none · ref 56 · internal anchor
CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.

Mixed Precision Training

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer