RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
hub
Mixed Precision Training
33 Pith papers cite this work. Polarity classification is still indexing.
abstract
Deep neural networks have enabled progress in a wide variety of applications. Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases. We introduce a technique to train deep neural networks using half precision floating point numbers. In our technique, weights, activations and gradients are stored in IEEE half-precision format. Half-precision floating numbers have limited numerical range compared to single-precision numbers. We propose two techniques to handle this loss of information. Firstly, we recommend maintaining a single-precision copy of the weights that accumulates the gradients after each optimizer step. This single-precision copy is rounded to half-precision format during training. Secondly, we propose scaling the loss appropriately to handle the loss of information with half-precision gradients. We demonstrate that this approach works for a wide variety of models including convolution neural networks, recurrent neural networks and generative adversarial networks. This technique works for large scale models with more than 100 million parameters trained on large datasets. Using this approach, we can reduce the memory consumption of deep learning models by nearly 2x. In future processors, we can also expect a significant computation speedup using half-precision hardware units.
hub tools
representative citing papers
TransDot unifies SIMD FMA and trans-precision DPA in one reconfigurable FPU, achieving 2x FP16, 4x FP8, and 8x FP4 throughput with FP32 accumulation plus 1.46x to 2.92x area efficiency gains over the FPnew baseline.
Mass matrix assembly for implicit PIC methods can be exactly reformulated cell-by-cell as tensor-core matrix products, delivering up to 3x kernel speedup and 15% end-to-end runtime reduction in ECSIM simulations.
A BFP NPU microarchitecture using row/column blocking and per-path protections achieves near-DMR reliability at 3.55% geometric mean performance overhead and under 2% hardware cost.
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.
Creates the BGTD benchmark and mmTraffic architecture to enable explainable multimodal interpretation of encrypted network traffic using LLMs.
PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.
LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
A precision-aware predictor for distributed training time achieves 9.8% MAPE across precision settings, compared to errors up to 147.85% when precision is ignored.
FP16 KV caching in transformers causes deterministic token divergence versus cache-free inference due to non-associative floating-point accumulation orderings.
SHARE models are the first causal LMs pretrained exclusively for SSH and match general models like Phi-4 on SSH texts despite using 100 times fewer tokens, paired with a non-generative MIRROR interface to support scholarly review.
LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.
FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of simulators.
Neural Harmonic Textures add periodic feature interpolation and deferred neural decoding to primitive representations, achieving state-of-the-art real-time novel-view synthesis and bridging primitive and neural-field methods.
PyTorch Fully Sharded Data Parallel enables training of significantly larger models than Distributed Data Parallel with comparable speed and near-linear TFLOPS scaling.
citing papers explorer
-
Efficient Training on Multiple Consumer GPUs with RoundPipe
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
-
TransDot: An Area-efficient Reconfigurable Floating-Point Unit for Trans-Precision Dot-Product Accumulation for FPGA AI Engines
TransDot unifies SIMD FMA and trans-precision DPA in one reconfigurable FPU, achieving 2x FP16, 4x FP8, and 8x FP4 throughput with FP32 accumulation plus 1.46x to 2.92x area efficiency gains over the FPnew baseline.
-
Mass Matrix Assembly on Tensor Cores for Implicit Particle-In-Cell Methods
Mass matrix assembly for implicit PIC methods can be exactly reformulated cell-by-cell as tensor-core matrix products, delivering up to 3x kernel speedup and 15% end-to-end runtime reduction in ECSIM simulations.
-
From Characterization to Microarchitecture: Designing an Elegant and Reliable BFP-Based NPU
A BFP NPU microarchitecture using row/column blocking and per-path protections achieves near-DMR reliability at 3.55% geometric mean performance overhead and under 2% hardware cost.
-
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.
-
Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark
Creates the BGTD benchmark and mmTraffic architecture to enable explainable multimodal interpretation of encrypted network traffic using LLMs.
-
Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements
PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.
-
Diffusion Models Beat GANs on Image Synthesis
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
-
Generating Long Sequences with Sparse Transformers
Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.
-
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
-
CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification
CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.
-
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
-
Training Time Prediction for Mixed Precision-based Distributed Training
A precision-aware predictor for distributed training time achieves 9.8% MAPE across precision settings, compared to errors up to 147.85% when precision is ignored.
-
The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
FP16 KV caching in transformers causes deterministic token divergence versus cache-free inference due to non-associative floating-point accumulation orderings.
-
SHARE: Social-Humanities AI for Research and Education
SHARE models are the first causal LMs pretrained exclusively for SSH and match general models like Phi-4 on SSH texts despite using 100 times fewer tokens, paired with a non-generative MIRROR interface to support scholarly review.
-
LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training
LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.
-
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.
-
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of simulators.
-
Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction
Neural Harmonic Textures add periodic feature interpolation and deferred neural decoding to primitive representations, achieving state-of-the-art real-time novel-view synthesis and bridging primitive and neural-field methods.
-
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
PyTorch Fully Sharded Data Parallel enables training of significantly larger models than Distributed Data Parallel with comparable speed and near-linear TFLOPS scaling.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
-
Linformer: Self-Attention with Linear Complexity
Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.
-
Probing Routing-Conditional Calibration in Attention-Residual Transformers
Routing summaries and auxiliary features do not provide stable evidence of conditional miscalibration in AR transformers once confidence-matched baselines, capacity controls, and permutation nulls are applied.
-
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay
Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
-
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
-
PINNACLE: An Open-Source Computational Framework for Classical and Quantum PINNs
PINNACLE is an open-source framework for classical and quantum PINNs that supplies modular training methods and benchmarks showing high sensitivity to architecture choices plus parameter-efficiency gains in some hybrid quantum regimes.
-
BAAI Cardiac Agent: An intelligent multimodal agent for automated reasoning and diagnosis of cardiovascular diseases from cardiac magnetic resonance imaging
BAAI Cardiac Agent automates end-to-end cardiac MRI analysis for seven cardiovascular diseases, achieving AUC >0.93 internally and >0.81 externally with high correlation to expert measurements.
-
Assessing Performance and Porting Strategies for Gravitational $N$-Body Simulations on the RISC-V-Based Tenstorrent Wormhole\textsuperscript{\texttrademark}
Three scaling strategies for an N-body code on Tenstorrent Wormhole accelerators are compared via execution time and energy measurements, identifying the configuration with the best efficiency-performance balance.
-
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding
CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.