Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe
Pith reviewed 2026-06-26 16:57 UTC · model grok-4.3
The pith
Non-uniform E2M1 FP4 formats create Shrinkage Bias from bin asymmetry that accumulates across layers and drives training instability, while uniform grids avoid it.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the geometric asymmetry of E2M1's representable bins produces Shrinkage Bias, a negative rounding error that accumulates multiplicatively across layers and is further amplified by RHT. This bias supplies a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. Uniform grids bypass the grid-geometry error entirely and convert RHT's improved bucket utilization into higher quantization quality. UFP4 is presented as the practical uniform 4-bit recipe that realizes these advantages while restricting stochastic rounding to dY.
What carries the argument
Shrinkage Bias, the systematic negative rounding error caused by the geometric asymmetry of non-uniform E2M1 representable bins; it accumulates multiplicatively across layers and is amplified by RHT.
If this is right
- UFP4 achieves lower BF16-relative loss degradation than E2M1 baselines on Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining.
- Uniform grids convert the improved bucket utilization from RHT into higher quantization quality without introducing grid-geometry error.
- The bias accumulates multiplicatively across layers, so its effect grows with network depth.
- Future accelerators should treat E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.
Where Pith is reading between the lines
- The same geometric bias mechanism may appear in any quantization format whose bin boundaries are asymmetrically spaced.
- Hardware vendors could prioritize uniform 4-bit support to enable more stable low-precision training at scale.
- Selective application of stochastic rounding only to dY may generalize as a stability technique beyond the UFP4 recipe.
- Scaling-law studies of FP4 pretraining may need explicit correction terms for cumulative rounding bias.
Load-bearing premise
The geometric asymmetry of E2M1 bins is the primary driver of observed training instability rather than optimizer interactions, hardware rounding, or data-dependent effects.
What would settle it
A controlled multi-layer forward-pass experiment that isolates rounding error in E2M1 versus an otherwise identical uniform grid, or a training run in which E2M1 rounding is forced to be symmetric and instability disappears.
read the original abstract
FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by the Random Hadamard Transform (RHT), providing a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies. Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that E2M1 FP4 formats suffer from Shrinkage Bias due to geometric asymmetry in representable bins, which accumulates multiplicatively across layers and is amplified by Random Hadamard Transform (RHT), explaining training instability in existing FP4 recipes. It proposes UFP4, a uniform 4-bit recipe applying RHT to all three training GEMMs but restricting stochastic rounding to dY, and reports lower BF16-relative loss degradation than E2M1 baselines on Dense 1.5B, MoE 7.9B, and MoE 124B models, backed by scaling-law analysis and ablations. The work suggests hardware should prioritize uniform grids like E1M2/INT4.
Significance. If the geometric origin of Shrinkage Bias is confirmed as the dominant mechanism and UFP4's improvements hold under controlled conditions, the result would provide a concrete rationale for shifting FP4 training hardware toward uniform formats, potentially improving stability and quantization quality in large-scale LLM pretraining without additional memory overhead.
major comments (2)
- [Experiments / Ablations] Experiments section (and associated ablation tables): the comparison of full E2M1 vs. UFP4 recipes does not isolate bin uniformity as the causal factor. The reported loss gap could be driven by differences in stochastic-rounding schedule (dY-only in UFP4), GEMM ordering, or hardware-specific rounding rather than grid geometry; a controlled contrast holding optimizer, RHT pattern, and rounding implementation fixed while varying only E2M1 vs. uniform bins is required to support the 'geometric origin' claim.
- [Bias Analysis] § on bias accumulation and RHT amplification: the multiplicative accumulation argument relies on scaling-law fits, but without explicit derivation showing how the negative rounding error from asymmetric bins propagates through the forward/backward passes (e.g., via a closed-form expression or layer-wise error model), it remains unclear whether the observed instability is primarily geometric or confounded by data-dependent or optimizer effects.
minor comments (2)
- [Introduction / Geometric Analysis] Clarify the exact definition of 'Shrinkage Bias' with a small numerical example of E2M1 bin boundaries and the resulting rounding error distribution.
- [Related Work] Add a reference to prior work on uniform vs. non-uniform quantization error analysis in low-precision training if not already present.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below, clarifying the controls in our experiments and the support for the geometric bias analysis.
read point-by-point responses
-
Referee: [Experiments / Ablations] Experiments section (and associated ablation tables): the comparison of full E2M1 vs. UFP4 recipes does not isolate bin uniformity as the causal factor. The reported loss gap could be driven by differences in stochastic-rounding schedule (dY-only in UFP4), GEMM ordering, or hardware-specific rounding rather than grid geometry; a controlled contrast holding optimizer, RHT pattern, and rounding implementation fixed while varying only E2M1 vs. uniform bins is required to support the 'geometric origin' claim.
Authors: We agree that a fully isolated contrast, varying only the quantization grid while holding stochastic rounding schedule, RHT pattern, optimizer, and rounding implementation fixed, would provide stronger causal evidence for the geometric origin. The current UFP4 vs. E2M1 comparison does vary both grid uniformity and rounding schedule. Our existing ablations vary grid type while controlling other factors to the extent hardware permits, but we will add a new controlled experiment in the revision that directly compares E2M1 and E1M2 grids under identical rounding and RHT settings. revision: yes
-
Referee: [Bias Analysis] § on bias accumulation and RHT amplification: the multiplicative accumulation argument relies on scaling-law fits, but without explicit derivation showing how the negative rounding error from asymmetric bins propagates through the forward/backward passes (e.g., via a closed-form expression or layer-wise error model), it remains unclear whether the observed instability is primarily geometric or confounded by data-dependent or optimizer effects.
Authors: The scaling-law fits and layer-wise bias measurements provide empirical evidence that the negative rounding error accumulates and is amplified by RHT. We acknowledge that an explicit propagation model would strengthen the geometric claim over potential confounds. In the revision we will add a simplified layer-wise error model to the bias analysis section to illustrate the multiplicative effect more formally, while noting that a full closed-form derivation across optimizer and data-dependent dynamics is beyond the current scope. revision: yes
Circularity Check
No circularity: claims rest on empirical scaling-law analysis and ablations without reduction to fitted inputs or self-referential definitions
full rationale
The paper identifies Shrinkage Bias via geometric analysis of E2M1 bins and demonstrates its impact through scaling-law fits and ablation studies comparing full training recipes (E2M1 baselines vs. UFP4) on Dense 1.5B, MoE 7.9B, and MoE 124B models. No equations are presented that define a quantity in terms of itself, rename a fitted parameter as a prediction, or rely on self-citations for load-bearing uniqueness or ansatz. The central claims (bias accumulation, RHT amplification, UFP4 superiority) are framed as outcomes of the reported experiments rather than derivations that collapse to their inputs by construction. This is the common case of an empirical paper whose results stand or fall on the quality of its controls and measurements, not on definitional circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: outlier-free 4-bit inference in rotated llms. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS '24, Red Hook, NY, USA, 2024. Curran Associates Inc. IS...
2024
-
[2]
Metis: Training LLM s with FP 4 quantization
Hengjie Cao, Mengyi Chen, Yifeng Yang, Fang Dong, Ruijun Huang, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Yuan Cheng, FAN WU, Fan Yang, Tun Lu, Ning Gu, and Li Shang. Metis: Training LLM s with FP 4 quantization. In The Fourteenth International Conference on Learning Representations, 2026. https://openreview.net/forum?id=I2ZrCi5O84
2026
-
[3]
Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, Zhiheng Liu, Xingyan Bin, and Ping Luo. Int v.s. fp: A comprehensive study of fine-grained low-bit quantization formats, 2025. https://arxiv.org/abs/2510.25602
arXiv 2025
-
[4]
Yuxiang Chen, Yifan Liu, Xiaoming Xu, Pengle Zhang, Michael Beyer, Martin Rapp, Jun Zhu, and Jianfei Chen. Tetrajet-v2: Accurate nvfp4 training for large language models with oscillation suppression and outlier control, 2026. https://arxiv.org/abs/2510.27527
Pith/arXiv arXiv 2026
-
[5]
FP 4 all the way: Fully quantized training of large language models
Brian Chmiel, Maxim Fishman, Ron Banner, and Daniel Soudry. FP 4 all the way: Fully quantized training of large language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. https://openreview.net/forum?id=kuzye4EPLR
2026
-
[6]
Four over six: More accurate nvfp4 quantization with adaptive block scaling, 2026
Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, Keith Wyss, Mahdi Nazemi, Asit Mishra, Carlo del Mundo, Tijmen Blankevoort, and Song Han. Four over six: More accurate nvfp4 quantization with adaptive block scaling, 2026. https://arxiv.org/abs/2512.02010
Pith/arXiv arXiv 2026
-
[7]
Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bing-Li Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dong-Li Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaji...
2024
-
[8]
Deep learning with limited numerical precision
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1737--1746, Lille, France, 07--09 Jul 2015. PMLR. https://proc...
2015
-
[9]
Rae, and Laurent Sifre
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...
2022
-
[10]
Powlu: An activation function for stable pre-training of llms, 2026
Peijie Jiang, Yuqi Feng, Cunyin Peng, Qian Zhao, Jia Liu, Kunlong Chen, Zhiqiang Zhang, and Jun Zhou. Powlu: An activation function for stable pre-training of llms, 2026. https://api.semanticscholar.org/CorpusID:288669934
2026
-
[11]
Faar: Format-aware adaptive rounding for nvfp4, 2026
Hanglin Li, Shuchang Tian, Chen Lin, Zhiyong Zhao, and Kun Zhan. Faar: Format-aware adaptive rounding for nvfp4, 2026. https://arxiv.org/abs/2603.22370
arXiv 2026
-
[12]
Svdquant: Absorbing outliers by low-rank component for 4-bit diffusion models
Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank component for 4-bit diffusion models. In The Thirteenth International Conference on Learning Representations
-
[13]
Spinquant: LLM quantization with learned rotations
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: LLM quantization with learned rotations. In The Thirteenth International Conference on Learning Representations, 2025. https://openreview.net/forum?id=ogO6DGE6FZ
2025
-
[15]
Pretraining large language models with nvfp4, 2026
NVIDIA, Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, Stefania Alborghetti, Michael Andersch, Sivakumar Arayandi, Alexis Bjorlin, Aaron Blakeman, Evan Briones, Ian Buck, Bryan Catanzaro, Muya Chang, Jinhang Choi, Mike Chrzanowski, Eric Chung, Victor Cui, Steve Dai, Bita Darvish Rouhani, Carlo del Mundo, Deena Donia, Burc Eryilmaz, Henry Estela, ...
arXiv 2026
-
[16]
Quartet ii: Accurate llm pre-training in nvfp4 by improved unbiased gradient estimation, 2026
Andrei Panferov, Erik Schultheis, Soroush Tabesh, and Dan Alistarh. Quartet ii: Accurate llm pre-training in nvfp4 by improved unbiased gradient estimation, 2026. https://arxiv.org/abs/2601.22813
Pith/arXiv arXiv 2026
-
[19]
Flatquant: Flatness matters for llm quantization
Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quantization. In International Conference on Machine Learning, pages 57587--57613. PMLR, 2025
2025
-
[20]
Hifloat4 format for language model pre-training on ascend npus, 2026
Mehran Taghian, Yunke Peng, Xing Huang, Yao Wang, Yaoyuan Wang, Wei Guo, Yuanyong Luo, Tianchi Hu, Junsong Wang, Xin Wang, Hu Liu, Yu Cheng, Ziwei Yu, Hongliang Li, Mehdi Rahimifar, Lei Yan, Xuefei Wang, Zhuang Ma, Lei Liu, Hui Yu, Anandharaju Durai Raju, Hoang Le, Hei Yi Mak, Tanzila Rahman, and Shadan Golestan. Hifloat4 format for language model pre-tra...
Pith/arXiv arXiv 2026
-
[21]
Ling Team, Ang Li, Ben Liu, Binbin Hu, Bing Li, Bi Zeng, Borui Ye, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, Chen Qian, Chenchen Ju, Chenchen Li, Chengfu Tang, Chilin Fu, C. Ren, Chunwei Wu, Cong Zhang, Cunyin Peng, Dafeng Xu, Daixin Wang, Dalong Zhang, Dingnan Jin, Dingyuan Zhu, Di Hu, Fa-Chang Zhao, Feifan Wu, Feng Zhu, Gangshan Wang, Haitao Z...
2025
-
[22]
Towards greater leverage: Scaling laws for efficient mixture-of-experts language models
Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, and JUN ZHOU. Towards greater leverage: Scaling laws for efficient mixture-of-experts language models. In The Fourteenth International Conference on Learning Representations, 2026. https://openreview.net/forum?id=7r2lkhDGUj
2026
-
[23]
Optimizing large language model training using FP 4 quantization
Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zheng-Jun Zha, and Peng CHENG. Optimizing large language model training using FP 4 quantization. In Forty-second International Conference on Machine Learning, 2025. https://openreview.net/forum?id=uK7JArZEJM
2025
-
[24]
Smoothquant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In International conference on machine learning, pages 38087--38099. PMLR, 2023
2023
-
[25]
G roundbreaking S uper P o D I nterconnect: L eading a N ew P aradigm for A I I nfrastructure - H uawei --- huawei.com
Eric Xu. G roundbreaking S uper P o D I nterconnect: L eading a N ew P aradigm for A I I nfrastructure - H uawei --- huawei.com. https://www.huawei.com/en/news/2025/9/hc-xu-keynote-speech, 2025
2025
-
[26]
A survey of large language models
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Zican Dong, Yupeng Hou, Beichen Zhang, Yingqian Min, Junjie Zhang, Peiyu Liu, et al. A survey of large language models. Frontiers of Computer Science, 20 0 (12): 0 2012627, 2026
2026
-
[27]
Mixfp4: Enhancing nvfp4 with adaptive fp4/int4 block representations, 2026
Jiaxiang Zou, Yonghao Chen, Ruilong Wu, and Xinyu Chen. Mixfp4: Enhancing nvfp4 with adaptive fp4/int4 block representations, 2026. https://arxiv.org/abs/2605.31035
Pith/arXiv arXiv 2026
-
[28]
Frontiers of Computer Science , volume =
A survey of large language models , author =. Frontiers of Computer Science , volume =. 2026 , publisher =
2026
-
[29]
arXiv preprint arXiv:2209.05433 , year =
Fp8 formats for deep learning , author =. arXiv preprint arXiv:2209.05433 , year =
-
[30]
The Fourteenth International Conference on Learning Representations , year =
Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models , author =. The Fourteenth International Conference on Learning Representations , year =
-
[31]
2026 , url =
PowLU: An Activation Function for Stable Pre-Training of LLMs , author =. 2026 , url =
2026
-
[32]
2026 , eprint =
Pretraining Large Language Models with NVFP4 , author =. 2026 , eprint =
2026
-
[33]
Castro and Andrei Panferov and Soroush Tabesh and Oliver Sieberling and Jiale Chen and Mahdi Nikdan and Saleh Ashkboos and Dan Alistarh , booktitle =
Roberto L. Castro and Andrei Panferov and Soroush Tabesh and Oliver Sieberling and Jiale Chen and Mahdi Nikdan and Saleh Ashkboos and Dan Alistarh , booktitle =. Quartet: Native. 2026 , url =
2026
-
[34]
2026 , eprint =
Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation , author =. 2026 , eprint =
2026
-
[35]
2026 , eprint =
TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control , author =. 2026 , eprint =
2026
-
[36]
2026 , url =
Brian Chmiel and Maxim Fishman and Ron Banner and Daniel Soudry , booktitle =. 2026 , url =
2026
-
[37]
2026 , eprint =
HiFloat4 Format for Language Model Pre-training on Ascend NPUs , author =. 2026 , eprint =
2026
-
[38]
2026 , eprint =
Pretraining large language models with MXFP4 on Native FP4 Hardware , author =. 2026 , eprint =
2026
-
[39]
Lin Zhao and Felix Marty and Spandan Tiwari and Wei Luo and Bowen Bao and Xinjun Niu and Zhaofeng Zhang and Haoyang Li and Ke Wang and Ashish Sirasao , title =
-
[40]
Kyle Aubrey , title =
-
[41]
Diwakar Gupta and Sabastian Mugazambi , title =
-
[42]
2025 , url =
Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation , author =. 2025 , url =
2025
-
[43]
2024 , url =
DeepSeek-V3 Technical Report , author =. 2024 , url =
2024
-
[44]
Metis: Training
Hengjie Cao and Mengyi Chen and Yifeng Yang and Fang Dong and Ruijun Huang and Jixian Zhou and Anrui Chen and Mingzhi Dong and Yujiang Wang and Jinlong Hou and Yuan Cheng and FAN WU and Fan Yang and Tun Lu and Ning Gu and Li Shang , booktitle =. Metis: Training. 2026 , url =
2026
-
[45]
SpinQuant:
Zechun Liu and Changsheng Zhao and Igor Fedorov and Bilge Soran and Dhruv Choudhary and Raghuraman Krishnamoorthi and Vikas Chandra and Yuandong Tian and Tijmen Blankevoort , booktitle =. SpinQuant:. 2025 , url =
2025
-
[46]
International Conference on Machine Learning , pages =
FlatQuant: Flatness Matters for LLM Quantization , author =. International Conference on Machine Learning , pages =. 2025 , organization =
2025
-
[47]
International conference on machine learning , pages =
Smoothquant: Accurate and efficient post-training quantization for large language models , author =. International conference on machine learning , pages =. 2023 , organization =
2023
-
[48]
SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models , author =
-
[49]
Optimizing Large Language Model Training Using
Ruizhe Wang and Yeyun Gong and Xiao Liu and Guoshuai Zhao and Ziyue Yang and Baining Guo and Zheng-Jun Zha and Peng CHENG , booktitle =. Optimizing Large Language Model Training Using. 2025 , url =
2025
-
[50]
arXiv preprint arXiv:2310.10537 , year =
Microscaling data formats for deep learning , author =. arXiv preprint arXiv:2310.10537 , year =
-
[51]
2026 , eprint =
MixFP4: Enhancing NVFP4 with Adaptive FP4/INT4 Block Representations , author =. 2026 , eprint =
2026
-
[52]
FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats , author =
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats , author =. 2025 , eprint =
2025
-
[53]
2026 , eprint =
FAAR: Format-Aware Adaptive Rounding for NVFP4 , author =. 2026 , eprint =
2026
-
[54]
2026 , eprint =
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling , author =. 2026 , eprint =
2026
-
[55]
2023 , eprint =
Microscaling Data Formats for Deep Learning , author =. 2023 , eprint =
2023
-
[56]
2022 , eprint =
FP8 Formats for Deep Learning , author =. 2022 , eprint =
2022
-
[57]
and Sifre, Laurent , title =
Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...
2022
-
[58]
and Li, Bo and Cameron, Pashmina and Jaggi, Martin and Alistarh, Dan and Hoefler, Torsten and Hensman, James , title =
Ashkboos, Saleh and Mohtashami, Amirkeivan and Croci, Maximilian L. and Li, Bo and Cameron, Pashmina and Jaggi, Martin and Alistarh, Dan and Hoefler, Torsten and Hensman, James , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =
2024
-
[59]
Proceedings of the 32nd International Conference on Machine Learning , pages =
Deep Learning with Limited Numerical Precision , author =. Proceedings of the 32nd International Conference on Machine Learning , pages =. 2015 , editor =
2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.