arxiv: 2208.07339 · v2 · submitted 2022-08-15 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Luke Zettlemoyer, Mike Lewis, Tim Dettmers, Younes Belkada

Pith reviewed 2026-05-13 13:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords 8-bit quantizationtransformer inferencelarge language modelsmatrix multiplicationmixed precisionoutlier featuresmemory reduction

0 comments

The pith

LLM.int8() performs 8-bit matrix multiplication for transformers up to 175B parameters with no performance degradation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops LLM.int8(), a procedure that cuts the memory needed for transformer inference in half by converting matrix multiplications to 8-bit while preserving full accuracy. It does this by applying vector-wise quantization to most values and isolating a small set of systematic outlier dimensions for separate 16-bit handling. The result lets 175B parameter models run inference immediately after conversion from 16/32-bit checkpoints. This makes such models practical on consumer hardware instead of requiring multiple high-memory GPUs.

Core claim

LLM.int8() performs Int8 matrix multiplication for feed-forward and attention projection layers in transformers. It first uses vector-wise quantization with separate normalization constants for each inner product to quantize most features. For emergent outliers, it applies a mixed-precision decomposition that isolates those dimensions into a 16-bit matrix multiplication while still more than 99.9% of values are multiplied in 8-bit. Using this approach, inference in LLMs with up to 175B parameters shows no performance degradation.

What carries the argument

LLM.int8(), a two-part quantization procedure consisting of vector-wise quantization for the bulk of features and mixed-precision decomposition that isolates outlier dimensions for 16-bit computation.

If this is right

A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
Memory required for inference is reduced by half.
Models such as OPT-175B or BLOOM can run on a single server with consumer GPUs.
More than 99.9% of the values in the matrix multiplications are handled in 8-bit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same outlier-handling logic might extend to other matrix-heavy operations in neural networks beyond transformers.
If the outliers prove stable across training runs, the method could reduce memory during fine-tuning as well.
Widespread adoption would shift the practical limit of model size from available GPU memory to available compute.

Load-bearing premise

The emergent outlier features in transformers are highly systematic, limited to a small number of dimensions, and can be isolated via mixed-precision decomposition without introducing errors in the remaining 8-bit computations.

What would settle it

Applying LLM.int8() to a 175B parameter model and measuring a drop in perplexity or zero-shot task accuracy relative to the original 16/32-bit version.

read the original abstract

Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around properties of highly systematic emergent features in transformer language models that dominate attention and transformer predictive performance. To cope with these features, we develop a two-part quantization procedure, LLM.int8(). We first use vector-wise quantization with separate normalization constants for each inner product in the matrix multiplication, to quantize most of the features. However, for the emergent outliers, we also include a new mixed-precision decomposition scheme, which isolates the outlier feature dimensions into a 16-bit matrix multiplication while still more than 99.9% of values are multiplied in 8-bit. Using LLM.int8(), we show empirically it is possible to perform inference in LLMs with up to 175B parameters without any performance degradation. This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open-source our software.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows a practical 8-bit quantization method for transformer matmuls that matches FP16 accuracy on 175B models by isolating a small set of outlier dimensions into 16-bit arithmetic.

read the letter

The main point is that LLM.int8() lets you load and run a 175B model in 8-bit with no reported drop in zero-shot performance. They do this with vector-wise quantization on most values plus a mixed-precision split that routes the emergent high-magnitude outlier features through a separate 16-bit path, keeping over 99.9% of the computation in 8 bits. That combination is what lets them cut memory in half while claiming full precision results on OPT-175B and BLOOM-scale models. The open-sourced implementation is a direct plus for anyone who wants to try it on consumer GPUs. The work is mostly empirical engineering rather than new theory, but it is grounded in the observation that these outliers are systematic and limited in number, which prior uniform quantization approaches did not handle at this scale. The results on large models are the strongest part; they demonstrate the method actually works where smaller-scale tests would not have been convincing. One soft spot is that the zero-degradation claim rests on the outliers staying stable and fully captured by the 16-bit path with no residual error in the 8-bit path. The abstract does not detail sensitivity to the outlier threshold, variance across prompts, or an ablation that removes the mixed-precision step on the largest model, so the robustness is not fully visible from the summary alone. If those checks are in the full paper and hold up, the contribution is solid. This is for practitioners who need to run large transformers under tight memory constraints. A reader working on inference optimization or quantization will find the concrete scheme and code useful. I would send it to peer review; the scale of the empirical result and the practical impact are enough to justify referee time even if some sections need tightening.

Referee Report

3 major / 2 minor

Summary. The paper introduces LLM.int8(), a two-part quantization procedure for Int8 matrix multiplication in transformer feed-forward and attention layers. Vector-wise quantization with per-inner-product normalization constants handles most features, while a mixed-precision decomposition isolates a small set of emergent outlier dimensions into 16-bit computation (with >99.9% of values remaining in 8-bit). The central empirical claim is that this enables inference on models up to 175B parameters (e.g., OPT-175B, BLOOM) with no performance degradation relative to FP16, halving memory requirements and allowing deployment on consumer GPUs.

Significance. If the zero-degradation result holds under rigorous validation, the work has substantial practical significance: it removes a major hardware barrier for state-of-the-art LLMs, making 175B-scale models runnable on single servers with consumer GPUs while preserving zero-shot performance. The open-sourcing of the implementation further strengthens its potential impact.

major comments (3)

[Abstract and §4] Abstract and §4 (empirical results): the claim of 'without any performance degradation' on 175B-parameter models is load-bearing but rests on matching FP16 zero-shot numbers without reported sensitivity analysis on the outlier threshold, variance across random seeds or prompt distributions, or statistical tests; this leaves the no-residual-error assumption in the 8-bit path untested.
[§3.2] §3.2 (mixed-precision decomposition): the procedure assumes emergent outliers are stable, limited to <0.1% dimensions, and perfectly isolatable with no leakage into the 8-bit path; the manuscript provides no ablation removing the 16-bit component on the largest model to confirm that the remaining 99.9% of the matmul truly incurs zero accuracy loss.
[§4] §4 (experiments on OPT-175B): the reported matching of FP16 numbers is presented without baselines for alternative quantization schemes or ablations of the vector-wise normalization constants, making it impossible to isolate the contribution of the mixed-precision step to the zero-degradation result.

minor comments (2)

The manuscript mentions open-sourcing the software but does not include a repository URL, commit hash, or reproduction instructions.
[§3.1] Notation for the normalization constants (e.g., per-vector scaling factors) could be clarified with an explicit equation in §3.1 to avoid ambiguity when implementing the vector-wise quantization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting areas where the empirical validation can be strengthened. We respond to each major comment below, providing clarifications from the manuscript and indicating where revisions will be made.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (empirical results): the claim of 'without any performance degradation' on 175B-parameter models is load-bearing but rests on matching FP16 zero-shot numbers without reported sensitivity analysis on the outlier threshold, variance across random seeds or prompt distributions, or statistical tests; this leaves the no-residual-error assumption in the 8-bit path untested.

Authors: We agree that the manuscript would benefit from additional analysis on sensitivity and variance. The outlier threshold is determined based on the distribution of activation magnitudes observed in smaller models, as detailed in §3.2, and we observed no degradation across the reported zero-shot tasks for OPT-175B and BLOOM. In the revised version, we will include a sensitivity analysis on the threshold for smaller models and report results on additional prompt sets to address variance across distributions. Full statistical tests on 175B models are computationally intensive, but the consistent matching across benchmarks supports the claim. revision: partial
Referee: [§3.2] §3.2 (mixed-precision decomposition): the procedure assumes emergent outliers are stable, limited to <0.1% dimensions, and perfectly isolatable with no leakage into the 8-bit path; the manuscript provides no ablation removing the 16-bit component on the largest model to confirm that the remaining 99.9% of the matmul truly incurs zero accuracy loss.

Authors: The analysis in §3.2 shows that the outlier dimensions are consistent across layers and inputs for a given model, with less than 0.1% of dimensions affected. We provide ablations in §4 on models up to 13B parameters demonstrating that the 8-bit path alone leads to accuracy loss when outliers are not handled in 16-bit. Performing the full ablation on 175B would require significant additional compute resources beyond what was available for the original experiments. We will expand the discussion in the revision to emphasize the scaling behavior observed. revision: no
Referee: [§4] §4 (experiments on OPT-175B): the reported matching of FP16 numbers is presented without baselines for alternative quantization schemes or ablations of the vector-wise normalization constants, making it impossible to isolate the contribution of the mixed-precision step to the zero-degradation result.

Authors: Section 4 includes direct comparisons to per-tensor quantization and other baselines, which show substantial degradation on large models. Ablations of the vector-wise quantization are presented for smaller models in §4, isolating its contribution to reducing quantization error. To further isolate the mixed-precision step, we will add explicit results with and without it for models up to 6.7B in the revision, and note that the zero-degradation on 175B relies on both components working together as shown in the scaling experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical quantization procedure validated directly on large models

full rationale

The paper introduces LLM.int8() as a two-part procedure (vector-wise 8-bit quantization plus mixed-precision outlier decomposition) whose normalization constants are computed on-the-fly from each input activation vector. The central claim—that this yields zero degradation on downstream tasks for models up to 175B—is supported by direct empirical comparison to FP16 baselines on OPT-175B and BLOOM, with no intermediate derivation that reduces to a fitted parameter, self-definition, or self-citation chain. The method is externally falsifiable via the reported zero-shot accuracy numbers and open-source implementation; no load-bearing step collapses to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the observed existence of systematic emergent outlier features that dominate transformer performance; these are treated as an empirical property of the models rather than derived quantities.

axioms (1)

domain assumption Transformer language models contain highly systematic emergent outlier features in attention and feed-forward layers that dominate predictive performance.
This property is invoked to justify the need for the mixed-precision decomposition step.

pith-pipeline@v0.9.0 · 5547 in / 1222 out tokens · 64890 ms · 2026-05-13T13:28:57.607036+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
cs.PF 2026-05 unverdicted novelty 7.0

A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
Chronos: Learning the Language of Time Series
cs.LG 2024-03 conditional novelty 7.0

Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Accelerating Large Language Model Decoding with Speculative Sampling
cs.CL 2023-02 accept novelty 7.0

Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
cs.LG 2022-10 unverdicted novelty 7.0

GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
cs.LG 2026-05 unverdicted novelty 6.0

LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
cs.LG 2026-05 unverdicted novelty 6.0

Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
cs.SE 2026-04 unverdicted novelty 6.0

Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
cs.CL 2026-04 unverdicted novelty 6.0

GSQ applies a Gumbel-Softmax relaxation to learn discrete grid assignments in scalar quantization, closing most of the accuracy gap to vector methods like QTIP on Llama-3.1 models at 2-3 bits while using only symmetri...
Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition
cs.CL 2026-04 unverdicted novelty 6.0

Depth Registers plus hinge loss cut W4A4-induced perplexity collapse from 1727 to 119 in a 300M SwiGLU model by selectively taming reader-layer activations while leaving bilinear generator tails largely untouched.
Rethinking Residual Errors in Compensation-based LLM Quantization
cs.LG 2026-04 conditional novelty 6.0

Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
cs.LG 2026-04 unverdicted novelty 6.0

Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
cs.CL 2024-02 conditional novelty 6.0

KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
cs.CL 2023-05 unverdicted novelty 6.0

Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
cs.CL 2022-11 unverdicted novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models
cs.AI 2026-05 unverdicted novelty 5.0

BitCal-TTS raises exact-match accuracy by 3.7 points (7B) and 2.8 points (14B) on small GSM8K shards for 4-bit Qwen2.5 models while cutting premature-stop rates and retaining token savings versus fixed-budget decoding.
LoRaQ: Optimized Low Rank Approximation for 4-bit Quantization
cs.LG 2026-04 unverdicted novelty 5.0

LoRaQ enables fully sub-16-bit quantized diffusion models by optimizing low-rank error compensation in a data-free way, outperforming prior methods at equal memory cost on Pixart-Σ and SANA while supporting mixed low-...
A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models
cs.LG 2026-04 unverdicted novelty 5.0

KL divergence provides a superior forward-only metric for identifying quantization-sensitive parts in SSM-Transformer hybrids, outperforming MSE and SQNR and supporting practical mixed-precision deployment on edge devices.
SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

SEPTQ simplifies LLM post-training quantization to two steps via static global importance scoring and mask-guided column-wise weight updates, claiming superior results over baselines in low-bit settings.
Yi: Open Foundation Models by 01.AI
cs.CL 2024-03 unverdicted novelty 4.0

Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.

Reference graph

Works this paper leans on

145 extracted references · 145 canonical work pages · cited by 22 Pith papers · 24 internal anchors

[1]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Neural networks : the official journal of the International Neural Network Society , year=

On the momentum term in gradient descent learning algorithms , author=. Neural networks : the official journal of the International Neural Network Society , year=

work page
[4]

Cited on , volume=

Neural networks for machine learning lecture 6a overview of mini-batch gradient descent , author=. Cited on , volume=

work page
[5]

International Conference on Learning Representations (ICLR) , year=

8-bit approximations for parallelism in deep learning , author=. International Conference on Learning Representations (ICLR) , year=

work page
[6]

2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP) , pages=

AutoClip: Adaptive gradient clipping for source separation networks , author=. 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP) , pages=. 2020 , organization=

work page 2020
[7]

arXiv preprint arXiv:2102.06171 , year=

High-Performance Large-Scale Image Recognition Without Normalization , author=. arXiv preprint arXiv:2102.06171 , year=

work page arXiv
[8]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009
[9]

Proceedings of the IEEE international conference on computer vision , pages=

Aligning books and movies: Towards story-like visual explanations by watching movies and reading books , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[11]

urlhttp://Skylion007

Openwebtext corpus , author=. urlhttp://Skylion007. github. io/OpenWebTextCorpus , year=

work page
[14]

International Conference on Machine Learning , pages=

Adafactor: Adaptive learning rates with sublinear memory cost , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018
[15]

arXiv preprint arXiv:1907.04840 , year=

Sparse networks from scratch: Faster training without losing performance , author=. arXiv preprint arXiv:1907.04840 , year=

work page arXiv 1907
[16]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very deep convolutional networks for large-scale image recognition , author=. arXiv preprint arXiv:1409.1556 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Wide Residual Networks

Wide residual networks , author=. arXiv preprint arXiv:1605.07146 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Advances in neural information processing systems , volume=

Imagenet classification with deep convolutional neural networks , author=. Advances in neural information processing systems , volume=

work page
[19]

arXiv preprint arXiv:1809.10853 , year=

Adaptive input representations for neural language modeling , author=. arXiv preprint arXiv:1809.10853 , year=

work page arXiv
[20]

Fixing weight decay regularization in adam , author=

work page
[24]

arXiv preprint arXiv:2103.16716 , year=

BASE Layers: Simplifying Training of Large, Sparse Models , author=. arXiv preprint arXiv:2103.16716 , year=

work page arXiv
[26]

Advances in Neural Information Processing Systems , volume=

A statistical framework for low-bitwidth training of deep neural networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[28]

2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS) , pages=

Q8bert: Quantized 8bit bert , author=. 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS) , pages=. 2019 , organization=

work page 2019
[29]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[30]

9th International Conference on Learning Representations, ICLR , year=

8-bit Optimizers via Block-wise Quantization , author=. 9th International Conference on Learning Representations, ICLR , year=

work page
[35]

EMNLP , year=

TernaryBERT: Distillation-aware Ultra-low Bit BERT , author=. EMNLP , year=

work page
[36]

ArXiv , year=

BinaryBERT: Pushing the Limit of BERT Quantization , author=. ArXiv , year=

work page
[37]

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , booktitle =

Mohammad Rastegari and Vicente Ordonez and Joseph Redmon and Ali Farhadi , editor =. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , booktitle =. 2016 , url =. doi:10.1007/978-3-319-46493-0\_32 , timestamp =

work page doi:10.1007/978-3-319-46493-0 2016
[38]

BinaryConnect: Training Deep Neural Networks with binary weights during propagations , booktitle =

Matthieu Courbariaux and Yoshua Bengio and Jean. BinaryConnect: Training Deep Neural Networks with binary weights during propagations , booktitle =. 2015 , url =

work page 2015
[40]

Training Deep Neural Networks with 8-bit Floating Point Numbers , booktitle =

Naigang Wang and Jungwook Choi and Daniel Brand and Chia. Training Deep Neural Networks with 8-bit Floating Point Numbers , booktitle =. 2018 , url =

work page 2018
[41]

Hybrid 8-bit Floating Point

Xiao Sun and Jungwook Choi and Chia. Hybrid 8-bit Floating Point. Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada , pages =. 2019 , url =

work page 2019
[42]

Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks , booktitle =

L. Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks , booktitle =. 2020 , url =

work page 2020
[43]

Training DNNs with Hybrid Block Floating Point , booktitle =

Mario Drumond and Tao Lin and Martin Jaggi and Babak Falsafi , editor =. Training DNNs with Hybrid Block Floating Point , booktitle =. 2018 , url =

work page 2018
[44]

Dally , title =

Chenzhuo Zhu and Song Han and Huizi Mao and William J. Dally , title =. 5th International Conference on Learning Representations,. 2017 , url =

work page 2017
[45]

2019 , url =

Rundong Li and Yan Wang and Feng Liang and Hongwei Qin and Junjie Yan and Rui Fan , title =. 2019 , url =. doi:10.1109/CVPR.2019.00292 , timestamp =

work page doi:10.1109/cvpr.2019.00292 2019
[46]

International conference on machine learning , pages=

I-bert: Integer-only bert quantization , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[48]

Advances in neural information processing systems , volume=

Mesh-tensorflow: Deep learning for supercomputers , author=. Advances in neural information processing systems , volume=

work page
[49]

Ruihao Gong and Xianglong Liu and Shenghu Jiang and Tianxiang Li and Peng Hu and Jiazhen Lin and Fengwei Yu and Junjie Yan , title =. 2019. 2019 , url =. doi:10.1109/ICCV.2019.00495 , timestamp =

work page doi:10.1109/iccv.2019.00495 2019
[50]

Accurate and Efficient 2-bit Quantized Neural Networks , booktitle =

Jungwook Choi and Swagath Venkataramani and Vijayalakshmi Srinivasan and Kailash Gopalakrishnan and Zhuo Wang and Pierce Chuang , editor =. Accurate and Efficient 2-bit Quantized Neural Networks , booktitle =. 2019 , url =

work page 2019
[54]

High Performance Natural Language Processing

Ilharco, Gabriel and Ilharco, Cesar and Turc, Iulia and Dettmers, Tim and Ferreira, Felipe and Lee, Kenton. High Performance Natural Language Processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts. 2020. doi:10.18653/v1/2020.emnlp-tutorials.4

work page doi:10.18653/v1/2020.emnlp-tutorials.4 2020
[55]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Q-bert: Hessian based ultra low precision quantization of bert , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[56]

arXiv e-prints, art , author=

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. arXiv e-prints, art , author=. arXiv preprint arXiv:1712.05877 , year=

work page arXiv
[57]

Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence , year=

Distribution adaptive int8 quantization for training cnns , author=. Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence , year=

work page
[58]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[59]

Improved Baselines with Momentum Contrastive Learning

Improved baselines with momentum contrastive learning , author=. arXiv preprint arXiv:2003.04297 , year=

work page internal anchor Pith review arXiv 2003
[60]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

work page 2009
[61]

Mixed Precision Training

Mixed precision training , author=. arXiv preprint arXiv:1710.03740 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[63]

Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=

Understanding the difficulty of training deep feedforward neural networks , author=. Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=. 2010 , organization=

work page 2010
[64]

Layer Normalization

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

Zero-Shot Text-to-Image Generation

Zero-shot text-to-image generation , author=. arXiv preprint arXiv:2102.12092 , year=

work page internal anchor Pith review arXiv
[67]

arXiv preprint arXiv:1404.5997 , year=

One weird trick for parallelizing convolutional neural networks , author=. arXiv preprint arXiv:1404.5997 , year=

work page arXiv
[68]

arXiv preprint arXiv:2104.06069 , year=

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed , author=. arXiv preprint arXiv:2104.06069 , year=

work page arXiv
[69]

arXiv preprint arXiv:2102.02888 , year=

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed , author=. arXiv preprint arXiv:2102.02888 , year=

work page arXiv
[70]

Fifteenth Annual Conference of the International Speech Communication Association , year=

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns , author=. Fifteenth Annual Conference of the International Speech Communication Association , year=

work page
[71]

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

work page 2020
[72]

Le, and Zhifeng Chen

Gpipe: Efficient training of giant neural networks using pipeline parallelism , author=. arXiv preprint arXiv:1811.06965 , year=

work page arXiv
[73]

arXiv preprint arXiv:1806.03377 , year=

Pipedream: Fast and efficient pipeline parallel dnn training , author=. arXiv preprint arXiv:1806.03377 , year=

work page arXiv
[74]

arXiv preprint arXiv:1707.04585 , year=

The reversible residual network: Backpropagation without storing activations , author=. arXiv preprint arXiv:1707.04585 , year=

work page arXiv
[75]

Training Deep Nets with Sublinear Memory Cost

Training deep nets with sublinear memory cost , author=. arXiv preprint arXiv:1604.06174 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

arXiv preprint arXiv:2002.05645 , year=

Training large neural networks with constant memory using a new execution algorithm , author=. arXiv preprint arXiv:2002.05645 , year=

work page arXiv 2002
[77]

arXiv preprint arXiv:2104.07857 , year=

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning , author=. arXiv preprint arXiv:2104.07857 , year=

work page arXiv
[78]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Gshard: Scaling giant models with conditional computation and automatic sharding , author=. arXiv preprint arXiv:2006.16668 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006
[79]

ACM SIGMOD Record , volume=

Space-efficient online computation of quantile summaries , author=. ACM SIGMOD Record , volume=. 2001 , publisher=

work page 2001
[80]

Proceedings of the 2005 ACM SIGMOD international conference on Management of data , pages=

Fast and approximate stream mining of quantiles and frequencies using graphics processors , author=. Proceedings of the 2005 ACM SIGMOD international conference on Management of data , pages=

work page 2005
[81]

arXiv preprint arXiv:1902.04023 , year=

Computing extremely accurate quantiles using t-digests , author=. arXiv preprint arXiv:1902.04023 , year=

work page arXiv 1902
[82]

Proceeding of the 2001 Winter Simulation Conference (Cat

Quantile and histogram estimation , author=. Proceeding of the 2001 Winter Simulation Conference (Cat. No. 01CH37304) , volume=. 2001 , organization=

work page 2001
[84]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author=. arXiv preprint arXiv:2101.03961 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[86]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[87]

CCN et: Extracting High Quality Monolingual Datasets from Web Crawl Data

Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm \'a n, Francisco and Joulin, Armand and Grave, Edouard. CCN et: Extracting High Quality Monolingual Datasets from Web Crawl Data. Proceedings of the 12th Language Resources and Evaluation Conference. 2020

work page 2020
[88]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

GLUE: A multi-task benchmark and analysis platform for natural language understanding , author=. arXiv preprint arXiv:1804.07461 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[89]

Proceedings of the Ninth Workshop on Statistical Machine Translation , pages=

Results of the WMT14 metrics shared task , author=. Proceedings of the Ninth Workshop on Statistical Machine Translation , pages=

work page
[91]

2002 , publisher=

Digital image processing , author=. 2002 , publisher=

work page 2002
[92]

The American Statistician , volume=

Sample quantiles in statistical packages , author=. The American Statistician , volume=. 1996 , publisher=

work page 1996
[93]

Pointer Sentinel Mixture Models

Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[94]

International conference on machine learning , pages=

On the difficulty of training recurrent neural networks , author=. International conference on machine learning , pages=. 2013 , organization=

work page 2013
[95]

Neural Machine Translation of Rare Words with Subword Units

Neural machine translation of rare words with subword units , author=. arXiv preprint arXiv:1508.07909 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[96]

The annals of mathematical statistics , pages=

A stochastic approximation method , author=. The annals of mathematical statistics , pages=. 1951 , publisher=

work page 1951
[98]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Hawq: Hessian aware quantization of neural networks with mixed-precision , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[99]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Zeroq: A novel zero shot quantization framework , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[100]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Differentiable soft quantization: Bridging full-precision and low-bit neural networks , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[101]

Proceedings of the European conference on computer vision (ECCV) , pages=

Lq-nets: Learned quantization for highly accurate and compact deep neural networks , author=. Proceedings of the European conference on computer vision (ECCV) , pages=

work page
[104]

Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex

Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and Phang, Jason and Reynolds, Laria and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy , title =. doi:10.5281/zenodo.5371628 , url =

work page doi:10.5281/zenodo.5371628
[110]

International Conference on Machine Learning , pages=

Hawq-v3: Dyadic neural network quantization , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021

Showing first 80 references.