pith. machine review for the scientific record. sign in

arxiv: 2208.07339 · v2 · submitted 2022-08-15 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Luke Zettlemoyer, Mike Lewis, Tim Dettmers, Younes Belkada

Pith reviewed 2026-05-13 13:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords 8-bit quantizationtransformer inferencelarge language modelsmatrix multiplicationmixed precisionoutlier featuresmemory reduction
0
0 comments X

The pith

LLM.int8() performs 8-bit matrix multiplication for transformers up to 175B parameters with no performance degradation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops LLM.int8(), a procedure that cuts the memory needed for transformer inference in half by converting matrix multiplications to 8-bit while preserving full accuracy. It does this by applying vector-wise quantization to most values and isolating a small set of systematic outlier dimensions for separate 16-bit handling. The result lets 175B parameter models run inference immediately after conversion from 16/32-bit checkpoints. This makes such models practical on consumer hardware instead of requiring multiple high-memory GPUs.

Core claim

LLM.int8() performs Int8 matrix multiplication for feed-forward and attention projection layers in transformers. It first uses vector-wise quantization with separate normalization constants for each inner product to quantize most features. For emergent outliers, it applies a mixed-precision decomposition that isolates those dimensions into a 16-bit matrix multiplication while still more than 99.9% of values are multiplied in 8-bit. Using this approach, inference in LLMs with up to 175B parameters shows no performance degradation.

What carries the argument

LLM.int8(), a two-part quantization procedure consisting of vector-wise quantization for the bulk of features and mixed-precision decomposition that isolates outlier dimensions for 16-bit computation.

If this is right

  • A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
  • Memory required for inference is reduced by half.
  • Models such as OPT-175B or BLOOM can run on a single server with consumer GPUs.
  • More than 99.9% of the values in the matrix multiplications are handled in 8-bit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same outlier-handling logic might extend to other matrix-heavy operations in neural networks beyond transformers.
  • If the outliers prove stable across training runs, the method could reduce memory during fine-tuning as well.
  • Widespread adoption would shift the practical limit of model size from available GPU memory to available compute.

Load-bearing premise

The emergent outlier features in transformers are highly systematic, limited to a small number of dimensions, and can be isolated via mixed-precision decomposition without introducing errors in the remaining 8-bit computations.

What would settle it

Applying LLM.int8() to a 175B parameter model and measuring a drop in perplexity or zero-shot task accuracy relative to the original 16/32-bit version.

read the original abstract

Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around properties of highly systematic emergent features in transformer language models that dominate attention and transformer predictive performance. To cope with these features, we develop a two-part quantization procedure, LLM.int8(). We first use vector-wise quantization with separate normalization constants for each inner product in the matrix multiplication, to quantize most of the features. However, for the emergent outliers, we also include a new mixed-precision decomposition scheme, which isolates the outlier feature dimensions into a 16-bit matrix multiplication while still more than 99.9% of values are multiplied in 8-bit. Using LLM.int8(), we show empirically it is possible to perform inference in LLMs with up to 175B parameters without any performance degradation. This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open-source our software.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces LLM.int8(), a two-part quantization procedure for Int8 matrix multiplication in transformer feed-forward and attention layers. Vector-wise quantization with per-inner-product normalization constants handles most features, while a mixed-precision decomposition isolates a small set of emergent outlier dimensions into 16-bit computation (with >99.9% of values remaining in 8-bit). The central empirical claim is that this enables inference on models up to 175B parameters (e.g., OPT-175B, BLOOM) with no performance degradation relative to FP16, halving memory requirements and allowing deployment on consumer GPUs.

Significance. If the zero-degradation result holds under rigorous validation, the work has substantial practical significance: it removes a major hardware barrier for state-of-the-art LLMs, making 175B-scale models runnable on single servers with consumer GPUs while preserving zero-shot performance. The open-sourcing of the implementation further strengthens its potential impact.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (empirical results): the claim of 'without any performance degradation' on 175B-parameter models is load-bearing but rests on matching FP16 zero-shot numbers without reported sensitivity analysis on the outlier threshold, variance across random seeds or prompt distributions, or statistical tests; this leaves the no-residual-error assumption in the 8-bit path untested.
  2. [§3.2] §3.2 (mixed-precision decomposition): the procedure assumes emergent outliers are stable, limited to <0.1% dimensions, and perfectly isolatable with no leakage into the 8-bit path; the manuscript provides no ablation removing the 16-bit component on the largest model to confirm that the remaining 99.9% of the matmul truly incurs zero accuracy loss.
  3. [§4] §4 (experiments on OPT-175B): the reported matching of FP16 numbers is presented without baselines for alternative quantization schemes or ablations of the vector-wise normalization constants, making it impossible to isolate the contribution of the mixed-precision step to the zero-degradation result.
minor comments (2)
  1. The manuscript mentions open-sourcing the software but does not include a repository URL, commit hash, or reproduction instructions.
  2. [§3.1] Notation for the normalization constants (e.g., per-vector scaling factors) could be clarified with an explicit equation in §3.1 to avoid ambiguity when implementing the vector-wise quantization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting areas where the empirical validation can be strengthened. We respond to each major comment below, providing clarifications from the manuscript and indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (empirical results): the claim of 'without any performance degradation' on 175B-parameter models is load-bearing but rests on matching FP16 zero-shot numbers without reported sensitivity analysis on the outlier threshold, variance across random seeds or prompt distributions, or statistical tests; this leaves the no-residual-error assumption in the 8-bit path untested.

    Authors: We agree that the manuscript would benefit from additional analysis on sensitivity and variance. The outlier threshold is determined based on the distribution of activation magnitudes observed in smaller models, as detailed in §3.2, and we observed no degradation across the reported zero-shot tasks for OPT-175B and BLOOM. In the revised version, we will include a sensitivity analysis on the threshold for smaller models and report results on additional prompt sets to address variance across distributions. Full statistical tests on 175B models are computationally intensive, but the consistent matching across benchmarks supports the claim. revision: partial

  2. Referee: [§3.2] §3.2 (mixed-precision decomposition): the procedure assumes emergent outliers are stable, limited to <0.1% dimensions, and perfectly isolatable with no leakage into the 8-bit path; the manuscript provides no ablation removing the 16-bit component on the largest model to confirm that the remaining 99.9% of the matmul truly incurs zero accuracy loss.

    Authors: The analysis in §3.2 shows that the outlier dimensions are consistent across layers and inputs for a given model, with less than 0.1% of dimensions affected. We provide ablations in §4 on models up to 13B parameters demonstrating that the 8-bit path alone leads to accuracy loss when outliers are not handled in 16-bit. Performing the full ablation on 175B would require significant additional compute resources beyond what was available for the original experiments. We will expand the discussion in the revision to emphasize the scaling behavior observed. revision: no

  3. Referee: [§4] §4 (experiments on OPT-175B): the reported matching of FP16 numbers is presented without baselines for alternative quantization schemes or ablations of the vector-wise normalization constants, making it impossible to isolate the contribution of the mixed-precision step to the zero-degradation result.

    Authors: Section 4 includes direct comparisons to per-tensor quantization and other baselines, which show substantial degradation on large models. Ablations of the vector-wise quantization are presented for smaller models in §4, isolating its contribution to reducing quantization error. To further isolate the mixed-precision step, we will add explicit results with and without it for models up to 6.7B in the revision, and note that the zero-degradation on 175B relies on both components working together as shown in the scaling experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical quantization procedure validated directly on large models

full rationale

The paper introduces LLM.int8() as a two-part procedure (vector-wise 8-bit quantization plus mixed-precision outlier decomposition) whose normalization constants are computed on-the-fly from each input activation vector. The central claim—that this yields zero degradation on downstream tasks for models up to 175B—is supported by direct empirical comparison to FP16 baselines on OPT-175B and BLOOM, with no intermediate derivation that reduces to a fitted parameter, self-definition, or self-citation chain. The method is externally falsifiable via the reported zero-shot accuracy numbers and open-source implementation; no load-bearing step collapses to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the observed existence of systematic emergent outlier features that dominate transformer performance; these are treated as an empirical property of the models rather than derived quantities.

axioms (1)
  • domain assumption Transformer language models contain highly systematic emergent outlier features in attention and feed-forward layers that dominate predictive performance.
    This property is invoked to justify the need for the mixed-precision decomposition step.

pith-pipeline@v0.9.0 · 5547 in / 1222 out tokens · 64890 ms · 2026-05-13T13:28:57.607036+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

    cs.PF 2026-05 unverdicted novelty 7.0

    A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.

  2. Chronos: Learning the Language of Time Series

    cs.LG 2024-03 conditional novelty 7.0

    Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.

  3. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    cs.LG 2024-01 conditional novelty 7.0

    Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

  4. Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    cs.CV 2023-10 unverdicted novelty 7.0

    A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

  5. Accelerating Large Language Model Decoding with Speculative Sampling

    cs.CL 2023-02 accept novelty 7.0

    Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.

  6. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    cs.LG 2022-10 unverdicted novelty 7.0

    GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.

  7. LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

    cs.LG 2026-05 unverdicted novelty 6.0

    LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.

  8. OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.

  9. OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.

  10. Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

    cs.LG 2026-05 unverdicted novelty 6.0

    Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.

  11. Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

    cs.SE 2026-04 unverdicted novelty 6.0

    Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.

  12. GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

    cs.CL 2026-04 unverdicted novelty 6.0

    GSQ applies a Gumbel-Softmax relaxation to learn discrete grid assignments in scalar quantization, closing most of the accuracy gap to vector methods like QTIP on Llama-3.1 models at 2-3 bits while using only symmetri...

  13. Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition

    cs.CL 2026-04 unverdicted novelty 6.0

    Depth Registers plus hinge loss cut W4A4-induced perplexity collapse from 1727 to 119 in a 300M SwiGLU model by selectively taming reader-layer activations while leaving bilinear generator tails largely untouched.

  14. Rethinking Residual Errors in Compensation-based LLM Quantization

    cs.LG 2026-04 conditional novelty 6.0

    Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.

  15. FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

    cs.LG 2026-04 unverdicted novelty 6.0

    Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.

  16. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    cs.CL 2024-02 conditional novelty 6.0

    KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.

  17. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    cs.CL 2023-05 unverdicted novelty 6.0

    Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.

  18. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    cs.CL 2022-11 unverdicted novelty 6.0

    BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.

  19. BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models

    cs.AI 2026-05 unverdicted novelty 5.0

    BitCal-TTS raises exact-match accuracy by 3.7 points (7B) and 2.8 points (14B) on small GSM8K shards for 4-bit Qwen2.5 models while cutting premature-stop rates and retaining token savings versus fixed-budget decoding.

  20. LoRaQ: Optimized Low Rank Approximation for 4-bit Quantization

    cs.LG 2026-04 unverdicted novelty 5.0

    LoRaQ enables fully sub-16-bit quantized diffusion models by optimizing low-rank error compensation in a data-free way, outperforming prior methods at equal memory cost on Pixart-Σ and SANA while supporting mixed low-...

  21. A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models

    cs.LG 2026-04 unverdicted novelty 5.0

    KL divergence provides a superior forward-only metric for identifying quantization-sensitive parts in SSM-Transformer hybrids, outperforming MSE and SQNR and supporting practical mixed-precision deployment on edge devices.

  22. SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    SEPTQ simplifies LLM post-training quantization to two steps via static global importance scoring and mask-guided column-wise weight updates, claiming superior results over baselines in low-bit settings.

  23. Yi: Open Foundation Models by 01.AI

    cs.CL 2024-03 unverdicted novelty 4.0

    Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.

Reference graph

Works this paper leans on

145 extracted references · 145 canonical work pages · cited by 22 Pith papers · 24 internal anchors

  1. [1]

    Adam: A Method for Stochastic Optimization

    Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

  2. [2]

    Neural networks : the official journal of the International Neural Network Society , year=

    On the momentum term in gradient descent learning algorithms , author=. Neural networks : the official journal of the International Neural Network Society , year=

  3. [4]

    Cited on , volume=

    Neural networks for machine learning lecture 6a overview of mini-batch gradient descent , author=. Cited on , volume=

  4. [5]

    International Conference on Learning Representations (ICLR) , year=

    8-bit approximations for parallelism in deep learning , author=. International Conference on Learning Representations (ICLR) , year=

  5. [6]

    2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP) , pages=

    AutoClip: Adaptive gradient clipping for source separation networks , author=. 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP) , pages=. 2020 , organization=

  6. [7]

    arXiv preprint arXiv:2102.06171 , year=

    High-Performance Large-Scale Image Recognition Without Normalization , author=. arXiv preprint arXiv:2102.06171 , year=

  7. [8]

    2009 , publisher=

    Learning multiple layers of features from tiny images , author=. 2009 , publisher=

  8. [9]

    Proceedings of the IEEE international conference on computer vision , pages=

    Aligning books and movies: Towards story-like visual explanations by watching movies and reading books , author=. Proceedings of the IEEE international conference on computer vision , pages=

  9. [11]

    urlhttp://Skylion007

    Openwebtext corpus , author=. urlhttp://Skylion007. github. io/OpenWebTextCorpus , year=

  10. [14]

    International Conference on Machine Learning , pages=

    Adafactor: Adaptive learning rates with sublinear memory cost , author=. International Conference on Machine Learning , pages=. 2018 , organization=

  11. [15]

    arXiv preprint arXiv:1907.04840 , year=

    Sparse networks from scratch: Faster training without losing performance , author=. arXiv preprint arXiv:1907.04840 , year=

  12. [16]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Very deep convolutional networks for large-scale image recognition , author=. arXiv preprint arXiv:1409.1556 , year=

  13. [17]

    Wide Residual Networks

    Wide residual networks , author=. arXiv preprint arXiv:1605.07146 , year=

  14. [18]

    Advances in neural information processing systems , volume=

    Imagenet classification with deep convolutional neural networks , author=. Advances in neural information processing systems , volume=

  15. [19]

    arXiv preprint arXiv:1809.10853 , year=

    Adaptive input representations for neural language modeling , author=. arXiv preprint arXiv:1809.10853 , year=

  16. [20]

    Fixing weight decay regularization in adam , author=

  17. [24]

    arXiv preprint arXiv:2103.16716 , year=

    BASE Layers: Simplifying Training of Large, Sparse Models , author=. arXiv preprint arXiv:2103.16716 , year=

  18. [26]

    Advances in Neural Information Processing Systems , volume=

    A statistical framework for low-bitwidth training of deep neural networks , author=. Advances in Neural Information Processing Systems , volume=

  19. [28]

    2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS) , pages=

    Q8bert: Quantized 8bit bert , author=. 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS) , pages=. 2019 , organization=

  20. [29]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

  21. [30]

    9th International Conference on Learning Representations, ICLR , year=

    8-bit Optimizers via Block-wise Quantization , author=. 9th International Conference on Learning Representations, ICLR , year=

  22. [35]

    EMNLP , year=

    TernaryBERT: Distillation-aware Ultra-low Bit BERT , author=. EMNLP , year=

  23. [36]

    ArXiv , year=

    BinaryBERT: Pushing the Limit of BERT Quantization , author=. ArXiv , year=

  24. [37]

    XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , booktitle =

    Mohammad Rastegari and Vicente Ordonez and Joseph Redmon and Ali Farhadi , editor =. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , booktitle =. 2016 , url =. doi:10.1007/978-3-319-46493-0\_32 , timestamp =

  25. [38]

    BinaryConnect: Training Deep Neural Networks with binary weights during propagations , booktitle =

    Matthieu Courbariaux and Yoshua Bengio and Jean. BinaryConnect: Training Deep Neural Networks with binary weights during propagations , booktitle =. 2015 , url =

  26. [40]

    Training Deep Neural Networks with 8-bit Floating Point Numbers , booktitle =

    Naigang Wang and Jungwook Choi and Daniel Brand and Chia. Training Deep Neural Networks with 8-bit Floating Point Numbers , booktitle =. 2018 , url =

  27. [41]

    Hybrid 8-bit Floating Point

    Xiao Sun and Jungwook Choi and Chia. Hybrid 8-bit Floating Point. Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada , pages =. 2019 , url =

  28. [42]

    Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks , booktitle =

    L. Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks , booktitle =. 2020 , url =

  29. [43]

    Training DNNs with Hybrid Block Floating Point , booktitle =

    Mario Drumond and Tao Lin and Martin Jaggi and Babak Falsafi , editor =. Training DNNs with Hybrid Block Floating Point , booktitle =. 2018 , url =

  30. [44]

    Dally , title =

    Chenzhuo Zhu and Song Han and Huizi Mao and William J. Dally , title =. 5th International Conference on Learning Representations,. 2017 , url =

  31. [45]

    2019 , url =

    Rundong Li and Yan Wang and Feng Liang and Hongwei Qin and Junjie Yan and Rui Fan , title =. 2019 , url =. doi:10.1109/CVPR.2019.00292 , timestamp =

  32. [46]

    International conference on machine learning , pages=

    I-bert: Integer-only bert quantization , author=. International conference on machine learning , pages=. 2021 , organization=

  33. [48]

    Advances in neural information processing systems , volume=

    Mesh-tensorflow: Deep learning for supercomputers , author=. Advances in neural information processing systems , volume=

  34. [49]

    Ruihao Gong and Xianglong Liu and Shenghu Jiang and Tianxiang Li and Peng Hu and Jiazhen Lin and Fengwei Yu and Junjie Yan , title =. 2019. 2019 , url =. doi:10.1109/ICCV.2019.00495 , timestamp =

  35. [50]

    Accurate and Efficient 2-bit Quantized Neural Networks , booktitle =

    Jungwook Choi and Swagath Venkataramani and Vijayalakshmi Srinivasan and Kailash Gopalakrishnan and Zhuo Wang and Pierce Chuang , editor =. Accurate and Efficient 2-bit Quantized Neural Networks , booktitle =. 2019 , url =

  36. [54]

    High Performance Natural Language Processing

    Ilharco, Gabriel and Ilharco, Cesar and Turc, Iulia and Dettmers, Tim and Ferreira, Felipe and Lee, Kenton. High Performance Natural Language Processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts. 2020. doi:10.18653/v1/2020.emnlp-tutorials.4

  37. [55]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Q-bert: Hessian based ultra low precision quantization of bert , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  38. [56]

    arXiv e-prints, art , author=

    Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. arXiv e-prints, art , author=. arXiv preprint arXiv:1712.05877 , year=

  39. [57]

    Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence , year=

    Distribution adaptive int8 quantization for training cnns , author=. Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence , year=

  40. [58]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  41. [59]

    Improved Baselines with Momentum Contrastive Learning

    Improved baselines with momentum contrastive learning , author=. arXiv preprint arXiv:2003.04297 , year=

  42. [60]

    2009 IEEE conference on computer vision and pattern recognition , pages=

    Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

  43. [61]

    Mixed Precision Training

    Mixed precision training , author=. arXiv preprint arXiv:1710.03740 , year=

  44. [62]

    Scaling Laws for Neural Language Models

    Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

  45. [63]

    Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=

    Understanding the difficulty of training deep feedforward neural networks , author=. Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=. 2010 , organization=

  46. [64]

    Layer Normalization

    Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

  47. [66]

    Zero-Shot Text-to-Image Generation

    Zero-shot text-to-image generation , author=. arXiv preprint arXiv:2102.12092 , year=

  48. [67]

    arXiv preprint arXiv:1404.5997 , year=

    One weird trick for parallelizing convolutional neural networks , author=. arXiv preprint arXiv:1404.5997 , year=

  49. [68]

    arXiv preprint arXiv:2104.06069 , year=

    1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed , author=. arXiv preprint arXiv:2104.06069 , year=

  50. [69]

    arXiv preprint arXiv:2102.02888 , year=

    1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed , author=. arXiv preprint arXiv:2102.02888 , year=

  51. [70]

    Fifteenth Annual Conference of the International Speech Communication Association , year=

    1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns , author=. Fifteenth Annual Conference of the International Speech Communication Association , year=

  52. [71]

    SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

    Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

  53. [72]

    Le, and Zhifeng Chen

    Gpipe: Efficient training of giant neural networks using pipeline parallelism , author=. arXiv preprint arXiv:1811.06965 , year=

  54. [73]

    arXiv preprint arXiv:1806.03377 , year=

    Pipedream: Fast and efficient pipeline parallel dnn training , author=. arXiv preprint arXiv:1806.03377 , year=

  55. [74]

    arXiv preprint arXiv:1707.04585 , year=

    The reversible residual network: Backpropagation without storing activations , author=. arXiv preprint arXiv:1707.04585 , year=

  56. [75]

    Training Deep Nets with Sublinear Memory Cost

    Training deep nets with sublinear memory cost , author=. arXiv preprint arXiv:1604.06174 , year=

  57. [76]

    arXiv preprint arXiv:2002.05645 , year=

    Training large neural networks with constant memory using a new execution algorithm , author=. arXiv preprint arXiv:2002.05645 , year=

  58. [77]

    arXiv preprint arXiv:2104.07857 , year=

    ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning , author=. arXiv preprint arXiv:2104.07857 , year=

  59. [78]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Gshard: Scaling giant models with conditional computation and automatic sharding , author=. arXiv preprint arXiv:2006.16668 , year=

  60. [79]

    ACM SIGMOD Record , volume=

    Space-efficient online computation of quantile summaries , author=. ACM SIGMOD Record , volume=. 2001 , publisher=

  61. [80]

    Proceedings of the 2005 ACM SIGMOD international conference on Management of data , pages=

    Fast and approximate stream mining of quantiles and frequencies using graphics processors , author=. Proceedings of the 2005 ACM SIGMOD international conference on Management of data , pages=

  62. [81]

    arXiv preprint arXiv:1902.04023 , year=

    Computing extremely accurate quantiles using t-digests , author=. arXiv preprint arXiv:1902.04023 , year=

  63. [82]

    Proceeding of the 2001 Winter Simulation Conference (Cat

    Quantile and histogram estimation , author=. Proceeding of the 2001 Winter Simulation Conference (Cat. No. 01CH37304) , volume=. 2001 , organization=

  64. [84]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author=. arXiv preprint arXiv:2101.03961 , year=

  65. [86]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  66. [87]

    CCN et: Extracting High Quality Monolingual Datasets from Web Crawl Data

    Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm \'a n, Francisco and Joulin, Armand and Grave, Edouard. CCN et: Extracting High Quality Monolingual Datasets from Web Crawl Data. Proceedings of the 12th Language Resources and Evaluation Conference. 2020

  67. [88]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    GLUE: A multi-task benchmark and analysis platform for natural language understanding , author=. arXiv preprint arXiv:1804.07461 , year=

  68. [89]

    Proceedings of the Ninth Workshop on Statistical Machine Translation , pages=

    Results of the WMT14 metrics shared task , author=. Proceedings of the Ninth Workshop on Statistical Machine Translation , pages=

  69. [91]

    2002 , publisher=

    Digital image processing , author=. 2002 , publisher=

  70. [92]

    The American Statistician , volume=

    Sample quantiles in statistical packages , author=. The American Statistician , volume=. 1996 , publisher=

  71. [93]

    Pointer Sentinel Mixture Models

    Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

  72. [94]

    International conference on machine learning , pages=

    On the difficulty of training recurrent neural networks , author=. International conference on machine learning , pages=. 2013 , organization=

  73. [95]

    Neural Machine Translation of Rare Words with Subword Units

    Neural machine translation of rare words with subword units , author=. arXiv preprint arXiv:1508.07909 , year=

  74. [96]

    The annals of mathematical statistics , pages=

    A stochastic approximation method , author=. The annals of mathematical statistics , pages=. 1951 , publisher=

  75. [98]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Hawq: Hessian aware quantization of neural networks with mixed-precision , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  76. [99]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Zeroq: A novel zero shot quantization framework , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  77. [100]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Differentiable soft quantization: Bridging full-precision and low-bit neural networks , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  78. [101]

    Proceedings of the European conference on computer vision (ECCV) , pages=

    Lq-nets: Learned quantization for highly accurate and compact deep neural networks , author=. Proceedings of the European conference on computer vision (ECCV) , pages=

  79. [104]

    Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex

    Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and Phang, Jason and Reynolds, Laria and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy , title =. doi:10.5281/zenodo.5371628 , url =

  80. [110]

    International Conference on Machine Learning , pages=

    Hawq-v3: Dyadic neural network quantization , author=. International Conference on Machine Learning , pages=. 2021 , organization=

Showing first 80 references.