arxiv: 2605.00539 · v2 · submitted 2026-05-01 · 💻 cs.CL · cs.DC

Recognition: no theorem link

AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

Bing Wang, Juntao Huang, Laili Li, Luhan Zhang, Mengyang Zhang, Shaohuai Shi, Wenxiang Lin, Xiang Bao

Pith reviewed 2026-05-12 05:12 UTC · model grok-4.3

classification 💻 cs.CL cs.DC

keywords quantizationLLM trainingdistributed trainingactivation quantizationgradient quantizationmemory efficiencyLLaMApipeline parallelism

0 comments

The pith

Layer-aware quantization stores LLM activations near 4 bits and gradients at 8 bits to cut memory and speed training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AGoQ addresses the challenge of high memory use in distributed training of large language models by quantizing both activations and gradients to lower precision without harming convergence. It applies a layer-aware algorithm that chooses bit widths for activations according to each layer's type and its position in the pipeline, reaching storage close to 4 bits. Gradients are kept in 8 bits with a modified All-Reduce step that maintains numerical precision during communication. Experiments on LLaMA models ranging from 8 billion to 32 billion parameters across up to 64 GPUs report up to 52 percent less memory and up to 1.34 times faster training than Megatron-LM, COAT, and DeepSpeed, while pretraining loss matches the baseline and downstream accuracy remains comparable.

Core claim

The central claim is that bit-width allocation for activations can be made layer-specific and stage-specific so that near-4-bit activation storage works reliably in pipeline-parallel setups, and that 8-bit gradient storage together with precision-preserving 8-bit All-Reduce reduces both memory footprint and communication time, allowing memory-efficient training of 8B–32B LLaMA models at full convergence.

What carries the argument

Layer-aware activation quantization algorithm that assigns bit widths according to layer type and pipeline stage, combined with 8-bit gradient storage and a precision-preserving 8-bit All-Reduce communication primitive.

If this is right

GPU memory required to train 8B–32B parameter models falls by up to 52 percent.
End-to-end training throughput rises by up to 1.34 times relative to current distributed systems.
Pretraining reaches the same loss value as full-precision runs.
Accuracy on downstream benchmarks stays statistically comparable to unquantized baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layer-aware allocation idea could be tested on other model families such as Mistral or Qwen to check transferability.
Combining AGoQ with existing sharding methods might allow training of models beyond 32B parameters on current GPU clusters.
Measuring communication volume at larger scale would show how much the 8-bit All-Reduce step contributes to the reported speed-up.
Extending the allocation rules to include weight quantization during training is a natural next measurement.

Load-bearing premise

The chosen rules for assigning bit widths to different layers and pipeline stages will continue to avoid accuracy loss when the method is applied to model sizes, tasks, or hardware setups outside the 8B–32B LLaMA experiments.

What would settle it

Apply AGoQ to a 70B-parameter model or a non-LLaMA architecture and measure whether pretraining loss diverges from the full-precision baseline or downstream task accuracy drops by more than a few percent.

Figures

Figures reproduced from arXiv: 2605.00539 by Bing Wang, Juntao Huang, Laili Li, Luhan Zhang, Mengyang Zhang, Shaohuai Shi, Wenxiang Lin, Xiang Bao.

**Figure 1.** Figure 1: An example of Interleaved 1F1B PP with four stages and each mini-batch divided into eight micro-batches. of each worker in the same DP group are aggregated through an All-Reduce operation so that they can use the identical gradient to update the model parameters. The All-Reduce operation accumulates distributed gradients (say Xi at i worker) from all workers (say P workers) using a reduction operation (typ… view at source ↗

**Figure 2.** Figure 2: The integration of activation and gradient quantization with Megatron-LM view at source ↗

**Figure 3.** Figure 3: Training memory consumption on an OLMo-1B model. 3. AGoQ: System Overview To significantly reduce the GPU memory footprint of LLM training, we design our AGoQ to compress activations to nearly 4 bits and gradients to 8 bits, which is also compatible with the 8-bit Adam optimizer (Dettmers et al., 2022) atop Megatron-LM. As shown in view at source ↗

**Figure 4.** Figure 4: The forward and backward passes of attention and MLP for kernel fusion of quantization/dequantization and GEMM. A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 D1 D2 D3 D4 GPU1 GPU2 GPU3 GPU4 A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 A4 B4 C4 D4 GPU1 GPU2 GPU3 GPU4 All-to-All GPU1 GPU2 GPU3 GPU4 Reduce A1+B1 +C1+D1 A2+B2 +C2+D2 A3+B3 +C3+D3 A4+B4 +C4+D4 A1+B1 +C1+D1 A2+B2 +C2+D2 A3+B3 +C3+D3 A4+B4 +C4+D4 A1+B1 +C1+D1 A2+B2 … view at source ↗

**Figure 5.** Figure 5: The illustration of our process to perform All-Reduce by combining All-to-All with All-Gather. GPU1 GPU2 Quantize 32bit 8bit 8bit Dequantize All-to-All 32bit + + 32bit All Gather Quantize 8bit 8bit view at source ↗

**Figure 6.** Figure 6: The illustration of our process to combine gradients from different GPUs. 5. Precision-Preserved Gradient Quantization To minimize both the memory usage associated with storing gradients and the communication overhead during gradient All-Reduce, we introduce an 8-bit block-wise (Dettmers et al., 2022) gradient quantization technique. This method maintains precision throughout the All-Reduce operation and e… view at source ↗

**Figure 7.** Figure 7: Speedups of our AGoQ and ZeRO-1 over Megatron-LM on varied configurations view at source ↗

**Figure 8.** Figure 8: Training loss of Megatron (w/ BF16), FP8-AllReduce and ours (A+O+G) on LLaMA2-7B and LLaMA3-8B. and degree of PP. Comparison with COAT. We further compare AGoQ with COAT, using two nodes with 8 Pro6000 GPUs to support COAT’s FP8 format. With a global batch size of 64 and sequence lengths of 24K and 32K, we evaluate the OLMo1B model. For 32K sequences, COAT encounters OOM errors, requiring recomputation fo… view at source ↗

**Figure 10.** Figure 10: Speedups of our AGoQ and ZeRO-1 over Megatron-LM on varied models. and ∂A/∂Q the bounds grow to O(ηL3/2√ dk). These quantities are larger than the RMSNorm error by factors of Θ(L √ dk) and Θ(L 3/2√ dk), respectively. Such a multiplicative gap explains why activation quantization applied to the Q, K, V projections leads to severely amplified gradient errors and training instability, whereas RMSNorm can be… view at source ↗

read the original abstract

Quantization is a key method for reducing the GPU memory requirement of training large language models (LLMs). Yet, current approaches are ineffective for 4-bit activations and 8-bit gradients, which would easily cause slow convergence or accuracy loss. To address this, we introduce AGoQ, incorporating two new techniques: 1) a layer-aware activation quantization algorithm that allocates appropriate bit-widths for activations of various layers based on their types and pipeline stages to achieve near 4-bit activation storage, and 2) a gradient quantization algorithm that reduces memory usage and shortens communication time by employing 8-bit gradient storage and precision-preserving 8-bit All-Reduce communication. We conduct extensive experiments using different sizes of LLMs on two GPU clusters (up to 64 GPUs), and the experimental results show that our AGoQ reduces the memory by up to 52\% and achieves up to 1.34$\times$ improvement of training speed compared to state-of-the-art training systems Megatron-LM (w/ or w/o ZeRO), COAT and DeepSpeed with 8B to 32B LLaMA models, while achieving convergence loss on pretraining and comparable accuracy on downstream tasks with LLaMA architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces AGoQ for memory-efficient distributed LLM training via two techniques: (1) a layer-aware activation quantization algorithm that assigns bit-widths to activations based on layer type and pipeline stage to reach near-4-bit storage, and (2) 8-bit gradient quantization paired with a precision-preserving 8-bit All-Reduce. End-to-end experiments on LLaMA models (8B–32B) across two clusters (up to 64 GPUs) report up to 52% memory reduction and 1.34× training speedup versus Megatron-LM (with/without ZeRO), COAT, and DeepSpeed, while claiming convergence parity on pretraining and comparable downstream accuracy.

Significance. If the layer-aware heuristic generalizes, the work would offer a practical advance for scaling LLM training under tight memory budgets by safely using sub-8-bit activations. The manuscript earns credit for extensive multi-size, multi-cluster experiments that directly measure wall-clock speed and memory on realistic distributed setups; these provide concrete, reproducible evidence of the claimed gains when the allocation rules are applied to the tested LLaMA pretraining regime.

major comments (3)

[§3.1] §3.1 (layer-aware activation quantization): the bit-width allocation rules are presented as fixed once chosen for the 8B–32B LLaMA models, yet the text supplies neither a derivation from quantization-error bounds nor an ablation comparing them to uniform 4-bit or alternative heuristics; this is load-bearing because the abstract itself states that naive 4-bit activations cause slow convergence or accuracy loss.
[§4] §4 (experimental results): Tables 1–3 and the associated figures report single-run memory, throughput, loss, and accuracy numbers without error bars, multiple random seeds, or statistical significance tests, making it impossible to assess whether the “comparable accuracy” and “convergence loss” claims are robust to training stochasticity.
[§4.2] §4.2 (ablation of allocation heuristic): no experiment isolates the contribution of the per-layer-type and per-pipeline-stage rules versus the gradient quantization alone or versus simpler bit-width schedules, so the central claim that these rules enable “near 4-bit activation storage” without accuracy degradation rests on an untested empirical choice.

minor comments (2)

[§3] The notation for activation bit-width variables (e.g., b_l for layer l) is introduced only in figures; an explicit definition in the main text of §3 would improve readability.
[Figure 2] Figure 2 (pipeline-stage diagram) would benefit from explicit per-stage bit-width labels to allow exact reproduction of the reported allocation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the scale of our experiments. We address each major comment below with proposed revisions to improve the manuscript's clarity and rigor.

read point-by-point responses

Referee: [§3.1] §3.1 (layer-aware activation quantization): the bit-width allocation rules are presented as fixed once chosen for the 8B–32B LLaMA models, yet the text supplies neither a derivation from quantization-error bounds nor an ablation comparing them to uniform 4-bit or alternative heuristics; this is load-bearing because the abstract itself states that naive 4-bit activations cause slow convergence or accuracy loss.

Authors: We agree that the allocation rules lack a formal derivation from quantization-error bounds and that an ablation is needed to justify the empirical choices. The rules were selected after observing higher quantization sensitivity in specific layers (e.g., certain attention projections) and pipeline stages during preliminary tuning. In revision we will add both a motivation subsection grounded in per-layer sensitivity measurements and a new ablation comparing the layer-aware schedule against uniform 4-bit, uniform 5-bit, and layer-type-only heuristics, demonstrating their impact on convergence. revision: yes
Referee: [§4] §4 (experimental results): Tables 1–3 and the associated figures report single-run memory, throughput, loss, and accuracy numbers without error bars, multiple random seeds, or statistical significance tests, making it impossible to assess whether the “comparable accuracy” and “convergence loss” claims are robust to training stochasticity.

Authors: We acknowledge that single-run reporting limits statistical assessment, especially given the high cost of 32B-scale runs. In the revision we will report results from three random seeds for all 8B experiments (including error bars on loss and accuracy) and perform basic significance checks. For 16B–32B models we will explicitly note that resource constraints precluded multiple seeds but that the observed trends remain consistent across model sizes and two distinct clusters, providing indirect support for robustness. revision: partial
Referee: [§4.2] §4.2 (ablation of allocation heuristic): no experiment isolates the contribution of the per-layer-type and per-pipeline-stage rules versus the gradient quantization alone or versus simpler bit-width schedules, so the central claim that these rules enable “near 4-bit activation storage” without accuracy degradation rests on an untested empirical choice.

Authors: We will add a dedicated ablation subsection that isolates the activation rules. The new experiments will compare: (i) full AGoQ versus gradient quantization with full-precision activations, (ii) layer-aware allocation versus uniform 4-bit, and (iii) the full heuristic versus a simpler layer-type-only schedule. These will quantify the incremental memory and accuracy benefits of the per-layer and per-stage components. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical algorithmic method with experimental validation

full rationale

The paper introduces AGoQ as a set of algorithmic techniques (layer-aware activation bit allocation by type and pipeline stage, plus 8-bit gradient storage with precision-preserving All-Reduce) and validates them via end-to-end experiments on 8B-32B LLaMA models across GPU clusters. No mathematical derivation chain, closed-form prediction, or self-referential definition is present in the abstract or described method. Bit-width rules are stated as chosen for the tested configurations rather than derived from first principles or fitted in a way that renames inputs as outputs. Any self-citations (if present in the full text) are not load-bearing for the core claims, which rest on reported empirical results rather than tautological reduction. This is the expected non-finding for an engineering systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms or new physical entities are introduced; the method rests on the empirical premise that layer sensitivities can be pre-characterized to choose bit widths without harming convergence, but the abstract gives no explicit free parameters or derivation steps.

pith-pipeline@v0.9.0 · 5545 in / 1392 out tokens · 49175 ms · 2026-05-12T05:12:45.661802+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

121 extracted references · 121 canonical work pages · 9 internal anchors

[1]

International Conference on Learning Representations , year=

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. International Conference on Learning Representations , year=

work page
[2]

Transformer Engine: An Efficient Library for Training Transformer Models , year =

work page
[3]

2024 , url=

OLMo: Accelerating the Science of Language Models , author=. 2024 , url=

work page 2024
[4]

Abhinav Jangda and Jun Huang and Guodong Liu and Amir Hossein Nodehi Sabet and Saeed Maleki and Youshan Miao and Madanlal Musuvathi and Todd Mytkowicz and Olli Saarikivi , title =

work page
[5]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[6]

Journal of Machine Learning Research , volume=

Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=

work page
[7]

PaLM-E: An Embodied Multimodal Language Model

Palm-e: An embodied multimodal language model , author=. arXiv preprint arXiv:2303.03378 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[9]

arXiv preprint arXiv:2301.06813 , year=

AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication , author=. arXiv preprint arXiv:2301.06813 , year=

work page arXiv
[10]

16th USENIX Symposium on Operating Systems Design and Implementation , pages=

Alpa: Automating inter-and \ Intra-Operator \ parallelism for distributed deep learning , author=. 16th USENIX Symposium on Operating Systems Design and Implementation , pages=

work page
[11]

Advances in neural information processing systems , volume=

Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=

work page
[12]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[13]

International Conference on Learning Representations , year=

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , author=. International Conference on Learning Representations , year=

work page
[14]

Proceedings of Machine Learning and Systems , volume=

Tutel: Adaptive mixture-of-experts at scale , author=. Proceedings of Machine Learning and Systems , volume=

work page
[15]

International Conference on Machine Learning , pages=

Gating dropout: Communication-efficient regularization for sparsely activated transformers , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[16]

The Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. The Journal of Machine Learning Research , volume=. 2022 , publisher=

work page 2022
[17]

Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , pages=

BaGuaLu: targeting brain scale pretrained models with over 37 million cores , author=. Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , pages=

work page
[18]

arXiv preprint arXiv:2203.14685 , year=

HetuMoE: An efficient trillion-scale mixture-of-expert distributed training system , author=. arXiv preprint arXiv:2203.14685 , year=

work page arXiv
[19]

Proceedings of the 13th Symposium on Cloud Computing , pages=

Accelerating large-scale distributed neural network training with SPMD parallelism , author=. Proceedings of the 13th Symposium on Cloud Computing , pages=

work page
[20]

International Conference on Machine Learning , pages=

Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[21]

Advances in neural information processing systems , volume=

Large scale distributed deep networks , author=. Advances in neural information processing systems , volume=

work page
[22]

Advances in neural information processing systems , volume=

Gpipe: Efficient training of giant neural networks using pipeline parallelism , author=. Advances in neural information processing systems , volume=

work page
[23]

1999 , publisher=

Numerical optimization , author=. 1999 , publisher=

work page 1999
[24]

Handbook of optimization: From classical to modern approach , pages=

Differential evolution , author=. Handbook of optimization: From classical to modern approach , pages=. 2013 , publisher=

work page 2013
[25]

International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale , author=. International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2022 , organization=

work page 2022
[26]

Shi, Shaohuai and Pan, Xinglin and Chu, Xiaowen and Li, Bo , booktitle=

work page
[27]

Zhai, Mingshu and He, Jiaao and Ma, Zixuan and Zong, Zan and Zhang, Runqing and Zhai, Jidong , booktitle=

work page
[28]

USENIX Annual Technical Conference , pages=

Accelerating distributed \ MoE \ training and inference with lina , author=. USENIX Annual Technical Conference , pages=

work page
[29]

International Conference on Learning Representations , year=

Sparse mixture-of-experts are domain generalizable learners , author=. International Conference on Learning Representations , year=

work page
[30]

Transactions on Machine Learning Research , issn=

Interpretable Mixture of Experts , author=. Transactions on Machine Learning Research , issn=. 2023 , note=

work page 2023
[31]

International Conference on Learning Representations , year=

Sparse upcycling: Training mixture-of-experts from dense checkpoints , author=. International Conference on Learning Representations , year=

work page
[32]

Zhiyuan Zeng and Deyi Xiong , booktitle=

work page
[33]

International Conference on Learning Representations , year=

The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers , author=. International Conference on Learning Representations , year=

work page
[34]

Using deepspeed and megatron to train

Smith, Shaden and Patwary, Mostofa and Norick, Brandon and LeGresley, Patrick and Rajbhandari, Samyam and Casper, Jared and Liu, Zhun and Prabhumoye, Shrimai and Zerveas, George and Korthikanti, Vijay and others , journal=. Using deepspeed and megatron to train

work page
[35]

Ren, Xiaozhe and Zhou, Pingyi and Meng, Xinfan and Huang, Xinjing and Wang, Yadao and Wang, Weichao and Li, Pengfei and Zhang, Xiaoda and Podolskiy, Alexander and Arshinov, Grigory and others , journal=. Pangu-

work page
[36]

arXiv preprint arXiv:2110.03888 , year=

M6-10t: A sharing-delinking paradigm for efficient multi-trillion parameter pretraining , author=. arXiv preprint arXiv:2110.03888 , year=

work page arXiv
[37]

arXiv preprint arXiv:2103.00823 , year=

M6: A chinese multimodal pretrainer , author=. arXiv preprint arXiv:2103.00823 , year=

work page arXiv
[38]

Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement , pages=

Connection-level analysis and modeling of network traffic , author=. Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement , pages=

work page
[39]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

SparCML: High-performance sparse communication for machine learning , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

work page
[40]

The International Journal of High Performance Computing Applications , volume=

Optimization of collective communication operations in MPICH , author=. The International Journal of High Performance Computing Applications , volume=. 2005 , publisher=

work page 2005
[41]

The Eleventh International Conference on Learning Representations , year=

SCoMoE: Efficient Mixtures of Experts with Structured Communication , author=. The Eleventh International Conference on Learning Representations , year=

work page
[42]

arXiv preprint arXiv:2202.09368 , year=

Mixture-of-Experts with Expert Choice Routing , author=. arXiv preprint arXiv:2202.09368 , year=

work page arXiv
[43]

Proceedings of IEEE Scalable High Performance Computing Conference , pages=

Interprocessor collective communication library (InterCom) , author=. Proceedings of IEEE Scalable High Performance Computing Conference , pages=. 1994 , organization=

work page 1994
[44]

International Conference on Computational Science , pages=

Optimization of collective reduction operations , author=. International Conference on Computational Science , pages=. 2004 , organization=

work page 2004
[45]

Efficient large-scale language model training on

Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and others , booktitle=. Efficient large-scale language model training on

work page
[46]

He, Jiaao and Zhai, Jidong and Antunes, Tiago and Wang, Haojie and Luo, Fuwen and Shi, Shangfeng and Li, Qin , booktitle=

work page
[47]

2021 , organization=

Lewis, Mike and Bhosale, Shruti and Dettmers, Tim and Goyal, Naman and Zettlemoyer, Luke , booktitle=. 2021 , organization=

work page 2021
[48]

Doubling all2all Performance with NVIDIA Collective Communication Library 2.12 , howpublished =

work page
[49]

Shen, Liang and Wu, Zhihua and Gong, WeiBao and Hao, Hongxiang and Bai, Yangfan and Wu, HuaChao and Wu, Xinxuan and Xiong, Haoyi and Yu, Dianhai and Ma, Yanjun , journal=

work page
[50]

International Conference on Learning Representations , pages=

Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers , author=. International Conference on Learning Representations , pages=. 2022 , organization=

work page 2022
[51]

Proceedings of NAACL-HLT , pages=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. Proceedings of NAACL-HLT , pages=

work page
[52]

Proceedings of the 37th International Conference on Supercomputing , pages=

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training , author=. Proceedings of the 37th International Conference on Supercomputing , pages=

work page
[53]

Proceedings of the ACM SIGCOMM 2023 Conference , pages=

Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models , author=. Proceedings of the ACM SIGCOMM 2023 Conference , pages=

work page 2023
[54]

Proceedings of the ACM on Management of Data , volume=

FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement , author=. Proceedings of the ACM on Management of Data , volume=. 2023 , publisher=

work page 2023
[55]

Advances in Neural Information Processing Systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[56]

Mixtral of Experts

Mixtral of Experts , author=. arXiv preprint arXiv:2401.04088 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

International Conference on Learning Representations , year=

Taming Sparsely Activated Transformer with Stochastic Experts , author=. International Conference on Learning Representations , year=

work page
[58]

Sinclair , title =

Suchita Pati and Shaizeen Aga and Mahzabeen Islam and Nuwan Jayasena and Matthew D. Sinclair , title =

work page
[59]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[60]

arXiv preprint arXiv:2308.06093 , year=

Experts weights averaging: A new general training scheme for vision transformers , author=. arXiv preprint arXiv:2308.06093 , year=

work page arXiv
[61]

arXiv preprint arXiv:2105.03036 , year=

Speechmoe: Scaling to large acoustic models with dynamic routing mixture of experts , author=. arXiv preprint arXiv:2105.03036 , year=

work page arXiv
[62]

Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

Modeling task relationships in multi-task learning with multi-gate mixture-of-experts , author=. Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

work page
[63]

arXiv preprint arXiv:2308.00951 , year=

From sparse to soft mixtures of experts , author=. arXiv preprint arXiv:2308.00951 , year=

work page arXiv
[64]

Advances in Neural Information Processing Systems , volume=

On the representation collapse of sparse mixture of experts , author=. Advances in Neural Information Processing Systems , volume=

work page
[65]

FastMoE: A fast mixture-of-expert training system.arXiv preprint arXiv:2103.13262,

Fastmoe: A fast mixture-of-expert training system , author=. arXiv preprint arXiv:2103.13262 , year=

work page arXiv
[66]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

StableMoE: Stable Routing Strategy for Mixture of Experts , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[67]

Highly Scalable Deep Learning Training System with Mixed-Precision: Training

Jia, Xianyan and Song, Shutao and Shi, Shaohuai and He, Wei and Wang, Yangzihao and Rong, Haidong and Zhou, Feihu and Xie, Liqiang and Guo, Zhenyu and Yang, Yuanzhou and Yu, Liwei and Chen, Tiegang and Hu, Guangxiao and Chu, Xiaowen , booktitle=. Highly Scalable Deep Learning Training System with Mixed-Precision: Training

work page
[68]

Large Batch Optimization for Deep Learning: Training

Yang You and Jing Li and Sashank Reddi and Jonathan Hseu and Sanjiv Kumar and Srinadh Bhojanapalli and Xiaodan Song and James Demmel and Kurt Keutzer and Cho-Jui Hsieh , booktitle=. Large Batch Optimization for Deep Learning: Training

work page
[69]

Pan, Xinglin and Lin, Wenxiang and Shi, Shaohuai and Chu, Xiaowen and Sun, Weinong and Li, Bo , booktitle=

work page
[70]

2024 , eprint=

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=

work page 2024
[71]

Proceedings of the 29th Symposium on Operating Systems Principles , pages=

Pit: Optimization of dynamic sparse deep learning models via permutation invariant transformation , author=. Proceedings of the 29th Symposium on Operating Systems Principles , pages=

work page
[72]

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 , pages=

Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning , author=. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 , pages=

work page
[73]

Proceedings of Machine Learning and Systems , volume=

Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication , author=. Proceedings of Machine Learning and Systems , volume=

work page
[74]

Proceedings of the Nineteenth European Conference on Computer Systems , pages=

ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling , author=. Proceedings of the Nineteenth European Conference on Computer Systems , pages=

work page
[75]

Shibo Wang and Jinliang Wei and Amit Sabne and Andy Davis and Berkin Ilbeyi and Blake Hechtman and Dehao Chen and Karthik Srinivasa Murthy and Marcello Maggioni and Qiao Zhang and Sameer Kumar and Tongfei Guo and Yuanzhong Xu and Zongwei Zhou , title =

work page
[76]

2024 , eprint=

Optimizing Distributed ML Communication with Fused Computation-Collective Operations , author=. 2024 , eprint=

work page 2024
[77]

The Computational Geometry Algorithms Library , author =

work page
[78]

Menelaos Karavelas , subtitle =

work page
[79]

The Computational Geometry Algorithms Library , subtitle =

Menelaos Karavelas , editor =. The Computational Geometry Algorithms Library , subtitle =

work page
[80]

The Parmap library , author =

work page

Showing first 80 references.