Learned Subspace Compression for Communication-Efficient Pipeline Parallelism

Edouard Oyallon; Eugene Belilovsky; Paul Janson

arxiv: 2606.05484 · v1 · pith:3EZGRNNPnew · submitted 2026-06-03 · 💻 cs.LG

Learned Subspace Compression for Communication-Efficient Pipeline Parallelism

Paul Janson , Edouard Oyallon , Eugene Belilovsky This is my paper

Pith reviewed 2026-06-28 06:39 UTC · model grok-4.3

classification 💻 cs.LG

keywords pipeline parallelismactivation compressionStiefel manifoldorthogonal projectionslarge language modelscommunication efficiencymanifold optimizationresidual vector quantization

0 comments

The pith

MAPL learns per-stage orthogonal projections on the Stiefel manifold to compress activations in pipeline parallelism with negligible performance loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Manifold Aware Projection Learning as a way to reduce the dominant communication cost between pipeline stages when training large language models on low-bandwidth networks. Instead of using fixed orthogonal projections that degrade performance and require extra training tweaks, MAPL lets each stage learn and adapt its own low-rank compression subspace while enforcing orthogonality through manifold-constrained optimization. It adds lightweight per-stage anchor embeddings to reconstruct full activations and optionally layers on residual vector quantization with synchronized codebooks. A reader would care because the approach reportedly delivers much better compression-accuracy tradeoffs than prior subspace methods on LLaMA models ranging from 150 million to 1 billion parameters.

Core claim

MAPL treats inter-stage compression as a learnable orthogonal projection under explicit Stiefel manifold constraints. Rather than prescribing a fixed global subspace, MAPL lets each pipeline stage discover and continuously adapt its own task-optimal compression subspace via manifold-constrained steepest descent. To recover token-specific signals at stage boundaries, it introduces per-stage factorized anchor embeddings that allow for full-rank activation reconstruction with negligible communication overhead. Residual vector quantization can be incorporated after projection with a streaming codebook synchronization protocol that amortizes dictionary communication.

What carries the argument

Manifold Aware Projection Learning (MAPL) performing manifold-constrained steepest descent on per-stage Stiefel manifolds to learn adaptive orthogonal compression subspaces, paired with factorized anchor embeddings for reconstruction.

If this is right

High activation compression becomes possible in pipeline-parallel LLM training while keeping performance degradation negligible.
Each stage can continuously adapt its compression subspace to the task rather than relying on a single fixed projection.
The performance-compression tradeoff improves substantially over Subspace Networks across LLaMA models from 150M to 1B parameters.
The method integrates directly into existing pipeline parallelism code without requiring global subspace prescriptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The per-stage adaptation might reduce the hardware requirement for high-bandwidth interconnects when scaling to clusters with heterogeneous links.
Anchor embeddings could be extended to other boundary compression settings such as model parallelism or federated averaging.
Testing whether the same manifold optimization remains stable on models larger than 1B parameters would clarify the method's scaling limits.

Load-bearing premise

That manifold-constrained steepest descent on per-stage Stiefel manifolds can be performed stably during end-to-end training without non-standard adaptations and that the resulting subspaces remain task-optimal throughout optimization.

What would settle it

Running end-to-end training of an LLaMA-scale model under MAPL and observing either large accuracy degradation relative to uncompressed baselines or optimization divergence that fixed-projection methods avoid.

Figures

Figures reproduced from arXiv: 2606.05484 by Edouard Oyallon, Eugene Belilovsky, Paul Janson.

**Figure 1.** Figure 1: Pareto frontier for compressed pipeline-parallel training. Validation cross-entropy versus communication cost (bytes per token, with the corresponding compression ratio relative to the 2048-byte uncompressed baseline) for a 150M-parameter model trained with P = 4 pipeline stages on DCLM-10B using the Muon optimizer [21] unless otherwise stated. Lower-left is better. We compare our learned projection (“Ours… view at source ↗

**Figure 2.** Figure 2: Overview of MAPL compression at a pipeline stage boundary, repeated across all P − 1 inter-stage boundaries. At each boundary, the token-dependent offset is subtracted from the boundary activation Xbp ∈ R b×n×d before transmission (red arrow): the first stage subtracts the original token embeddings, while subsequent stages subtract their per-stage factorized anchor embeddings Esmall p [tids] Pp, where Pp i… view at source ↗

**Figure 3.** Figure 3: Boundary activations exhibit intrinsic low-rank structure across all pipeline stages, with rank-250 truncation retaining ≥99% of activation energy. (a) Singular value spectra of the centered boundary activations Xbp − E[tids] (reshaped to (B · T) × d) for all P − 1 = 7 inter-stage boundaries of a 150M LLaMA model (d = 1024, P = 8) trained with Muon [21] on DCLM [28]. The x-axis indexes singular values in d… view at source ↗

**Figure 4.** Figure 4: Empirical validation of learned orthogonal projectors for activation compression in pipeline-parallel training. (a) Pairwise mean principal angles (degrees) between learned Stiefel manifold projectors across pipeline stages i and j, computed via arccos(σk(A⊤ i Aj )). Large off-diagonal angles (up to 72◦ ) confirm that projectors across non-adjacent stages converge to geometrically distinct subspaces, whil… view at source ↗

**Figure 5.** Figure 5: Dynamics of the stable rank, defined as srank(A) = ∥A∥ 2 F ∥A∥ 2 2 , for layers adjacent to the compression boundary over the course of training. (a) Under a learnable Stiefel compression (r = 128), the attention output projection rapidly collapses to a severely low-rank structure, converging toward near rank-1 behavior. In contrast, a fixed orthogonal compression initially triggers a large transient incre… view at source ↗

read the original abstract

Pipeline parallelism enables training of large language models that exceed single-device memory, yet inter-stage activation communication becomes the dominant bottleneck when trained on low-bandwidth networks. Recent work in this area has proposed using fixed orthogonal projections to compress activations. However, this still results in a significant performance degradation and requires a number of non-standard adaptations to constrain the optimization. A natural alternative is to learn a low rank projection for each pipeline stage, however maintaining the necessary orthogonality of these projectors during training remains a challenge. We present Manifold Aware Projection Learning (MAPL), a method that treats inter-stage compression as a learnable orthogonal projection under explicit Stiefel manifold (orthogonal matrices) constraints. Rather than prescribing a fixed global subspace, MAPL lets each pipeline stage discover and continuously adapt its own task-optimal compression subspace via manifold-constrained steepest descent. To recover token-specific signals at stage boundaries, we introduce per-stage factorized anchor embeddings that allow for full-rank activation reconstruction with negligible communication overhead. We further show that we can incorporate residual vector quantization after projection with a streaming codebook synchronization protocol that amortizes dictionary communication. Across LLaMA models from 150M to 1B parameters we show that MAPL can be easily applied to the existing pipeline and can achieve high compression with neglibile performance degradation with a drastically improved tradeoffs in performance vs. compression compared to Subspace Networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAPL adds per-stage learned Stiefel projectors and factorized anchors to pipeline compression, but the abstract gives no numbers or optimization details so the performance claims stay unverified.

read the letter

The core idea is to replace fixed orthogonal projections between pipeline stages with learned ones that each stage adapts on its own Stiefel manifold, plus factorized anchor embeddings that recover token signals at low extra cost and a streaming codebook for residual quantization. This combination is new relative to the fixed-projection papers and the Subspace Networks baseline.

It targets a real bottleneck: activation communication in pipeline-parallel LLM training on low-bandwidth networks. Letting each stage discover its own subspace is a reasonable step beyond a single global projection, and the anchors address the token-specific information loss that fixed methods struggle with.

The main weakness is the missing evidence. The abstract states that MAPL achieves high compression with negligible degradation and better tradeoffs than baselines on LLaMA models from 150M to 1B parameters, yet it supplies no tables, error bars, ablation numbers, or training curves. The stress-test point about stable end-to-end manifold optimization is not resolved in the text; there is no description of the retraction, step-size schedule, or orthogonality enforcement, so it is unclear whether the method avoids the non-standard adaptations it criticizes in prior work.

If the full paper shows the optimization runs cleanly and the subspaces stay useful, the central claim holds. Without those details the empirical part is thin. The method description itself is coherent and the problem is scoped correctly.

This is for people working on distributed training efficiency. A reader focused on communication compression would find the technical approach useful even before the results are fully checked. It deserves peer review because the problem is concrete and the extension is a direct response to existing limitations, though the current version would need the experimental support filled in.

Referee Report

2 major / 2 minor

Summary. The paper claims that Manifold Aware Projection Learning (MAPL) enables communication-efficient pipeline parallelism by learning per-stage orthogonal projections on Stiefel manifolds using manifold-constrained steepest descent, along with factorized anchor embeddings and residual vector quantization. It reports that this achieves high compression with negligible performance degradation on LLaMA models from 150M to 1B parameters, with better tradeoffs than Subspace Networks.

Significance. If the central optimization procedure is stable and the empirical claims are supported by data, the work could advance efficient training of large models by reducing inter-stage communication in pipeline parallelism through adaptive, learned subspaces rather than fixed projections.

major comments (2)

[Method] Method section: The description of MAPL relies on performing manifold-constrained steepest descent on per-stage Stiefel manifolds during end-to-end training, but no specifics are given on the retraction, step-size rule, or orthogonality enforcement mechanism. This is critical because the abstract contrasts it with fixed projections that require non-standard adaptations, yet without these details the stability claim cannot be assessed.
[Experiments] Experiments section: The abstract states empirical gains across LLaMA models but the provided text supplies no quantitative tables, error bars, ablation details on the manifold optimization stability, or training curves, which are necessary to support the claim of negligible degradation and improved tradeoffs.

minor comments (2)

[Abstract] Typo: 'neglibile' should be 'negligible'.
[Abstract] The phrase 'drastically improved tradeoffs in performance vs. compression' is grammatically awkward; consider rephrasing for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested details and evidence.

read point-by-point responses

Referee: [Method] Method section: The description of MAPL relies on performing manifold-constrained steepest descent on per-stage Stiefel manifolds during end-to-end training, but no specifics are given on the retraction, step-size rule, or orthogonality enforcement mechanism. This is critical because the abstract contrasts it with fixed projections that require non-standard adaptations, yet without these details the stability claim cannot be assessed.

Authors: We agree that the Method section requires additional implementation specifics for reproducibility and stability assessment. In the revised manuscript we will expand this section to detail the retraction operator (polar decomposition), the step-size rule (manifold-adapted Armijo line search), and orthogonality enforcement (via the Stiefel manifold parameterization within the optimizer). These additions will also clarify how the approach avoids the non-standard adaptations needed for fixed projections. revision: yes
Referee: [Experiments] Experiments section: The abstract states empirical gains across LLaMA models but the provided text supplies no quantitative tables, error bars, ablation details on the manifold optimization stability, or training curves, which are necessary to support the claim of negligible degradation and improved tradeoffs.

Authors: We acknowledge the absence of these elements in the current text. In the revision we will incorporate quantitative tables reporting compression ratios versus performance metrics (with standard error bars from multiple runs), ablation studies on manifold optimization stability, and training curves demonstrating convergence behavior and negligible degradation. These will be added to the Experiments section to directly support the claimed tradeoffs versus Subspace Networks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an optimization procedure with external empirical claims

full rationale

The paper presents MAPL as a manifold-constrained steepest descent procedure on Stiefel manifolds for per-stage projectors, combined with anchor embeddings and residual quantization. No equations, derivations, or self-citations are supplied that reduce the reported compression gains or performance claims to quantities fitted inside the same experiment by construction. The central results are empirical tradeoffs on LLaMA models (150M-1B), which are independent of any internal redefinition or self-referential prediction. This is the most common honest non-finding for a methods paper describing an optimization technique.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that Stiefel-manifold optimization can be performed stably inside standard pipeline training loops and that the learned subspaces remain useful without additional regularization; no explicit free parameters, axioms, or invented physical entities are named in the abstract.

pith-pipeline@v0.9.1-grok · 5779 in / 1238 out tokens · 20146 ms · 2026-06-28T06:39:42.777188+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 13 canonical work pages · 5 internal anchors

[1]

Qsgd: Communication-efficient sgd via gradient quantization and encoding.Advances in neural information processing systems, 30, 2017

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan V ojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding.Advances in neural information processing systems, 30, 2017

2017
[2]

The polar express: Optimal matrix sign methods and their application to the muon algorithm.International Conference on Learning Representations (ICLR), 2026

Noah Amsel, David Persson, Christopher Musco, and Robert M Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.International Conference on Learning Representations (ICLR), 2026

2026
[3]

signsgd: Compressed optimisation for non-convex problems

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Compressed optimisation for non-convex problems. InInternational conference on machine learning, pages 560–569. PMLR, 2018

2018
[4]

Does compressing activations help model parallel training?Proceedings of Machine Learning and Systems, 6:239–252, 2024

Song Bian, Dacheng Li, Hongyi Wang, Eric P Xing, and Shivaram Venkataraman. Does compressing activations help model parallel training?Proceedings of Machine Learning and Systems, 6:239–252, 2024

2024
[5]

Piqa: Reasoning about phys- ical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

2020
[6]

Actnn: Reducing training memory footprint via 2-bit activation compressed training

Jianfei Chen, Lianmin Zheng, Zhewei Yao, Dequan Wang, Ion Stoica, Michael Mahoney, and Joseph Gonzalez. Actnn: Reducing training memory footprint via 2-bit activation compressed training. InInternational Conference on Machine Learning, pages 1803–1813. PMLR, 2021

2021
[7]

Fira: Can we achieve full-rank training of LLMs under low-rank constraint? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, and Guoren Wang. Fira: Can we achieve full-rank training of LLMs under low-rank constraint? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

8-bit optimizers via block-wise quantization

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. InInternational Conference on Learning Representations, 2022

2022
[10]

Distributed deep learning in open collaborations.Advances in Neural Information Processing Systems, 34:7879–7897, 2021

Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, Lucile Saulnier, Anton Sinitsin, Dmitry Popov, Dmitry V Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Villanova del Moral, et al. Distributed deep learning in open collaborations.Advances in Neural Information Processing Systems, 34:7879–7897, 2021. 10

2021
[11]

Streaming diloco with overlapping communication

Arthur Douillard, Yani Donchev, J Keith Rush, Satyen Kale, Zachary Charles, Gabriel Teston, Zachary Garrett, Jiajun Shen, Ross McIlroy, David Lacey, Alexandre Rame, Arthur Szlam, MarcAurelio Ranzato, and Paul R Barham. Streaming diloco with overlapping communication. InSecond Conference on Language Modeling, 2025

2025
[12]

Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc’Aurelio Ranzato, Arthur Szlam, and Jiajun Shen

Arthur Douillard, Qixuang Feng, Andrei A. Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc’Aurelio Ranzato, Arthur Szlam, and Jiajun Shen. Diloco: Distributed low- communication training of language models.CoRR, abs/2311.08105, 2023

work page arXiv 2023
[13]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations, 2019

2019
[14]

Sgd with weight decay secretly minimizes the ranks of your neural networks

Tomer Galanti, Zachary S Siegel, Aparna Gupte, and Tomaso A Poggio. Sgd with weight decay secretly minimizes the ranks of your neural networks. InThe Second Conference on Parsimony and Learning (Proceedings Track), 2025

2025
[15]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

2024
[16]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

2019
[18]

Training neural networks from scratch with parallel low-rank adapters.arXiv preprint arXiv:2402.16828, 2024

Minyoung Huh, Brian Cheung, Jeremy Bernstein, Phillip Isola, and Pulkit Agrawal. Training neural networks from scratch with parallel low-rank adapters.arXiv preprint arXiv:2402.16828, 2024

work page arXiv 2024
[19]

Opendiloco: An open-source frame- work for globally distributed low-communication training, July 2024

Sami Jaghouar, Jack Min Ong, and Johannes Hagemann. Opendiloco: An open-source frame- work for globally distributed low-communication training, July 2024

2024
[20]

Stabilizing native low-rank LLM pretraining

Paul Janson, Edouard Oyallon, and Eugene Belilovsky. Stabilizing native low-rank LLM pretraining. InForty-third International Conference on Machine Learning, 2026

2026
[21]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

2024
[22]

Error feedback fixes signsgd and other gradient compression schemes

Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. InInternational conference on machine learning, pages 3252–3261. PMLR, 2019

2019
[23]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInterna- tional Conference on Learning Representations (ICLR), 2015

2015
[24]

A unified theory of decentralized sgd with changing topology and local updates

Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A unified theory of decentralized sgd with changing topology and local updates. InInternational conference on machine learning, pages 5381–5393. PMLR, 2020

2020
[25]

Decentralized stochastic optimization and gossip algorithms with compressed communication

Anastasia Koloskova, Sebastian Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. InInternational conference on machine learning, pages 3478–3487. PMLR, 2019

2019
[26]

Albert: A lite bert for self-supervised learning of language representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2020. 11

2020
[27]

Optimal brain damage.Advances in neural information processing systems, 2, 1989

Yann LeCun, John Denker, and Sara Solla. Optimal brain damage.Advances in neural information processing systems, 2, 1989

1989
[28]

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Kumar Guha, Sedrick Scott Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee F. Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, D...

2024
[29]

ReloRA: High-rank training through low-rank updates

Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky. ReloRA: High-rank training through low-rank updates. InThe Twelfth International Conference on Learning Representations, 2024

2024
[30]

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent.Advances in neural information processing systems, 30, 2017

2017
[31]

Covenant-72b: Pre-training a 72b llm with trustless peers over-the-internet.arXiv preprint arXiv:2603.08163, 2026

Joel Lidin, Amir Sarfi, Erfan Miahi, Quentin Anthony, Shivam Chauhan, Evangelos Pappas, Benjamin Thérien, Eugene Belilovsky, and Samuel Dare. Covenant-72b: Pre-training a 72b llm with trustless peers over-the-internet.arXiv preprint arXiv:2603.08163, 2026

work page arXiv 2026
[32]

Deep gradient compression: Reducing the communication bandwidth for distributed training

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. InInternational Conference on Learning Representations, 2018

2018
[33]

Muon is scalable for llm training, February 2025

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

2025
[34]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019

2019
[35]

Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021

Charles H Martin and Michael W Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021

2021
[36]

Parameter and memory efficient pretraining via low-rank riemannian optimization

Zhanfeng Mo, Long-Kai Huang, and Sinno Jialin Pan. Parameter and memory efficient pretraining via low-rank riemannian optimization. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[37]

Acco: Accumulate while you communicate for communication-overlapped sharded llm training.arXiv preprint arXiv:2406.02613, 2024

Adel Nabli, Louis Fournier, Pierre Erbacher, Louis Serrano, Eugene Belilovsky, and Edouard Oyallon. Acco: Accumulate while you communicate for communication-overlapped sharded llm training.arXiv preprint arXiv:2406.02613, 2024

work page arXiv 2024
[38]

Acco: Accumulate while you communicate for communication-overlapped sharded llm training

Adel Nabli, Louis Fournier, Pierre Erbacher, Louis Serrano, Eugene Belilovsky, and Edouard Oyallon. Acco: Accumulate while you communicate for communication-overlapped sharded llm training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 12

2025
[39]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

2019
[40]

Decoupled momentum optimization

Bowen Peng, Jeffrey Quesnelle, and Diederik P Kingma. Decoupled momentum optimization. arXiv preprint arXiv:2411.19870, 2024

work page arXiv 2024
[41]

Zero: Memory optimiza- tions toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

2020
[42]

Subspace networks: Scaling decentralized training with communication-efficient model par- allelism

Sameera Ramasinghe, Thalaiyasingam Ajanthan, Gil Avraham, Yan Zuo, and Alexander Long. Subspace networks: Scaling decentralized training with communication-efficient model par- allelism. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[43]

PETRA: Parallel end-to-end training with reversible architectures

Stephane Rivaud, Louis Fournier, Thomas Pumir, Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon. PETRA: Parallel end-to-end training with reversible architectures. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[44]

Swarm parallelism: Training large models can be surprisingly communication-efficient

Max Ryabinin, Tim Dettmers, Michael Diskin, and Alexander Borzunov. Swarm parallelism: Training large models can be surprisingly communication-efficient. InInternational Conference on Machine Learning, pages 29416–29440. PMLR, 2023

2023
[45]

Towards crowdsourced training of large neural networks using decentralized mixture-of-experts.Advances in Neural Information Processing Systems, 33:3659–3672, 2020

Max Ryabinin and Anton Gusev. Towards crowdsourced training of large neural networks using decentralized mixture-of-experts.Advances in Neural Information Processing Systems, 33:3659–3672, 2020

2020
[46]

Communication efficient llm pre-training with sparseloco.arXiv preprint arXiv:2508.15706, 2025

Amir Sarfi, Benjamin Thérien, Joel Lidin, and Eugene Belilovsky. Communication efficient llm pre-training with sparseloco.arXiv preprint arXiv:2508.15706, 2025

work page arXiv 2025
[47]

Local sgd converges fast and communicates little

Sebastian U Stich. Local sgd converges fast and communicates little. InInternational Conference on Learning Representations, 2019

2019
[48]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[49]

1-bit adam: Communication efficient large-scale training with adam’s convergence speed

Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, and Yuxiong He. 1-bit adam: Communication efficient large-scale training with adam’s convergence speed. InInternational Conference on Machine Learning, pages 10118–10129. PMLR, 2021

2021
[50]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Powersgd: Practical low-rank gradient compression for distributed optimization

Thijs V ogels, Sai Praneeth Karimireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Proc...

2019
[53]

Powersgd: Practical low-rank gradient compression for distributed optimization.Advances in Neural Information Processing Systems, 32, 2019

Thijs V ogels, Sai Praneeth Karimireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization.Advances in Neural Information Processing Systems, 32, 2019. 13

2019
[54]

Pufferfish: Communication- efficient models at no extra cost.Proceedings of Machine Learning and Systems, 3:365–386, 2021

Hongyi Wang, Saurabh Agarwal, and Dimitris Papailiopoulos. Pufferfish: Communication- efficient models at no extra cost.Proceedings of Machine Learning and Systems, 3:365–386, 2021

2021
[55]

Efficient distributed learning with sparsity

Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. Efficient distributed learning with sparsity. InInternational conference on machine learning, pages 3636–3645. PMLR, 2017

2017
[56]

Cocktailsgd: Fine-tuning foundation models over 500mbps networks

Jue Wang, Yucheng Lu, Binhang Yuan, Beidi Chen, Percy Liang, Christopher De Sa, Christopher Re, and Ce Zhang. Cocktailsgd: Fine-tuning foundation models over 500mbps networks. In International Conference on Machine Learning, pages 36058–36076. PMLR, 2023

2023
[57]

Fine-tuning language models over slow networks using activation quantization with guarantees.Advances in Neural Information Processing Systems, 35:19215–19230, 2022

Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Ré, and Ce Zhang. Fine-tuning language models over slow networks using activation quantization with guarantees.Advances in Neural Information Processing Systems, 35:19215–19230, 2022

2022
[58]

Gradient sparsification for communication-efficient distributed optimization.Advances in Neural Information Processing Systems, 31, 2018

Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization.Advances in Neural Information Processing Systems, 31, 2018

2018
[59]

Building on efficient foundations: Effective training of llms with structured feedforward layers

Xiuying Wei, Skander Moalla, Razvan Pascanu, and Caglar Gulcehre. Building on efficient foundations: Effective training of llms with structured feedforward layers. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 4689–4717. Curran Associates, I...

2024
[60]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transform- ers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[61]

Error compensated quantized sgd and its applications to large-scale distributed optimization

Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated quantized sgd and its applications to large-scale distributed optimization. InInternational conference on machine learning, pages 5325–5333. PMLR, 2018

2018
[62]

arXiv preprint arXiv:2310.17813 , year=

Greg Yang, James B Simon, and Jeremy Bernstein. A spectral condition for feature learning. arXiv preprint arXiv:2310.17813, 2023

work page arXiv 2023
[63]

Manifold constrained steepest descent.arXiv preprint arXiv:2601.21487, 2026

Kaiwei Yang and Lexiao Lai. Manifold constrained steepest descent.arXiv preprint arXiv:2601.21487, 2026

work page arXiv 2026
[64]

On compressing deep models by low rank and sparse decomposition

Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by low rank and sparse decomposition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7370–7379, 2017

2017
[65]

Visualizing and understanding convolutional networks

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014

2014
[66]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

2019
[67]

Codebook transfer with part-of-speech for vector-quantized image modeling

Baoquan Zhang, Huaibin Wang, Chuyao Luo, Xutao Li, Guotao Liang, Yunming Ye, Xiaochen Qi, and Yao He. Codebook transfer with part-of-speech for vector-quantized image modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7757–7766, 2024

2024
[68]

LR scale

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection, 2024. 14 A Background: Subspace networks SSN [42] exploits a structural property of trained Transformers: the output projection weights W ℓ 2 ∈R dff ×d and attention projection weights W ℓ 1 ∈R ...

2024

[1] [1]

Qsgd: Communication-efficient sgd via gradient quantization and encoding.Advances in neural information processing systems, 30, 2017

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan V ojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding.Advances in neural information processing systems, 30, 2017

2017

[2] [2]

The polar express: Optimal matrix sign methods and their application to the muon algorithm.International Conference on Learning Representations (ICLR), 2026

Noah Amsel, David Persson, Christopher Musco, and Robert M Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.International Conference on Learning Representations (ICLR), 2026

2026

[3] [3]

signsgd: Compressed optimisation for non-convex problems

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Compressed optimisation for non-convex problems. InInternational conference on machine learning, pages 560–569. PMLR, 2018

2018

[4] [4]

Does compressing activations help model parallel training?Proceedings of Machine Learning and Systems, 6:239–252, 2024

Song Bian, Dacheng Li, Hongyi Wang, Eric P Xing, and Shivaram Venkataraman. Does compressing activations help model parallel training?Proceedings of Machine Learning and Systems, 6:239–252, 2024

2024

[5] [5]

Piqa: Reasoning about phys- ical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

2020

[6] [6]

Actnn: Reducing training memory footprint via 2-bit activation compressed training

Jianfei Chen, Lianmin Zheng, Zhewei Yao, Dequan Wang, Ion Stoica, Michael Mahoney, and Joseph Gonzalez. Actnn: Reducing training memory footprint via 2-bit activation compressed training. InInternational Conference on Machine Learning, pages 1803–1813. PMLR, 2021

2021

[7] [7]

Fira: Can we achieve full-rank training of LLMs under low-rank constraint? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, and Guoren Wang. Fira: Can we achieve full-rank training of LLMs under low-rank constraint? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[8] [8]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

8-bit optimizers via block-wise quantization

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. InInternational Conference on Learning Representations, 2022

2022

[10] [10]

Distributed deep learning in open collaborations.Advances in Neural Information Processing Systems, 34:7879–7897, 2021

Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, Lucile Saulnier, Anton Sinitsin, Dmitry Popov, Dmitry V Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Villanova del Moral, et al. Distributed deep learning in open collaborations.Advances in Neural Information Processing Systems, 34:7879–7897, 2021. 10

2021

[11] [11]

Streaming diloco with overlapping communication

Arthur Douillard, Yani Donchev, J Keith Rush, Satyen Kale, Zachary Charles, Gabriel Teston, Zachary Garrett, Jiajun Shen, Ross McIlroy, David Lacey, Alexandre Rame, Arthur Szlam, MarcAurelio Ranzato, and Paul R Barham. Streaming diloco with overlapping communication. InSecond Conference on Language Modeling, 2025

2025

[12] [12]

Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc’Aurelio Ranzato, Arthur Szlam, and Jiajun Shen

Arthur Douillard, Qixuang Feng, Andrei A. Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc’Aurelio Ranzato, Arthur Szlam, and Jiajun Shen. Diloco: Distributed low- communication training of language models.CoRR, abs/2311.08105, 2023

work page arXiv 2023

[13] [13]

The lottery ticket hypothesis: Finding sparse, trainable neural networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations, 2019

2019

[14] [14]

Sgd with weight decay secretly minimizes the ranks of your neural networks

Tomer Galanti, Zachary S Siegel, Aparna Gupte, and Tomaso A Poggio. Sgd with weight decay secretly minimizes the ranks of your neural networks. InThe Second Conference on Parsimony and Learning (Proceedings Track), 2025

2025

[15] [15]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

2024

[16] [16]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

2019

[18] [18]

Training neural networks from scratch with parallel low-rank adapters.arXiv preprint arXiv:2402.16828, 2024

Minyoung Huh, Brian Cheung, Jeremy Bernstein, Phillip Isola, and Pulkit Agrawal. Training neural networks from scratch with parallel low-rank adapters.arXiv preprint arXiv:2402.16828, 2024

work page arXiv 2024

[19] [19]

Opendiloco: An open-source frame- work for globally distributed low-communication training, July 2024

Sami Jaghouar, Jack Min Ong, and Johannes Hagemann. Opendiloco: An open-source frame- work for globally distributed low-communication training, July 2024

2024

[20] [20]

Stabilizing native low-rank LLM pretraining

Paul Janson, Edouard Oyallon, and Eugene Belilovsky. Stabilizing native low-rank LLM pretraining. InForty-third International Conference on Machine Learning, 2026

2026

[21] [21]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

2024

[22] [22]

Error feedback fixes signsgd and other gradient compression schemes

Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. InInternational conference on machine learning, pages 3252–3261. PMLR, 2019

2019

[23] [23]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInterna- tional Conference on Learning Representations (ICLR), 2015

2015

[24] [24]

A unified theory of decentralized sgd with changing topology and local updates

Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A unified theory of decentralized sgd with changing topology and local updates. InInternational conference on machine learning, pages 5381–5393. PMLR, 2020

2020

[25] [25]

Decentralized stochastic optimization and gossip algorithms with compressed communication

Anastasia Koloskova, Sebastian Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. InInternational conference on machine learning, pages 3478–3487. PMLR, 2019

2019

[26] [26]

Albert: A lite bert for self-supervised learning of language representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2020. 11

2020

[27] [27]

Optimal brain damage.Advances in neural information processing systems, 2, 1989

Yann LeCun, John Denker, and Sara Solla. Optimal brain damage.Advances in neural information processing systems, 2, 1989

1989

[28] [28]

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Kumar Guha, Sedrick Scott Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee F. Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, D...

2024

[29] [29]

ReloRA: High-rank training through low-rank updates

Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky. ReloRA: High-rank training through low-rank updates. InThe Twelfth International Conference on Learning Representations, 2024

2024

[30] [30]

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent.Advances in neural information processing systems, 30, 2017

2017

[31] [31]

Covenant-72b: Pre-training a 72b llm with trustless peers over-the-internet.arXiv preprint arXiv:2603.08163, 2026

Joel Lidin, Amir Sarfi, Erfan Miahi, Quentin Anthony, Shivam Chauhan, Evangelos Pappas, Benjamin Thérien, Eugene Belilovsky, and Samuel Dare. Covenant-72b: Pre-training a 72b llm with trustless peers over-the-internet.arXiv preprint arXiv:2603.08163, 2026

work page arXiv 2026

[32] [32]

Deep gradient compression: Reducing the communication bandwidth for distributed training

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. InInternational Conference on Learning Representations, 2018

2018

[33] [33]

Muon is scalable for llm training, February 2025

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

2025

[34] [34]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019

2019

[35] [35]

Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021

Charles H Martin and Michael W Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021

2021

[36] [36]

Parameter and memory efficient pretraining via low-rank riemannian optimization

Zhanfeng Mo, Long-Kai Huang, and Sinno Jialin Pan. Parameter and memory efficient pretraining via low-rank riemannian optimization. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[37] [37]

Acco: Accumulate while you communicate for communication-overlapped sharded llm training.arXiv preprint arXiv:2406.02613, 2024

Adel Nabli, Louis Fournier, Pierre Erbacher, Louis Serrano, Eugene Belilovsky, and Edouard Oyallon. Acco: Accumulate while you communicate for communication-overlapped sharded llm training.arXiv preprint arXiv:2406.02613, 2024

work page arXiv 2024

[38] [38]

Acco: Accumulate while you communicate for communication-overlapped sharded llm training

Adel Nabli, Louis Fournier, Pierre Erbacher, Louis Serrano, Eugene Belilovsky, and Edouard Oyallon. Acco: Accumulate while you communicate for communication-overlapped sharded llm training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 12

2025

[39] [39]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

2019

[40] [40]

Decoupled momentum optimization

Bowen Peng, Jeffrey Quesnelle, and Diederik P Kingma. Decoupled momentum optimization. arXiv preprint arXiv:2411.19870, 2024

work page arXiv 2024

[41] [41]

Zero: Memory optimiza- tions toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

2020

[42] [42]

Subspace networks: Scaling decentralized training with communication-efficient model par- allelism

Sameera Ramasinghe, Thalaiyasingam Ajanthan, Gil Avraham, Yan Zuo, and Alexander Long. Subspace networks: Scaling decentralized training with communication-efficient model par- allelism. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[43] [43]

PETRA: Parallel end-to-end training with reversible architectures

Stephane Rivaud, Louis Fournier, Thomas Pumir, Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon. PETRA: Parallel end-to-end training with reversible architectures. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[44] [44]

Swarm parallelism: Training large models can be surprisingly communication-efficient

Max Ryabinin, Tim Dettmers, Michael Diskin, and Alexander Borzunov. Swarm parallelism: Training large models can be surprisingly communication-efficient. InInternational Conference on Machine Learning, pages 29416–29440. PMLR, 2023

2023

[45] [45]

Towards crowdsourced training of large neural networks using decentralized mixture-of-experts.Advances in Neural Information Processing Systems, 33:3659–3672, 2020

Max Ryabinin and Anton Gusev. Towards crowdsourced training of large neural networks using decentralized mixture-of-experts.Advances in Neural Information Processing Systems, 33:3659–3672, 2020

2020

[46] [46]

Communication efficient llm pre-training with sparseloco.arXiv preprint arXiv:2508.15706, 2025

Amir Sarfi, Benjamin Thérien, Joel Lidin, and Eugene Belilovsky. Communication efficient llm pre-training with sparseloco.arXiv preprint arXiv:2508.15706, 2025

work page arXiv 2025

[47] [47]

Local sgd converges fast and communicates little

Sebastian U Stich. Local sgd converges fast and communicates little. InInternational Conference on Learning Representations, 2019

2019

[48] [48]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[49] [49]

1-bit adam: Communication efficient large-scale training with adam’s convergence speed

Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, and Yuxiong He. 1-bit adam: Communication efficient large-scale training with adam’s convergence speed. InInternational Conference on Machine Learning, pages 10118–10129. PMLR, 2021

2021

[50] [50]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Powersgd: Practical low-rank gradient compression for distributed optimization

Thijs V ogels, Sai Praneeth Karimireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Proc...

2019

[53] [53]

Powersgd: Practical low-rank gradient compression for distributed optimization.Advances in Neural Information Processing Systems, 32, 2019

Thijs V ogels, Sai Praneeth Karimireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization.Advances in Neural Information Processing Systems, 32, 2019. 13

2019

[54] [54]

Pufferfish: Communication- efficient models at no extra cost.Proceedings of Machine Learning and Systems, 3:365–386, 2021

Hongyi Wang, Saurabh Agarwal, and Dimitris Papailiopoulos. Pufferfish: Communication- efficient models at no extra cost.Proceedings of Machine Learning and Systems, 3:365–386, 2021

2021

[55] [55]

Efficient distributed learning with sparsity

Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. Efficient distributed learning with sparsity. InInternational conference on machine learning, pages 3636–3645. PMLR, 2017

2017

[56] [56]

Cocktailsgd: Fine-tuning foundation models over 500mbps networks

Jue Wang, Yucheng Lu, Binhang Yuan, Beidi Chen, Percy Liang, Christopher De Sa, Christopher Re, and Ce Zhang. Cocktailsgd: Fine-tuning foundation models over 500mbps networks. In International Conference on Machine Learning, pages 36058–36076. PMLR, 2023

2023

[57] [57]

Fine-tuning language models over slow networks using activation quantization with guarantees.Advances in Neural Information Processing Systems, 35:19215–19230, 2022

Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Ré, and Ce Zhang. Fine-tuning language models over slow networks using activation quantization with guarantees.Advances in Neural Information Processing Systems, 35:19215–19230, 2022

2022

[58] [58]

Gradient sparsification for communication-efficient distributed optimization.Advances in Neural Information Processing Systems, 31, 2018

Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization.Advances in Neural Information Processing Systems, 31, 2018

2018

[59] [59]

Building on efficient foundations: Effective training of llms with structured feedforward layers

Xiuying Wei, Skander Moalla, Razvan Pascanu, and Caglar Gulcehre. Building on efficient foundations: Effective training of llms with structured feedforward layers. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 4689–4717. Curran Associates, I...

2024

[60] [60]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transform- ers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[61] [61]

Error compensated quantized sgd and its applications to large-scale distributed optimization

Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated quantized sgd and its applications to large-scale distributed optimization. InInternational conference on machine learning, pages 5325–5333. PMLR, 2018

2018

[62] [62]

arXiv preprint arXiv:2310.17813 , year=

Greg Yang, James B Simon, and Jeremy Bernstein. A spectral condition for feature learning. arXiv preprint arXiv:2310.17813, 2023

work page arXiv 2023

[63] [63]

Manifold constrained steepest descent.arXiv preprint arXiv:2601.21487, 2026

Kaiwei Yang and Lexiao Lai. Manifold constrained steepest descent.arXiv preprint arXiv:2601.21487, 2026

work page arXiv 2026

[64] [64]

On compressing deep models by low rank and sparse decomposition

Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by low rank and sparse decomposition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7370–7379, 2017

2017

[65] [65]

Visualizing and understanding convolutional networks

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014

2014

[66] [66]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

2019

[67] [67]

Codebook transfer with part-of-speech for vector-quantized image modeling

Baoquan Zhang, Huaibin Wang, Chuyao Luo, Xutao Li, Guotao Liang, Yunming Ye, Xiaochen Qi, and Yao He. Codebook transfer with part-of-speech for vector-quantized image modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7757–7766, 2024

2024

[68] [68]

LR scale

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection, 2024. 14 A Background: Subspace networks SSN [42] exploits a structural property of trained Transformers: the output projection weights W ℓ 2 ∈R dff ×d and attention projection weights W ℓ 1 ∈R ...

2024