DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers

Dan Alistarh; Erik Schultheis; Ionut-Vlad Modoranu; Mher Safaryan; Philip Zmushko

arxiv: 2602.02016 · v2 · pith:EC76XWZXnew · submitted 2026-02-02 · 💻 cs.LG

DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers

Ionut-Vlad Modoranu , Philip Zmushko , Erik Schultheis , Mher Safaryan , Dan Alistarh This is my paper

classification 💻 cs.LG

keywords shampoofastertextbfdashdistributedfirstimplementationiteration

0 comments

read the original abstract

Shampoo is one of the leading approximate second-order optimizers: a variant of it has won the MLCommons AlgoPerf competition, and it has been shown to produce models with lower activation outliers that are easier to compress. Yet, applying Shampoo currently comes at the cost of significant computational slowdown, due to its expensive internal operations. In this paper, we take a significant step to address this shortcoming by proposing \method (for \textbf{D}istributed \textbf{A}ccelerated \textbf{SH}ampoo), a faster implementation of Distributed Shampoo based on two main new techniques: First, we show that preconditioner blocks can be stacked into 3D tensors to significantly improve GPU utilization; second, we introduce the Newton-DB iteration and the Chebyshev polynomial approximations as novel and faster approaches for computing the inverse matrix roots required by Shampoo. Along with these algorithmic contributions, we provide a first in-depth analysis of how matrix scaling critically affects Shampoo convergence. On the practical side, our GPU-aware implementation achieves up to $5.6\times$ faster optimizer steps compared to the well-optimized Distributed Shampoo, while Newton-DB attains the lowest validation perplexity per iteration among all tested methods. Our code is available at https://github.com/IST-DASLab/DASH.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Hierarchical Muon: Tiled Newton-Schulz Updates for Efficient Muon Optimization
math.NA 2026-06 unverdicted novelty 7.0

HiMuon partitions momentum-gradient matrices into T x T tiles, runs independent Newton-Schulz iterations on each tile, and reassembles the results, reducing leading cost to O(H W T K) while defining a local rather tha...