pith. sign in

arxiv: 2506.16659 · v3 · pith:5GLYRQWInew · submitted 2025-06-20 · 💻 cs.LG · cs.AI· math.OC

Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

classification 💻 cs.LG cs.AImath.OC
keywords memorymemory-efficientpretrainingapollogradientmomentumperformancescale
0
0 comments X
read the original abstract

Training large language models (LLMs) relies on adaptive optimizers such as Adam, which introduce extra operations and require significantly more memory to maintain first- and second-order moments than SGD. While recent works such as GaLore, Fira and APOLLO have proposed state-compressed memory-efficient variants, a fundamental question remains: What are the minimum modifications to plain SGD needed to match state-of-the-art pretraining performance? We systematically investigate this question using a bottom-up approach, and identify two simple yet highly (memory- and compute-) efficient techniques: (1) column-wise gradient normalization (normalizing the gradient along the output dimension), that boosts SGD performance without momentum; and (2) applying first-order momentum only to the output layer, where gradient variance is highest. Combining these two techniques lead to SCALE (Stochastic Column-normAlized Last-layer momEntum), a simple optimizer for memory efficient pretraining. Across multiple models (60M-1B), SCALE matches or exceeds the performance of Adam while using only 35-45% of the total memory. It also consistently outperforms memory-efficient optimizers such as GaLore, Fira and APOLLO, making it a strong candidate for large-scale pretraining under memory constraints. For LLaMA 7B, SCALE outperforms the state-of-the-art memory-efficient methods APOLLO and Muon in both perplexity and memory consumption.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

    math.OC 2026-05 conditional novelty 7.0

    Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.

  2. Budget-aware Auto Optimizer Configurator

    cs.AI 2026-05 unverdicted novelty 6.0

    BAOC samples gradient streams to compute per-block risk metrics for cheap optimizer configs then solves a constrained optimization to minimize total risk under memory and time budgets while preserving training quality.

  3. Demystifying Manifold Constraints in LLM Pre-training

    cs.LG 2026-05 unverdicted novelty 6.0

    Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering co...

  4. MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

    cs.LG 2026-03 unverdicted novelty 6.0

    MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

  5. Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

    cs.LG 2025-09 unverdicted novelty 5.0

    Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.