Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.
hub
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
28 Pith papers cite this work. Polarity classification is still indexing.
abstract
Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in this line of research is LARS, which by employing layerwise adaptive learning rates trains ResNet on ImageNet in a few minutes. However, LARS performs poorly for attention models like BERT, indicating that its performance gains are not consistent across tasks. In this paper, we first study a principled layerwise adaptation strategy to accelerate training of deep neural networks using large mini-batches. Using this strategy, we develop a new layerwise adaptive large batch optimization technique called LAMB; we then provide convergence analysis of LAMB as well as LARS, showing convergence to a stationary point in general nonconvex settings. Our empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and ResNet-50 training with very little hyperparameter tuning. In particular, for BERT training, our optimizer enables use of very large batch sizes of 32868 without any degradation of performance. By increasing the batch size to the memory limit of a TPUv3 Pod, BERT training time can be reduced from 3 days to just 76 minutes (Table 1). The LAMB implementation is available at https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Quantum PINNs using tensor-rank polynomials solve the Merton portfolio optimization PDE more accurately and with far fewer parameters than classical neural networks.
Scion is a new stochastic LMO-based optimizer family that unifies existing methods, supports unconstrained problems, and delivers hyperparameter transferability plus speedups on nanoGPT training.
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
ZeRO removes memory redundancies in parallel training to scale deep learning models to over a trillion parameters with high throughput on current hardware.
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
TextTeacher uses frozen text embeddings from captions as semantic anchors to guide vision model training, improving ImageNet accuracy by up to 2.7 p.p. and transfer performance by 1.0 p.p. on average.
STELLAR trains up to 500M-parameter multi-modal models on 50M driving scenes and reports empirical scaling trends plus new state-of-the-art results on the Waymo Open Dataset.
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
MIST models up to 10x larger than prior work, fine-tuned on over 400 structure-property tasks, match or exceed SOTA on benchmarks and demonstrate zero-shot olfactory perception mapping consistent with hyperbolic geometry.
Presents a model-based proximal framework for adaptive momentum in first-order optimizers by using a two-plane approximation of the objective to dynamically set the memory coefficient online.
PLD recasts knowledge distillation as a weighted list-wise ranking loss under the Plackett-Luce model that optimizes a teacher-optimal class ranking and subsumes weighted cross-entropy.
SPANetV2 is a vision backbone built around a new spectral-adaptive modulation mixer that outperforms prior models on ImageNet-1K classification, COCO detection, and ADE20K segmentation.
Deep Optimizer States splits LLMs into subgroups and uses a performance model to schedule optimizer updates on CPU or GPU, achieving 2.5x faster iterations than prior offloading methods when integrated with DeepSpeed.
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
OrderDP is a plug-and-play data pruning method that selects a random subset then top-q samples to guarantee unbiased surrogate-loss training with convergence analysis and over 40% training cost reduction on CIFAR and ImageNet.
AdaMeZO adapts Adam moment estimates to zeroth-order LLM fine-tuning without extra memory storage, outperforming MeZO with up to 70% fewer forward passes.
Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.
With better hyperparameters, more data, and longer training, an unchanged BERT-Large architecture matches or exceeds XLNet and other successors on GLUE, SQuAD, and RACE.
Automatically constructed mapping priors from sensor aggregation are integrated via the MPA3D framework to achieve state-of-the-art 3D detection results on the Waymo Open Dataset.
Muon+ adds one normalization step after polar orthogonalization in the Muon optimizer, yielding lower training and validation perplexity and faster pre-training across 60M-7B models.
citing papers explorer
-
Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space
Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.
-
Learning PDEs for Portfolio Optimization with Quantum Physics-Informed Neural Networks
Quantum PINNs using tensor-rank polynomials solve the Merton portfolio optimization PDE more accurately and with far fewer parameters than classical neural networks.
-
Training Deep Learning Models with Norm-Constrained LMOs
Scion is a new stochastic LMO-based optimizer family that unifies existing methods, supports unconstrained problems, and delivers hyperparameter transferability plus speedups on nanoGPT training.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
ZeRO removes memory redundancies in parallel training to scale deep learning models to over a trillion parameters with high throughput on current hardware.
-
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
-
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
-
TextTeacher: What Can Language Teach About Images?
TextTeacher uses frozen text embeddings from captions as semantic anchors to guide vision model training, improving ImageNet accuracy by up to 2.7 p.p. and transfer performance by 1.0 p.p. on average.
-
STELLAR: Scaling 3D Perception Large Models for Autonomous Driving
STELLAR trains up to 500M-parameter multi-modal models on 50M driving scenes and reports empirical scaling trends plus new state-of-the-art results on the Waymo Open Dataset.
-
ShardTensor: Domain Parallelism for Scientific Machine Learning
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
-
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.
-
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
-
Foundation Models for Discovery and Exploration in Chemical Space
MIST models up to 10x larger than prior work, fine-tuned on over 400 structure-property tasks, match or exceed SOTA on benchmarks and demonstrate zero-shot olfactory perception mapping consistent with hyperbolic geometry.
-
Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization
Presents a model-based proximal framework for adaptive momentum in first-order optimizers by using a two-plane approximation of the objective to dynamically set the memory coefficient online.
-
PLD: A Choice-Theoretic List-Wise Knowledge Distillation
PLD recasts knowledge distillation as a weighted list-wise ranking loss under the Plackett-Luce model that optimizes a teacher-optimal class ranking and subsumes weighted cross-entropy.
-
Spectral-Adaptive Modulation Networks for Visual Perception
SPANetV2 is a vision backbone built around a new spectral-adaptive modulation mixer that outperforms prior models on ImageNet-1K classification, COCO detection, and ADE20K segmentation.
-
Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading
Deep Optimizer States splits LLMs into subgroups and uses a performance model to schedule optimizer updates on CPU or GPU, achieving 2.5x faster iterations than prior offloading methods when integrated with DeepSpeed.
-
Demystifying CLIP Data
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
-
OrderDP: A Theoretically Guaranteed Lossless Dynamic Data Pruning Framework
OrderDP is a plug-and-play data pruning method that selects a random subset then top-q samples to guarantee unbiased surrogate-loss training with convergence analysis and over 40% training cost reduction on CIFAR and ImageNet.
-
AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments
AdaMeZO adapts Adam moment estimates to zeroth-order LLM fine-tuning without extra memory storage, outperforming MeZO with up to 70% fewer forward passes.
-
On the Convergence Analysis of Muon
Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.
-
RoBERTa: A Robustly Optimized BERT Pretraining Approach
With better hyperparameters, more data, and longer training, an unchanged BERT-Large architecture matches or exceeds XLNet and other successors on GLUE, SQuAD, and RACE.
-
Scene Reconstruction as Mapping Priors for 3D Detection
Automatically constructed mapping priors from sensor aggregation are integrated via the MPA3D framework to achieve state-of-the-art 3D detection results on the Waymo Open Dataset.
-
MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training
Muon+ adds one normalization step after polar orthogonalization in the Muon optimizer, yielding lower training and validation perplexity and faster pre-training across 60M-7B models.
- One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
- Scalable Reinforcement Learning via Adaptive Batch Scaling
- On the Stability of Growth in Structural Plasticity
- A Physics-Inspired Optimizer: Velocity Regularized Adam