It is evident that the usual learning rate choice of SGD in the large batch setting will leave a significant 14 0 1000 2000 3000 4000 5000 Iterations 101 Avg

The large weight-SG norm ratio implies that a small choice of learning rate would limit the optimization algorithm’s solution space ofwt to a small neighborhood around the initialization state w0 · 2000

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

The Adam-SGD gap in large-batch LLM pre-training arises mainly from SGD's restricted effective learning rates caused by small gradients and output-layer spikes; clipping lets SGD recover nearly all of Adam's performance.

citing papers explorer

Showing 1 of 1 citing paper.

Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates cs.LG · 2026-05-18 · unverdicted · none · ref 16
The Adam-SGD gap in large-batch LLM pre-training arises mainly from SGD's restricted effective learning rates caused by small gradients and output-layer spikes; clipping lets SGD recover nearly all of Adam's performance.

It is evident that the usual learning rate choice of SGD in the large batch setting will leave a significant 14 0 1000 2000 3000 4000 5000 Iterations 101 Avg

fields

years

verdicts

representative citing papers

citing papers explorer