InInter- national Conference on Learning Representations

Fast catch-up, late switching: Optimal batch size scheduling via functional scaling laws · 2023 · arXiv 2310.00692

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

How Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling Laws

cs.LG · 2026-05-25 · unverdicted · novelty 6.0

Extends functional scaling laws with data quality to derive optimal joint scheduling, proposing Drop-Stable-Rampup that improves accuracy by +1.70 over WSD and +2.98 over cosine decay on a 15B MoE model.

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

cs.LG · 2026-05-26 · unverdicted · novelty 5.0

Scale vectors in Pre-Norm LLMs aid optimization via preconditioning on linear layers rather than expressivity, and three lightweight modifications to them reduce terminal loss across model scales.

citing papers explorer

Showing 2 of 2 citing papers.

How Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling Laws cs.LG · 2026-05-25 · unverdicted · none · ref 2
Extends functional scaling laws with data quality to derive optimal joint scheduling, proposing Drop-Stable-Rampup that improves accuracy by +1.70 over WSD and +2.98 over cosine decay on a 15B MoE model.
Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models cs.LG · 2026-05-26 · unverdicted · none · ref 41
Scale vectors in Pre-Norm LLMs aid optimization via preconditioning on linear layers rather than expressivity, and three lightweight modifications to them reduce terminal loss across model scales.

InInter- national Conference on Learning Representations

fields

years

verdicts

representative citing papers

citing papers explorer