Title resolution pending

Power Scheduler: A Batch Size, Token Number Agnostic Learning Rate Scheduler , year = · arXiv 2408.13359

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LoopMoE is a looped MoE language model that outperforms matched vanilla MoE on 8 of 9 downstream benchmarks at 3B scale and continues to outperform at 9B scale under strictly controlled budgets.

Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Optimal hyperparameters for LLM continued pre-training follow predictable scaling laws derived from proxy models, enabling a two-stage framework that predicts settings from compute budget and checkpoint state to reduce search overhead by 90%.

How to Allocate Your Tokens? Scaling Laws with Training Steps and Batch Size

cs.LG · 2026-07-01 · unverdicted · novelty 5.0

Proposes a three-term scaling law for model size, training steps and batch size that recovers optimal batch size scaling and can be fitted using fewer runs by incorporating suboptimal batch sizes.

citing papers explorer

Showing 3 of 3 citing papers after filters.

LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling cs.LG · 2026-06-03 · unverdicted · none · ref 43
LoopMoE is a looped MoE language model that outperforms matched vanilla MoE on 8 of 9 downstream benchmarks at 3B scale and continues to outperform at 9B scale under strictly controlled budgets.
Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training cs.CL · 2026-06-04 · unverdicted · none · ref 24
Optimal hyperparameters for LLM continued pre-training follow predictable scaling laws derived from proxy models, enabling a two-stage framework that predicts settings from compute budget and checkpoint state to reduce search overhead by 90%.
How to Allocate Your Tokens? Scaling Laws with Training Steps and Batch Size cs.LG · 2026-07-01 · unverdicted · none · ref 22
Proposes a three-term scaling law for model size, training steps and batch size that recovers optimal batch size scaling and can be fitted using fewer runs by incorporating suboptimal batch sizes.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer