Broximal Alignment is a novel condition under which the Ball Proximal Point Method converges to global minima in non-convex settings, generalizing quasiconvexity, star convexity, and related frameworks.
Why gradients rapidly increase near the end of training.arXiv preprint arXiv:2506.02285
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering competitive performance with convergence guarantees.
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
citing papers explorer
-
Broximal Alignment for Global Non-Convex Optimization
Broximal Alignment is a novel condition under which the Ball Proximal Point Method converges to global minima in non-convex settings, generalizing quasiconvexity, star convexity, and related frameworks.
-
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
-
Demystifying Manifold Constraints in LLM Pre-training
Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering competitive performance with convergence guarantees.
-
Rethinking Language Model Scaling under Transferable Hypersphere Optimization
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.