{"total":16,"items":[{"citing_arxiv_id":"2606.29158","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the Nonlinearity of Learning Rate Scaling for LLM Training","primary_cat":"cs.LG","submitted_at":"2026-06-28T02:42:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Optimal learning rate for models from 22M to 707M parameters shows nonlinear upward curvature with scale that disappears under effective learning rate and data-scale extrapolation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20299","ref_index":37,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Statistical Properties of Training & Generalization","primary_cat":"stat.ML","submitted_at":"2026-06-18T14:35:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":1.0,"formal_verification":"none","one_line_summary":"Review of neural scaling laws and their relation to constraints and inductive biases when applying machine learning to physics problems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19025","ref_index":96,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs","primary_cat":"cs.LG","submitted_at":"2026-06-17T12:50:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FoMoE partitions expert layers across workers in MoE LLMs, skips non-resident experts, and reports up to 1.42x lower communication than baselines plus 1.4x throughput gains while maintaining stable routing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18524","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the Residual Scaling of Looped Transformers: Stability and Transferability","primary_cat":"cs.LG","submitted_at":"2026-06-16T22:39:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Looped Transformers require residual scaling ε = 1/N due to correlated updates from weight sharing, unlike standard 1/sqrt(L), enabling learning rate transfer independent of loop count N.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06418","ref_index":103,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss","primary_cat":"cs.LG","submitted_at":"2026-06-04T17:22:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Double preconditioning (DoPr) improves downstream task performance in test-time feedback settings without consistent gains in validation loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26459","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MuCon: Clipped Muon Updates for LLM Training","primary_cat":"cs.LG","submitted_at":"2026-05-26T02:16:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MuCon defines a clipped-Muon update via singular-value clipping and derives two exact identities for approximating the clip without dense SVD, while noting numerical instability near the threshold.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22297","ref_index":2,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-21T10:46:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLR uses heavy-tailed self-regularization theory to set per-layer learning rates in Transformers, yielding faster convergence and higher zero-shot accuracy than uniform rates across model scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21486","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate","primary_cat":"cs.LG","submitted_at":"2026-05-20T17:59:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A framework quantifies hyperparameter transfer via scaling-law fit quality, extrapolation robustness, and loss penalty, with ablations showing that μP's advantage over standard parameterization stems from maximizing the embedding layer learning rate to avoid bottlenecks and instabilities in AdamW.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15290","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GQA-{\\mu}P: The maximal parameterization update for grouped query attention","primary_cat":"cs.LG","submitted_at":"2026-05-14T18:03:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14200","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization","primary_cat":"cs.LG","submitted_at":"2026-05-13T23:32:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13405","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When is Warmstarting Effective for Scaling Language Models?","primary_cat":"cs.LG","submitted_at":"2026-05-13T12:00:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A 2x growth factor in model warmstarting yields reliable training speedups for language models under 20 tokens/parameter budgets, with an empirical upper bound on effective growth factors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09165","ref_index":34,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sparse Layers are Critical to Scaling Looped Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-09T20:58:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Looped-MoE models scale better than dense looped or standard transformers because routing changes across loops, and they enable stronger compute-quality trade-offs via early exits at loop boundaries.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07870","ref_index":22,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer","primary_cat":"cond-mat.dis-nn","submitted_at":"2026-05-08T15:28:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A two-level DMFT tracks bulk and outlier spectral dynamics in wide networks, predicting width-consistent outlier growth and hyperparameter transfer under muP scaling for deep linear nets while noting bulk restructuring for large-output tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. 2005. 11 [21] Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466, 2022. [22] Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don't be lazy: Completep enables compute-efficient deep transformers.arXiv preprint arXiv:2505.01618, 2025. [23] Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd, 2018. [24] Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor,"},{"citing_arxiv_id":"2604.21691","ref_index":96,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"There Will Be a Scientific Theory of Deep Learning","primary_cat":"stat.ML","submitted_at":"2026-04-23T13:58:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.00541","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Spectral Condition for $\\mu$P under Width-Depth Scaling","primary_cat":"cs.LG","submitted_at":"2026-02-28T08:38:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A unified spectral condition for μP under width-depth scaling reveals a transition at k=1 vs k≥2 transformations per residual block and enables stable feature learning for practical architectures like Transformers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.04774","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model","primary_cat":"cond-mat.dis-nn","submitted_at":"2026-02-04T17:11:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"In a random feature model, optimal SGD learning-rate schedules are polynomial decay in the easy phase and warmup-stable-decay in the hard phase, outperforming constant or simple power-law schedules and transferring differently across training horizons.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}