super hub Canonical reference

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Aapo Kyrola, Lukasz Wesolowski, Pieter Noordhuis, Priya Goyal, Ross Girshick · 2017 · cs.CV · arXiv 1706.02677

Canonical reference. 82% of citing Pith papers cite this work as background.

108 Pith papers citing it

Background 82% of classified citations

open full Pith review browse 108 citing papers more from Aapo Kyrola arXiv PDF

abstract

Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. Our findings enable training visual recognition models on internet-scale data with high efficiency.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 15 method 2

citation-polarity summary

background 14 use method 2 support 1

claims ledger

abstract Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are address

authors

Aapo Kyrola Lukasz Wesolowski Pieter Noordhuis Piotr Doll\'ar Priya Goyal Ross Girshick

co-cited works

representative citing papers

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Masked Autoencoders Are Scalable Vision Learners

cs.CV · 2021-11-11 · accept · novelty 8.0

Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.

Emerging Properties in Self-Supervised Vision Transformers

cs.CV · 2021-04-29 · conditional · novelty 8.0

Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.

TallyTrain: Communication-Efficient Federated Distillation

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

TallyTrain is a hard-label distillation protocol for federated learning that uses argmax transmission and optional sparse merges to match soft-label performance at up to 1000x lower communication cost.

Knowledge Cascade: Reverse Knowledge Distillation on Nonparametric Multivariate Functional Estimation

stat.ME · 2026-06-24 · unverdicted · novelty 7.0

KCas transfers student-selected smoothing parameters to full-sample teacher models via asymptotic scaling laws in smoothing splines and kernel methods, cutting computation while retaining performance guarantees.

Mirror Descent Beyond Euclidean Stability: An Exponential Separation in Initialization Sensitivity

cs.LG · 2026-06-09 · conditional · novelty 7.0

Non-quadratic Mirror Descent exhibits exponential initialization sensitivity in convex settings, shown via 3D constructions and KL-regularized simplex examples, with Bregman anchoring proposed for stabilization.

A Hybrid Generative Reduced-Order Model for the Minimal Flow Unit

physics.flu-dyn · 2026-06-08 · unverdicted · novelty 7.0

A β-VAE-GAN plus sensor-conditioned Transformer with Easy Attention forecasts near-wall turbulence in the Minimal Flow Unit, recovering 87% turbulent kinetic energy in 4D latent space and maintaining accuracy over 17288 t+ from 128 t+ initialization while reconstructing 82% TKE end-to-end.

Demystifying Pipeline Parallelism: First Theory for PipeDream

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Introduces Randomized PipeDream abstraction yielding first nonconvex convergence bound for PipeDream and proves delay scales as S squared for S stages.

From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

Derives mini-batch scaling laws for sketched linear regression, with shared approximation terms and protocol-specific variance/fluctuation scalings under power-law spectrum and source condition.

LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.

Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.

Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits

math.OC · 2026-05-08 · unverdicted · novelty 7.0

Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.

TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals

cs.CR · 2026-05-08 · unverdicted · novelty 7.0

TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.

A Provably Robust Multi-Jet Framework applied to Active Flow Control of an Airfoil in Weakly Compressible Flow

physics.flu-dyn · 2026-04-29 · unverdicted · novelty 7.0

A new injective multi-jet framework for RL flow control provides jet-count-independent running cost upper bounds and enables superior coordinated jet strategies, achieving drag suppression beyond symmetric ideals on cylinders and aerodynamic efficiency gains from 53% to 73% on airfoils.

Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark

cs.CR · 2026-04-09 · unverdicted · novelty 7.0

Creates the BGTD benchmark and mmTraffic architecture to enable explainable multimodal interpretation of encrypted network traffic using LLMs.

Fusion and Alignment Enhancement with Large Language Models for Tail-item Sequential Recommendation

cs.IR · 2026-04-04 · unverdicted · novelty 7.0

FAERec fuses collaborative ID embeddings with LLM semantic embeddings using adaptive gating and dual-level alignment to enhance tail-item sequential recommendations.

Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

cs.DC · 2026-04-02 · unverdicted · novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.

FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning

cs.LG · 2026-03-05 · unverdicted · novelty 7.0

FedBCGD reduces communication in federated learning by a factor of 1/N through block-wise parameter updates with accelerated convergence guarantees.

Efficient GPU-Accelerated Training of a Neuroevolution Potential with Analytical Gradients

cond-mat.dis-nn · 2025-07-01 · conditional · novelty 7.0

GNEP trains neuroevolution potentials with analytical gradients and Adam optimizer, cutting fitting time by orders of magnitude for Sb-Te systems while matching DFT accuracy on equation of state and radial distribution functions.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

cs.LG · 2025-02-07 · unverdicted · novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

Mini-batch Estimation for Deep Cox Models: Statistical Foundations and Practical Guidance

stat.ML · 2024-08-05 · unverdicted · novelty 7.0

Mini-batch SGD optimizes a different objective than full partial likelihood in Cox models, but the resulting mb-MPLE is still consistent with optimal rates for neural nets and asymptotic normality for linear models.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

Segment Anything

cs.CV · 2023-04-05 · unverdicted · novelty 7.0

A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.

Scalable Diffusion Models with Transformers

cs.CV · 2022-12-19 · unverdicted · novelty 7.0

DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.

citing papers explorer

Showing 50 of 63 citing papers after filters.

TallyTrain: Communication-Efficient Federated Distillation cs.LG · 2026-06-30 · unverdicted · none · ref 43 · internal anchor
TallyTrain is a hard-label distillation protocol for federated learning that uses argmax transmission and optional sparse merges to match soft-label performance at up to 1000x lower communication cost.
Knowledge Cascade: Reverse Knowledge Distillation on Nonparametric Multivariate Functional Estimation stat.ME · 2026-06-24 · unverdicted · none · ref 5 · internal anchor
KCas transfers student-selected smoothing parameters to full-sample teacher models via asymptotic scaling laws in smoothing splines and kernel methods, cutting computation while retaining performance guarantees.
Mirror Descent Beyond Euclidean Stability: An Exponential Separation in Initialization Sensitivity cs.LG · 2026-06-09 · conditional · none · ref 45 · internal anchor
Non-quadratic Mirror Descent exhibits exponential initialization sensitivity in convex settings, shown via 3D constructions and KL-regularized simplex examples, with Bregman anchoring proposed for stabilization.
A Hybrid Generative Reduced-Order Model for the Minimal Flow Unit physics.flu-dyn · 2026-06-08 · unverdicted · none · ref 64 · internal anchor
A β-VAE-GAN plus sensor-conditioned Transformer with Easy Attention forecasts near-wall turbulence in the Minimal Flow Unit, recovering 87% turbulent kinetic energy in 4D latent space and maintaining accuracy over 17288 t+ from 128 t+ initialization while reconstructing 82% TKE end-to-end.
Demystifying Pipeline Parallelism: First Theory for PipeDream cs.LG · 2026-06-02 · unverdicted · none · ref 4 · internal anchor
Introduces Randomized PipeDream abstraction yielding first nonconvex convergence bound for PipeDream and proves delay scales as S squared for S stages.
From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression cs.LG · 2026-05-23 · unverdicted · none · ref 3 · internal anchor
Derives mini-batch scaling laws for sketched linear regression, with shared approximation terms and protocol-specific variance/fluctuation scalings under power-law spectrum and source condition.
LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging cs.LG · 2026-05-20 · unverdicted · none · ref 52 · internal anchor
LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method cs.LG · 2026-05-18 · unverdicted · none · ref 50 · internal anchor
Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.
Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits math.OC · 2026-05-08 · unverdicted · none · ref 110 · internal anchor
Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals cs.CR · 2026-05-08 · unverdicted · none · ref 40 · internal anchor
TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
A Provably Robust Multi-Jet Framework applied to Active Flow Control of an Airfoil in Weakly Compressible Flow physics.flu-dyn · 2026-04-29 · unverdicted · none · ref 61 · internal anchor
A new injective multi-jet framework for RL flow control provides jet-count-independent running cost upper bounds and enables superior coordinated jet strategies, achieving drag suppression beyond symmetric ideals on cylinders and aerodynamic efficiency gains from 53% to 73% on airfoils.
Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark cs.CR · 2026-04-09 · unverdicted · none · ref 12 · internal anchor
Creates the BGTD benchmark and mmTraffic architecture to enable explainable multimodal interpretation of encrypted network traffic using LLMs.
Fusion and Alignment Enhancement with Large Language Models for Tail-item Sequential Recommendation cs.IR · 2026-04-04 · unverdicted · none · ref 16 · internal anchor
FAERec fuses collaborative ID embeddings with LLM semantic embeddings using adaptive gating and dual-level alignment to enhance tail-item sequential recommendations.
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods cs.DC · 2026-04-02 · unverdicted · none · ref 44 · internal anchor
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning cs.LG · 2026-03-05 · unverdicted · none · ref 13 · internal anchor
FedBCGD reduces communication in federated learning by a factor of 1/N through block-wise parameter updates with accelerated convergence guarantees.
Convergence of Continual Learning in Homogeneous Deep Networks cs.LG · 2026-06-29 · unverdicted · none · ref 57 · internal anchor
Continual classification in homogeneous models is sequential projections onto margin sets, with local linear convergence under regularity properties for random and cyclic tasks, extended to regression.
ScaleAware-JEPA: Latent Representation for Discovery in Multiscale Physical Fields cs.LG · 2026-06-29 · unverdicted · none · ref 43 · internal anchor
ScaleAware-JEPA combines Constrained Diffusion Decomposition with a scale-tied JEPA objective to learn label-free latent coordinates that recover coherent morphology in multiscale fields such as MHD turbulence and interstellar gas.
Spectral phase transitions and trainability in neural network learning dynamics cond-mat.dis-nn · 2026-06-26 · unverdicted · none · ref 60 · internal anchor
SGD on neural network weights induces a BBP phase transition that detaches signal eigenvalues from the random bulk, yielding an analytically solvable phase diagram for trainability in a linear teacher-student model.
PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion cs.CV · 2026-06-26 · unverdicted · none · ref 10 · internal anchor
PixelU is a minimalist U-shaped Diffusion Transformer for pixel-space diffusion that decouples frequencies with zero-cost skip connections and constant-channel downsampling, outperforming baselines like JiT-G at 1/3 the compute cost with FID 1.63 on ImageNet 256x256.
Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors cs.LG · 2026-06-24 · unverdicted · none · ref 154 · internal anchor
MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.
A Benchmark for Heterogeneous Stereo Deblurring with Physically- and Epipolar-constrained Cross Attention cs.CV · 2026-06-24 · unverdicted · none · ref 5 · internal anchor
Presents HSD dataset and PECA module for heterogeneous stereo deblurring, improving CNN/Transformer/NAFNet baselines via constrained cross attention.
TL++: Accuracy and Privacy Preserving Traversal Learning for Distributed Intelligent Systems cs.LG · 2026-06-24 · unverdicted · none · ref 18 · internal anchor
TL++ recovers centralized mini-batch gradients via virtual batches in split learning and adds secret sharing for cut-layer tensors, achieving 91.41% accuracy on CIFAR-10 with 13x lower communication than full-model sync.
Field-level weak lensing cosmology with $<100$ simulations using multifidelity simulation-based inference astro-ph.CO · 2026-06-22 · unverdicted · none · ref 199 · internal anchor
Multifidelity simulation-based inference enables accurate field-level weak lensing cosmology with 60-100 high-fidelity N-body simulations via pre-training on log-normal mocks.
Local MixVR: Breaking the Communication-Sample Dependence in Distributed Learning cs.LG · 2026-05-31 · unverdicted · none · ref 4 · internal anchor
Local MixVR achieves communication complexity scaling only with number of workers M, independent of total samples N, and outperforms Minibatch Accelerated SGD when M is smaller than order N to the 1/4.
Accurate Large-sample Uncertainty Quantification using Stochastic Gradient Markov Chain Monte Carlo cs.LG · 2026-05-29 · unverdicted · none · ref 18 · internal anchor
New discrete-time approximations to SG(L)D enable accurate non-asymptotic predictions of covariance and integrated autocorrelation time for practical tuning in large-batch or misspecified regimes.
Towards Efficient LLMs Annealing with Principled Sample Selection cs.CL · 2026-05-29 · unverdicted · none · ref 7 · internal anchor
DiReCT reformulates LLM annealing sample selection as a constrained optimization problem that enforces per-sample gradient directions aligned with the loss landscape's curvature.
LionMuon: Alternating Spectral and Sign Descent for Efficient Training cs.LG · 2026-05-19 · unverdicted · none · ref 4 · 2 links · internal anchor
LionMuon alternates Lion and Muon steps with shared dual-EMA buffer to Pareto-dominate existing optimizers in loss and compute on models up to 720M parameters.
Neural Collapse by Design: Learning Class Prototypes on the Hypersphere cs.LG · 2026-05-19 · unverdicted · none · ref 76 · 2 links · internal anchor
Supervised classification reaches neural collapse by design via normalized prototype losses on the hypersphere, outperforming CE and SCL on ImageNet-1K and other benchmarks with faster convergence and better transfer.
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity cs.LG · 2026-05-13 · unverdicted · none · ref 169 · internal anchor
Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.
Hypernetworks for Dynamic Feature Selection cs.LG · 2026-05-12 · unverdicted · none · ref 19 · internal anchor
Hyper-DFS uses hypernetworks and Set Transformers to generate on-demand parameters for feature subsets in dynamic selection, outperforming prior methods on tabular data and showing stronger zero-shot generalization.
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling cs.LG · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.
Parameter-Efficient Adaptation of Pre-Trained Vision Foundation Models for Active and Passive Seismic Data Denoising physics.geo-ph · 2026-04-30 · conditional · none · ref 49 · internal anchor
Adapting vision foundation models with LoRA and kurtosis-guided unsupervised test-time adaptation matches or exceeds domain-specific models for seismic denoising across multiple sites and unseen data.
Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training cs.CV · 2026-04-30 · unverdicted · none · ref 48 · internal anchor
DynamiCS dynamically scales semantic clusters per training epoch to reduce VLM pre-training compute while improving accuracy on long-tail concepts compared to static or flattening baselines.
COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training cs.DC · 2026-04-29 · unverdicted · none · ref 9 · internal anchor
COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.
CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training cs.LG · 2026-04-27 · unverdicted · none · ref 8 · internal anchor
CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing cs.CL · 2026-04-21 · unverdicted · none · ref 41 · internal anchor
DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
A time-series classification framework for individual-level absenteeism prediction under severe class imbalance cs.AI · 2026-06-30 · unverdicted · none · ref 10 · internal anchor
A TSC framework separates historical attendance sequences from future labels and uses LSTM-FCN with BFL or G-Mean loss to achieve approximately 80% balanced accuracy for proactive absenteeism prediction on simulated data.
LLM Evolution as an Industry-Scale Ecosystem: A Lifecycle Perspective on Continual Learning cs.LG · 2026-06-12 · unverdicted · none · ref 32 · internal anchor
The paper reformulates industrial continual learning for LLMs as a closed-loop ecosystem problem, identifies three core challenges, and organizes solutions around five lifecycle design principles.
What Do Students Learn? A Feature-Level Analysis of Dark Knowledge cs.LG · 2026-06-02 · unverdicted · none · ref 24 · internal anchor
Confusion Distillation is a self-distillation method that treats dataset-level confusion patterns as dynamic soft targets, achieving competitive results on ResNet models for CIFAR-100 without a teacher.
A Training-Efficient Transformer-Based Anti-Spoofing Network for Logical Access in ASVspoof 5 cs.SD · 2026-06-02 · unverdicted · none · ref 17 · 2 links · internal anchor
TFPARN applies a Transformer encoder with attention pooling and combined focal-pairwise losses to ASVspoof 5 Track 1, reporting minDCF 0.2430, EER 12.52%, lowest inference memory, and faster training than re-implemented baselines.
SaluNet: Enabling Total Plasticity in Normalization-Free Deep Networks cs.CV · 2026-06-01 · unverdicted · none · ref 16 · internal anchor
SaluNet replaces normalization layers with the SALU activation and reports competitive accuracies on CIFAR-10/100 and ImageNet-1K without normalization.
MoEIoU: Rethinking Bounding-Box Regression as a Mixture of Experts cs.CV · 2026-05-30 · unverdicted · none · ref 6 · internal anchor
MoEIoU is a mixture-of-experts IoU loss using log-sum-exp aggregation and curriculum weighting that reports consistent gains over prior IoU losses on PASCAL VOC, HRIPCB, and MS COCO with YOLO models.
Orion: Enabling Self-adaptive Memory Management for On-device Online Continual Learning eess.SY · 2026-05-26 · unverdicted · none · ref 31 · internal anchor
Orion is a self-adaptive memory management framework for on-device online continual learning that co-optimizes latency, plasticity, and stability via URGE-based reallocation and prefetching.
Anytime Training with Schedule-Free Spectral Optimization cs.LG · 2026-05-21 · unverdicted · none · ref 68 · internal anchor
SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates cs.LG · 2026-05-19 · unverdicted · none · ref 13 · internal anchor
FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.
Information theoretic underpinning of self-supervised learning by clustering cs.LG · 2026-05-12 · unverdicted · none · ref 156 · internal anchor
SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.
Multi-Narrow Transformation as a Single-Model Ensemble: Boundary Conditions, Mechanisms, and Failure Modes cs.LG · 2026-05-12 · unverdicted · none · ref 7 · internal anchor
Multi-narrow single-model ensembles outperform wide baselines in low-data image classification by learning diverse features but underperform in data-rich settings where training favors few paths.
Probing Routing-Conditional Calibration in Attention-Residual Transformers cs.CV · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
Routing summaries and auxiliary features do not provide stable evidence of conditional miscalibration in AR transformers once confidence-matched baselines, capacity controls, and permutation nulls are applied.
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction math.OC · 2026-05-09 · unverdicted · none · ref 47 · internal anchor
Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP cs.DC · 2026-05-08 · unverdicted · none · ref 44 · internal anchor
FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer