arxiv: 1608.03983 · v5 · submitted 2016-08-13 · 💻 cs.LG · cs.NE· math.OC

Recognition: 2 theorem links

· Lean Theorem

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov , Frank Hutter

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:23 UTC · model grok-4.3

classification 💻 cs.LG cs.NEmath.OC

keywords stochastic gradient descentwarm restartsdeep neural networksanytime performanceCIFAR-10CIFAR-100learning rate scheduleoptimization

0 comments

The pith

A warm restart technique for stochastic gradient descent improves its anytime performance in deep neural network training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SGDR, a variant of stochastic gradient descent that periodically resets the learning rate to a higher value after following a cosine decay within each cycle. This warm restart approach is intended to help the optimizer navigate complex loss landscapes more effectively than standard fixed or monotonically decreasing schedules. The authors test the method on image classification tasks and report new state-of-the-art error rates. They also evaluate it on EEG signal data and a reduced ImageNet set to show broader applicability. The central promise is that these restarts deliver strong results at any point during training, not only at the end.

Core claim

We propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on the CIFAR-10 and CIFAR-100 datasets, where we demonstrate new state-of-the-art results at 3.14% and 16.21%, respectively. We also demonstrate its advantages on a dataset of EEG recordings and on a downsampled version of the ImageNet dataset.

What carries the argument

Periodic warm restarts of the learning rate schedule in SGD, where the rate decays via a cosine function within each restart cycle before being reset.

Load-bearing premise

That the periodic warm restarts will reliably improve convergence and anytime performance across architectures and datasets without introducing new failure modes or requiring dataset-specific retuning of the restart schedule.

What would settle it

Running the same network architecture on CIFAR-10 with a carefully tuned standard SGD schedule that achieves lower final error than the reported SGDR result would falsify the performance improvement claim.

read the original abstract

Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial warm restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on the CIFAR-10 and CIFAR-100 datasets, where we demonstrate new state-of-the-art results at 3.14% and 16.21%, respectively. We also demonstrate its advantages on a dataset of EEG recordings and on a downsampled version of the ImageNet dataset. Our source code is available at https://github.com/loshchil/SGDR

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SGDR packages cosine annealing with warm restarts into SGD and shows clear gains on CIFAR benchmarks plus public code, but stays an empirical tweak without theory.

read the letter

SGDR is a warm-restart version of SGD that uses cosine annealing between restarts, and the authors show it reaches lower error rates on CIFAR-10 and CIFAR-100 than prior schedules. They report 3.14% and 16.21% respectively, along with gains on EEG data and a downsampled ImageNet set. The schedule description is straightforward: cosine decay over a period, then a restart with a scaled-down peak rate and longer period if using the multiplier option. They supply the exact parameters and link to code, which makes the claims checkable in practice. The learning curves support the anytime-performance angle, with better accuracy reached earlier than plain SGD or fixed schedules. That combination of simplicity and released implementation is the main practical value. The work stays empirical. There is no analysis of the underlying dynamics or proof that cosine plus restarts is better than other decay shapes. The restart periods and multiplier still need selection, and the paper does not quantify how sensitive final results are to those choices across architectures. Variance across seeds is not emphasized, so the reliability of the reported margins is harder to judge from the text alone. This paper targets practitioners who train convolutional networks and want a drop-in schedule that improves SGD without changing the optimizer. Readers who run their own vision or signal experiments will get concrete numbers and code they can test immediately. The central empirical claim holds up on the evidence provided, so the paper deserves a serious referee rather than a desk reject. I would send it out for review.

Referee Report

2 major / 2 minor

Summary. The paper proposes SGDR, a simple modification to stochastic gradient descent that applies periodic warm restarts combined with cosine annealing of the learning rate schedule. The central claim is that this heuristic improves the anytime performance of SGD when training deep neural networks. The authors support the claim with experiments on CIFAR-10 and CIFAR-100, reporting new state-of-the-art error rates of 3.14% and 16.21% respectively, plus additional results on EEG recordings and downsampled ImageNet; public code is released.

Significance. If the empirical results hold under independent verification, the work supplies a lightweight, practical enhancement to SGD training that requires only two additional schedule parameters (T0 and Tmult) and yields measurable gains in convergence speed and final accuracy on standard vision benchmarks. The public implementation and explicit schedule definitions strengthen reproducibility.

major comments (2)

§4 (CIFAR experiments): the reported 3.14% and 16.21% error rates are presented as single-point SOTA figures without accompanying standard deviations or the number of independent runs; this weakens the strength of the cross-method comparison because small differences could arise from random seed variation rather than the restart schedule.
§3.2 (SGDR definition): the restart periods T_i are described as a geometric progression controlled by Tmult, yet the manuscript does not provide an ablation showing sensitivity to the choice of Tmult versus a fixed-period baseline; this leaves open whether the reported gains are robust to modest changes in the restart schedule.

minor comments (2)

Figure 1 caption: the learning-rate plot would benefit from an explicit annotation of the restart points T_i to make the cosine-annealing pattern immediately visible without cross-referencing the text.
Related-work section: the discussion of prior warm-restart methods in non-stochastic settings could cite the specific accelerated-gradient papers that motivated the partial-warm-restart idea.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. We address the two major comments point by point below.

read point-by-point responses

Referee: §4 (CIFAR experiments): the reported 3.14% and 16.21% error rates are presented as single-point SOTA figures without accompanying standard deviations or the number of independent runs; this weakens the strength of the cross-method comparison because small differences could arise from random seed variation rather than the restart schedule.

Authors: We acknowledge that single-run results limit statistical assessment of variability. The original experiments followed the common practice of reporting single-run error rates on CIFAR benchmarks. In the revised manuscript we will explicitly state that the 3.14% and 16.21% figures are from single training runs and note that the released code enables independent verification. We also observe that the same schedule yields consistent gains on EEG and downsampled ImageNet, supporting robustness beyond random seed effects. revision: partial
Referee: §3.2 (SGDR definition): the restart periods T_i are described as a geometric progression controlled by Tmult, yet the manuscript does not provide an ablation showing sensitivity to the choice of Tmult versus a fixed-period baseline; this leaves open whether the reported gains are robust to modest changes in the restart schedule.

Authors: We agree that an explicit comparison would strengthen the presentation. In the revised version we will add a short ablation that contrasts the geometric schedule (Tmult > 1) against a fixed-period baseline (Tmult = 1) on CIFAR-10, confirming that the geometric progression contributes to the observed improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes SGDR as an empirical algorithmic heuristic for SGD with periodic cosine-annealed warm restarts, without any claimed first-principles derivation or mathematical prediction chain. All load-bearing elements are direct experimental validations on CIFAR-10/100 (new SOTA error rates) plus EEG and ImageNet subsets, with explicit schedule parameters and public code. No step reduces by construction to fitted inputs, self-definitions, or self-citation chains; the contribution is self-contained as a practical schedule definition plus empirical results.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The claim rests on the empirical effectiveness of a new learning-rate schedule; no new physical entities or unstated mathematical axioms are introduced beyond standard optimization assumptions.

free parameters (2)

restart periods T_i
Chosen as hyperparameters that increase over time; their specific values affect when restarts occur.
initial learning rate eta_max
Standard SGD hyperparameter that is reset at each restart.

axioms (1)

domain assumption Cosine annealing combined with periodic restarts improves convergence on multimodal loss surfaces typical of deep networks.
Invoked to motivate the schedule; supported by prior work on accelerated methods but treated as given for this paper.

pith-pipeline@v0.9.0 · 5424 in / 1210 out tokens · 40319 ms · 2026-05-10T19:23:38.766313+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ENSEMBITS: an alphabet of protein conformational ensembles
cs.LG 2026-05 unverdicted novelty 8.0

Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.
ENSEMBITS: an alphabet of protein conformational ensembles
cs.LG 2026-05 unverdicted novelty 8.0

Ensembits creates a discrete vocabulary for protein conformational ensembles that outperforms static tokenizers on dynamics prediction tasks and enables ensemble token prediction from single structures via distillation.
A Unified Perspective on Adversarial Membership Manipulation in Vision Models
cs.CV 2026-04 conditional novelty 8.0

Adversarial perturbations reliably fabricate membership signals in vision-model MIAs, separated by a gradient-norm collapse trajectory that enables robust detection and inference.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
The finite expression method for turbulent dynamics with high-order moment recovery
cs.LG 2026-05 unverdicted novelty 7.0

A two-stage symbolic regression plus generative model framework recovers governing interaction terms and forcing in stochastic triad models while accurately predicting statistical moments up to order five.
End-to-End Keyword Spotting on FPGA Using Graph Neural Networks with a Neuromorphic Auditory Sensor
cs.LG 2026-05 conditional novelty 7.0

An FPGA implementation of a neuromorphic auditory sensor plus graph neural network achieves 87.43% accuracy on Google Speech Commands v2 with sub-35 µs latency and 1.12 W power.
GPROF-IR: An Improved Single-Channel Infrared Precipitation Retrieval for Merged Satellite Precipitation Products
physics.ao-ph 2026-05 unverdicted novelty 7.0

GPROF-IR is a CNN-based retrieval that uses temporal context in geostationary IR observations to produce precipitation estimates with lower error than prior IR methods and climatological consistency with PMW retrieval...
The Interplay of Data Structure and Imbalance in the Learning Dynamics of Diffusion Models
stat.ML 2026-05 unverdicted novelty 7.0

Higher-variance classes are learned first in diffusion models; strong class imbalance reverses the order and imposes distinct delayed learning times on minority classes.
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
cs.AI 2026-05 conditional novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline ...
Selective Contrastive Learning For Gloss Free Sign Language Translation
cs.CL 2026-04 unverdicted novelty 7.0

A pair selection strategy based on negative similarity dynamics strengthens contrastive supervision in gloss-free sign language translation by reducing noisy negatives.
Learned Nonlocal Feature Matching and Filtering for RAW Image Denoising
eess.IV 2026-04 unverdicted novelty 7.0

A learnable nonlocal block that mimics classical neighbor matching and collaborative filtering on multiscale features produces competitive RAW denoising with far fewer parameters than current deep models and generaliz...
From Zero to Detail: A Progressive Spectral Decoupling Paradigm for UHD Image Restoration with New Benchmark
cs.CV 2026-04 unverdicted novelty 7.0

A new framework called ERR decomposes UHD image restoration into three frequency stages with specialized sub-networks and introduces the LSUHDIR benchmark dataset of over 82,000 images.
MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events
cs.CL 2026-04 unverdicted novelty 7.0

MADE creates a contamination-resistant living benchmark for multi-label classification of medical device adverse events, with evaluations revealing model-specific trade-offs in accuracy and uncertainty quantification.
High-Speed Full-Color HDR Imaging via Unwrapping Modulo-Encoded Spike Streams
cs.CV 2026-04 unverdicted novelty 7.0

An exposure-decoupled modulo formulation and iteration-free diffusion-prior unwrapping enable 1000 FPS full-color HDR imaging on spike cameras while cutting bandwidth from 20 Gbps to 6 Gbps.
Sparse Contrastive Learning for Content-Based Cold Item Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

SEMCo uses sparse entmax contrastive learning for purely content-based cold-start item recommendation, outperforming standard methods in ranking accuracy.
MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts
cs.CL 2026-04 unverdicted novelty 7.0

MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B pa...
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching
cs.RO 2026-04 unverdicted novelty 7.0

ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...
CV-HoloSR: Hologram to hologram super-resolution through volume-upsampling three-dimensional scenes
cs.GR 2026-04 conditional novelty 7.0

CV-HoloSR uses a complex-valued residual dense network, depth-aware perceptual loss, and complex LoRA fine-tuning to perform hologram super-resolution for volumetric upsampling, achieving 32% better LPIPS while mainta...
VDPP: Video Depth Post-Processing for Speed and Scalability
cs.CV 2026-04 unverdicted novelty 7.0

VDPP is an RGB-free video depth post-processor that achieves over 43 FPS on Jetson Orin Nano by refining geometry at low resolution rather than reconstructing full scenes.
BiDexGrasp: Coordinated Bimanual Dexterous Grasps across Object Geometries and Sizes
cs.RO 2026-04 unverdicted novelty 7.0

BiDexGrasp supplies a 9.7-million-grasp bimanual dexterous dataset built via two-stage synthesis and a coordinated geometry-size-adaptive model that generates grasps for unseen objects.
Joint Fullband-Subband Modeling for High-Resolution SingFake Detection
cs.SD 2026-04 unverdicted novelty 7.0

A joint fullband-subband model using high-resolution 44.1 kHz audio outperforms standard 16 kHz detectors for singing voice deepfake detection by exploiting spectrum-specific synthesis artifacts.
UniDAC: Universal Metric Depth Estimation for Any Camera
cs.CV 2026-03 unverdicted novelty 7.0

UniDAC achieves universal metric depth estimation across camera types by decoupling relative depth prediction from spatially varying scale estimation using a depth-guided module and distortion-aware positional embedding.
LPNSR: Optimal Noise-Guided Diffusion Image Super-Resolution Via Learnable Noise Prediction
cs.CV 2026-03 conditional novelty 7.0

LPNSR derives optimal intermediate noise for diffusion SR via MLE and implements it with an LR-guided noise predictor, reaching SOTA perceptual quality in 4 steps without text priors.
Polarized Target Nuclear Magnetic Resonance Measurements with Deep Neural Networks
physics.ins-det 2026-03 unverdicted novelty 7.0

Deep neural networks reduce fitting uncertainties in CW-NMR polarization measurements for dynamically polarized targets.
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
LRM: Large Reconstruction Model for Single Image to 3D
cs.CV 2023-11 conditional novelty 7.0

LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.
A Simple Framework for Contrastive Learning of Visual Representations
cs.LG 2020-02 accept novelty 7.0

SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
From Clever Hans to Scientific Discovery: Interpreting EEG Foundational Transformers with LRP
cs.AI 2026-05 unverdicted novelty 6.0

LRP on EEG transformers reveals Clever Hans artifacts in motor imagery tasks and a recurring central electrode cluster as a candidate sensorimotor signature of arousal.
Direct-to-Event Spiking Neural Network Transfer
cs.NE 2026-05 unverdicted novelty 6.0

This work provides the first systematic study of transferring direct-coded spiking neural networks to event-based representations while aiming to preserve accuracy and reduce energy use.
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
cs.LG 2026-05 unverdicted novelty 6.0

Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
cs.LG 2026-05 unverdicted novelty 6.0

OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
CARD: Coarse-to-fine Autoregressive Modeling with Radix-based Decomposition for Transferable Free Energy Estimation
cs.LG 2026-05 unverdicted novelty 6.0

CARD uses radix decomposition to enable autoregressive modeling of molecular coordinates as a zero-free-energy reference distribution, delivering classical accuracy for absolute free energy on unseen systems at ~40x speedup.
Anon: Extrapolating Adaptivity Beyond SGD and Adam
cs.AI 2026-05 unverdicted novelty 6.0

Anon optimizer uses tunable adaptivity and incremental delay update to achieve convergence guarantees and outperform existing methods on image classification, diffusion, and language modeling tasks.
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
cs.CV 2026-05 unverdicted novelty 6.0

UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
Parameter-Efficient Adaptation of Pre-Trained Vision Foundation Models for Active and Passive Seismic Data Denoising
physics.geo-ph 2026-04 conditional novelty 6.0

Adapting vision foundation models with LoRA and kurtosis-guided unsupervised test-time adaptation matches or exceeds domain-specific models for seismic denoising across multiple sites and unseen data.
Euclid Quick Data Release (Q1). AstroVink: A vision transformer approach to find strong gravitational lens systems
astro-ph.IM 2026-04 conditional novelty 6.0

A vision transformer classifier trained on simulated and real Euclid data recovers all known strong lenses in test sets and finds 8 Grade A plus 26 Grade B new candidates in the Q1 data.
Materialistic RIR: Material Conditioned Realistic RIR Generation
cs.CV 2026-04 unverdicted novelty 6.0

A two-module neural model disentangles spatial layout from material properties to generate controllable and more realistic room impulse responses, reporting gains of up to 16% on acoustic metrics and 70% on material m...
Self-supervised pretraining for an iterative image size agnostic vision transformer
cs.CV 2026-04 unverdicted novelty 6.0

A sequential-to-global SSL method based on DINO pretrains iterative foveal-inspired vision transformers to achieve competitive ImageNet-1K performance with constant compute regardless of input resolution.
IonMorphNet: Generalizable Learning of Ion Image Morphologies for Peak Picking in Mass Spectrometry Imaging
cs.CV 2026-04 unverdicted novelty 6.0

IonMorphNet is a ConvNeXt-based classifier trained on six spatial pattern classes from 53 MSI datasets that performs generalizable peak picking and improves mSCF1 by 7% over prior methods while also aiding tumor class...
EAST: Early Action Prediction Sampling Strategy with Token Masking
cs.CV 2026-04 unverdicted novelty 6.0

EAST uses randomized time-step sampling and token masking to train a single encoder-only model that generalizes across all observation ratios in early action prediction and reports new state-of-the-art accuracy on NTU...
HSG: Hyperbolic Scene Graph
cs.CV 2026-04 unverdicted novelty 6.0

Hyperbolic Scene Graph (HSG) learns embeddings in hyperbolic space for better hierarchical structure in scene graphs, achieving graph IoU of 33.51 versus 25.37 for the best Euclidean baseline.
AI-assisted modeling and Bayesian inference of unpolarized quark transverse momentum distributions from Drell-Yan data
hep-ph 2026-04 unverdicted novelty 6.0

An AI-assisted Bayesian framework extracts TMD PDFs from global Drell-Yan data using surrogate models for scalable MCMC sampling.
OTProf: estimating high-resolution profiles of optical turbulence ($C_n^2$) from reanalysis using deep learning
physics.ao-ph 2026-04 conditional novelty 6.0

Deep learning model OTProf generates high-resolution C_n² profiles from ERA5 reanalysis data and outperforms the Hufnagel-Valley model for vertical structure and integrated parameters like Fried parameter r_0 in the N...
Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition
cs.CV 2026-04 unverdicted novelty 6.0

FDSM recovers fine-grained motion details in zero-shot skeleton action recognition by integrating semantic-guided spectral residual, timestep-adaptive spectral loss, and curriculum-based semantic abstraction, reaching...
CloudMamba: An Uncertainty-Guided Dual-Scale Mamba Network for Cloud Detection in Remote Sensing Imagery
cs.CV 2026-04 unverdicted novelty 6.0

CloudMamba combines uncertainty-guided refinement with a dual-scale Mamba network to outperform prior methods on cloud segmentation accuracy while maintaining linear computational cost.
Learning Shared Sentiment Prototypes for Adaptive Multimodal Sentiment Analysis
cs.MM 2026-04 unverdicted novelty 6.0

PRISM learns shared sentiment prototypes to enable structured cross-modal comparison and dynamic modality reweighting in multimodal sentiment analysis, outperforming baselines on three benchmark datasets.
Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment
cs.LG 2026-04 unverdicted novelty 6.0

Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.
Scalable Variational Bayesian Fine-Tuning of LLMs via Orthogonalized Low-Rank Adapters
cs.LG 2026-04 unverdicted novelty 6.0

PoLAR-VBLL combines orthogonalized low-rank adapters with variational Bayesian last-layer inference to enable scalable, well-calibrated uncertainty quantification in fine-tuned LLMs.
DeepSeek-OCR: Contexts Optical Compression
cs.CV 2025-10 unverdicted novelty 6.0

DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
cs.CL 2024-04 conditional novelty 6.0

MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
cs.LG 2024-02 conditional novelty 6.0

REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
cs.RO 2023-12 conditional novelty 6.0

A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.
Vision Transformers Need Registers
cs.CV 2023-09 unverdicted novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
cs.LG 2023-04 conditional novelty 6.0

IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
cs.CL 2022-11 unverdicted novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
ERPPO: Entropy Regularization-based Proximal Policy Optimization
cs.LG 2026-05 unverdicted novelty 5.0

ERPPO adds a DSA-based ambiguity estimator to MAPPO and switches between L1 and L2 entropy regularization to improve exploration and stability in non-stationary multi-dimensional observations.
Probing Routing-Conditional Calibration in Attention-Residual Transformers
cs.CV 2026-05 unverdicted novelty 5.0

Routing summaries and auxiliary features do not provide stable evidence of conditional miscalibration in AR transformers once confidence-matched baselines, capacity controls, and permutation nulls are applied.
MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

MambaLiteUNet integrates Mamba into U-Net with adaptive fusion, local-global mixing, and cross-gated attention modules to reach 87.12% IoU and 93.09% Dice on skin lesion datasets while cutting parameters by 93.6%.
Training-inference input alignment outweighs framework choice in longitudinal retinal image prediction
cs.CV 2026-04 unverdicted novelty 5.0

Training-inference input alignment outweighs framework choice for longitudinal retinal image prediction, with deterministic regression matching complex models when acquisition variability dominates disease progression.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 76 Pith papers · 4 internal anchors

[1]

The loss surfaces of multilayer networks, 2015

Anna Choromanska, Mikael Henaff, Michael Mathieu, G ´erard Ben Arous, and Yann LeCun. The loss surface of multilayer networks. arXiv preprint arXiv:1412.0233,

work page arXiv
[2]

Rmsprop and equilibrated adaptive learning rates for non-convex optimization

Yann N Dauphin, Harm de Vries, Junyoung Chung, and Yoshua Bengio. Rmsprop and equilibrated adaptive learning rates for non-convex optimization. arXiv preprint arXiv:1502.04390,

work page arXiv
[3]

Deep pyramidal residual networks

Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. arXiv preprint arXiv:1610.02915,

work page arXiv
[4]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149,

work page internal anchor Pith review arXiv
[5]

Benchmarking a BI-population CMA-ES on the BBOB-2009 function testbed

Nikolaus Hansen. Benchmarking a BI-population CMA-ES on the BBOB-2009 function testbed. In Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computa- tion Conference: Late Breaking Papers, pp. 2389–2396. ACM,

work page 2009
[6]

Evaluating the cma evolution strategy on multimodal test functions

11 Published as a conference paper at ICLR 2017 Nikolaus Hansen and Stefan Kern. Evaluating the cma evolution strategy on multimodal test functions. In International Conference on Parallel Problem Solving from Nature , pp. 282–291. Springer,

work page 2017
[7]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. arXiv preprint arXiv:1512.03385,

work page internal anchor Pith review arXiv
[8]

Identity mappings in deep residual networks

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027,

work page arXiv
[9]

Densely connected convolutional networks,

Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E. Hopcroft, and Kilian Q. Weinberger. Snapshot ensembles: Train 1, get m for free. ICLR 2017 submission, 2016a. Gao Huang, Zhuang Liu, and Kilian Q Weinberger. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016b. Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinb...

work page arXiv 2017
[10]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. SGDR: Stochastic Gradient Descent with Restarts.arXiv preprint arXiv:1608.03983,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Adaptive restart for accelerated gradient schemes

Brendan O’Donoghue and Emmanuel Candes. Adaptive restart for accelerated gradient schemes. arXiv preprint arXiv:1204.3982,

work page arXiv
[12]

Benchmarking the bfgs algorithm on the bbob-2009 function testbed

Raymond Ros. Benchmarking the bfgs algorithm on the bbob-2009 function testbed. In Proceed- ings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Con- ference: Late Breaking Papers, pp. 2409–2414. ACM,

work page 2009
[13]

Deep learning with convolutional neural networks for brain mapping and decoding of movement-related information from the human eeg

12 Published as a conference paper at ICLR 2017 Robin Tibor Schirrmeister, Jost Tobias Springenberg, Lukas Dominique Josef Fiederer, Martin Glasstetter, Katharina Eggensperger, Michael Tangermann, Frank Hutter, Wolfram Burgard, and Tonio Ball. Deep learning with convolutional neural networks for brain mapping and decoding of movement-related information f...

work page arXiv 2017
[15]

Cyclical Learning Rates for Training Neural Networks

Leslie N Smith. Cyclical learning rates for training neural networks. arXiv preprint arXiv:1506.01186v3,

work page Pith review arXiv
[16]

Stochastic subgradient methods with linear convergence for polyhe- dral convex optimization

Tianbao Yang and Qihang Lin. Stochastic subgradient methods with linear convergence for polyhe- dral convex optimization. arXiv preprint arXiv:1510.01444,

work page arXiv
[17]

Wide Residual Networks

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146,

work page internal anchor Pith review arXiv
[18]

Adadelta: an adaptive learning rate method.arXiv preprint arXiv:1212.5701,

Matthew D Zeiler. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701 ,

work page arXiv
[19]

8.1 50 K VS 100 K EXAMPLES PER EPOCH Our data augmentation procedure code is inherited from the Lasagne Recipe code for ResNets where ﬂipped images are added to the training set

13 Published as a conference paper at ICLR 2017 8 S UPPLEMENTARY MATERIAL 0 20 40 60 80 100 5 10 15 20 25 30 Epochs Test error (%) CIFAR−10 Default SGDR Figure 6: The median results of 5 runs for the best learning rate settings considered for WRN-28-1. 8.1 50 K VS 100 K EXAMPLES PER EPOCH Our data augmentation procedure code is inherited from the Lasagne ...

work page 2017