WaveNet generates realistic raw audio using an autoregressive neural network with dilated convolutions, achieving state-of-the-art naturalness in speech synthesis for English and Mandarin.
super hub Mixed citations
Deep Residual Learning for Image Recognition
Mixed citation behavior. Most common role is background (50%).
abstract
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG ne
authors
co-cited works
representative citing papers
In the proportional high-dimensional regime, stronger backdoor training triggers improve clean accuracy and make attack success non-monotonic for regularized GLMs on Gaussian mixtures, with closed-form proofs for squared loss and fixed-point extensions to convex losses.
Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
SOCP uses self-organizing maps for unsupervised group discovery to enable local calibration in conformal prediction, reducing regional coverage gaps on benchmarks with small set-size increases while preserving validity guarantees.
MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.
StoMPP progressively binarizes BNN layers layerwise from input to output via stochastic masks, delivering depth-scalable accuracy gains in a fully STE-free regime by controlling activation-induced gradient blockades.
A calibration strategy using full-Jones corrections with an in-field unpolarised calibrator and visibility-based multi-epoch alignment enables sub-arcsecond polarimetric imaging with LOFAR at metre wavelengths.
AdaVoMP predicts accurate dense spatially-varying Young's modulus, Poisson's ratio and density for 3D objects using an adaptive sparse voxel structure generated by a sparse transformer encoder-decoder at 16^3 higher resolution than prior fixed-voxel methods.
SparseModesNet uses linear POD encoding plus LassoNet-enforced sparse nonlinear neural decoding to select informative modes and cut reconstruction error on advection-dominated and turbulent flows.
Transformer Field Theory frames the residual stream as a field, models patching as source insertion, and uses first-order sensitivities plus Green functions to predict and describe responses, with empirical tests on GPT-2 autoregressive models.
Expert specialization in vision MoE models is dominated by a stable animate-inanimate distinction visible from gating to readout, with broader tuning to continuous visual and semantic dimensions rather than narrow categorical preferences.
CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.
A new differentiable reconstruction method uses symmetrized hyperspherical harmonics on quaternions plus two- and three-point descriptors to generate 3D microstructures from 2D data, demonstrated on aluminum alloy with L-BFGS-B optimization.
PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
In generalized contrastive learning with imbalanced classes, optimal representations collapse to class means whose angular geometry is determined by class proportions via convex optimization, and extreme imbalance causes all minority classes to collapse to one vector.
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
Replica calculations fully solve spherical Boltzmann machine ensembles and identify regimes where ensemble learning outperforms standard training, particularly for nearly finite-dimensional data.
Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
Momentum SGD exhibits two distinct EoSS regimes for batch sharpness, stabilizing at 2(1-β)/η for small batches and 2(1+β)/η for large batches, aligning with linear stability thresholds.
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
A variational physics-informed neural network solves higher-order anisotropic phase-field fracture models by minimizing total energy with B-spline enriched trial functions.
Paired flash-non-flash imaging improves contactless fingerprint spoof detection by highlighting material and structure differences between genuine and fake prints.
citing papers explorer
-
WaveNet: A Generative Model for Raw Audio
WaveNet generates realistic raw audio using an autoregressive neural network with dilated convolutions, achieving state-of-the-art naturalness in speech synthesis for English and Mandarin.
-
When Stronger Triggers Backfire: A High-Dimensional Theory of Backdoor Attacks
In the proportional high-dimensional regime, stronger backdoor training triggers improve clean accuracy and make attack success non-monotonic for regularized GLMs on Gaussian mixtures, with closed-form proofs for squared loss and fixed-point extensions to convex losses.
-
Density estimation using Real NVP
Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
-
Self-Organized Conformal Prediction: Reducing Regional Coverage Gaps with Unsupervised Group Discovery
SOCP uses self-organizing maps for unsupervised group discovery to enable local calibration in conformal prediction, reducing regional coverage gaps on benchmarks with small set-size increases while preserving validity guarantees.
-
Learning from Acquisition: Metadata-driven Multimodal Pre-training for Cardiac MRI
MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.
-
Layerwise Progressive Freezing: A Training Scaffold for Depth-Scalable Binary Networks
StoMPP progressively binarizes BNN layers layerwise from input to output via stochastic masks, delivering depth-scalable accuracy gains in a fully STE-free regime by controlling activation-induced gradient blockades.
-
Polarisation and Faraday rotation measure imaging at metre wavelengths with sub-arcsecond resolution: a foundational calibration strategy
A calibration strategy using full-Jones corrections with an in-field unpolarised calibrator and visibility-based multi-epoch alignment enables sub-arcsecond polarimetric imaging with LOFAR at metre wavelengths.
-
Adaptive Volumetric Mechanical Property Fields Invariant to Resolution
AdaVoMP predicts accurate dense spatially-varying Young's modulus, Poisson's ratio and density for 3D objects using an adaptive sparse voxel structure generated by a sparse transformer encoder-decoder at 16^3 higher resolution than prior fixed-voxel methods.
-
Sparse POD Mode Selection and Manifold Dimensionality Reduction with Neural Networks
SparseModesNet uses linear POD encoding plus LassoNet-enforced sparse nonlinear neural decoding to select informative modes and cut reconstruction error on advection-dominated and turbulent flows.
-
Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability
Transformer Field Theory frames the residual stream as a field, models patching as source insertion, and uses first-order sensitivities plus Green functions to predict and describe responses, with empirical tests on GPT-2 autoregressive models.
-
Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts
Expert specialization in vision MoE models is dominated by a stable animate-inanimate distinction visible from gating to readout, with broader tuning to continuous visual and semantic dimensions rather than narrow categorical preferences.
-
When Bits Break Recourse: Counterfactual-Faithful Quantization
CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.
-
Generative reconstruction of 2D and 3D polycrystalline microstructures using symmetrized hyperspherical harmonics
A new differentiable reconstruction method uses symmetrized hyperspherical harmonics on quaternions plus two- and three-point descriptors to generate 3D microstructures from 2D data, demonstrated on aluminum alloy with L-BFGS-B optimization.
-
Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo
PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
-
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
-
Optimal Representations for Generalized Contrastive Learning with Imbalanced Datasets
In generalized contrastive learning with imbalanced classes, optimal representations collapse to class means whose angular geometry is determined by class proportions via convex optimization, and extreme imbalance causes all minority classes to collapse to one vector.
-
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
-
Replica Theory of Spherical Boltzmann Machine Ensembles
Replica calculations fully solve spherical Boltzmann machine ensembles and identify regimes where ensemble learning outperforms standard training, particularly for nearly finite-dimensional data.
-
Grokking of Diffusion Models: Case Study on Modular Addition
Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
-
Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading
Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
-
Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
Momentum SGD exhibits two distinct EoSS regimes for batch sharpness, stabilizing at 2(1-β)/η for small batches and 2(1+β)/η for large batches, aligning with linear stability thresholds.
-
Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
-
Deep learning-based phase-field modelling of brittle fracture in anisotropic media
A variational physics-informed neural network solves higher-order anisotropic phase-field fracture models by minimizing total energy with B-spline enriched trial functions.
-
Illumination-Aware Contactless Fingerprint Spoof Detection via Paired Flash-Non-Flash Imaging
Paired flash-non-flash imaging improves contactless fingerprint spoof detection by highlighting material and structure differences between genuine and fake prints.
-
Polarized Target Nuclear Magnetic Resonance Measurements with Deep Neural Networks
Deep neural networks reduce fitting uncertainties in CW-NMR polarization measurements for dynamically polarized targets.
-
Contour Refinement using Discrete Diffusion in Low Data Regime
A CNN-based discrete diffusion method refines sparse contours from segmentation masks using simplified denoising steps and minimal post-processing, outperforming baselines on small medical and environmental datasets while running 3.5 times faster.
-
B-FIRE: Binning-Free Diffusion Implicit Neural Representation for Hyper-Accelerated Motion-Resolved MRI
B-FIRE uses a diffusion-optimized CNN-INR to reconstruct instantaneous 3D abdominal anatomy from binning-free, hyper-accelerated non-Cartesian k-space data in motion-resolved MRI.
-
NASTaR: NovaSAR Automated Ship Target Recognition Dataset
NASTaR is a new dataset of 3415 AIS-labeled ship patches from NovaSAR S-band SAR imagery with 23 classes, inshore/offshore splits, and wake annotations, validated via benchmark deep learning models.
-
Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge
The FedSurg challenge benchmarks federated learning on appendectomy videos and finds only 26% F1 on unseen centers even with centralized data, plus extra penalties from decentralization, with spatiotemporal models performing best.
-
Atomistic Machine Learning with Irreducible Cartesian Natural Tensors
CarNet develops irreducible Cartesian natural tensors and an equivariant model that matches leading spherical-tensor performance for ML interatomic potentials and high-rank tensor predictions like elastic constants.
-
DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling
DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x throughput gains with under 2% accuracy drop.
-
Prospects for Deep-Learning-Based Mass Reconstruction of Ultra-High-Energy Cosmic Rays using Simulated Air-Shower Profiles
A CNN predicts ln A from longitudinal shower profiles with bias under 0.4, resolution 1-1.5, and proton-iron merit factor 2.19, outperforming simpler ML models on shape parameters and remaining robust to hadronic model changes.
-
SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples
SCOOTER supplies best-practice guidelines, open tools, and a 3K-image benchmark with 34K+ human ratings showing that six tested unrestricted attacks produce images humans can detect as fake.
-
V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard?
V-RoAst applies zero-shot VLMs (Gemini-1.5-flash, GPT-4o-mini) to iRAP road safety attribute classification on a new ThaiRAP image dataset and compares them to CNN baselines, finding better generalization to unseen classes but weaker spatial reasoning.
-
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, and improve online A/B metrics by 12.4%.
-
Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
UMI enables zero-shot deployment of robot manipulation policies trained solely on portable human demonstrations captured with custom handheld grippers, supporting dynamic bimanual tasks across novel environments and objects.
-
Stateful Detection of Black-Box Adversarial Attacks
The paper argues for stateful defenses over stateless ones to detect adversarial example generation via query history and introduces query blinding as a counter-attack.
-
Transfer Learning from Audio-Visual Grounding to Speech Recognition
Features from audio-visual semantic grounding models improve speech recognition when used as input, with earlier layers retaining more phonetic detail and deeper layers showing greater domain invariance.
-
Predicting Retrosynthetic Reaction using Self-Corrected Transformer Neural Networks
SCROP Transformer model with neural syntax corrector reaches 59% accuracy on retrosynthesis benchmarks, outperforming prior deep learning methods by over 21 points and template-based methods by over 6 points, with 1.7 times higher accuracy on unseen compounds.
-
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
-
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.
-
Wide Residual Networks
Wide residual networks achieve higher accuracy and faster training than very deep thin residual networks by increasing width and decreasing depth, setting new state-of-the-art results on CIFAR, SVHN, and ImageNet.
-
Training Deep Nets with Sublinear Memory Cost
An algorithm trains n-layer networks with O(sqrt(n)) memory via selective recomputation of activations, at the cost of one extra forward pass.
-
Expected Gain-based Escalation in Vertical Federated Learning
An analytical expected-gain score from calibrated posteriors and classwise reliability estimates decides escalation in VFL, improving communication-accuracy trade-off over baselines.
-
Benchmark AUC Is Not Deployable Reliability: A Cross-Dataset Audit of Off-the-Shelf Features for Surveillance Video Anomaly Detection
Cross-dataset testing of nearest-neighbor and Mahalanobis anomaly detectors on CLIP, DINOv2, ResNet-50 and EfficientNet embeddings shows same-dataset AUC averaging 0.704 dropping to 0.499 on other datasets, with false-alarm rates around 31,931 per hour at usable operating points.
-
Neural posterior estimation of Galactic Binary signals for the LISA mission
Conditional normalizing flows perform likelihood-free parameter estimation for single and overlapping LISA galactic binaries, generating thousands of posterior samples per second after training on simulations.
-
WattLayer: Get Layers Right to Estimate Inference Energy of Neural Networks
WattLayer is a layer-wise energy estimation model achieving 19.6% median error on over 100k layers from 295 architectures across 3 tasks and 3 platforms, with generalization to new tasks via shared layers.
-
Beyond Aesthetics: Quantifying Information Loss in Turbid Scenes
Introduces the TUB dataset of 1320 real turbid underwater images and PCD metric showing strong correlation with instance segmentation performance where standard metrics fail.
-
Unmasking LAION-5B: Age, Gender, Race, and Emotion Biases in Large-Scale Image Datasets
Empirical audit of LAION-2B-en and LAION-2B-multi finds overrepresentation of young adults, White people, and males plus stereotypical emotion associations across two attribute classifiers.
-
Do Value Vectors in Deep Layers Need Context from the Residual Stream?
Deeper transformer layers benefit from context-free token-specific value vectors in a Bank of Values lookup table, improving performance over standard attention with less compute.