Recognition: unknown
Very Deep Convolutional Networks for Large-Scale Image Recognition
read the original abstract
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
This paper has not been read by Pith yet.
Forward citations
Cited by 60 Pith papers
-
Density estimation using Real NVP
Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
-
U-Net: Convolutional Networks for Biomedical Image Segmentation
A u-shaped fully-convolutional encoder-decoder with skip connections trained with elastic-deformation augmentation produces accurate biomedical image segmentations from very small training sets.
-
Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations
CoDAAR creates a unified discrete representation space for multimodal sequences by aligning modality-specific codebooks through index-level semantic consensus, enabling both specificity and cross-modal generalization.
-
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
-
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
-
Empirical Evidence for Simply Connected Decision Regions in Image Classifiers
Empirical tests with quad-mesh filling indicate that decision regions in modern image classifiers are simply connected.
-
Retain-Neutral Surrogates for Min-Max Unlearning
ROSU derives a closed-form retain-neutral perturbation for min-max unlearning that bounds retain damage via curvature and improves performance when gradients are aligned.
-
DMGD: Train-Free Dataset Distillation with Semantic-Distribution Matching in Diffusion Models
DMGD achieves better performance than fine-tuned SOTA methods in dataset distillation on ImageNet subsets by using semantic matching through conditional likelihood optimization and OT-based distribution matching in a ...
-
Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation
HeroCrystal uses single-image diffusion synthesis, probabilistic federated Faster R-CNN with contrastive debiasing, and inconsistent-category integration to reach 33.4% mAP in privacy-preserving multi-camera object detection.
-
Dual-branch Robust Unlearnable Examples
DUNE creates robust unlearnable examples through dual-branch spatial-color perturbation optimization and ensemble strategies, achieving lower average test accuracies of 14.95% to 50.82% than 12 prior methods against 7...
-
Hierarchical Spatio-Channel Clustering for Efficient Model Compression in Medical Image Analysis
A spatio-channel clustering framework for CNN compression reduces FLOPs by 81% and raises brain tumor MRI classification accuracy from 87.76% to 89.80% compared with global SVD and Tucker baselines.
-
KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition
KAConvNet introduces a Kolmogorov-Arnold Convolutional Layer to build networks competitive with ViTs and CNNs while offering stronger theoretical interpretability.
-
Different Strokes for Different Folks: Writer Identification for Historical Arabic Manuscripts
CNN models with attention reach 99.05% top-1 accuracy on line-level splits and 78.61% on page-disjoint splits for writer identification after expanding the labeled portion of the Muharaf historical Arabic manuscript dataset.
-
Causal Disentanglement for Full-Reference Image Quality Assessment
Causal disentanglement decouples content and degradation representations via intervention on latents and a content-masking module to predict quality scores from degradation features, achieving strong benchmark perform...
-
MESA: A Training-Free Multi-Exemplar Deep Framework for Restoring Ancient Inscription Textures
MESA restores ancient inscription textures via multi-exemplar style transfer from VGG19 features with per-layer exemplar selection and OCR-derived weights, without any model training.
-
Channel-Level Semantic Perturbations: Unlearnable Examples for Diverse Training Paradigms
Unlearnable examples fail under pretraining-finetuning due to semantic filtering by frozen layers, but Shallow Semantic Camouflage restores effectiveness by confining perturbations to semantically valid subspaces.
-
Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification
FogFool creates fog-based adversarial perturbations using Perlin noise optimization to achieve high black-box transferability (83.74% TASR) and robustness to defenses in remote sensing classification.
-
VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale
VidTAG achieves fine-grained global video-to-GPS geolocalization via temporal frame alignment and denoising sequence refinement, reporting 20% gains at 1 km over GeoCLIP and 25% on CityGuessr68k.
-
Ghosts of eruptions past: Searching for historical Galactic supernovae using variable thermal dust echoes and machine learning
An all-sky NEOWISE-based search using difference imaging and a CNN classifier trained on Cas A echoes detects no other historical Galactic supernova dust echoes at WISE sensitivity and delivers a catalog of 20477 Cas ...
-
Beyond Corner Patches: Semantics-Aware Backdoor Attack in Federated Learning
SABLE shows that semantics-aware natural triggers enable effective backdoor attacks in federated learning against multiple aggregation rules while preserving benign accuracy.
-
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
A Simple Framework for Contrastive Learning of Visual Representations
SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
-
Stereo Magnification: Learning View Synthesis using Multiplane Images
A deep network predicts multiplane images from narrow-baseline stereo pairs to synthesize novel views that extrapolate beyond the input baseline.
-
Mixed Precision Training
Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.
-
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
-
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
A pruning-quantization-Huffman pipeline compresses deep neural networks 35-49x without accuracy loss.
-
LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop
LSUN dataset of one million images per category across 30 classes is constructed via iterative human-in-the-loop deep learning labeling.
-
Hystar: Hypernetwork-driven Style-adaptive Retrieval via Dynamic SVD Modulation
Hystar adapts CLIP-like models to unseen query styles by generating per-input singular-value perturbations with a hypernetwork for attention layers and a new StyleNCE contrastive loss.
-
SURGE: Surrogate Gradient Adaptation in Binary Neural Networks
SURGE proposes a dual-path gradient compensator and adaptive scaler to learn better surrogate gradients for binary neural network training, outperforming prior methods on classification, detection, and language tasks.
-
Lightweight Unpaired Smartphone ISP Transfer with Semantic Pseudo-Pairing
Semantic pseudo-pairing via DINOv2 embeddings and fused Gromov-Wasserstein optimal transport enables training a 7K-parameter CNN for unpaired smartphone ISP, achieving 22.569 PSNR on the NTIRE 2026 challenge test set.
-
UniV2D: Bridging Visual Restoration and Semantic Perception for Underwater Salient Object Detection
UniV2D is a dual-branch network that lets high-level saliency masks guide low-level image restoration and lets restored features improve saliency detection, outperforming prior separate-stage methods on underwater benchmarks.
-
FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation
FREPix achieves competitive FID scores on ImageNet by decomposing image generation into separate low- and high-frequency paths within a flow matching framework.
-
Region Seeding via Pre-Activation Regularization: A Geometric View of Piecewise Affine Neural Networks
A geometric theory yields a region-seeding regularizer that increases realized affine regions and improves early accuracy in piecewise affine neural networks.
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
-
Anatomy of a failure: When, how, and why deep vision fails in scientific domains
Deep learning on information-rich scientific images collapses to one-dimensional predictions due to a mismatch between data priors and the model's simplicity bias, even after robustification techniques.
-
MASRA: MLLM-Assisted Semantic-Relational Consistent Alignment for Video Temporal Grounding
MASRA improves video temporal grounding accuracy by using MLLM-generated textual priors for event semantic alignment and local relational consistency during training only.
-
Gradient-Discrepancy Acquisition for Pool-Based Active Learning
A new gradient-discrepancy acquisition function derived from a generalization bound enables more effective pool-based active learning by selecting informative samples.
-
Differentiable Kernel Ridge Regression for Deep Learning Pipelines
Sparse Kernels turn kernel ridge regression into end-to-end differentiable PyTorch layers that support training-free transfer, nonlinear probing, and hybrid models while matching or augmenting neural readouts in some ...
-
Model Merging: Foundations and Algorithms
New cycle-consistent optimization, task vector theory, singular vector decompositions, adaptive routing, and efficient evolutionary search provide foundations for merging neural network weights across tasks.
-
Checkerboard: A Simple, Effective, Efficient and Learning-free Clean Label Backdoor Attack with Low Poisoning Budget
Checkerboard derives a closed-form checkerboard trigger for clean-label backdoor attacks that achieves over 94% ASR with poisoning rates as low as 0.46% on ImageNet-100 and 99.99% ASR with 20 samples on CIFAR-10.
-
Possibilistic Predictive Uncertainty for Deep Learning
DAPPr introduces a possibilistic framework that projects parameter posteriors to predictions via supremum and approximates them with Dirichlet possibility functions to yield efficient, closed-form epistemic uncertaint...
-
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.
-
Fair Dataset Distillation via Cross-Group Barycenter Alignment
Dataset distillation introduces fairness gaps from subgroup pattern mismatches rather than just imbalance; distilling to a group-agnostic barycenter of predictive information reduces these gaps.
-
SECOS: Semantic Capture for Rigorous Classification in Open-World Semi-Supervised Learning
SECOS enables direct semantic label prediction in open-world semi-supervised learning by aligning representations with external knowledge for novel classes, outperforming prior methods by up to 5.4% even without post-...
-
VTBench: A Multimodal Framework for Time-Series Classification with Chart-Based Representations
Fusing chart visualizations with raw time series improves or maintains classification accuracy on UCR datasets when the visuals add non-redundant information.
-
Towards interpretable AI with quantum annealing feature selection
Quantum annealing solves a combinatorial optimization problem to select key CNN feature maps, yielding more class-disentangled explanations than GradCAM or GradCAM++.
-
ZID-Net: Zero-Inference Diffusion Prior Decoupling Network for Single Image Dehazing
ZID-Net decouples diffusion-based priors into a training-only head to create an efficient feed-forward network for single-image dehazing, reporting 40.75 dB PSNR on RESIDE and 19 ms inference.
-
BurstGP: Enhancing Raw Burst Image Super Resolution with Generative Priors
BurstGP enhances raw burst image super-resolution by integrating pretrained video diffusion priors through a multiframe-aware model, degradation-aware conditioning, and color-space conversion, outperforming prior meth...
-
H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers
H-Sets detects higher-order feature interactions in image classifiers via Hessian-guided pair merging and attributes them with IDG-Vis to generate more interpretable saliency maps than existing marginal or coarse methods.
-
DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures
DualSplat bootstraps object-level pseudo-masks from initial 3DGS reconstruction failures using residuals and SAM2 to enable robust second-pass optimization in transient-heavy scenes.
-
LatRef-Diff: Latent and Reference-Guided Diffusion for Facial Attribute Editing and Style Manipulation
LatRef-Diff replaces semantic directions in diffusion models with latent and reference-guided style codes, uses a hierarchical style modulation module, and applies forward-backward consistency training to achieve stat...
-
Rethinking Intrinsic Dimension Estimation in Neural Representations
Common ID estimators fail to track the true intrinsic dimension of neural representations and are instead driven by other factors.
-
Embedding Arithmetic: A Lightweight, Tuning-Free Framework for Post-hoc Bias Mitigation in Text-to-Image Models
Embedding Arithmetic performs vector operations in the embedding space of T2I models to mitigate bias at inference time, outperforming baselines on diversity while preserving coherence via a new Concept Coherence Score.
-
EmbodiedHead: Real-Time Listening and Speaking Avatar for Conversational Agents
EmbodiedHead introduces a Rectified-Flow Diffusion Transformer with differentiable renderer and single-stream listening-speaking conditioning to achieve real-time high-fidelity conversational avatars.
-
Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations
Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in As...
-
IA-CLAHE: Image-Adaptive Clip Limit Estimation for CLAHE
IA-CLAHE trains a lightweight network on a differentiable CLAHE extension to predict per-tile clip limits that drive local histograms toward a uniform distribution, delivering zero-shot gains in recognition accuracy a...
-
Impact of Nonlinear Power Amplifier on Massive MIMO: Machine Learning Prediction Under Realistic Radio Channel
ML model predicts nonlinear distortion in massive MIMO using 3D ray tracing channels, enabling power allocation with 12% median user throughput gain over fixed PA operation.
-
Cross-Modal Generation: From Commodity WiFi to High-Fidelity mmWave and RFID Sensing
RF-CMG synthesizes high-quality mmWave and RFID signals from WiFi using a diffusion model with Modality-Guided Embedding for high-frequency details and Low-Frequency Modality Consistency to preserve physical structure.
-
SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding
SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.