Unprivileged CUDA kernels can use Rowhammer to tamper with GPU page tables for targeted privilege escalation, leaking cryptographic keys and escalating to CPU root access by bypassing IOMMU.
super hub
ImageNet Large Scale Visual Recognition Challenge
50 Pith papers cite this work, alongside 30,004 external citations. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- dataset T able 4Common datasets used in CDOD benchmarks, summarizing modality, scale, annotation volume, typical role, and dominant shift type. Acronyms: S = Source, T = Target. Symbol:∼ indicates approximate counts. Dataset Y ear Modality #Images #Cls #Anno Role Domain Shift PASCAL VOC [95] 2007-2012 RGB∼16.5K∼20∼40K S/T mild scene shift MS COCO [96] 2014 RGB∼330K∼80∼2.5M S scene diversity ImageNet DET [97] 2013 RGB∼450K∼200∼500K S fine-grained cate- gory Cityscapes [98] 2016 RGB∼3.0K∼8∼65K T urban sce
- dataset Finally, ifg 1 andg 2 both do not depend on the second argument, (3) is a linear parabolic SPDE with additive noise: dUt =α 1(t)∆Ut dt+α 2(t) dWt for allt∈I.(20) I Numerical simulation For the numerical simulation of the forward and backward processes, (3) and (1), we modeled the image space Λ as Λ = (0, d1)×(0, d 2)and decomposed the boundary∂Λaccording to ∂LΛ :={0} ×[0, d 2);(21) ∂T Λ := [0, d1)× {d 2};(22) ∂RΛ :={d 1} ×(0, d 2];(23) ∂BΛ := (0, d1]× {0}(24) into its left, top, right and bottom
- background ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV)115, 3 (2015), 211-252. doi:10.1007/s11263-015-0816-y [41] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. 2015. Sun rgb-d: A rgb-d scene understanding benchmark suite. InProceedings of the IEEE conference on computer vision and pattern recognition. 567-576. [42] Alex Tamkin, Mike Wu, and Noah D. Goodman. 2020. Viewmaker Networks: Learning Views for Unsupervised Representation Learning.ArXivab
- background 1 Introduction In recent years, the emergence and evolution of auto-regressive models [18, 44, 66] and diffusion models [32, 61, 16, 50, 58, 55, 56] have led to AI-generated content (AIGC) becoming increasingly realistic and widely applied across industries, bringing convenience to fields such as entertainment [51, 2, 63], advertising [ 39, 17], and medicine [ 60, 83]. This progress is particularly evident in AI- synthesized images, which have seen gradual improvements in resolution and semantic
authors
co-cited works
representative citing papers
State-of-the-art convolutional networks easily memorize random labels and unstructured noise images, indicating that generalization in deep learning cannot be explained by traditional capacity or regularization arguments.
Semantic geometry emerges transiently early in next-token prediction training before collapsing to Neural Collapse symmetry in synthetic settings with latent semantic factors.
DiTo shifts token reduction in DiTs to output token similarity, reusing prior-step matches across timesteps with PMR scheduling and frequency-aware penalties to raise PSNR at given speedups.
ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
Single-thread JPEG benchmarks misrank decoders for ML DataLoader use, with rankings changing across CPUs and worker counts; torchvision and simplejpeg perform best in measured DataLoader tiers.
Variable codebook sizes that increase along the sequence in visual tokenizers reduce generation FID scores significantly for autoregressive models on ImageNet.
Multi-Level Optimal Transport (MOT) jointly infers soft layer couplings and neuron transport plans to produce global alignment scores and structured hierarchical correspondences between networks of varying depths.
ClusterMark applies visual token clustering to create robust in-generation watermarks for autoregressive image models, improving detectability under perturbations compared to direct token biasing while preserving quality.
SCOOTER supplies best-practice guidelines, open tools, and a 3K-image benchmark with 34K+ human ratings showing that six tested unrestricted attacks produce images humans can detect as fake.
LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
SPN is a CNN that detects a spacecraft bounding box, classifies then regresses attitude, and optimizes position via Gauss-Newton, achieving degree-level attitude and cm-level position errors on real images after training only on synthetic data.
Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.
FUSE creates full-spectrum unlearnable perturbations using random spectral masking during training and cross-band guidance to enforce consistency between frequency components.
RBFN projection heads serve as competitive replacements for MLP heads in SSL and enable SNS, a label-free metric from RBF parameters that correlates strongly with logistic regression evaluation.
Jaguar replaces prime-modulus HE with power-of-two arithmetic to enable coefficient-domain convolution and local-shift truncation, reporting 2-3.7x lower latency than Cheetah and Rhombus on ResNet-18/50 and MobileNetV2.
CSFlow derives inference-time timestep weights for flow matching by matching per-step frequency content to human CSF, yielding 4.7% FID reduction and smaller gains on IS and GenEval.
C-GSPN scales 2D spatial propagation to foundation vision encoders via a fast CUDA kernel, compressed blocks, and two-stage distillation, matching ViT performance with 15% fewer parameters and 4x block speedup at 2K resolution.
CS researchers show pragmatic skepticism toward LLM leaderboards, using them despite distrust while preferring peer networks, arena leaderboards, and cost transparency as key missing feature.
Introduces LOES, a constructive spectral method to select task-discriminative subspaces from intermediate layer embeddings, and GeoReg for enforcing simplicial class geometry during fine-tuning, with reported gains increasing with model depth across modalities.
HDFM adds a continuous heat-dissipation (blur) process to flow matching, aligns an interpolated path to fix ill-posed inverse heat dissipation, and uses x-prediction to ease high-dimensional regression, yielding better performance than most baselines on image datasets.
MINE uses mechanistic interpretability on language-aligned image representations to generate per-voxel feature descriptions, validated via image generation and counterfactual edits that causally shift brain activation.
Domain adaptation with an ensemble of CNN and transformer models trained on DES detects 20,180 LSBGs and 434 UDGs in KiDS DR5, with structural parameters and environmental trends consistent with known samples.
Anisotropic SPDEs preserve geometric data structure over longer timescales in score-based generative modeling, yielding better image quality than standard SDE baselines and flow matching in unconditional and conditional tasks.
citing papers explorer
-
GPUBreach: Privilege Escalation Attacks on GPUs using Rowhammer
Unprivileged CUDA kernels can use Rowhammer to tamper with GPU page tables for targeted privilege escalation, leaking cryptographic keys and escalating to CPU root access by bypassing IOMMU.
-
Understanding deep learning requires rethinking generalization
State-of-the-art convolutional networks easily memorize random labels and unstructured noise images, indicating that generalization in deep learning cannot be explained by traditional capacity or regularization arguments.
-
Structure Before Collapse: Transient semantic geometry in next-token prediction
Semantic geometry emerges transiently early in next-token prediction training before collapsing to Neural Collapse symmetry in synthetic settings with latent semantic factors.
-
Rethinking Token Reduction for Diffusion Models via Output-Similarity-Awareness
DiTo shifts token reduction in DiTs to output token similarity, reusing prior-step matches across timesteps with PMR scheduling and frequency-aware penalties to raise PSNR at given speedups.
-
ImageAttributionBench: How Far Are We from Generalizable Attribution?
ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
-
Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders
Single-thread JPEG benchmarks misrank decoders for ML DataLoader use, with rankings changing across CPUs and worker counts; torchvision and simplejpeg perform best in measured DataLoader tiers.
-
Taming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation
Variable codebook sizes that increase along the sequence in visual tokenizers reduce generation FID scores significantly for autoregressive models on ImageNet.
-
Representational Alignment Across Model Layers and Brain Regions with Multi-Level Optimal Transport
Multi-Level Optimal Transport (MOT) jointly infers soft layer couplings and neuron transport plans to produce global alignment scores and structured hierarchical correspondences between networks of varying depths.
-
ClusterMark: Towards Robust Watermarking for Autoregressive Image Generators with Visual Token Clustering
ClusterMark applies visual token clustering to create robust in-generation watermarks for autoregressive image models, improving detectability under perturbations compared to direct token biasing while preserving quality.
-
SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples
SCOOTER supplies best-practice guidelines, open tools, and a 3K-image benchmark with 34K+ human ratings showing that six tested unrestricted attacks produce images humans can detect as fake.
-
LAION-5B: An open large-scale dataset for training next generation image-text models
LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
-
Pose Estimation for Non-Cooperative Rendezvous Using Neural Networks
SPN is a CNN that detects a spacecraft bounding box, classifies then regresses attitude, and optimizes position via Gauss-Newton, achieving degree-level attitude and cm-level position errors on real images after training only on synthetic data.
-
Mixed Precision Training
Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.
-
Full spectrum Unlearnable Examples via Spectral Equalization
FUSE creates full-spectrum unlearnable perturbations using random spectral masking during training and cross-band guidance to enforce consistency between frequency components.
-
Radial Basis Function Networks as Projection Heads in Self-Supervised Learning
RBFN projection heads serve as competitive replacements for MLP heads in SSL and enable SNS, a label-free metric from RBF parameters that correlates strongly with logistic regression evaluation.
-
Jaguar: Fast Private CNN Inference with Power-of-Two Homomorphic Arithmetic
Jaguar replaces prime-modulus HE with power-of-two arithmetic to enable coefficient-domain convolution and local-shift truncation, reporting 2-3.7x lower latency than Cheetah and Rhombus on ResNet-18/50 and MobileNetV2.
-
CSFlow: Aligning Flow Matching with Human Contrast Sensitivity
CSFlow derives inference-time timestep weights for flow matching by matching per-step frequency content to human CSF, yielding 4.7% FID reduction and smaller gains on IS and GenEval.
-
Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders
C-GSPN scales 2D spatial propagation to foundation vision encoders via a fast CUDA kernel, compressed blocks, and two-stage distillation, matching ViT performance with 15% fewer parameters and 4x block speedup at 2K resolution.
-
The Trust Paradox: How CS Researchers Engage LLM Leaderboards
CS researchers show pragmatic skepticism toward LLM leaderboards, using them despite distrust while preferring peer networks, arena leaderboards, and cost transparency as key missing feature.
-
Uncovering the Latent Potential of Deep Intermediate Representations
Introduces LOES, a constructive spectral method to select task-discriminative subspaces from intermediate layer embeddings, and GeoReg for enforcing simplicial class geometry during fine-tuning, with reported gains increasing with model depth across modalities.
-
Multi-Scale Generative Modeling with Heat Dissipation Flow Matching
HDFM adds a continuous heat-dissipation (blur) process to flow matching, aligns an interpolated path to fix ill-posed inverse heat dissipation, and uses x-prediction to ease high-dimensional regression, yielding better performance than most baselines on image datasets.
-
Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex
MINE uses mechanistic interpretability on language-aligned image representations to generate per-voxel feature descriptions, validated via image generation and counterfactual edits that causally shift brain activation.
-
From DES to KiDS: Domain adaptation for cross-survey detection of low-surface-brightness galaxies
Domain adaptation with an ensemble of CNN and transformer models trained on DES detects 20,180 LSBGs and 434 UDGs in KiDS DR5, with structural parameters and environmental trends consistent with known samples.
-
Score-Based Generative Modeling through Anisotropic Stochastic Partial Differential Equations
Anisotropic SPDEs preserve geometric data structure over longer timescales in score-based generative modeling, yielding better image quality than standard SDE baselines and flow matching in unconditional and conditional tasks.
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters
ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier in joint scaling with generators.
-
Detecting Adversarial Data via Provable Adversarial Noise Amplification
A provable adversarial noise amplification theorem under sufficient conditions enables a custom-trained detector that identifies adversarial examples at inference time using enhanced layer-wise noise signals.
-
Efficient Adversarial Training via Criticality-Aware Fine-Tuning
CAAT selects critical parameters for adversarial robustness in ViTs and applies PEFT to tune only those, yielding a 4.3% robustness drop versus full AT while using ~6% of parameters.
-
On the Robustness of Watermarking for Autoregressive Image Generation
Watermarking schemes for autoregressive image generation fail against removal and forgery attacks, enabling false detections and undermining synthetic content filtering.
-
EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models
EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradient interference.
-
Complex Facial Expression Recognition Using Deep Knowledge Distillation of Basic Features
Continual learning via knowledge distillation achieves SOTA 74.28% accuracy on new compound facial expression classes and 100% in one-shot learning.
-
Learning Effective Loss Functions Efficiently
An anytime algorithm for learning loss functions that is asymptotically optimal in the worst case and experimentally faster than prior methods for hyperparameter tuning.
-
OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025
The OSS Challenge provides benchmarks showing spatiotemporal video models excel at open suturing skill classification and OSATS scoring but struggle with keypoint tracking under occlusion.
-
Accelerating Vision Foundation Models with Drop-in Depthwise Convolution
Replacing selected attention heads in pretrained ViTs with depthwise convolutions, identified by simple strategies and recovered via fine-tuning, delivers 17-20% inference speedup on image tasks with minimal accuracy loss.
-
A generalised pre-training strategy for deep learning networks in semantic segmentation of remotely sensed images
A novel pre-training strategy for ImageNet-initialized models achieves state-of-the-art semantic segmentation performance on four remote sensing datasets (iSAID, MFNet, PST900, Potsdam) by reducing domain-specific feature learning during pre-training.
-
Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer
The OG-ReG Transformer achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 by combining global glance and local gaze processing paths.
-
CHiQPM: Calibrated Hierarchical Interpretable Image Classification
CHiQPM is a hierarchical interpretable image classifier that maintains 99% of non-interpretable model accuracy while supplying contrastive global explanations, human-like hierarchical paths, and calibrated interpretable set predictions via conformal prediction.
-
Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions
STEP uses dynamic superpatch merging via dCTS and early token exits to cut token count by 2.5x and computational complexity by up to 4x on ViT-Large for high-res segmentation, with at most 2% accuracy drop and 40% tokens halted early.
-
TOAST: Transformer Optimization using Adaptive and Simple Transformations
TOAST approximates full transformer blocks in pretrained models via lightweight closed-form mappings to cut parameters and FLOPs without retraining or finetuning.
-
Adversarially Trained Deep Neural Semantic Hashing Scheme for Subjective Search in Fashion Inventory
Adversarial deep semantic hashing for fashion retrieval achieves 90.65% mAP, outperforming prior deep Cauchy hashing at 53.26%.
-
A Utility-Preserving GAN for Face Obscuration
UP-GAN uses a GAN to obscure faces while preserving utility attributes like age, gender, pose, and expression better than blurring or pixelation.
-
Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD
GNC convolves stochastic gradient noise to smooth sharp minima in large-batch SGD, outperforming isotropic noise for better generalization in distributed deep learning.
-
Formal Concept Lattices are Good Semantic Scaffolds for Concept-Based Learning
Formal concept lattices guide staged, hierarchical concept learning in deep networks to produce more interpretable and semantically structured representations.
-
CoarseSoundNet: Building a reliable model for ecological soundscape analysis
The paper introduces CoarseSoundNet, a deep learning model for classifying biophony, geophony, and anthropophony in passive acoustic monitoring recordings, reporting performance gains from additional similar data, a silence class, and decision thresholds, plus a case study on acoustic index trends.
-
Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation
Grad-ECLIP is an equivalent but flawed variant of attention-based interpretation, with two principles proposed to ensure model explanations reflect the original model.
-
Opportunistic Bone-Loss Screening from Routine Knee Radiographs Using a Multi-Task Deep Learning Framework with Sensitivity-Constrained Threshold Optimization
STR-Net achieves AUROC of 0.933 for binary bone-loss screening and 0.801 correlation for T-score estimation from knee X-rays on a held-out test set.
-
Robustness Analysis of USmorph: II. Optimizing Feature Extraction, Dimensionality Reduction, and Clustering for Unsupervised Galaxy Morphology Classification
Optimizes ImageNet-pretrained AlexNet, UMAP, and a bagging multi-cluster voting scheme with K-means, Birch and Agg for unsupervised galaxy morphology classification, reporting improved stability and consistency with galaxy evolution expectations.
-
Generalization Under Scrutiny: Cross-Domain Detection Progresses, Pitfalls, and Persistent Challenges
A survey that organizes methods for cross-domain object detection into a taxonomy, analyzes domain shift across detection stages, and outlines persistent challenges.
-
RGB-D image-based Object Detection: from Traditional Methods to Deep Learning Techniques
A survey of RGB-D object detection from traditional hand-crafted features with machine learning to deep learning techniques.
- Symmetrization of Loss Functions for Robust Training of Neural Networks in the Presence of Noisy Labels