FigSIM is the first annotated dataset for fine-grained suicide severity and figurative language in suicide memes, accompanied by benchmarks on 16 unimodal and multimodal models.
mega hub Mixed citations
Deep residual learning for image recognition
Mixed citation behavior. Most common role is method (46%).
hub tools
citation-role summary
citation-polarity summary
claims ledger
- method These channels are not independent signals but jointly represent a single complex-valued measurement, where the relationship between them encodes the local phase. Unlike magnitude-only approaches, where a single intensity channel is compressed, this coupling must be explicitly preserved. The architecture, loss function, and evaluation metrics described below are designed accordingly. The architecture is implemented as a ResNet-based [20] conditional variational autoencoder (CVAE) [21]. The encod
- method Together, these considerations make a scalable, high-speed, and robust reconstruction capable of operating at Monte Carlo scale essential for Hyper-Kamiokande. Machine-learning based reconstruction offers a promising path toward meeting these computational and topological chal- lenges. Convolutional neural networks [ 16], and in particular residual networks (ResNets) [17], are well suited to process the high-dimensional charge and time images recorded by the PMT array. At Super-Kamiokande, machi
- method Instead of binary classification, our model classifies into four states (LL,L,H,HH), and instead of training CNN feature extractors from scratch, we use pre-trained ResNet50 using transfer learning. The model architecture is shown in Figure 3. 3.6.1 Feature extraction.The first step is to extract features from each of the seven images. Here we apply transfer learning using ResNet50 [22], pre-trained on a large dataset. We extract information from the penultimate layer of ResNet50, compressing ea
- dataset historical video and recomputes attention upon query arrival. (2) ReKV [12] retrieves query-relevant KVCache at the token level. (3) LiveVLM [13] further combines token-level retrieval with KVCache compression to reduce memory usage. (4) StreamMem [14] also compresses KVCache, but under a TABLE II DATASET CONFIGURATIONS. Dataset Max Length Description MLVU [19] 703s multi-task long video LongVideoBench [20] 468s long-term multi-modal video VideoMME [21] 1,018s full-spectrum multi-modal video RVS
- background Training on such data could reinforce areas where AI systems are vulnerable [37, 796], enhancing their robustness in real-world applications. Adversarial examples can be constructed in various ways. One straightforward approach is to add small perturbations to inputs, which preserves their original labels while introducing adversarial characteristics [100, 260, 300, 504]. Another effective strategy is red teaming, which usually involves human teams systematically testing to find vulnerabilities
- method histopathological images [2], [4], [5], [6]. CNN have been widely adopted for cancer detection due to their ability to capture local texture patterns and hierarchical spatial features. Residual learning has been introduced to alleviate the vanishing gradient problem, leading to significant improvements in deep feature representation, as exemplified by ResNet architectures [7]. Similarly, DenseNet and kernel architectures enhance feature reuse and gradient flow, while EfficientNet achieves state-
authors
mega hub controls
Recognition alignment
counterfactual ablation
co-cited works
representative citing papers
Quantitative Bayesian inference using a deep-learning emulator detects 0.018-0.020 M_sun of helium in the Type Ic supernova 2014L.
HASTE enables training-free dynamic compression of pre-trained CNNs by patch-wise LSH-based merging of redundant channels, reporting 46.2% FLOPs reduction on ResNet34 CIFAR-10 with 1.25% accuracy drop.
An event-camera system with active gaze control and contrast-maximization spin estimation achieves real-time performance in table tennis with 8.8% magnitude error, 6.4° axis error, 3 ms latency, and 750 Hz throughput.
MATCH is the first flow matching method for multi-view anomaly detection, reporting SOTA results on Real-IAD and the first comprehensive evaluation on MANTA-Tiny while enabling real-time use by omitting the divergence term.
Spatial multiplexing in optical neural networks is repurposed as a trainable representational coordinate, demonstrated in multi-layer architectures for image classification, regression, and hybrid vision-language captioning with over one million optical phase parameters.
An ILP-based oracle applied to seven VIS methods on YouTube-VIS and OVIS shows tracking instability as the dominant bottleneck, producing gaps exceeding 20 AP under occlusion while classification impact is secondary.
DELOS applies contrastive learning to phase-folded light curves to detect shallow intermediate-to-long period transits, reporting 15.5% and 11.25% gains in combined precision-recall over BLS and TLS in low-SNR tests plus 3-80x speedups.
SDM is a new staged gradient attack that reconstructs the adversarial objective around probability differences and reports stronger performance than prior methods like APGD.
Argus enables backdoor detection in decentralized ML by collaborative neighbor-based validation of triggers, backed by convergence theory and reducing attack success by up to 90% on tested datasets.
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
LLQR+SAM pairs a slow learned geometry preconditioner with fast SAM perturbations to amplify escape from locally sharp 'potholes' while stabilizing flat basins, producing consistent gains over SAM and LLQR alone.
MorphoHELM is a new benchmark for Cell Painting morphology representations that tests methods across increasing batch effect levels and finds classic computer vision strategies remain the strongest general-purpose performers.
VCR learns valid contextual representations for incomplete wearable signals via orthogonal disentanglement and missing-aware mixture-of-experts, improving robustness across full and missing-modality settings.
The paper develops a martingale-consistent SSL framework enforcing expected coherence between coarse and refined predictions via new objectives and a Monte Carlo estimator, improving robustness under partial observations.
Urban-ImageNet is a 2-million-image multi-modal dataset with HUSIC 10-class taxonomy enabling benchmarks for urban scene classification, cross-modal retrieval, and instance segmentation.
GPROF-IR is a CNN-based retrieval that uses temporal context in geostationary IR observations to produce precipitation estimates with lower error than prior IR methods and climatological consistency with PMW retrievals for integration into IMERG V08.
The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.
GEODE uses per-sample cosine-similarity scaling in a norm loss to preserve feature geometry for universal scorer-compatible OOD detection, matching or exceeding OE performance on CIFAR benchmarks.
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
Trust-SSL introduces additive-residual trust weights in SSL to selectively handle corruptions in aerial imagery, yielding higher linear-probe accuracy and larger gains under severe degradations than SimCLR or VICReg.
FRTSearch reframes fast radio transient detection as instance segmentation on dynamic spectra and uses the segmented shapes to infer dispersion measure and time of arrival, achieving 98% recall with over 99.9% fewer false positives than traditional methods.
CapBench is a new multi-PDK dataset of post-layout 3D windows with high-fidelity capacitance labels and multiple ML-ready representations, plus baseline results showing CNN accuracy versus GNN speed trade-offs.
citing papers explorer
-
HASTE: A Framework for Training-Free, Dynamic, and Steerable Compression of Pre-Trained Convolutional Neural Networks
HASTE enables training-free dynamic compression of pre-trained CNNs by patch-wise LSH-based merging of redundant channels, reporting 46.2% FLOPs reduction on ResNet34 CIFAR-10 with 1.25% accuracy drop.
-
Event-based Gaze Control System for Accurate Real-time Spin Estimation in Professional Ball Games
An event-camera system with active gaze control and contrast-maximization spin estimation achieves real-time performance in table tennis with 8.8% magnitude error, 6.4° axis error, 3 ms latency, and 750 Hz throughput.
-
MATCH: Flow Matching for Multi-View Anomaly Detection
MATCH is the first flow matching method for multi-view anomaly detection, reporting SOTA results on Real-IAD and the first comprehensive evaluation on MANTA-Tiny while enabling real-time use by omitting the divergence term.
-
Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation
An ILP-based oracle applied to seven VIS methods on YouTube-VIS and OVIS shows tracking instability as the dominant bottleneck, producing gaps exceeding 20 AP under occlusion while classification impact is secondary.
-
SDM: A Powerful Tool for Evaluating Model Robustness
SDM is a new staged gradient attack that reconstructs the adversarial objective around probability differences and reports stronger performance than prior methods like APGD.
-
MorphoHELM: A Comprehensive Benchmark for Evaluating Representations for Microscopy-Based Morphology Assays
MorphoHELM is a new benchmark for Cell Painting morphology representations that tests methods across increasing batch effect levels and finds classic computer vision strategies remain the strongest general-purpose performers.
-
Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception
Urban-ImageNet is a 2-million-image multi-modal dataset with HUSIC 10-class taxonomy enabling benchmarks for urban scene classification, cross-modal retrieval, and instance segmentation.
-
Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model
The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.
-
Trust-SSL: Additive-Residual Selective Invariance for Robust Aerial Self-Supervised Learning
Trust-SSL introduces additive-residual trust weights in SSL to selectively handle corruptions in aerial imagery, yielding higher linear-probe accuracy and larger gains under severe degradations than SimCLR or VICReg.
-
Lightweight True In-Pixel Encryption with FeFET Enabled Pixel Design for Secure Imaging
SecurePix uses FeFET multidomain polarization states for in-pixel symmetric-key encryption, dropping ResNet-18 accuracy to 9.58% on MNIST and 6.98% on CIFAR-10 while supporting key-based decryption via lookup table.
-
Fusion2Print: Deep Flash-Non-Flash Fusion for Contactless Fingerprint Matching
Fusion2Print fuses flash-non-flash contactless fingerprints via attention-based networks and U-Net enhancement to reach AUC 0.999 and EER 1.12% with cross-domain compatibility.
-
Revisiting Image Manipulation Localization under Realistic Manipulation Scenarios
RITA models image manipulation localization as ordered sequence prediction with a new benchmark HSIM and HSS metric to handle multi-step editing processes.
-
MAGIC: Few-Shot Mask-Guided Anomaly Inpainting with Prompt Perturbation, Spatially Adaptive Guidance, and Context Awareness
MAGIC is a few-shot mask-guided anomaly inpainting framework using Gaussian prompt perturbation, spatially adaptive guidance, and context-aware mask alignment to produce high-fidelity, diverse anomalies that outperform prior methods on downstream detection tasks.
-
A welding penetration prediction model for laser welding process based on self-supervised learning using physics-informed neural networks
SimPhysNet achieves 96.06% accuracy classifying laser welding penetration states using self-supervised contrastive learning with a physics-informed neural network and prototypical networks on only 200 labeled images.
-
Radial Basis Function Networks as Projection Heads in Self-Supervised Learning
RBFN projection heads serve as competitive replacements for MLP heads in SSL and enable SNS, a label-free metric from RBF parameters that correlates strongly with logistic regression evaluation.
-
Gaussian Process Prior Variational Autoencoder for Endoscopic Videos
GPVAE replaces the standard VAE latent prior with a temporal Gaussian process prior, combined with endoscopy-specific encoders and specular masking, to achieve up to 26.1% lower image reconstruction RMSE on the C3VDv2 colonoscopy dataset.
-
GEOPHYS: The Geometry of Physical Plausibility
GEOPHYS defines five geometric properties of per-frame embeddings from image encoders that detect physical implausibility in videos with SOTA accuracy and serve as an efficient verifier.
-
Multimodal Action Diffusion for Robust End-to-End Autonomous Driving
Action Diffusion Transformer generates multimodal driving actions via diffusion and nearest-neighbor selection, claiming SOTA on Bench2Drive with 10x lower latency.
-
Deep Psychovisual Image Representations
Proposes a psychovisual-inspired deep learning method that encodes images in learned frequency sub-bands for interpretable semantic structures and reduced depth dependence.
-
Ultra-High-Definition Image Quality Assessment via Graph Representation Learning
UHD-GCN-BIQA models structural dependencies among sampled patches via a hybrid kNN graph and residual graph convolutions to achieve competitive PLCC and SRCC with the lowest RMSE on the UHD-IQA benchmark for blind ultra-high-definition image quality assessment.
-
Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex
MINE uses mechanistic interpretability on language-aligned image representations to generate per-voxel feature descriptions, validated via image generation and counterfactual edits that causally shift brain activation.
-
MooD: Perception-Enhanced Efficient Affective Image Editing via Continuous Valence-Arousal Modeling
MooD introduces continuous valence-arousal modeling with VA-aware retrieval and perception-enhanced guidance for efficient, controllable affective image editing, plus a new AffectSet dataset.
-
CAHAL: Clinically Applicable resolution enHAncement for Low-resolution MRI scans
CAHAL introduces a physics-informed mixture-of-experts super-resolution network for clinical MRI that conditions on resolution and anisotropy and uses edge-penalised, Fourier, and segmentation-guided losses to reduce hallucinations compared with prior generative methods.
-
Physics-Informed Tracking (PIT)
PIT uses a neural autoencoder with a differentiable physics module and a new Physics-Informed Landmark Loss to track single particles in video, achieving sub-pixel accuracy in supervised and unsupervised modes.
-
Variational Feature Compression for Model-Specific Representations
A variational latent bottleneck with KL regularization and a dynamic binary mask based on saliency produces model-specific features that keep high accuracy for one classifier but drop others below 2% on CIFAR-100 with over 45x suppression.
-
Toward Unified Fine-Grained Vehicle Classification and Automatic License Plate Recognition
UFPR-VeSV is a new real-world dataset for fine-grained vehicle classification and automatic license plate recognition collected from Brazilian police cameras, with benchmarks demonstrating its difficulty and the value of joint task use.
-
Physical Knot Classification Beyond Accuracy: A Benchmark and Diagnostic Study
New knot classification benchmark and topology-aware supervision methods yield small specificity gains but confirm that appearance bias remains the dominant failure mode.
-
Holi-DETR: Holistic Fashion Item Detection Leveraging Contextual Information
Holi-DETR improves fashion item detection by integrating co-occurrence probabilities, inter-item spatial arrangements, and body keypoint relationships into the DETR architecture.
-
A Generalist Model for Diverse Text-Guided Medical Image Synthesis
MediSyn is a generalist latent diffusion model that synthesizes text-guided medical images across multiple specialties and modalities from public data and improves downstream classifiers in low-data settings.
-
FedLAS: Feature-Modulated Bidirectional Label Smoothing for Neural Network Calibration
FedLAS adds feature-norm based confidence detection and bidirectional gating to label smoothing losses to reduce calibration error on vision benchmarks while preserving accuracy.
-
Improving Adversarial Robustness via Activation Amplification and Attenuation
A3 is a learnable activation scaling module that trains on amplified adversarial signals via contrastive losses to improve robustness when the same parameters are used in attenuation mode.
-
A Controlled Study of CLIP-Based Body-Scene Fusion for Emotion Recognition in Context
Controlled study finds CLIP-based body-scene fusion model for emotion recognition on EMOTIC is not improved by context debiasing or rare-class training, with best mAP of 34.52%.
-
Test-Time Adaptation in Optical Coherence Tomography Using Trajectory-Aligned Time-Independent Flow
Flow-matching TTA with histogram matching to synthetic reference trajectories and time-independent flow achieves SOTA segmentation of AMD biomarkers in OCT.
-
Multi-FRuGaL: Multimodal Flexible Redundancy-aware Decomposed Gated Learning for Cancer Diagnosis and Prognosis
Multi-FRuGaL is a decomposition-aware gated fusion framework for multimodal cancer data that maintains performance under missing modalities and reports AUC gains on two head-and-neck cancer cohorts.
-
From Local Training to Large-Scale Mapping: A Comparative Assessment of Machine Learning and Deep Learning for Transferable Satellite-Derived Bathymetry
Deep CNNs with spatial continuity preservation and a new weighted loss function outperform Random Forest in cross-regional transfer for satellite-derived bathymetry, achieving low RMSE on independent tests and a public benchmark.
-
OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025
The OSS Challenge provides benchmarks showing spatiotemporal video models excel at open suturing skill classification and OSATS scoring but struggle with keypoint tracking under occlusion.
-
When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing
Sparse MoE vision models show positive accuracy gaps only when routing a substantial compute fraction ρ and using k≥2 experts at large scale; batch-axis dispatch is identified as a key failure mode.
-
Rethinking the Good Enough Embedding for Easy Few-Shot Learning
Frozen DINOv2-L features with k-NN classification and PCA/ICA refinement achieve state-of-the-art few-shot performance on four benchmarks without any backpropagation or fine-tuning.
-
Venus-DeFakerOne: Unified Fake Image Detection & Localization
DeFakerOne is a unified foundation model for joint image-level fake image detection and pixel-level localization that reports SOTA results on 39 detection and 9 localization benchmarks.
-
LAMES: A Large-Scale and Artisanal Mining Environmental Segmentation Dataset
LAMES is a new annotated remote-sensing dataset covering 150 large-scale mining sites and 870 km² of artisanal mining for environmental segmentation and monitoring tasks.
-
A Unified Framework for the Detection and Classification of Fatty Pancreas in Ultrasound Images
A TransUNet-based segmentation followed by texture comparison classifies fatty pancreas in ultrasound with 89.7% accuracy on a small clinical dataset.
-
Architecture-agnostic Lipschitz-constant Bayesian header and its application to resolve semantically proximal classification errors with vision transformers
LipB-ViT adds bi-Lipschitz Bayesian layers to vision transformers and uses uncertainty-aware fusion to identify corrupted labels with over 93% recall at 15% noise, beating kNN baselines.
-
The autoPET3 Challenge: Automated Lesion Segmentation in Whole-Body PET/CT $\unicode{x2013}$ Multitracer Multicenter Generalization
The autoPET3 challenge finds that leading AI models reach a mean Dice score of 0.66 for multitracer PET/CT lesion segmentation, with compositional generalization to unseen tracer-center pairs remaining an open problem driven by volume overestimation and case heterogeneity.
-
H-SemiS: Hierarchical Fusion of Semi and Self-Supervised Learning for Knee Osteoarthritis Severity Grading
H-SemiS decomposes multi-class KOA severity grading into binary sub-tasks in a semi-supervised setup with self-supervision and quantum-inspired mixing, outperforming baselines on two multi-class and two binary datasets.
-
Seeing Candidates at Scale: Multimodal LLMs for Visual Political Communication on Instagram
GPT-4o achieves macro F1 scores of 0.89 for politician face recognition and 0.86 for person counting in election Instagram stories, outperforming FaceNet512, RetinaFace, and Google Cloud Vision.
-
Training-inference input alignment outweighs framework choice in longitudinal retinal image prediction
Training-inference input alignment outweighs framework choice for longitudinal retinal image prediction, with deterministic regression matching complex models when acquisition variability dominates disease progression.
-
Learning to Look before Learning to Like: Incorporating Human Visual Cognition into Aesthetic Quality Assessment
AestheticNet improves aesthetic quality assessment by fusing a gaze-aligned visual encoder pre-trained on eye-tracking data with semantic encoders via cross-attention, yielding consistent gains over semantic-only baselines.
-
Weak-to-Strong Knowledge Distillation Accelerates Visual Learning
Weak-to-strong knowledge distillation applied early and then turned off accelerates convergence to target performance in visual learning tasks by factors of 1.7-4.8x.
-
Automatic Road Subsurface Distress Recognition from Ground Penetrating Radar Images using Deep Learning-based Cross-verification
A cross-verification strategy using three YOLO models trained on distinct views of a 2134-sample 3D GPR dataset detects road subsurface distress with over 98.6 percent recall on field data.
-
New pointwise convolution in Deep Neural Networks through Extremely Fast and Non Parametric Transforms
Replacing pointwise convolutions with DWHT yields a model with 79.1% fewer parameters, 48.4% fewer FLOPs, and 1.49% higher accuracy than MobileNet-V1 on CIFAR-100.