super hub Mixed citations

Learning Transferable Visual Models From Natural Language Supervision

Aditya Ramesh, Alec Radford, Chris Hallacy, Gabriel Goh, Jong Wook Kim, Sandhini Agarwal · 2021 · cs.CV · arXiv 2103.00020

Mixed citation behavior. Most common role is background (69%).

265 Pith papers citing it

Background 69% of classified citations

open full Pith review browse 265 citing papers more from Aditya Ramesh arXiv PDF

abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 36 method 8 baseline 4 other 1

citation-polarity summary

background 34 use method 8 baseline 4 unclear 2 support 1

claims ledger

abstract State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (i

authors

Aditya Ramesh Alec Radford Chris Hallacy Gabriel Goh Jong Wook Kim Sandhini Agarwal

co-cited works

representative citing papers

Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature

cs.CV · 2026-06-29 · accept · novelty 8.0

MatMMExtract pipeline creates MatSciFig dataset of 391k annotated materials science figure panels and MaterialScope detection dataset with high accuracy.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

Prompt-to-Prompt Image Editing with Cross Attention Control

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

DART: Difficulty-Adaptive Routing for Zero-Shot Video Temporal Grounding

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

DART routes zero-shot video temporal grounding queries by difficulty using DPP entropy, achieving up to 3.5 mIoU gains with 7x fewer frames on Charades-STA and ActivityNet Captions.

SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

SonoCLIP presents a mask-guided region-aware vision-language foundation model pretrained on 1.44M fetal ultrasound images, demonstrating superior zero-shot performance.

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

cs.AI · 2026-06-28 · conditional · novelty 7.0

AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.

Learning from Acquisition: Metadata-driven Multimodal Pre-training for Cardiac MRI

cs.CV · 2026-06-27 · unverdicted · novelty 7.0

MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.

Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors

cs.RO · 2026-06-26 · conditional · novelty 7.0

Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.

Evaluation Pitfalls and Challenges in Multimedia Event Extraction

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

A systematic analysis of evaluation practices in multimedia event extraction reveals that minor methodological choices cause large performance swings and overestimation of cross-modal grounding ability.

Text Dictates, Music Decorates: Energy-based Attention for Editable Dance Motion Generation

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

STREAM decouples text and music conditioning in a diffusion transformer via AdaLN for structure and BEAM for beats, plus new Motorica++ dataset and editability metrics, claiming SOTA music alignment with preserved semantics.

Beyond 'One Language, One Script': Quantifying Orthographic Bias in Multilingual VLMs with PuMVR

cs.CL · 2026-06-18 · unverdicted · novelty 7.0

PuMVR benchmark shows VLMs exhibit script-dependent bias on Punjabi tasks with accuracy gaps up to 16% and script consistency rates as low as 24.8%, even when visual input is provided.

Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

LOGICA adds context to pretrained biological LMs via logit-space contrastive alignment with gated adapters, improving AUC on held-out drug-resistance mutation ranking from ~0.55 to ~0.65 while preserving token likelihoods.

Forged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection in the Age of Diffusion

cs.CV · 2026-06-17 · unverdicted · novelty 7.0

Introduces Forged Calamity benchmark and shows that fine-tuned and zero-shot synthetic image detectors lose substantial accuracy on unseen generators and disaster types.

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

cs.CV · 2026-06-15 · conditional · novelty 7.0

A new benchmark for Punjabi reveals VLMs have large script-dependent performance gaps on identical tasks, with consistency as low as 24.8 percent.

VISA: VLM-Guided Instance Semantic Auditing for 3D Occupancy World Models

cs.CV · 2026-06-11 · unverdicted · novelty 7.0

VISA improves closed-set 3D occupancy mIoU on nuScenes by using VLM instance audits as reliability-weighted semantic supervisors during training of existing world models.

Net-Ev$^2$: A Generative Simulator for Network Event Evolution

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

Net-Ev² proposes a two-stage generative simulator with structure-guided masked pre-training and topology-aware diffusion using graph U-Net down/upsampling to model network event evolution from text inputs, plus a new 6.5M multimodal benchmark and JL-MMD metric.

When to Align, When to Predict: A Phase Diagram for Multimodal Learning

cs.LG · 2026-06-09 · accept · novelty 7.0

A spiked signal-plus-noise model yields separation ratios that partition multimodal problems into four regimes where alignment, prediction, both, or neither succeed.

ArtiFact: A Large-Scale Multi-Modal Cultural Heritage Dataset

cs.DB · 2026-06-08 · unverdicted · novelty 7.0

ArtiFact is a new multi-modal dataset of 651k museum records used to benchmark cross-modal error detection with seven error categories and semantic query processing challenges.

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Smaller self-supervised ViTs localize objects better via attention than larger ViTs, enabling A² to decouple localization from feature extraction for competitive performance on distribution-shifted benchmarks.

The Regularizing Power of Language-Training Deepfake Detectors

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

A dual-encoder deepfake detector pairs a frozen specialist with a LoRA-tuned MLLM, trained first via binary alignment then via RL to reward explain-then-classify behavior, yielding improved cross-dataset performance and interpretability.

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

PInVerify is a new offline embodied benchmark for active instance verification that supplies multi-view captures and 6-sector navigation topology, with MLLM baselines reaching 85.6% after fine-tuning but showing no reliable benefit from tested next-best-view strategies.

Dex2HOI: Dexterous Bimanual Two-Object Interaction Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Dex2HOI is a dual-stream diffusion model with bidirectional cross-attention and motion fusion that generates long bimanual single- and two-object HOI sequences from text at real-time speeds.

citing papers explorer

Showing 50 of 265 citing papers.

InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation cs.RO · 2026-02-26 · unverdicted · none · ref 23 · internal anchor
InCoM achieves 23-28% higher success rates in mobile manipulation tasks by inferring motion intent for adaptive perception and decoupling base-arm action generation.
VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction cs.CV · 2026-02-09 · unverdicted · none · ref 34 · internal anchor
VisPhyWorld evaluates MLLMs' physical reasoning via executable code generation for video reconstruction, with VisPhyBench showing strong semantics but weak parameter inference and dynamics simulation.
CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding cs.CL · 2026-01-29 · unverdicted · none · ref 23 · internal anchor
CausalEmbed uses auto-regressive generation with iterative margin loss to produce multi-vector embeddings that reduce visual token counts 30-155x while retaining competitive performance on VDR benchmarks.
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch cs.CV · 2026-01-20 · conditional · none · ref 33 · internal anchor
ChartVerse uses Rollout Posterior Entropy and truth-anchored inverse QA synthesis to produce 640K high-quality chart reasoning samples, training an 8B model that surpasses its 30B teacher.
Flexible Multitask Learning with Factorized Diffusion Policy cs.RO · 2025-12-26 · unverdicted · none · ref 31 · internal anchor
A factorized modular diffusion policy improves fitting of multimodal robot actions and enables flexible task adaptation without catastrophic forgetting.
MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications cs.AI · 2025-11-17 · unverdicted · none · ref 28 · internal anchor
MM-Telco creates multimodal benchmarks for telecom and demonstrates that fine-tuned LLMs and VLMs achieve significant performance gains on domain-specific tasks.
Foundation Models for Discovery and Exploration in Chemical Space physics.chem-ph · 2025-10-20 · unverdicted · none · ref 13 · internal anchor
MIST models up to 10x larger than prior work, fine-tuned on over 400 structure-property tasks, match or exceed SOTA on benchmarks and demonstrate zero-shot olfactory perception mapping consistent with hyperbolic geometry.
SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning cs.CV · 2025-10-18 · unverdicted · none · ref 45 · internal anchor
SSL4RL reformulates self-supervised learning objectives into dense, verifiable reward signals for RL-based fine-tuning of vision-language models, yielding performance gains on reasoning benchmarks.
Artificial Phantasia: Emergent Mental Imagery in Large Language Models cs.AI · 2025-09-27 · unverdicted · none · ref 68 · internal anchor
LLMs achieve higher accuracy than humans on compositional imagery tasks previously argued to require pictorial representations, supporting emergent propositional mental imagery in AI.
VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analysis cs.CV · 2025-09-20 · unverdicted · none · ref 29 · internal anchor
VC-Inspector introduces a lightweight open-source LMM and a controllable factual-error generation framework that achieves state-of-the-art correlation with human judgments on reference-free video caption evaluation.
Scalable Option Learning in High-Throughput Environments cs.LG · 2025-08-30 · unverdicted · none · ref 52 · internal anchor
SOL is a new hierarchical RL algorithm that reaches 35x higher throughput and outperforms flat agents when trained on 30 billion frames in NetHack while showing positive scaling.
Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters? cs.CV · 2025-07-14 · conditional · none · ref 35 · internal anchor
The ITW-SM dataset and targeted optimization of detector design choices yield a 26.87% average AUC improvement for state-of-the-art AI-generated image detectors under real-world social media conditions.
HeartcareGPT: A Unified Multimodal ECG Suite for Dual Signal-Image Modeling and Understanding cs.LG · 2025-06-06 · unverdicted · none · ref 18 · internal anchor
HeartcareGPT proposes Dual Stream Projection Alignment (DSPA) on a structure-aware tokenizer for unified ECG signal-image modeling, supported by Heartcare-400K dataset and Heartcare-Bench.
v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning cs.CL · 2025-05-24 · unverdicted · none · ref 5 · internal anchor
v1 adds a point-and-copy mechanism for dynamic visual token referencing in multimodal reasoning, trained on a new 300K dataset with grounding annotations, and outperforms baselines on multimodal math tasks.
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data cs.RO · 2025-05-06 · unverdicted · none · ref 3 · internal anchor
GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.
Beyond Cross-Modal Alignment: Measuring and Leveraging Modality Gap in Vision-Language Models cs.CV · 2025-02-16 · unverdicted · none · ref 27 · internal anchor
Introduces Modality Dominance Score (MDS) to measure modality-specific features in VLMs and applies training-free editing to improve bias mitigation, adversarial generation, and modality control.
RoboMD: Uncovering Robot Vulnerabilities through Semantic Potential Fields cs.RO · 2024-12-03 · unverdicted · none · ref 47 · internal anchor
A deep RL vulnerability-prediction policy trained in semantic embedding space finds up to 23% more unique robot manipulation failures than vision-language baselines and enables more efficient fine-tuning.
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models cs.CL · 2024-11-07 · conditional · none · ref 30 · internal anchor
MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model cs.AI · 2024-08-20 · unverdicted · none · ref 17 · internal anchor
A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset cs.RO · 2024-03-19 · accept · none · ref 41 · internal anchor
DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets cs.CV · 2023-11-25 · conditional · none · ref 66 · internal anchor
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis cs.CV · 2023-07-04 · conditional · none · ref 34 · internal anchor
SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-the-art generators.
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day cs.CV · 2023-06-01 · unverdicted · none · ref 36 · internal anchor
LLaVA-Med is created via curriculum fine-tuning on PubMed figure-caption pairs and GPT-4 self-instructed data, achieving competitive or better results than prior supervised models on three biomedical VQA benchmarks.
Training Diffusion Models with Reinforcement Learning cs.LG · 2023-05-22 · unverdicted · none · ref 21 · internal anchor
DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models cs.CV · 2023-05-13 · accept · none · ref 8 · internal anchor
OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.
Shap-E: Generating Conditional 3D Implicit Functions cs.CV · 2023-05-03 · accept · none · ref 48 · internal anchor
Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.
Scaling Laws and Interpretability of Learning from Repeated Data cs.LG · 2022-05-21 · accept · none · ref 1 · internal anchor
Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.
Text and Code Embeddings by Contrastive Pre-Training cs.CL · 2022-01-24 · unverdicted · none · ref 15 · internal anchor
Contrastive pre-training on unsupervised data at scale creates text and code embeddings that set new state-of-the-art results on classification and semantic search benchmarks.
Florence: A New Foundation Model for Computer Vision cs.CV · 2021-11-22 · unverdicted · none · ref 17 · internal anchor
Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.
Multimodal Fusion for Fine-Grained Classification of Breast Fibroadenoma and Phyllodes Tumors cs.CV · 2026-07-02 · unverdicted · none · ref 45 · internal anchor
A new multimodal fusion model using image, text, and clinical encoders with Transformer fusion reaches 77.64% accuracy on a pathology-confirmed 910-patient breast ultrasound dataset for distinguishing fibroadenoma from phyllodes tumors.
Boosting Ultrasound Image Classification via Attribute-Guided Dual-Branch Framework cs.CV · 2026-07-02 · conditional · none · ref 19 · internal anchor
An attribute-guided dual-branch framework fuses a standard classifier with an interpretable attribute-prior branch to boost ultrasound classification accuracy and explainability.
Restore3D: Breathing Life into Broken Objects with Shape and Texture Restoration cs.CV · 2026-07-01 · unverdicted · none · ref 85 · internal anchor
Restore3D restores shape and texture of broken 3D objects via multi-view image refinement with a Mask Self-Perceiver and coarse-to-fine mesh reconstruction, outperforming baselines on synthetic and real benchmarks.
Robust Onion: Peeling Open Vocab Object Detectors Under Noise cs.CV · 2026-06-25 · unverdicted · none · ref 51 · internal anchor
Empirical study finds OV-OD robustness driven by vision backbone and image domain via layer-wise feature collapse analysis, validated with a low-parameter robustness improvement on real data.
SAC$^2$-Net: Semantic Anchoring and Complementary-Consensus Fusion for Multimodal Micro-Expression Recognition cs.CV · 2026-06-24 · unverdicted · none · ref 27 · internal anchor
SAC²-Net uses semantic anchoring soft alignment and complementary-consensus fusion to report SOTA or competitive results on five MER benchmarks.
Factor-Aware Mixture-of-Experts with Pretrained Encoder for Combinatorial Generalization cs.RO · 2026-06-19 · unverdicted · none · ref 4 · internal anchor
FAME combines a factor-aware MoE with frozen pretrained encoders via staged adapter training and joint fine-tuning, reporting 34% gains on Meta-World and 35% in real-world pick-and-place under environmental changes.
Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models cs.CL · 2026-06-09 · unverdicted · none · ref 89 · internal anchor
The work establishes an evaluation framework for personality induction and switching in MLLMs, reporting improved captioning but impaired VQA performance plus balancing and residual effects during multi-trait and dynamic conditions.
ZODS-RS -- Zero-training Oriented Detection & Segmentation for Remote Sensing cs.CV · 2026-06-09 · unverdicted · none · ref 21 · internal anchor
ZODS-RS introduces a zero-training closed-form pipeline using DINOv3 dense features and SAM-style proposals for horizontal-box detection and instance segmentation in remote-sensing imagery.
Traits Run Deeper: Trait-Specific Asymmetric Fusion for Personality Assessment cs.CV · 2026-06-09 · unverdicted · none · ref 34 · internal anchor
Traits Run Deeper proposes MFR, TSMF asymmetric fusion, and DCPR modules to improve multimodal personality assessment, claiming 25% MSE reduction and first place on AVI Challenge 2026.
Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation cs.AI · 2026-06-08 · unverdicted · none · ref 14 · internal anchor
DPVR-LF routes saturated vision tokens into a one-layer side branch after layer 4, runs text-only processing through layers 5-17, and performs late fusion at the final layer to reduce visual computation while preserving multimodal performance.
Experiment-free disruption prediction for new devices enabled by synthetic diagnostic data augmentation physics.plasm-ph · 2026-06-07 · unverdicted · none · ref 39 · internal anchor
Augmenting EAST tokamak experimental data with synthetic J-TEXT diagnostic signals from NIMROD MHD simulations and applying Fourier Domain Adaptation improves zero-shot disruption prediction early warning rate from 50% to 57% on 1596 J-TEXT discharges.
CheXanatomy: Anatomy-Aware Vision-Language Modeling for Chest Radiographs cs.CV · 2026-06-07 · unverdicted · none · ref 17 · internal anchor
CheXanatomy trains VLMs to generate 2D anatomical masks via next-token prediction on synthetic CXRs from CT, matching U-Net performance with better domain-shift robustness and sample efficiency.
Native3D: End-to-End 3D Scene Generation via Unified Mesh-Texture Modeling and Semantic Alignment cs.CV · 2026-06-05 · unverdicted · none · ref 30 · internal anchor
Native3D introduces a direct 3D scene generation method using unified mesh-texture representation and 3D REPA Loss for semantic alignment, claimed to outperform prior 2D-dependent approaches.
PaCX-MAE: Physiology-Augmented Chest X-Ray Masked Autoencoder cs.CV · 2026-06-01 · unverdicted · none · ref 17 · internal anchor
PaCX-MAE augments masked autoencoding of chest X-rays with dual contrastive-predictive alignment to ECG and laboratory embeddings, reporting gains on physiology-dependent tasks while remaining unimodal at test time.
EigeNet: Geometry-Informed Multi-Modal Learning for Few-shot Novel View RIR Prediction cs.SD · 2026-05-27 · unverdicted · none · ref 37 · internal anchor
EigeNet applies a cross-view alternate-attention transformer with geometry modulation for few-shot novel-view RIR prediction, reporting SOTA results on simulated and real data.
Rare Events, Real Signals: Functional Ensembles as Units of Computation in Deep Spiking Networks cs.NE · 2026-05-21 · unverdicted · none · ref 62 · internal anchor
In spiking ResNets, 1FC ensembles defined by pairwise correlations show ReLU-like cofiring-to-response mapping whose gain scales with ensemble size, with reliable class encoding restricted to infrequent high-cofiring events.
Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection cs.CV · 2026-05-21 · unverdicted · none · ref 65 · internal anchor
VINA trains a single detector on images plus video frames using a cross-modal supervised contrastive objective, yielding bidirectional gains and SOTA results on 14 image, video, and in-the-wild benchmarks.
SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching cs.CV · 2026-05-20 · unverdicted · none · ref 16 · internal anchor
SceneGraphGrounder builds a persistent 3D scene graph from VLM-inferred relations in 2D views and solves grounding via constrained graph alignment, achieving competitive zero-shot results on ScanRefer with only RGB-D input.
Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction cs.CV · 2026-05-20 · unverdicted · none · ref 20 · internal anchor
A two-stage method predicts an intermediate Canny map for structure then renders the image conditioned on appearance and structure, paired with a 100k text-aware dataset, to improve detail preservation in subject-driven generation.
Tunable MAGMAX: Preference-Aware Model Merging for Continual Learning cs.LG · 2026-05-20 · unverdicted · none · ref 16 · internal anchor
Tunable MAGMAX adds a tunable preference vector to model merging for continual learning, enabling automatic adaptation to target environments using small amounts of data while maintaining or improving task-wise performance.
Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models cs.CV · 2026-05-19 · unverdicted · none · ref 12 · 2 links · internal anchor
EyeVLM benchmark finds that current VLMs underperform specialized visual models on gaze following and social gaze prediction, with fine-tuning narrowing but not closing the gap.

Learning Transferable Visual Models From Natural Language Supervision

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer