hub

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al · 2021

34 Pith papers cite this work. Polarity classification is still indexing.

34 Pith papers citing it

browse 34 citing papers

hub tools

JSON dossier citing papers JSON

co-cited works

representative citing papers

UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.

Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.

PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

PoseBridge recovers semantic information lost during skeletonization by extracting pose-anchored cues from human pose estimation and transferring them via skeleton-conditioned bridging and semantic prototype adaptation, yielding 13.3-17.4 point gains on the Kinetics PURLS benchmark.

PhyGround: Benchmarking Physical Reasoning in Generative World Models

cs.CV · 2026-05-11 · accept · novelty 7.0

PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.

GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

GridProbe uses posterior probing on a KxK frame grid to adaptively select question-relevant frames, delivering up to 3.36x TFLOPs reduction with accuracy within 1.6 pp of the full-frame baseline on Video-MME-v2.

LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.

IPAD-CLIP: Teaching CLIP to Detect Image Local Perceptual Artifacts

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

IPAD-CLIP adapts CLIP via artifact-aware text embeddings to detect multi-class local perceptual artifacts, backed by a new dataset of 3520 images with pixel-level masks.

SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

SphereVAD performs training-free video anomaly detection by recasting anomaly discrimination as von Mises-Fisher likelihood-ratio geodesic inference on the unit hypersphere using intermediate MLLM features, with Frechet mean centering, holistic scene attention, and spherical geodesic pulling.

Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Excess risk decomposes into independent alignment (trace of inverse average Hessian times gradient covariance) and curvature terms, so both flatness and gradient alignment are required; SAGE achieves this and sets new SOTA on DomainBed.

Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

PIQL integrates train-time-only privileged information into tabular foundation models via new constructions and a reconstruction architecture to achieve faster convergence and better generalization.

What Cohort INRs Encode and Where to Freeze Them

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.

Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

S2M extracts structured text quadruples from change masks to provide noise-free multimodal supervision, achieving 17.80% Sek and 66.14% F_scd on the new Gaza-Change-v2 dataset and outperforming LLM-based multimodal methods.

TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

TrajGANR learns continuous neural representations of trajectories to enable fine-grained alignment with street-view images and locations in a joint multimodal self-supervised objective, outperforming prior geospatial MSSL methods on urban mobility and road tasks.

SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.

LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

LDDR proposes a linear DPP-based dynamic-resolution frame sampler that achieves 3x speedup and up to 2.5-point gains on video MLLM benchmarks by selecting non-redundant frames and allocating tokens accordingly.

SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.

CLEF: EEG Foundation Model for Learning Clinical Semantics

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

CLEF, a long-context EEG foundation model using 3D multitaper spectrograms and contrastive alignment with reports and EHR, beats prior models on 229 of 234 clinical tasks and raises mean AUROC from 0.65 to 0.74.

Reinforcing Multimodal Reasoning Against Visual Degradation

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.

Decoupling Endpoint and Semantic Transition Learning for Zero-Shot Composed Image Retrieval

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

DeCIR decouples endpoint alignment from semantic transition alignment in projection-based ZS-CIR via paired edit tuples, separate low-rank adapters, and LRDM merging, yielding consistent gains on CIRR, CIRCO, FashionIQ, and GeneCIS without added inference cost.

DVD: Discrete Voxel Diffusion for 3D Generation and Editing

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

DVD treats voxel occupancy as a discrete variable in a diffusion framework to generate, assess, and edit sparse 3D voxels without continuous thresholding.

Anisotropic Modality Align

cs.MM · 2026-05-08 · unverdicted · novelty 6.0

Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.

citing papers explorer

Showing 34 of 34 citing papers.

UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs cs.CV · 2026-05-12 · unverdicted · none · ref 25
VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning cs.LG · 2026-05-12 · unverdicted · none · ref 30
SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.
PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition cs.CV · 2026-05-12 · unverdicted · none · ref 22
PoseBridge recovers semantic information lost during skeletonization by extracting pose-anchored cues from human pose estimation and transferring them via skeleton-conditioned bridging and semantic prototype adaptation, yielding 13.3-17.4 point gains on the Kinetics PURLS benchmark.
PhyGround: Benchmarking Physical Reasoning in Generative World Models cs.CV · 2026-05-11 · accept · none · ref 32
PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs cs.CV · 2026-05-11 · unverdicted · none · ref 8
GridProbe uses posterior probing on a KxK frame grid to adaptively select question-relevant frames, delivering up to 3.36x TFLOPs reduction with accuracy within 1.6 pp of the full-frame baseline on Video-MME-v2.
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models cs.AI · 2026-05-11 · unverdicted · none · ref 5
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
IPAD-CLIP: Teaching CLIP to Detect Image Local Perceptual Artifacts cs.CV · 2026-05-09 · unverdicted · none · ref 16
IPAD-CLIP adapts CLIP via artifact-aware text embeddings to detect multi-class local perceptual artifacts, backed by a new dataset of 3520 images with pixel-level masks.
SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere cs.CV · 2026-05-08 · unverdicted · none · ref 28
SphereVAD performs training-free video anomaly detection by recasting anomaly discrimination as von Mises-Fisher likelihood-ratio geodesic inference on the unit hypersphere using intermediate MLLM features, with Frechet mean centering, holistic scene attention, and spherical geodesic pulling.
Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning cs.LG · 2026-05-08 · unverdicted · none · ref 53
Excess risk decomposes into independent alignment (trace of inverse average Hessian times gradient covariance) and curvature terms, so both flatness and gradient alignment are required; SAGE achieves this and sets new SOTA on DomainBed.
Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning cs.LG · 2026-05-08 · unverdicted · none · ref 27
PIQL integrates train-time-only privileged information into tabular foundation models via new constructions and a reconstruction architecture to achieve faster convergence and better generalization.
What Cohort INRs Encode and Where to Freeze Them cs.LG · 2026-05-08 · unverdicted · none · ref 50
Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection cs.CV · 2026-05-08 · unverdicted · none · ref 27
S2M extracts structured text quadruples from change masks to provide noise-free multimodal supervision, achieving 17.80% Sek and 66.14% F_scd on the new Gaza-Change-v2 dataset and outperforming LLM-based multimodal methods.
TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations cs.CV · 2026-05-07 · unverdicted · none · ref 41
TrajGANR learns continuous neural representations of trajectories to enable fine-grained alignment with street-view images and locations in a joint multimodal self-supervised objective, outperforming prior geospatial MSSL methods on urban mobility and road tasks.
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning cs.CV · 2026-05-12 · unverdicted · none · ref 32
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs cs.CV · 2026-05-12 · unverdicted · none · ref 23
LDDR proposes a linear DPP-based dynamic-resolution frame sampler that achieves 3x speedup and up to 2.5-point gains on video MLLM benchmarks by selecting non-redundant frames and allocating tokens accordingly.
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images cs.CV · 2026-05-12 · unverdicted · none · ref 42
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping cs.CV · 2026-05-11 · unverdicted · none · ref 25
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
CLEF: EEG Foundation Model for Learning Clinical Semantics cs.AI · 2026-05-11 · unverdicted · none · ref 19
CLEF, a long-context EEG foundation model using 3D multitaper spectrograms and contrastive alignment with reports and EHR, beats prior models on 229 of 234 clinical tasks and raises mean AUROC from 0.65 to 0.74.
Reinforcing Multimodal Reasoning Against Visual Degradation cs.CV · 2026-05-10 · unverdicted · none · ref 26
ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs? cs.CV · 2026-05-09 · unverdicted · none · ref 35
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI cs.LG · 2026-05-09 · unverdicted · none · ref 78
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
Decoupling Endpoint and Semantic Transition Learning for Zero-Shot Composed Image Retrieval cs.CV · 2026-05-08 · unverdicted · none · ref 33
DeCIR decouples endpoint alignment from semantic transition alignment in projection-based ZS-CIR via paired edit tuples, separate low-rank adapters, and LRDM merging, yielding consistent gains on CIRR, CIRCO, FashionIQ, and GeneCIS without added inference cost.
DVD: Discrete Voxel Diffusion for 3D Generation and Editing cs.CV · 2026-05-08 · unverdicted · none · ref 47
DVD treats voxel occupancy as a discrete variable in a diffusion framework to generate, assess, and edit sparse 3D voxels without continuous thresholding.
Anisotropic Modality Align cs.MM · 2026-05-08 · unverdicted · none · ref 12
Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.
LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation cs.CV · 2026-05-08 · conditional · none · ref 39
LithoBench is a new multi-level benchmark showing that existing large multimodal models have substantial limitations in geological semantic understanding for remote sensing lithology interpretation.
Hierarchical Dual-Subspace Decoupling for Continual Learning in Vision-Language Models cs.CV · 2026-05-08 · unverdicted · none · ref 3
HDSD decouples parameter subspaces in vision-language models via a Feature Modulation Module, General Fusion Module with adaptive thresholds, and Hierarchical Learning Module with SVD scaling to minimize cross-task interference and achieve state-of-the-art class-incremental learning performance.
ModelLens: Finding the Best for Your Task from Myriads of Models cs.LG · 2026-05-08 · unverdicted · none · ref 40
ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
Rapidly deploying on-device eye tracking by distilling visual foundation models cs.CV · 2026-04-02 · unverdicted · none · ref 40
DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.
A Composite Activation Function for Learning Stable Binary Representations cs.LG · 2026-05-12 · unverdicted · none · ref 54
HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.
TINS: Test-time ID-prototype-separated Negative Semantics Learning for OOD Detection cs.CV · 2026-05-11 · unverdicted · none · ref 40
TINS improves OOD detection by learning negative semantics at test time with ID-prototype separation, cutting average FPR95 from 14.04% to 6.72% on the Four-OOD benchmark with ImageNet-1K.
Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models cs.LG · 2026-05-11 · unverdicted · none · ref 66
SPACE induces sparsity in cross-attention parameters via closed-form iterative updates to erase target concepts more effectively than dense baselines in large diffusion models.
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation cs.CV · 2026-05-08 · unverdicted · none · ref 27
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness cs.CV · 2026-05-08 · unverdicted · none · ref 37
Pan-FM learns balanced representations across seven organs by adaptively masking dominant organs during pre-training, yielding stronger disease prediction and missing-organ robustness than single-organ or naive multimodal baselines on UK Biobank.
Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding cs.AI · 2026-05-10 · unverdicted · none · ref 99
Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.

Learning transferable visual models from natural language supervision

hub tools

co-cited works

fields

years

verdicts

representative citing papers

citing papers explorer