hub

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal · 2021 · cs.CV · arXiv 2103.00020

93 Pith papers cite this work. Polarity classification is still indexing.

93 Pith papers citing it

open full Pith review browse 93 citing papers arXiv PDF

abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

unclear 1

claims ledger

abstract State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (i

co-cited works

representative citing papers

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

Prompt-to-Prompt Image Editing with Cross Attention Control

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

SMA uses a submodular mutual information objective on data sets to deliver competitive zero-shot classification and retrieval performance on CLIP benchmarks with only tens of thousands of samples, orders of magnitude fewer than standard approaches.

Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Embedding Temporal Logic enables runtime monitoring of temporally extended perceptual behaviors by defining predicates via distances between observed and reference embeddings in learned spaces, with conformal calibration for reliable evaluation.

Allegory of the Cave: Measurement-Grounded Vision-Language Learning

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.

FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries

cs.MM · 2026-05-11 · unverdicted · novelty 7.0

FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.

UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

UniShield introduces a knowledge-graph-informed multimodal framework that improves unified detection of physical and digital face attacks through instruction tuning and consistency-optimized reasoning.

Classification Fields: Arbitrarily Fine Recursive Hierarchical Clustering From Few Examples

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

Classification fields are infinite recursive hierarchical cluster structures generated by a local refinement rule, and a ReLU network predictor learned from finite prefixes can approximate the generator and extend it to deeper levels with exponential convergence in the completed cell metric.

Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.

Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.

Exploring Entropy-based Active Learning for Fair Brain Segmentation

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

A weighted entropy active learning method for fair brain segmentation reduces group performance disparities by 75-86% versus standard entropy on synthetic biased MRI data.

Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.

$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a directional derivative penalty.

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

cs.IR · 2026-04-25 · unverdicted · novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

Latent Space Probing for Adult Content Detection in Video Generative Models

cs.CV · 2026-04-25 · unverdicted · novelty 7.0

Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.

Video Analysis and Generation via a Semantic Progress Function

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

A Semantic Progress Function is defined as a 1D curve of cumulative semantic shifts from frame embeddings, supporting a linearization procedure that retimes video sequences for constant-rate semantic evolution.

StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition

cs.GR · 2026-04-23 · unverdicted · novelty 7.0

StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

A-MAR decomposes art queries into reasoning plans to condition retrieval, leading to improved explanation quality and multi-step reasoning on art benchmarks compared to baselines.

Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.

Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

cs.CV · 2026-04-15 · conditional · novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.

Modality-Agnostic Prompt Learning for Multi-Modal Camouflaged Object Detection

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

A framework uses modality-agnostic prompts to adapt SAM for multi-modal camouflaged object detection, with a mask refine module for better boundaries.

citing papers explorer

Showing 43 of 93 citing papers.

MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis q-bio.NC · 2026-04-22 · unverdicted · none · ref 22 · internal anchor
MoDAl discovers complementary neurolinguistic modalities via contrastive-decorrelation objectives, cutting brain-to-text word error rate from 26.3% to 21.6% by incorporating area 44 signals.
REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction cs.CV · 2026-04-20 · unverdicted · none · ref 20 · internal anchor
REVEAL uses vision-language alignment of retinal morphometry and clinical risk narratives plus group contrastive learning to predict AD and dementia about 8 years early.
PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios cs.CV · 2026-04-15 · unverdicted · none · ref 27 · internal anchor
PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional loss plus geometric priors to preserve correct component relationships.
MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis cs.CV · 2026-04-15 · unverdicted · none · ref 27 · internal anchor
MaMe is a differentiable matrix-only token merging method that doubles ViT-B throughput with a 2% accuracy drop on pre-trained models and enables faster, higher-quality image synthesis when paired with MaRe.
Grounded World Model for Semantically Generalizable Planning cs.RO · 2026-04-13 · conditional · none · ref 47 · internal anchor
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models cs.AI · 2026-04-13 · unverdicted · none · ref 40 · 2 links · internal anchor
EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradient interference.
Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization cs.CV · 2026-04-13 · unverdicted · none · ref 43 · internal anchor
VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations cs.RO · 2026-04-12 · unverdicted · none · ref 140 · internal anchor
WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match teleoperation success rates on five tabletop tasks with 5-8x less collection effort.
ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification cs.LG · 2026-04-09 · unverdicted · none · ref 22 · internal anchor
ADAPT is a new pre-training paradigm that aligns physical properties of time-series data to allow simultaneous training on 162 diverse classification datasets, achieving new state-of-the-art performance.
Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models cs.CV · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
LAKE identifies sparse anomaly-sensitive neurons in pre-trained VLMs using minimal normal samples to build compact normality representations and achieve SOTA anomaly detection with neuron-level interpretability.
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection cs.CV · 2026-04-03 · unverdicted · none · ref 15 · 2 links · internal anchor
DeCo-DETR builds hierarchical semantic prototypes offline and uses decoupled training streams to deliver competitive zero-shot open-vocabulary detection with improved inference speed.
An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis cs.CV · 2026-04-02 · unverdicted · none · ref 33 · internal anchor
A VLM framework with spatial patch cross-attention and adaptive PID-Tversky loss reports 90.69% classification accuracy, 0.9512 Dice score, and 92.80 CIDEr for LSS diagnosis plus automated report generation.
LLM-Generated Fault Scenarios for Evaluating Perception-Driven Lane Following in Autonomous Edge Systems cs.LG · 2026-04-01 · conditional · none · ref 9 · internal anchor
A decoupled offline-online framework uses LLMs and latent diffusion models to generate fault scenarios for testing edge-based lane-following models, revealing large robustness drops under conditions like fog.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model cs.AI · 2024-08-20 · unverdicted · none · ref 17 · internal anchor
A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset cs.RO · 2024-03-19 · accept · none · ref 41 · internal anchor
DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets cs.CV · 2023-11-25 · conditional · none · ref 66 · internal anchor
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis cs.CV · 2023-07-04 · conditional · none · ref 34 · internal anchor
SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-the-art generators.
Training Diffusion Models with Reinforcement Learning cs.LG · 2023-05-22 · unverdicted · none · ref 21 · internal anchor
DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition cs.CV · 2026-05-11 · unverdicted · none · ref 13 · internal anchor
A fine-tuned large language-vision model achieves 98% accuracy on visual question answering for military vehicle identification in SAR imagery from an extended MSTAR benchmark.
Anchoring the Eigengap: Cross-Modal Spectral Stabilization for Sample-Efficient Representation Learning cs.LG · 2026-05-09 · unverdicted · none · ref 18 · internal anchor
Finite-sample noise collapses the eigengap in representation covariances limiting recoverable modes K(N); multimodal learning stabilizes it via low-rank constraints, yielding better class separation quantified by truncated Mahalanobis energy approximated with a zeta function.
GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs cs.CV · 2026-05-07 · unverdicted · none · ref 10 · internal anchor
GeoStack composes multiple domain experts into VLMs with preserved base knowledge and O(1) inference time via geometric stacking and a weight-folding property.
FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers cs.RO · 2026-05-05 · unverdicted · none · ref 4 · internal anchor
FUS3DMaps fuses voxel- and instance-level open-vocabulary layers inside a shared 3D voxel map to improve both layers and enable scalable accurate semantic mapping.
Bolek: A Multimodal Language Model for Molecular Reasoning cs.LG · 2026-05-04 · unverdicted · none · ref 21 · internal anchor
Bolek injects Morgan fingerprint embeddings into an instruction-tuned text model, then fine-tunes on molecular alignment and synthetic chain-of-thought tasks to improve performance and grounding on 15 TDC binary classification endpoints while generalizing to unseen tasks.
InVitroVision: a Multi-Modal AI Model for Automated Description of Embryo Development using Natural Language cs.AI · 2026-04-22 · unverdicted · none · ref 10 · internal anchor
InVitroVision, a fine-tuned PaliGemma-2 model, generates natural language descriptions of embryo development and outperforms ChatGPT 5.2 and base models on a public time-lapse dataset.
Style-Based Neural Architectures for Real-Time Weather Classification cs.CV · 2026-04-20 · unverdicted · none · ref 20 · internal anchor
Three style-based neural architectures are proposed for real-time weather classification from images, with two truncated ResNet variants claimed to outperform prior methods and generalize across public datasets.
UniMesh: Unifying 3D Mesh Understanding and Generation cs.CV · 2026-04-19 · unverdicted · none · ref 38 · internal anchor
UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
From edges to meaning: Semantic line sketches as a cognitive scaffold for ancient pictograph invention cs.AI · 2026-04-14 · unverdicted · none · ref 1 · internal anchor
A biologically inspired AI model of the visual hierarchy generates contour sketches that structurally resemble ancient pictographs from Egyptian, Chinese, and other writing systems.
Does Your VFM Speak Plant? The Botanical Grammar of Vision Foundation Models for Object Detection cs.CV · 2026-04-10 · unverdicted · none · ref 19 · internal anchor
Optimized prompts for vision foundation models improve cowpea detection accuracy by over 0.35 mAP on synthetic data and transfer effectively to real fields without manual annotations.
Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift cs.CV · 2026-04-10 · unverdicted · none · ref 6 · internal anchor
Supervised fine-tuning with 0.1% labeled data outperforms all 60 tested prompt variants for CLIPSeg cloud segmentation on satellite imagery under domain shift.
From Perception to Autonomous Computational Modeling: A Multi-Agent Approach cs.CE · 2026-04-08 · unverdicted · none · ref 67 · internal anchor
A multi-agent LLM framework autonomously completes the full computational mechanics pipeline from a photograph to a code-compliant engineering report on a steel L-bracket example.
Woosh: A Sound Effects Foundation Model cs.SD · 2026-04-02 · accept · none · ref 28 · internal anchor
Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.
LLM-Enhanced Topical Trend Detection at Snapchat cs.IR · 2026-04-29 · unverdicted · none · ref 17 · internal anchor
Snapchat's deployed system detects emerging topical trends in short videos via multimodal extraction, time-series burst detection, and LLM consolidation, achieving high precision per six months of human evaluation and improving content freshness in production.
ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification cs.LG · 2026-04-20 · unverdicted · none · ref 9 · internal anchor
ProtoCLIP improves zero-shot chest X-ray classification in CLIP models by 2-10 AUC points via curated data and prototype-aligned distillation, reaching 0.94 AUC for pneumothorax on VinDr-CXR.
Transparent and Controllable Recommendation Filtering via Multimodal Multi-Agent Collaboration cs.IR · 2026-04-19 · unverdicted · none · ref 23 · internal anchor
A multi-agent multimodal system with fact-grounded adjudication and a dynamic two-tier preference graph cuts false positives in content filtering by 74.3% and nearly doubles F1-score versus text-only baselines while supporting user-driven Delta adjustments.
Prompt Sensitivity in Vision-Language Grounding: How Small Changes in Wording Affect Object Detection cs.CV · 2026-04-18 · unverdicted · none · ref 1 · internal anchor
Vision-language grounding shows high prompt sensitivity, with different wordings for the same object leading to distinct instance selections and text embeddings explaining only 34% of the disagreement.
SatBLIP: Context Understanding and Feature Identification from Satellite Imagery with Vision-Language Learning cs.CV · 2026-04-15 · unverdicted · none · ref 12 · internal anchor
SatBLIP fine-tunes a satellite-adapted BLIP model on GPT-4o-generated captions to predict county-level SVI from satellite tiles and uses SHAP to highlight key features like roof condition and vegetation.
Few-Shot Contrastive Adaptation for Audio Abuse Detection in Low-Resource Indic Languages cs.SD · 2026-04-10 · unverdicted · none · ref 2 · internal anchor
CLAP audio representations with few-shot contrastive adaptation achieve competitive abusive speech detection across ten Indic languages, though gains are language-dependent and not always larger with more examples.
DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems cs.MM · 2026-04-07 · unverdicted · none · ref 32 · internal anchor
DAT combines a small-large model cascade with fine-tuning and bandwidth-aware multi-stream transmission to deliver high-accuracy event recognition and low-latency alerts for video streams in edge-cloud systems.
Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification cs.CV · 2026-04-06 · unverdicted · none · ref 37 · internal anchor
Detection-guided prompting raises small VLM hazard F1 from 34.5% to 50.6% and BERTScore from 0.61 to 0.82 on construction images with only 2.5 ms added latency.
Matched-Learning-Rate Analysis of Attention Drift and Transfer Retention in Fine-Tuned CLIP cs.LG · 2026-04-01 · unverdicted · none · ref 13 · internal anchor
Matched learning-rate experiments show LoRA retains substantially higher zero-shot transfer (45% vs 11% on EuroSAT, 58% vs 9% on Pets) than Full FT in CLIP adaptation.
Improved Baselines with Visual Instruction Tuning cs.CV · 2023-10-05 · conditional · none · ref 44 · internal anchor
Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
MIRAGE: Retrieval and Generation of Multimodal Images and Texts for Medical Education cs.CV · 2026-05-06 · unverdicted · none · ref 6 · internal anchor
MIRAGE combines a medical CLIP model, a diffusion generator, and an LLM into an accessible interface for retrieving and creating educational medical images and texts.
Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge cs.CV · 2026-04-21 · unverdicted · none · ref 16 · internal anchor
The 2025 LPCVC winners demonstrate practical techniques for low-power image classification under varied conditions, open-vocabulary segmentation from text prompts, and monocular depth estimation.

Learning Transferable Visual Models From Natural Language Supervision

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer