Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
hub
Learning Transferable Visual Models From Natural Language Supervision
93 Pith papers cite this work. Polarity classification is still indexing.
abstract
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (i
co-cited works
roles
background 1polarities
unclear 1representative citing papers
Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.
Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
SMA uses a submodular mutual information objective on data sets to deliver competitive zero-shot classification and retrieval performance on CLIP benchmarks with only tens of thousands of samples, orders of magnitude fewer than standard approaches.
Embedding Temporal Logic enables runtime monitoring of temporally extended perceptual behaviors by defining predicates via distances between observed and reference embeddings in learned spaces, with conformal calibration for reliable evaluation.
PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
UniShield introduces a knowledge-graph-informed multimodal framework that improves unified detection of physical and digital face attacks through instruction tuning and consistency-optimized reasoning.
Classification fields are infinite recursive hierarchical cluster structures generated by a local refinement rule, and a ReLU network predictor learned from finite prefixes can approximate the generator and extend it to deeper levels with exponential convergence in the completed cell metric.
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.
The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.
A weighted entropy active learning method for fair brain segmentation reduces group performance disparities by 75-86% versus standard entropy on synthetic biased MRI data.
Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a directional derivative penalty.
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
A Semantic Progress Function is defined as a 1D curve of cumulative semantic shifts from frame embeddings, supporting a linearization procedure that retimes video sequences for constant-rate semantic evolution.
StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
A-MAR decomposes art queries into reasoning plans to condition retrieval, leading to improved explanation quality and multi-step reasoning on art benchmarks compared to baselines.
Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
A framework uses modality-agnostic prompts to adapt SAM for multi-modal camouflaged object detection, with a mask refine module for better boundaries.
citing papers explorer
-
Editing Models with Task Arithmetic
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
-
SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning
SMA uses a submodular mutual information objective on data sets to deliver competitive zero-shot classification and retrieval performance on CLIP benchmarks with only tens of thousands of samples, orders of magnitude fewer than standard approaches.
-
Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic
Embedding Temporal Logic enables runtime monitoring of temporally extended perceptual behaviors by defining predicates via distances between observed and reference embeddings in learned spaces, with conformal calibration for reliable evaluation.
-
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
-
TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations
TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.
-
Bottleneck Tokens for Unified Multimodal Retrieval
Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
-
Self-Directed Task Identification
SDTI lets models identify the correct target variable in datasets in a zero-shot setting using standard neural networks, beating baselines by 14% F1 on synthetic benchmarks.
-
Diffusion Models Beat GANs on Image Synthesis
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
-
ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification
ADAPT is a new pre-training paradigm that aligns physical properties of time-series data to allow simultaneous training on 162 diverse classification datasets, achieving new state-of-the-art performance.
-
LLM-Generated Fault Scenarios for Evaluating Perception-Driven Lane Following in Autonomous Edge Systems
A decoupled offline-online framework uses LLMs and latent diffusion models to generate fault scenarios for testing edge-based lane-following models, revealing large robustness drops under conditions like fog.
-
Training Diffusion Models with Reinforcement Learning
DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
-
Anchoring the Eigengap: Cross-Modal Spectral Stabilization for Sample-Efficient Representation Learning
Finite-sample noise collapses the eigengap in representation covariances limiting recoverable modes K(N); multimodal learning stabilizes it via low-rank constraints, yielding better class separation quantified by truncated Mahalanobis energy approximated with a zeta function.
-
Bolek: A Multimodal Language Model for Molecular Reasoning
Bolek injects Morgan fingerprint embeddings into an instruction-tuned text model, then fine-tunes on molecular alignment and synthetic chain-of-thought tasks to improve performance and grounding on 15 TDC binary classification endpoints while generalizing to unseen tasks.
-
ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification
ProtoCLIP improves zero-shot chest X-ray classification in CLIP models by 2-10 AUC points via curated data and prototype-aligned distillation, reaching 0.94 AUC for pneumothorax on VinDr-CXR.
-
Matched-Learning-Rate Analysis of Attention Drift and Transfer Retention in Fine-Tuned CLIP
Matched learning-rate experiments show LoRA retains substantially higher zero-shot transfer (45% vs 11% on EuroSAT, 58% vs 9% on Pets) than Full FT in CLIP adaptation.