super hub Mixed citations

Learning Transferable Visual Models From Natural Language Supervision

Aditya Ramesh, Alec Radford, Chris Hallacy, Gabriel Goh, Jong Wook Kim, Sandhini Agarwal · 2021 · cs.CV · arXiv 2103.00020

Mixed citation behavior. Most common role is background (69%).

268 Pith papers citing it

Background 69% of classified citations

open full Pith review browse 268 citing papers more from Aditya Ramesh arXiv PDF

abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 36 method 8 baseline 4 other 1

citation-polarity summary

background 34 use method 8 baseline 4 unclear 2 support 1

claims ledger

abstract State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (i

authors

Aditya Ramesh Alec Radford Chris Hallacy Gabriel Goh Jong Wook Kim Sandhini Agarwal

co-cited works

representative citing papers

Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature

cs.CV · 2026-06-29 · accept · novelty 8.0

MatMMExtract pipeline creates MatSciFig dataset of 391k annotated materials science figure panels and MaterialScope detection dataset with high accuracy.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

Prompt-to-Prompt Image Editing with Cross Attention Control

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

DART: Difficulty-Adaptive Routing for Zero-Shot Video Temporal Grounding

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

DART routes zero-shot video temporal grounding queries by difficulty using DPP entropy, achieving up to 3.5 mIoU gains with 7x fewer frames on Charades-STA and ActivityNet Captions.

SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

SonoCLIP presents a mask-guided region-aware vision-language foundation model pretrained on 1.44M fetal ultrasound images, demonstrating superior zero-shot performance.

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

cs.AI · 2026-06-28 · conditional · novelty 7.0

AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.

Learning from Acquisition: Metadata-driven Multimodal Pre-training for Cardiac MRI

cs.CV · 2026-06-27 · unverdicted · novelty 7.0

MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.

Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors

cs.RO · 2026-06-26 · conditional · novelty 7.0

Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.

Evaluation Pitfalls and Challenges in Multimedia Event Extraction

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

A systematic analysis of evaluation practices in multimedia event extraction reveals that minor methodological choices cause large performance swings and overestimation of cross-modal grounding ability.

Text Dictates, Music Decorates: Energy-based Attention for Editable Dance Motion Generation

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

STREAM decouples text and music conditioning in a diffusion transformer via AdaLN for structure and BEAM for beats, plus new Motorica++ dataset and editability metrics, claiming SOTA music alignment with preserved semantics.

Beyond 'One Language, One Script': Quantifying Orthographic Bias in Multilingual VLMs with PuMVR

cs.CL · 2026-06-18 · unverdicted · novelty 7.0

PuMVR benchmark shows VLMs exhibit script-dependent bias on Punjabi tasks with accuracy gaps up to 16% and script consistency rates as low as 24.8%, even when visual input is provided.

Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

LOGICA adds context to pretrained biological LMs via logit-space contrastive alignment with gated adapters, improving AUC on held-out drug-resistance mutation ranking from ~0.55 to ~0.65 while preserving token likelihoods.

Forged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection in the Age of Diffusion

cs.CV · 2026-06-17 · unverdicted · novelty 7.0

Introduces Forged Calamity benchmark and shows that fine-tuned and zero-shot synthetic image detectors lose substantial accuracy on unseen generators and disaster types.

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

cs.CV · 2026-06-15 · conditional · novelty 7.0

A new benchmark for Punjabi reveals VLMs have large script-dependent performance gaps on identical tasks, with consistency as low as 24.8 percent.

VISA: VLM-Guided Instance Semantic Auditing for 3D Occupancy World Models

cs.CV · 2026-06-11 · unverdicted · novelty 7.0

VISA improves closed-set 3D occupancy mIoU on nuScenes by using VLM instance audits as reliability-weighted semantic supervisors during training of existing world models.

Net-Ev$^2$: A Generative Simulator for Network Event Evolution

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

Net-Ev² proposes a two-stage generative simulator with structure-guided masked pre-training and topology-aware diffusion using graph U-Net down/upsampling to model network event evolution from text inputs, plus a new 6.5M multimodal benchmark and JL-MMD metric.

When to Align, When to Predict: A Phase Diagram for Multimodal Learning

cs.LG · 2026-06-09 · accept · novelty 7.0

A spiked signal-plus-noise model yields separation ratios that partition multimodal problems into four regimes where alignment, prediction, both, or neither succeed.

ArtiFact: A Large-Scale Multi-Modal Cultural Heritage Dataset

cs.DB · 2026-06-08 · unverdicted · novelty 7.0

ArtiFact is a new multi-modal dataset of 651k museum records used to benchmark cross-modal error detection with seven error categories and semantic query processing challenges.

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Smaller self-supervised ViTs localize objects better via attention than larger ViTs, enabling A² to decouple localization from feature extraction for competitive performance on distribution-shifted benchmarks.

The Regularizing Power of Language-Training Deepfake Detectors

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

A dual-encoder deepfake detector pairs a frozen specialist with a LoRA-tuned MLLM, trained first via binary alignment then via RL to reward explain-then-classify behavior, yielding improved cross-dataset performance and interpretability.

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

PInVerify is a new offline embodied benchmark for active instance verification that supplies multi-view captures and 6-sector navigation topology, with MLLM baselines reaching 85.6% after fine-tuning but showing no reliable benefit from tested next-best-view strategies.

Dex2HOI: Dexterous Bimanual Two-Object Interaction Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Dex2HOI is a dual-stream diffusion model with bidirectional cross-attention and motion fusion that generates long bimanual single- and two-object HOI sequences from text at real-time speeds.

citing papers explorer

Showing 18 of 268 citing papers.

Are vision-language models ready to zero-shot replace supervised classification models in agriculture? cs.CV · 2025-12-17 · unverdicted · none · ref 2 · internal anchor
Zero-shot VLMs reach at most 62% accuracy on agricultural classification tasks while supervised models like YOLO11 perform markedly higher, indicating they are not ready to replace task-specific systems.
Physics-Based Benchmarking Metrics for Multimodal Synthetic Images cs.CV · 2025-11-19 · unverdicted · none · ref 1 · internal anchor
PCMDE is a three-stage metric that extracts multimodal features, fuses components with confidence weights, and applies LLM-based physics-guided reasoning to assess synthetic image quality beyond standard scores like BLEU or CLIPScore.
CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images cs.CV · 2025-06-13 · unverdicted · none · ref 13 · internal anchor
A lightweight multi-modal CLIP pipeline predicts exact-match geographical tags on a Kaggle subset of the Geograph crowdsourced image archive by fusing image, location, and title embeddings.
Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving cs.CV · 2025-06-05 · unverdicted · none · ref 18 · internal anchor
Introduces structured NuScenes-S dataset and 0.9B FastDrive VLM claiming 20% higher decision accuracy and over 10x inference speedup versus larger unstructured VLMs.
Improved Baselines with Visual Instruction Tuning cs.CV · 2023-10-05 · conditional · none · ref 44 · internal anchor
Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval cs.CV · 2026-06-18 · unverdicted · none · ref 15 · internal anchor
Framework uses LLaVA for triplet generation and two-stage fine-tuning to enhance composed fashion image retrieval.
The Market in the Model: Latent Diffusion as Neural Economy cs.CY · 2026-06-17 · unverdicted · none · ref 22 · internal anchor
Latent diffusion models function as a neural economy by abstracting social exchange into commensurable vectors that transfer the social sphere into parcels for sale.
Surveying GenAI-based Automation in Printed Circuit Board Design and Test cs.AR · 2026-06-10 · unverdicted · none · ref 83 · internal anchor
Survey of GenAI in PCB design lifecycle presenting taxonomy, technical challenges, and research directions.
MIRAGE: Retrieval and Generation of Multimodal Images and Texts for Medical Education cs.CV · 2026-05-06 · unverdicted · none · ref 6 · internal anchor
MIRAGE combines a medical CLIP model, a diffusion generator, and an LLM into an accessible interface for retrieving and creating educational medical images and texts.
Developing an AI Course for Synthetic Chemistry Students cs.AI · 2025-11-23 · unverdicted · none · ref 1 · internal anchor
AI4CHEM is a beginner-friendly introductory course that teaches data-driven chemistry methods to synthetic chemistry students using web-based platforms, chemistry-specific examples, and active learning without requiring prior coding skills.
Multimodal Contextualized Support for Enhancing Video Retrieval System cs.CV · 2024-12-10 · unverdicted · none · ref 10 · internal anchor
Proposes a multimodal pipeline for video retrieval that incorporates information from multiple frames to enable higher-level abstraction beyond single-image object detection.
Efficiently Linking Real Scenes with Synthetic Data Generation for AI-based Cognitive Robotics and Computer Vision Applications cs.RO · 2026-06-18 · unverdicted · none · ref 14 · internal anchor
The paper reviews limits in AI vision for robotics and describes work-in-progress on bridging sim-to-real domain gaps by linking real and synthetic training data.
Multimodal Sexism Identification and Characterization using Large Language Models and Gradient Boosting cs.CV · 2026-06-04 · unverdicted · none · ref 35 · internal anchor
A late-fusion gradient-boosting pipeline with LLM semantic features is submitted to the EXIST 2026 lab for sexism identification in memes and videos, showing mixed generalization from development to test data.
Advances in Neural 3D Mesh Texturing: A Survey cs.CV · 2026-05-28 · unverdicted · none · ref 155 · internal anchor
A literature survey that organizes neural 3D mesh texturing methods into a taxonomy spanning early GAN-based approaches to modern diffusion pipelines, while reviewing architectures, datasets, evaluation, and open challenges.
Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge cs.CV · 2026-04-21 · unverdicted · none · ref 16 · internal anchor
The 2025 LPCVC winners demonstrate practical techniques for low-power image classification under varied conditions, open-vocabulary segmentation from text prompts, and monocular depth estimation.
MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis q-bio.NC · 2026-04-22 · unreviewed · ref 22 · internal anchor
Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage cs.IR · 2026-03-09 · unreviewed · ref 49 · internal anchor
Synthetic Homes: A Multimodal Generative AI Pipeline for Residential Building Data Generation under Data Scarcity cs.AI · 2025-09-11 · unreviewed · ref 35 · internal anchor

Learning Transferable Visual Models From Natural Language Supervision

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer