super hub Mixed citations

Learning Transferable Visual Models From Natural Language Supervision

Aditya Ramesh, Alec Radford, Chris Hallacy, Gabriel Goh, Jong Wook Kim, Sandhini Agarwal · 2021 · cs.CV · arXiv 2103.00020

Mixed citation behavior. Most common role is background (69%).

254 Pith papers citing it

Background 69% of classified citations

open full Pith review browse 254 citing papers more from Aditya Ramesh arXiv PDF

abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 36 method 8 baseline 4 other 1

citation-polarity summary

background 34 use method 8 baseline 4 unclear 2 support 1

claims ledger

abstract State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (i

authors

Aditya Ramesh Alec Radford Chris Hallacy Gabriel Goh Jong Wook Kim Sandhini Agarwal

co-cited works

representative citing papers

Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature

cs.CV · 2026-06-29 · accept · novelty 8.0

MatMMExtract pipeline creates MatSciFig dataset of 391k annotated materials science figure panels and MaterialScope detection dataset with high accuracy.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

Prompt-to-Prompt Image Editing with Cross Attention Control

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

DART: Difficulty-Adaptive Routing for Zero-Shot Video Temporal Grounding

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

DART routes zero-shot video temporal grounding queries by difficulty using DPP entropy, achieving up to 3.5 mIoU gains with 7x fewer frames on Charades-STA and ActivityNet Captions.

SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

SonoCLIP presents a mask-guided region-aware vision-language foundation model pretrained on 1.44M fetal ultrasound images, demonstrating superior zero-shot performance.

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

cs.AI · 2026-06-28 · conditional · novelty 7.0

AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.

Learning from Acquisition: Metadata-driven Multimodal Pre-training for Cardiac MRI

cs.CV · 2026-06-27 · unverdicted · novelty 7.0

MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.

Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors

cs.RO · 2026-06-26 · conditional · novelty 7.0

Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.

Evaluation Pitfalls and Challenges in Multimedia Event Extraction

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

A systematic analysis of evaluation practices in multimedia event extraction reveals that minor methodological choices cause large performance swings and overestimation of cross-modal grounding ability.

Text Dictates, Music Decorates: Energy-based Attention for Editable Dance Motion Generation

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

STREAM decouples text and music conditioning in a diffusion transformer via AdaLN for structure and BEAM for beats, plus new Motorica++ dataset and editability metrics, claiming SOTA music alignment with preserved semantics.

Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

LOGICA adds context to pretrained biological LMs via logit-space contrastive alignment with gated adapters, improving AUC on held-out drug-resistance mutation ranking from ~0.55 to ~0.65 while preserving token likelihoods.

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

cs.CV · 2026-06-15 · conditional · novelty 7.0

A new benchmark for Punjabi reveals VLMs have large script-dependent performance gaps on identical tasks, with consistency as low as 24.8 percent.

VISA: VLM-Guided Instance Semantic Auditing for 3D Occupancy World Models

cs.CV · 2026-06-11 · unverdicted · novelty 7.0

VISA improves closed-set 3D occupancy mIoU on nuScenes by using VLM instance audits as reliability-weighted semantic supervisors during training of existing world models.

Net-Ev$^2$: A Generative Simulator for Network Event Evolution

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

Net-Ev² proposes a two-stage generative simulator with structure-guided masked pre-training and topology-aware diffusion using graph U-Net down/upsampling to model network event evolution from text inputs, plus a new 6.5M multimodal benchmark and JL-MMD metric.

When to Align, When to Predict: A Phase Diagram for Multimodal Learning

cs.LG · 2026-06-09 · accept · novelty 7.0

A spiked signal-plus-noise model yields separation ratios that partition multimodal problems into four regimes where alignment, prediction, both, or neither succeed.

ArtiFact: A Large-Scale Multi-Modal Cultural Heritage Dataset

cs.DB · 2026-06-08 · unverdicted · novelty 7.0

ArtiFact is a new multi-modal dataset of 651k museum records used to benchmark cross-modal error detection with seven error categories and semantic query processing challenges.

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Smaller self-supervised ViTs localize objects better via attention than larger ViTs, enabling A² to decouple localization from feature extraction for competitive performance on distribution-shifted benchmarks.

The Regularizing Power of Language-Training Deepfake Detectors

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

A dual-encoder deepfake detector pairs a frozen specialist with a LoRA-tuned MLLM, trained first via binary alignment then via RL to reward explain-then-classify behavior, yielding improved cross-dataset performance and interpretability.

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

PInVerify is a new offline embodied benchmark for active instance verification that supplies multi-view captures and 6-sector navigation topology, with MLLM baselines reaching 85.6% after fine-tuning but showing no reliable benefit from tested next-best-view strategies.

Dex2HOI: Dexterous Bimanual Two-Object Interaction Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Dex2HOI is a dual-stream diffusion model with bidirectional cross-attention and motion fusion that generates long bimanual single- and two-object HOI sequences from text at real-time speeds.

Colosseum V2: Benchmarking Generalization for Vision Language Action Models

cs.RO · 2026-05-26 · unverdicted · novelty 7.0

Introduces Colosseum V2 benchmark for evaluating VLA model generalization in robotic manipulation with 28 tasks, revealing limitations in current methods and sim-real correlations.

Once-For-All: A Train-Once and Select-Anytime Framework for Multimodal Instruction Tuning

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

A selector trained once on LLaVA-665K in CLIP space selects 15% of instructions to reach 98.3% of full-data performance and generalizes to an unseen dataset and different VLMs.

citing papers explorer

Showing 33 of 33 citing papers after filters.

Editing Models with Task Arithmetic cs.LG · 2022-12-08 · accept · none · ref 84 · internal anchor
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment cs.LG · 2026-06-17 · unverdicted · none · ref 40 · internal anchor
LOGICA adds context to pretrained biological LMs via logit-space contrastive alignment with gated adapters, improving AUC on held-out drug-resistance mutation ranking from ~0.55 to ~0.65 while preserving token likelihoods.
Net-Ev$^2$: A Generative Simulator for Network Event Evolution cs.LG · 2026-06-10 · unverdicted · none · ref 35 · internal anchor
Net-Ev² proposes a two-stage generative simulator with structure-guided masked pre-training and topology-aware diffusion using graph U-Net down/upsampling to model network event evolution from text inputs, plus a new 6.5M multimodal benchmark and JL-MMD metric.
When to Align, When to Predict: A Phase Diagram for Multimodal Learning cs.LG · 2026-06-09 · accept · none · ref 32 · internal anchor
A spiked signal-plus-noise model yields separation ratios that partition multimodal problems into four regimes where alignment, prediction, both, or neither succeed.
PMF-CL: Pareto-Minimal-Forgetting Continual Learner for Conflicting Tasks cs.LG · 2026-05-18 · unverdicted · none · ref 14 · 2 links · internal anchor
PMF-CL derives Pareto-minimal-forgetting algorithms for linear/basis-function regression and quadratic-bounded losses like logistic regression, achieving static O(d²) memory for d-parameter models.
Seeking the Unfamiliar but Memorable: Conceptual Creativity as Meta-Learning cs.LG · 2026-05-15 · unverdicted · none · ref 35 · internal anchor
Creativity is defined as meta-learning where a frozen diffusion creator optimizes candidates for rapid improvement by an adapting appraiser such as an autoencoder or CLIP adapter.
Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis cs.LG · 2026-05-15 · unverdicted · none · ref 60 · internal anchor
QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.
SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning cs.LG · 2026-05-13 · unverdicted · none · ref 44 · internal anchor
SMA uses a submodular mutual information objective on data sets to deliver competitive zero-shot classification and retrieval performance on CLIP benchmarks with only tens of thousands of samples, orders of magnitude fewer than standard approaches.
Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic cs.LG · 2026-05-12 · unverdicted · none · ref 86 · 2 links · internal anchor
Embedding Temporal Logic (ETL) performs runtime monitoring directly in learned embedding spaces using distance-based predicates composed with temporal operators, supported by conformal calibration for reliable predicate evaluation.
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models cs.LG · 2026-05-07 · unverdicted · none · ref 48 · internal anchor
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations cs.LG · 2026-05-04 · unverdicted · none · ref 50 · internal anchor
TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.
Bottleneck Tokens for Unified Multimodal Retrieval cs.LG · 2026-04-13 · unverdicted · none · ref 19 · internal anchor
Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
Self-Directed Task Identification cs.LG · 2026-04-02 · unverdicted · none · ref 10 · internal anchor
SDTI lets models identify the correct target variable in datasets in a zero-shot setting using standard neural networks, beating baselines by 14% F1 on synthetic benchmarks.
Diffusion Models Beat GANs on Image Synthesis cs.LG · 2021-05-11 · accept · none · ref 49 · internal anchor
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
GeoGNN: Time Series Geo-Localization using Two-Tower Graph Neural Networks cs.LG · 2026-06-06 · unverdicted · none · ref 47 · internal anchor
GeoGNN is a two-tower GNN that learns geographic cell embeddings from adjacency graphs and matches them to temporal representations via dot-product similarity plus classification, improving geolocalization accuracy by ~27% on electricity datasets.
From Symbolic to Geometric: Enabling Spatial Reasoning in Large Language Models cs.LG · 2026-06-03 · unverdicted · none · ref 23 · internal anchor
SLM adds a dedicated spatial modality and training dataset to LLMs, enabling geometric spatial reasoning and outperforming prompt-based symbolic methods on the new SpatialEval benchmark.
MIRAGE: Adaptive Multimodal Gating for Whole-Brain fMRI Encoding cs.LG · 2026-05-28 · unverdicted · none · ref 38 · internal anchor
MIRAGE uses adaptive multimodal gating on native multimodal backbones plus a transformer encoder to achieve state-of-the-art whole-brain fMRI prediction for naturalistic audiovisual stimuli, outperforming post-hoc unimodal aggregation.
Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification cs.LG · 2026-05-23 · unverdicted · none · ref 15 · internal anchor
StenCE uses cross-modal contrastive learning on paired ECG-angiography data to learn ECG features that classify severe coronary stenosis, reporting the first high performance on this task.
Reward Shaping and Action Masking for Compositional Tasks using Behavior Trees and LLMs cs.LG · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
MRBTs are LLM-generated, SMT-verified behavior trees that supply modular reward functions and action masks, improving RL training efficiency and success rates on five compositional tasks over baselines.
ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification cs.LG · 2026-04-09 · unverdicted · none · ref 22 · internal anchor
ADAPT is a new pre-training paradigm that aligns physical properties of time-series data to allow simultaneous training on 162 diverse classification datasets, achieving new state-of-the-art performance.
LLM-Generated Fault Scenarios for Evaluating Perception-Driven Lane Following in Autonomous Edge Systems cs.LG · 2026-04-01 · conditional · none · ref 9 · internal anchor
A decoupled offline-online framework uses LLMs and latent diffusion models to generate fault scenarios for testing edge-based lane-following models, revealing large robustness drops under conditions like fog.
TiledAttention: a CUDA Tile SDPA Kernel for PyTorch cs.LG · 2026-03-02 · unverdicted · none · ref 15 · internal anchor
TiledAttention is a cuTile-based SDPA kernel that balances performance with Python-level customizability for attention research in PyTorch.
Scalable Option Learning in High-Throughput Environments cs.LG · 2025-08-30 · unverdicted · none · ref 52 · internal anchor
SOL is a new hierarchical RL algorithm that reaches 35x higher throughput and outperforms flat agents when trained on 30 billion frames in NetHack while showing positive scaling.
HeartcareGPT: A Unified Multimodal ECG Suite for Dual Signal-Image Modeling and Understanding cs.LG · 2025-06-06 · unverdicted · none · ref 18 · internal anchor
HeartcareGPT proposes Dual Stream Projection Alignment (DSPA) on a structure-aware tokenizer for unified ECG signal-image modeling, supported by Heartcare-400K dataset and Heartcare-Bench.
Training Diffusion Models with Reinforcement Learning cs.LG · 2023-05-22 · unverdicted · none · ref 21 · internal anchor
DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
Scaling Laws and Interpretability of Learning from Repeated Data cs.LG · 2022-05-21 · accept · none · ref 1 · internal anchor
Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.
Tunable MAGMAX: Preference-Aware Model Merging for Continual Learning cs.LG · 2026-05-20 · unverdicted · none · ref 16 · internal anchor
Tunable MAGMAX adds a tunable preference vector to model merging for continual learning, enabling automatic adaptation to target environments using small amounts of data while maintaining or improving task-wise performance.
Anchoring the Eigengap: Cross-Modal Spectral Stabilization for Sample-Efficient Representation Learning cs.LG · 2026-05-09 · unverdicted · none · ref 18 · internal anchor
Finite-sample noise collapses the eigengap in representation covariances limiting recoverable modes K(N); multimodal learning stabilizes it via low-rank constraints, yielding better class separation quantified by truncated Mahalanobis energy approximated with a zeta function.
Bolek: A Multimodal Language Model for Molecular Reasoning cs.LG · 2026-05-04 · unverdicted · none · ref 21 · internal anchor
Bolek injects Morgan fingerprint embeddings into an instruction-tuned text model, then fine-tunes on molecular alignment and synthetic chain-of-thought tasks to improve performance and grounding on 15 TDC binary classification endpoints while generalizing to unseen tasks.
Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning cs.LG · 2025-06-11 · unverdicted · none · ref 43 · internal anchor
BYOL-γ uses self-predictive representations to approximate successor representations, improving zero-shot combinatorial generalization in goal-conditioned behavioral cloning.
At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization cs.LG · 2026-06-24 · unverdicted · none · ref 26 · internal anchor
Sparse autoencoders show OOD prompts increase fallacious concept activation in transformers, offering a mechanistic measure of shift and a path to robust fine-tuning.
ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification cs.LG · 2026-04-20 · unverdicted · none · ref 9 · internal anchor
ProtoCLIP improves zero-shot chest X-ray classification in CLIP models by 2-10 AUC points via curated data and prototype-aligned distillation, reaching 0.94 AUC for pneumothorax on VinDr-CXR.
Matched-Learning-Rate Analysis of Attention Drift and Transfer Retention in Fine-Tuned CLIP cs.LG · 2026-04-01 · unverdicted · none · ref 13 · internal anchor
Matched learning-rate experiments show LoRA retains substantially higher zero-shot transfer (45% vs 11% on EuroSAT, 58% vs 9% on Pets) than Full FT in CLIP adaptation.

Learning Transferable Visual Models From Natural Language Supervision

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer