Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al · 2021

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

Boxes2Pixels: Learning Defect Segmentation from Noisy SAM Masks

cs.CV · 2026-04-13 · accept · novelty 7.0

Boxes2Pixels distills noisy SAM pseudo-masks into a compact DINOv2-based student with auxiliary localization and one-sided self-correction, delivering +6.97 anomaly mIoU and +9.71 binary IoU gains over baselines on wind turbine data with 80% fewer parameters.

DRG-Font: Dynamic Reference-Guided Few-shot Font Generation via Contrastive Style-Content Disentanglement

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

DRG-Font generates stylistically consistent glyphs from few references by decomposing style and content via contrastive disentanglement, dynamic reference selection, and multi-scale fusion blocks.

Do Vision Language Models Need to Process Image Tokens?

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

Visual representations in VLMs converge quickly to stable low-complexity forms while text continues evolving, with task-dependent needs for sustained image token access.

citing papers explorer

Showing 3 of 3 citing papers.

Boxes2Pixels: Learning Defect Segmentation from Noisy SAM Masks cs.CV · 2026-04-13 · accept · none · ref 23
Boxes2Pixels distills noisy SAM pseudo-masks into a compact DINOv2-based student with auxiliary localization and one-sided self-correction, delivering +6.97 anomaly mIoU and +9.71 binary IoU gains over baselines on wind turbine data with 80% fewer parameters.
DRG-Font: Dynamic Reference-Guided Few-shot Font Generation via Contrastive Style-Content Disentanglement cs.CV · 2026-04-15 · unverdicted · none · ref 29
DRG-Font generates stylistically consistent glyphs from few references by decomposing style and content via contrastive disentanglement, dynamic reference selection, and multi-scale fusion blocks.
Do Vision Language Models Need to Process Image Tokens? cs.CV · 2026-04-10 · unverdicted · none · ref 4
Visual representations in VLMs converge quickly to stable low-complexity forms while text continues evolving, with task-dependent needs for sustained image token access.

Learning transferable visual models from natural language supervision

fields

years

verdicts

representative citing papers

citing papers explorer