hub

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al · 2021

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

browse 13 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

Representation Fr\'echet Loss for Visual Generation

cs.CV · 2026-04-30 · unverdicted · novelty 8.0

Fréchet Distance optimized as FD-loss in representation space by decoupling population size from batch size improves generator quality, enables one-step generation from multi-step models, and motivates a multi-representation metric FDr^k.

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.

Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.

WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

cs.CV · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

WorldJen is a new benchmark for generative video models that uses VLM-judged multi-dimensional Likert questionnaires validated against human preferences to achieve perfect tier agreement.

Cambrian-P: Pose-Grounded Video Understanding

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.

4D-GSW: Kinematic-Aware Spatio-Temporal Consistent Watermarking for 4D Gaussian Splatting

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

4D-GSW introduces a kinematic-aware spatio-temporal watermarking framework for 4D Gaussian Splatting that uses a Spatio-Temporal Curvature metric and HMM-MRF model to maintain consistency under attacks.

PersonaVLM: Long-Term Personalized Multimodal LLMs

cs.CL · 2026-03-20 · unverdicted · novelty 6.0

PersonaVLM adds memory extraction, multi-turn retrieval-based reasoning, and personality inference to multimodal LLMs, yielding 22.4% gains on a new long-term personalization benchmark and outperforming GPT-4o.

Evaluating Concept Filtering Defenses against Child Sexual Abuse Material Generation by Text-to-Image Models

cs.CR · 2025-12-05 · unverdicted · novelty 6.0

Concept filtering of child images from training data offers only limited protection against CSAM generation in text-to-image models, as prompting strategies and fine-tuning can bypass filters even when most child images are removed.

Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

cs.CV · 2025-11-17 · unverdicted · novelty 6.0

Uni-Hand forecasts 2D/3D hand waypoints, head motion, and contact states in egocentric views using vision-language fusion and dual-branch diffusion, with new benchmarks for downstream robotics and action tasks.

Cambrian-S: Towards Spatial Supersensing in Video

cs.CV · 2025-11-06 · unverdicted · novelty 6.0

Cambrian-S introduces VSI-SUPER benchmarks for long-horizon spatial recall and counting, shows data scaling yields 30% gains on existing tests, and demonstrates a self-supervised next-latent predictor using surprise outperforms baselines on the new spatial supersensing tasks.

Swift Sampling: Selecting Temporal Surprises via Taylor Series

cs.CV · 2026-05-21 · unverdicted · novelty 5.0

Swift Sampling is a training-free frame selection method that uses Taylor expansions on video latent trajectories to pick temporally surprising frames, outperforming uniform sampling on long-video QA tasks.

WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes

cs.CV · 2026-05-15 · unverdicted · novelty 5.0

WorldAct activates monolithic 3D worlds into interactive scenes via multimodal agent-guided decomposition, geometrically aligned mesh reconstruction, and 3D inpainting.

Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment

cs.CV · 2025-08-21 · unverdicted · novelty 5.0

ADAPT reframes test-time adaptation as probabilistic Gaussian inference with CLIP-guided regularization, delivering SOTA results without gradients, source data, or full target access in both online and transductive settings.

citing papers explorer

Showing 13 of 13 citing papers.

Representation Fr\'echet Loss for Visual Generation cs.CV · 2026-04-30 · unverdicted · none · ref 40
Fréchet Distance optimized as FD-loss in representation space by decoupling population size from batch size improves generator quality, enables one-step generation from multi-step models, and motivates a multi-representation metric FDr^k.
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment cs.LG · 2026-05-09 · unverdicted · none · ref 46 · 2 links
TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.
Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision cs.CV · 2026-05-08 · unverdicted · none · ref 42
Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.
WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models cs.CV · 2026-05-05 · unverdicted · none · ref 23 · 2 links
WorldJen is a new benchmark for generative video models that uses VLM-judged multi-dimensional Likert questionnaires validated against human preferences to achieve perfect tier agreement.
Cambrian-P: Pose-Grounded Video Understanding cs.CV · 2026-05-21 · unverdicted · none · ref 72
Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
4D-GSW: Kinematic-Aware Spatio-Temporal Consistent Watermarking for 4D Gaussian Splatting cs.CV · 2026-05-21 · unverdicted · none · ref 26
4D-GSW introduces a kinematic-aware spatio-temporal watermarking framework for 4D Gaussian Splatting that uses a Spatio-Temporal Curvature metric and HMM-MRF model to maintain consistency under attacks.
PersonaVLM: Long-Term Personalized Multimodal LLMs cs.CL · 2026-03-20 · unverdicted · none · ref 33
PersonaVLM adds memory extraction, multi-turn retrieval-based reasoning, and personality inference to multimodal LLMs, yielding 22.4% gains on a new long-term personalization benchmark and outperforming GPT-4o.
Evaluating Concept Filtering Defenses against Child Sexual Abuse Material Generation by Text-to-Image Models cs.CR · 2025-12-05 · unverdicted · none · ref 30
Concept filtering of child images from training data offers only limited protection against CSAM generation in text-to-image models, as prompting strategies and fine-tuning can bypass filters even when most child images are removed.
Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views cs.CV · 2025-11-17 · unverdicted · none · ref 64
Uni-Hand forecasts 2D/3D hand waypoints, head motion, and contact states in egocentric views using vision-language fusion and dual-branch diffusion, with new benchmarks for downstream robotics and action tasks.
Cambrian-S: Towards Spatial Supersensing in Video cs.CV · 2025-11-06 · unverdicted · none · ref 105
Cambrian-S introduces VSI-SUPER benchmarks for long-horizon spatial recall and counting, shows data scaling yields 30% gains on existing tests, and demonstrates a self-supervised next-latent predictor using surprise outperforms baselines on the new spatial supersensing tasks.
Swift Sampling: Selecting Temporal Surprises via Taylor Series cs.CV · 2026-05-21 · unverdicted · none · ref 73
Swift Sampling is a training-free frame selection method that uses Taylor expansions on video latent trajectories to pick temporally surprising frames, outperforming uniform sampling on long-video QA tasks.
WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes cs.CV · 2026-05-15 · unverdicted · none · ref 46
WorldAct activates monolithic 3D worlds into interactive scenes via multimodal agent-guided decomposition, geometrically aligned mesh reconstruction, and 3D inpainting.
Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment cs.CV · 2025-08-21 · unverdicted · none · ref 39
ADAPT reframes test-time adaptation as probabilistic Gaussian inference with CLIP-guided regularization, delivering SOTA results without gradients, source data, or full target access in both online and transductive settings.

Learning transferable visual models from natural language supervision

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer