hub

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author= · 2021

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

browse 18 citing papers

hub tools

JSON dossier citing papers JSON

representative citing papers

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

DirectTryOn achieves state-of-the-art one-step virtual try-on performance by applying pure conditional transport, garment preservation loss, and self-consistency loss to straighten trajectories in pretrained generative models.

Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

cs.CV · 2023-10-06 · unverdicted · novelty 7.0

Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.

Letting the neural code speak: Automated characterization of monkey visual neurons through human language

q-bio.NC · 2026-05-12 · unverdicted · novelty 6.0

Natural-language descriptions generated and verified through generative models and digital twins capture the selectivity of most neurons in macaque V1 and V4.

Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

SAGE trains agents in physics-grounded semantic abstractions via RL with asymmetric clipping, achieving 53.21% LLM-Match Success on A-EQA (+9.7% over baseline) and encouraging physical robot transfer.

Post-hoc Selective Classification for Reliable Synthetic Image Detection

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

ReSIDe generalizes logit-based confidence scores to intermediate layers of synthetic image detectors and uses preference optimization to aggregate them, cutting area under the risk-coverage curve by up to 69.55% under covariate shifts.

Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

cs.RO · 2026-05-08 · unverdicted · novelty 6.0

Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

cs.CL · 2026-05-02 · unverdicted · novelty 6.0

A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

cs.CV · 2024-12-19 · unverdicted · novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

cs.CV · 2024-04-02 · unverdicted · novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.

Revisiting Feature Prediction for Learning Visual Representations from Video

cs.CV · 2024-02-15 · conditional · novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.

R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

cs.CV · 2026-05-13 · unverdicted · novelty 5.0

R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.

APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment

cs.CV · 2026-05-08 · unverdicted · novelty 5.0

APEX is an assumption-free image quality metric using Sliced Wasserstein Distance on CLIP and DINOv2 embeddings that claims superior robustness to degradations and cross-dataset stability.

Neurosymbolic Framework for Concept-Driven Logical Reasoning in Skeleton-Based Human Action Recognition

cs.CV · 2026-05-08 · unverdicted · novelty 5.0

Neurosymbolic framework grounds skeleton motion in learnable pose and dynamics concepts then reasons over them with differentiable logic to recognize actions interpretably on NTU and NW-UCLA benchmarks.

CERSA: Cumulative Energy-Retaining Subspace Adaptation for Memory-Efficient Fine-Tuning

cs.LG · 2026-05-05 · unverdicted · novelty 5.0

CERSA derives low-rank fine-tuning subspaces from SVD principal components that retain 90-95% spectral energy, delivering higher performance than LoRA and other PEFT baselines at substantially lower memory cost across vision, generation, and language tasks.

Structured Diffusion Bridges: Inductive Bias for Denoising Diffusion Bridges

cs.LG · 2026-05-03 · unverdicted · novelty 5.0

A structured diffusion bridge method achieves near fully-paired modality translation quality using alignment constraints even in unpaired or semi-paired regimes.

Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers

cs.CV · 2026-05-05 · unverdicted · novelty 4.0

Human visual interestingness is linearly decodable from final-layer embeddings in Qwen3-VL-8B and becomes progressively more structured across vision and language layers without explicit supervision.

citing papers explorer

Showing 18 of 18 citing papers.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation cs.CR · 2026-05-11 · unverdicted · none · ref 41
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport cs.CV · 2026-05-13 · unverdicted · none · ref 41
DirectTryOn achieves state-of-the-art one-step virtual try-on performance by applying pure conditional transport, garment preservation loss, and self-consistency loss to straighten trajectories in pretrained generative models.
Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval cs.CV · 2026-05-08 · unverdicted · none · ref 15
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference cs.CV · 2023-10-06 · unverdicted · none · ref 39
Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.
Letting the neural code speak: Automated characterization of monkey visual neurons through human language q-bio.NC · 2026-05-12 · unverdicted · none · ref 56
Natural-language descriptions generated and verified through generative models and digital twins capture the selectivity of most neurons in macaque V1 and V4.
Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation cs.RO · 2026-05-11 · unverdicted · none · ref 21
SAGE trains agents in physics-grounded semantic abstractions via RL with asymmetric clipping, achieving 53.21% LLM-Match Success on A-EQA (+9.7% over baseline) and encouraging physical robot transfer.
Post-hoc Selective Classification for Reliable Synthetic Image Detection cs.CV · 2026-05-09 · unverdicted · none · ref 48
ReSIDe generalizes logit-based confidence scores to intermediate layers of synthetic image detectors and uses preference optimization to aggregate them, cutting area under the risk-coverage curve by up to 69.55% under covariate shifts.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation cs.RO · 2026-05-08 · unverdicted · none · ref 82
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression cs.CL · 2026-05-02 · unverdicted · none · ref 66
A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations cs.CV · 2024-12-19 · unverdicted · none · ref 47
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation cs.CV · 2024-04-02 · unverdicted · none · ref 77
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
Revisiting Feature Prediction for Learning Visual Representations from Video cs.CV · 2024-02-15 · conditional · none · ref 171
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow cs.CV · 2026-05-13 · unverdicted · none · ref 89
R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.
APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment cs.CV · 2026-05-08 · unverdicted · none · ref 10
APEX is an assumption-free image quality metric using Sliced Wasserstein Distance on CLIP and DINOv2 embeddings that claims superior robustness to degradations and cross-dataset stability.
Neurosymbolic Framework for Concept-Driven Logical Reasoning in Skeleton-Based Human Action Recognition cs.CV · 2026-05-08 · unverdicted · none · ref 52
Neurosymbolic framework grounds skeleton motion in learnable pose and dynamics concepts then reasons over them with differentiable logic to recognize actions interpretably on NTU and NW-UCLA benchmarks.
CERSA: Cumulative Energy-Retaining Subspace Adaptation for Memory-Efficient Fine-Tuning cs.LG · 2026-05-05 · unverdicted · none · ref 59
CERSA derives low-rank fine-tuning subspaces from SVD principal components that retain 90-95% spectral energy, delivering higher performance than LoRA and other PEFT baselines at substantially lower memory cost across vision, generation, and language tasks.
Structured Diffusion Bridges: Inductive Bias for Denoising Diffusion Bridges cs.LG · 2026-05-03 · unverdicted · none · ref 61
A structured diffusion bridge method achieves near fully-paired modality translation quality using alignment constraints even in unpaired or semi-paired regimes.
Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers cs.CV · 2026-05-05 · unverdicted · none · ref 19
Human visual interestingness is linearly decodable from final-layer embeddings in Qwen3-VL-8B and becomes progressively more structured across vision and language layers without explicit supervision.

International conference on machine learning , pages=

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer