hub

Learning to prompt for vision-language models.Int

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu · 2022 · International Journal of Computer Vision · DOI 10.1007/s11263-022-01653-1 · arXiv 2109.01134

9 Pith papers cite this work, alongside 2,607 external citations. Polarity classification is still indexing.

9 Pith papers citing it

2,607 external citations · Crossref

open at publisher browse 9 citing papers arXiv PDF

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

representative citing papers

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.

TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning

cs.CV · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

TB-AVA uses text-mediated gated semantic modulation to enable efficient audio-visual alignment, achieving state-of-the-art results on AVE, AVS, and AVVP benchmarks.

FACTOR: Counterfactual Training-Free Test-Time Adaptation for Open-Vocabulary Object Detection

cs.CV · 2026-05-05 · unverdicted · novelty 6.0

FACTOR uses counterfactual image perturbations to quantify and suppress attribute-dependent predictions in open-vocabulary object detection, improving robustness on corrupted datasets without any training.

GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs

cs.CV · 2026-05-07 · unverdicted · novelty 5.0

GeoStack composes multiple domain experts into VLMs with preserved base knowledge and O(1) inference time via geometric stacking and a weight-folding property.

Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction

cs.MM · 2026-04-22 · unverdicted · novelty 5.0

A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.

Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models

cs.CV · 2026-04-07 · unverdicted · novelty 5.0

VANGUARD is a staged-training VLM framework that reports 94% ROC-AUC and 84% F1 on UCF-Crime while adding chain-of-thought reasoning and spatial grounding to video anomaly detection.

Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation

cs.CV · 2026-05-13 · unverdicted · novelty 4.0

Grad-ECLIP is an equivalent but flawed variant of attention-based interpretation, with two principles proposed to ensure model explanations reflect the original model.

ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification

cs.LG · 2026-04-20 · unverdicted · novelty 4.0

ProtoCLIP improves zero-shot chest X-ray classification in CLIP models by 2-10 AUC points via curated data and prototype-aligned distillation, reaching 0.94 AUC for pneumothorax on VinDr-CXR.

citing papers explorer

Showing 9 of 9 citing papers.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion cs.CV · 2022-08-02 · unverdicted · none · ref 34
Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model cs.CV · 2026-05-04 · unverdicted · none · ref 16
The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.
TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning cs.CV · 2026-05-12 · unverdicted · none · ref 33 · 2 links
TB-AVA uses text-mediated gated semantic modulation to enable efficient audio-visual alignment, achieving state-of-the-art results on AVE, AVS, and AVVP benchmarks.
FACTOR: Counterfactual Training-Free Test-Time Adaptation for Open-Vocabulary Object Detection cs.CV · 2026-05-05 · unverdicted · none · ref 38
FACTOR uses counterfactual image perturbations to quantify and suppress attribute-dependent predictions in open-vocabulary object detection, improving robustness on corrupted datasets without any training.
GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs cs.CV · 2026-05-07 · unverdicted · none · ref 14
GeoStack composes multiple domain experts into VLMs with preserved base knowledge and O(1) inference time via geometric stacking and a weight-folding property.
Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction cs.MM · 2026-04-22 · unverdicted · none · ref 81
A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.
Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models cs.CV · 2026-04-07 · unverdicted · none · ref 6
VANGUARD is a staged-training VLM framework that reports 94% ROC-AUC and 84% F1 on UCF-Crime while adding chain-of-thought reasoning and spatial grounding to video anomaly detection.
Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation cs.CV · 2026-05-13 · unverdicted · none · ref 43
Grad-ECLIP is an equivalent but flawed variant of attention-based interpretation, with two principles proposed to ensure model explanations reflect the original model.
ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification cs.LG · 2026-04-20 · unverdicted · none · ref 12
ProtoCLIP improves zero-shot chest X-ray classification in CLIP models by 2-10 AUC points via curated data and prototype-aligned distillation, reaching 0.94 AUC for pneumothorax on VinDr-CXR.

Learning to prompt for vision-language models.Int

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer