Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
hub
Learning to prompt for vision-language models.Int
9 Pith papers cite this work, alongside 2,607 external citations. Polarity classification is still indexing.
hub tools
verdicts
UNVERDICTED 9representative citing papers
The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.
TB-AVA uses text-mediated gated semantic modulation to enable efficient audio-visual alignment, achieving state-of-the-art results on AVE, AVS, and AVVP benchmarks.
FACTOR uses counterfactual image perturbations to quantify and suppress attribute-dependent predictions in open-vocabulary object detection, improving robustness on corrupted datasets without any training.
GeoStack composes multiple domain experts into VLMs with preserved base knowledge and O(1) inference time via geometric stacking and a weight-folding property.
A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.
VANGUARD is a staged-training VLM framework that reports 94% ROC-AUC and 84% F1 on UCF-Crime while adding chain-of-thought reasoning and spatial grounding to video anomaly detection.
Grad-ECLIP is an equivalent but flawed variant of attention-based interpretation, with two principles proposed to ensure model explanations reflect the original model.
ProtoCLIP improves zero-shot chest X-ray classification in CLIP models by 2-10 AUC points via curated data and prototype-aligned distillation, reaching 0.94 AUC for pneumothorax on VinDr-CXR.
citing papers explorer
-
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
-
Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model
The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.
-
TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning
TB-AVA uses text-mediated gated semantic modulation to enable efficient audio-visual alignment, achieving state-of-the-art results on AVE, AVS, and AVVP benchmarks.
-
FACTOR: Counterfactual Training-Free Test-Time Adaptation for Open-Vocabulary Object Detection
FACTOR uses counterfactual image perturbations to quantify and suppress attribute-dependent predictions in open-vocabulary object detection, improving robustness on corrupted datasets without any training.
-
GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs
GeoStack composes multiple domain experts into VLMs with preserved base knowledge and O(1) inference time via geometric stacking and a weight-folding property.
-
Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction
A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.
-
Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models
VANGUARD is a staged-training VLM framework that reports 94% ROC-AUC and 84% F1 on UCF-Crime while adding chain-of-thought reasoning and spatial grounding to video anomaly detection.
-
Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation
Grad-ECLIP is an equivalent but flawed variant of attention-based interpretation, with two principles proposed to ensure model explanations reflect the original model.
-
ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification
ProtoCLIP improves zero-shot chest X-ray classification in CLIP models by 2-10 AUC points via curated data and prototype-aligned distillation, reaching 0.94 AUC for pneumothorax on VinDr-CXR.