Support-conditioned flow matching under the Gaussian OT path is exactly Nadaraya-Watson kernel smoothing with time-decreasing bandwidth, implemented by a single Gaussian attention head.
hub
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
68 Pith papers cite this work. Polarity classification is still indexing.
abstract
Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. In this paper, we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pretrained text-to-image diffusion models. The key design of our IP-Adapter is decoupled cross-attention mechanism that separates cross-attention layers for text features and image features. Despite the simplicity of our method, an IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fully fine-tuned image prompt model. As we freeze the pretrained diffusion model, the proposed IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. With the benefit of the decoupled cross-attention strategy, the image prompt can also work well with the text prompt to achieve multimodal image generation. The project page is available at \url{https://ip-adapter.github.io}.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. I
co-cited works
roles
background 1polarities
background 1representative citing papers
DirectTryOn achieves state-of-the-art one-step virtual try-on performance by applying pure conditional transport, garment preservation loss, and self-consistency loss to straighten trajectories in pretrained generative models.
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.
Flow matching admits controllable generation by shifting the conditional endpoint mean computed from a reference set, enabling training-free guidance on frozen pretrained models.
Deepfake detection must shift from classifying media realism to detecting communicative deception by applying Speech Act Theory, Grice's Cooperative Principle, and Cialdini's influence principles.
Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.
A training-free adaptive subspace projection method mitigates semantic collapsing in generative personalization by isolating and adjusting drift in a low-dimensional subspace using the stable pre-trained embedding as anchor.
Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and learning-based methods including a proposed diffusion-based V-cache.
CA-IDD is the first diffusion model for face swapping that integrates multi-modal cross-attention guidance from identity embeddings, gaze, and facial parsing to achieve better identity consistency and an FID of 11.73 over GAN baselines.
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.
A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.
ASTRA disentangles subject identity from pose structure in diffusion transformers via retrieval-augmented pose guidance, asymmetric EURoPE embeddings, and a DSM adapter to improve multi-subject generation.
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
UDAPose improves low-light human pose estimation by synthesizing realistic images via DHF and LCIM modules and dynamically balancing image cues with pose priors using DCA, yielding AP gains of 10.1 and 7.4 over prior methods.
NeuroFlow is the first unified flow model for bidirectional visual encoding and decoding from neural activity using NeuroVAE and cross-modal flow matching.
A 300K quadruplet dataset and UniDG foundation model enable reference- or text-driven defect generation across categories, outperforming few-shot baselines on anomaly detection tasks.
Graph-PiT adds graph priors and a hierarchical GNN to part-based image synthesis to enforce relational constraints and improve structural coherence over vanilla PiT.
ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.
CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.
V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute binding and structural control.
A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
citing papers explorer
-
ImgEdit: A Unified Image Editing Dataset and Benchmark
ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
-
RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation
RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrative dynamism in sequential image generation.
-
When Few Steps Are Enough: Training-Free Acceleration of Identity-Preserved Generation
Frozen identity adapter from FLUX dev works on distilled schnell model, enabling 5.9x faster generation with better identity preservation in few steps.
-
IConFace: Identity-Structure Asymmetric Conditioning for Unified Reference-Aware Face Restoration
IConFace performs unified reference-aware and no-reference blind face restoration by asymmetrically conditioning identity from references and structure from the degraded image.
-
Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion
Diffusion Templates is a unified plugin framework that allows injecting various controllable capabilities into diffusion models through a standardized interface.
-
Discriminative-Generative Synergy for Occlusion Robust 3D Human Mesh Recovery
A synergistic framework merges discriminative ViT features with generative diffusion priors via alignment and cross-attention fusion modules to improve 3D human mesh estimation under occlusion.
-
DiffMagicFace: Identity Consistent Facial Editing of Real Videos
DiffMagicFace uses concurrent fine-tuned text and image diffusion models plus a rendered multi-view dataset to achieve identity-consistent text-conditioned editing of real facial videos.
-
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance
FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models on new and existing benchmarks.
-
Brain-Grasp: Graph-based Saliency Priors for Improved fMRI-based Visual Brain Decoding
Graph-informed saliency masks derived from fMRI signals are used to condition a single diffusion model, improving object structure and semantic fidelity in visual brain decoding.
-
FREE-Switch: Frequency-based Dynamic LoRA Switch for Style Transfer
FREE-Switch dynamically switches LoRA adapters using frequency importance per diffusion step and adds semantic alignment to reduce content drift when merging specialized image generators.
-
ID-Sim: An Identity-Focused Similarity Metric
ID-Sim is a new similarity metric that aims to capture human selective sensitivity to identities by training on curated real and generative synthetic data and validating against human annotations on recognition, retrieval, and generative tasks.
-
Movie Gen: A Cast of Media Foundation Models
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
-
AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation
AtteConDA adds attention-based conflict suppression to multi-condition diffusion models so that generated driving-scene images retain richer structural cues from the original annotations.
-
AHS: Adaptive Head Synthesis via Synthetic Data Augmentations
Adaptive Head Synthesis (AHS) employs head-reenacted synthetic data augmentation to enable robust head swapping on full upper-body images without paired training data.
-
LegoDiffusion: Micro-Serving Text-to-Image Diffusion Workflows
LegoDiffusion decomposes diffusion workflows into micro-served model nodes to achieve up to 3x higher throughput and 8x better burst tolerance than monolithic serving systems.
-
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.