super hub Canonical reference

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Wei Yang, Xiao Han · 2023 · cs.CV · arXiv 2308.06721

Canonical reference. 90% of citing Pith papers cite this work as background.

170 Pith papers citing it

Background 90% of classified citations

open full Pith review browse 170 citing papers more from Hu Ye arXiv PDF

abstract

Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. In this paper, we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pretrained text-to-image diffusion models. The key design of our IP-Adapter is decoupled cross-attention mechanism that separates cross-attention layers for text features and image features. Despite the simplicity of our method, an IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fully fine-tuned image prompt model. As we freeze the pretrained diffusion model, the proposed IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. With the benefit of the decoupled cross-attention strategy, the image prompt can also work well with the text prompt to achieve multimodal image generation. The project page is available at \url{https://ip-adapter.github.io}.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 26 dataset 3 method 1

citation-polarity summary

background 27 use dataset 2 use method 1

claims ledger

abstract Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. I

authors

Hu Ye Jun Zhang Sibo Liu Wei Yang Xiao Han

co-cited works

representative citing papers

Support-Conditioned Flow Matching Is Kernel Smoothing

cs.LG · 2026-05-13 · accept · novelty 8.0

Support-conditioned flow matching under the Gaussian OT path is exactly Nadaraya-Watson kernel smoothing with time-decreasing bandwidth, implemented by a single Gaussian attention head.

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

cs.CV · 2026-06-11 · unverdicted · novelty 7.0

CineOrchestra unifies control of subjects, events, cameras, and shot transitions in cinematic video generation through entity-centric conditioning primitives and parameter-free coordinated rotary embeddings.

A Comprehensive Ecosystem for Open-Domain Customized Video Generation

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

Introduces PexelsCustom-1M dataset, CustoMDiT parameter-efficient model, and OpenCustom benchmark for open-domain customized video generation.

Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

ImageTime is a benchmark that probes image generation models' visual world modeling by requiring coherent four-state sequences in single images, scored via VLM judge.

Diff-CA: Separating Common and Salient Factors with Diffusion Models

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

A diffusion-based contrastive analysis method that decomposes conditioning into common and salient factors with weak supervision and proves identifiability of the additive model.

ImageAuditor: Membership Inference Attack against Image-based Retrieval-Augmented Generation

cs.CR · 2026-06-02 · unverdicted · novelty 7.0

ImageAuditor is the first MIA for IRAG that achieves over 80% AUROC with four queries by using reward-guided policy optimization for cross-modal retrieval and task-specific prompting for signal extraction.

Splatshot: 3D Face Avatar Generation from a Single Unconstrained Photo

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

SplatShot is a training-free method that inserts per-step 3DGS refitting and photometric feedback into diffusion denoising to enforce multi-view consistency for single-photo 3D face avatars.

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.

DEMON: Diffusion Engine for Musical Orchestrated Noise

cs.SD · 2026-05-27 · unverdicted · novelty 7.0

DEMON is a streaming diffusion engine that exposes denoising parameters as playable controls at up to 12.3 decoder completions per second via per-slot scheduling, shared state, source blending, and accelerated decoding.

Loki: Representation over Architecture for Diffusion-Based Portrait Animation

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Loki replaces RGB conditioning stacks with identity-orthogonal parametric face encodings rasterized for diffusion, achieving efficient cross-ID portrait animation without cross-ID training data.

EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

EM-Vid introduces an entity-centric latent patch memory bank with sparse token conditioning and budgeted updates for training-free consistent multi-shot video generation.

PIU: Proximity-guided Identity Unlearning in ID-Conditioned Diffusion Models

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

PIU suppresses target identity generation in Arc2Face by replacing it with a proximity-selected anchor identity through localized fine-tuning of cross-attention layers while preserving output quality for other identities.

Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Tiny-Engram uses small n-gram-indexed memory tables to bind trigger phrases to target visual identities in diffusion models while preserving compositional control from the surrounding prompt.

Functionalization via Structure Completion and Motion Rectification

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.

DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

DirectTryOn achieves state-of-the-art one-step virtual try-on performance by applying pure conditional transport, garment preservation loss, and self-consistency loss to straighten trajectories in pretrained generative models.

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including video extension.

MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.

Detecting Deception, Not Deepfakes: Why Media Forensics Needs Social Theories

cs.CY · 2026-05-09 · unverdicted · novelty 7.0

Deepfake detection must shift from classifying media realism to detecting communicative deception by applying Speech Act Theory, Grice's Cooperative Principle, and Cialdini's influence principles.

Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.

Adaptive Subspace Projection for Generative Personalization

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

A training-free adaptive subspace projection method mitigates semantic collapsing in generative personalization by isolating and adjusting drift in a low-dimensional subspace using the stable pre-trained embedding as anchor.

A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and learning-based methods including a proposed diffusion-based V-cache.

CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

CA-IDD is the first diffusion model for face swapping that integrates multi-modal cross-attention guidance from identity embeddings, gaze, and facial parsing to achieve better identity consistency and an FID of 11.73 over GAN baselines.

MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

cs.CV · 2026-04-26 · unverdicted · novelty 7.0 · 2 refs

MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.

citing papers explorer

Showing 50 of 170 citing papers.

Secure Intellicise Wireless Network: Agentic AI for Coverless Semantic Steganography Communication cs.CR · 2026-01-23 · unverdicted · none · ref 53 · internal anchor
Agentic AI enables coverless semantic steganography without private keys or cover images, delivering higher capacity and security than prior schemes in semantic communication.
EmoCtrl: Controllable Emotional Image Content Generation cs.CV · 2025-12-27 · unverdicted · none · ref 45 · internal anchor
EmoCtrl generates images faithful to content prompts while expressing target emotions via textual/visual enhancement modules and emotion-driven preference optimization.
Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation cs.CV · 2025-11-21 · unverdicted · none · ref 50 · internal anchor
Fine-tuning text-to-video models on sparse low-quality synthetic data for physical camera controls outperforms fine-tuning on photorealistic data.
How Noise Benefits AI-generated Image Detection cs.CV · 2025-11-20 · unverdicted · none · ref 76 · internal anchor
PiN-CLIP jointly trains a noise generator and detector under a variational positive-incentive principle to inject feature-space noise that suppresses shortcut directions and improves out-of-distribution accuracy by 5.4 points on images from 42 generative models.
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback cs.CV · 2025-10-19 · unverdicted · none · ref 24 · internal anchor
UniWorld-V2 applies policy optimization via DiffusionNFT and MLLM logit feedback with group filtering to reach state-of-the-art scores of 4.49 on ImgEdit and 7.83 on GEdit-Bench while remaining model-agnostic.
FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing cs.CV · 2025-09-26 · conditional · none · ref 49 · internal anchor
FlashEdit delivers real-time localized text-guided image editing under 0.2 seconds via cycle-consistent one-step inversion, background shield, and sparsified spatial cross-attention, achieving over 150x speedup on PIE-Bench.
TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering cs.CV · 2025-09-04 · unverdicted · none · ref 79 · internal anchor
TaleDiffusion introduces an iterative framework using LLM-generated per-frame descriptions, bounded attention-based per-box masks, identity-consistent self-attention, region-aware cross-attention, and CLIPSeg-based dialogue rendering to produce consistent multi-character story visualizations.
Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering cs.CV · 2025-08-20 · unverdicted · none · ref 74 · internal anchor
Ouroboros uses two single-step diffusion models with cycle consistency for forward and inverse rendering, extending intrinsic decomposition to indoor/outdoor scenes with faster inference than multi-step methods.
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space cs.GR · 2025-06-17 · unverdicted · none · ref 59 · internal anchor
FLUX.1 Kontext unifies image generation and editing via flow matching and sequence concatenation, delivering improved multi-turn consistency and speed on the new KontextBench benchmark.
ImgEdit: A Unified Image Editing Dataset and Benchmark cs.CV · 2025-05-26 · conditional · none · ref 78 · internal anchor
ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute cs.CV · 2025-04-23 · unverdicted · none · ref 55 · internal anchor
A zero-shot subject-driven video generation framework that decomposes the task into identity injection from 200K subject-image pairs and motion preservation from 4K arbitrary videos, trained in 288 A100 GPU hours on CogVideoX-5B to match prior performance at 1% compute.
FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation cs.CV · 2025-04-22 · unverdicted · none · ref 6 · internal anchor
FreeGraftor performs subject-driven text-to-image generation without training by cross-image feature grafting via semantic matching, position-constrained attention fusion, and a noise initialization strategy that preserves reference geometry.
Color Conditional Generation with Sliced Wasserstein Guidance cs.CV · 2025-03-24 · unverdicted · none · ref 60 · internal anchor
A training-free method modifies diffusion model sampling with differentiable Sliced 1-Wasserstein distance for color-conditional image generation.
NullFace: Training-Free Localized Face Anonymization cs.CV · 2025-03-11 · unverdicted · none · ref 77 · internal anchor
NullFace performs training-free localized face anonymization by inverting images to noise and denoising with modified identity embeddings from a pre-trained diffusion model.
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model cs.CV · 2025-03-10 · unverdicted · none · ref 38 · internal anchor
Seedream 2.0 is a native Chinese-English bilingual diffusion model that integrates a self-developed LLM text encoder, Glyph-Aligned ByT5, and Scaled ROPE to reach claimed state-of-the-art results in prompt following, aesthetics, text rendering, and human preference alignment via RLHF.
OmniPrism: Learning Disentangled Visual Concept for Image Generation cs.CV · 2024-12-16 · unverdicted · none · ref 46 · internal anchor
OmniPrism proposes a disentanglement method using a new paired dataset (PCD-200K), COD contrastive training, and block embeddings to inject separated concepts into diffusion models for multi-aspect image generation.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation cs.CV · 2024-04-02 · unverdicted · none · ref 162 · internal anchor
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
InstantID: Zero-shot Identity-Preserving Generation in Seconds cs.CV · 2024-01-15 · unverdicted · none · ref 24 · internal anchor
InstantID enables zero-shot identity-preserving image generation from one facial image via a novel IdentityNet that combines strong semantic and weak spatial conditioning with text prompts in diffusion models.
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation cs.CV · 2023-10-30 · unverdicted · none · ref 55 · internal anchor
Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
Personalization as Inverse Planning: Learning Latent Design Intents for Agentic Slide Generation via Structural Denoising cs.AI · 2026-07-01 · unverdicted · none · ref 46 · internal anchor
SPIRE approximates page-level slide personalization by training agents to denoise corrupted slide structures via collaborative RL, claiming a proof of consistency as a surrogate for inverse planning.
No Prompt, No Leaks: A Robust Generative Steganography Framework via Prompt-Free Diffusion cs.CV · 2026-06-30 · unverdicted · none · ref 20 · internal anchor
Proposes a prompt-free latent diffusion steganography method using style semantics, CACM for reversible mapping, and predictor-corrector sampling to improve stego image quality and secret recovery.
MemoryVAM: Integrating Memory into Video Action Model for Robot Manipulation cs.RO · 2026-06-13 · unverdicted · none · ref 27 · internal anchor
MemoryVAM integrates a Perceiver-based Recap Compressor and Cue Gate into video action models, raising success rates on long-horizon manipulation from 5% to 42.5% on LIBERO-Mem and 75-80% on real-robot counting, spatial recall, and tracking tasks.
FontFusion: Enhancing Generative Text in Diffusion Models with Typographic Conditioning cs.CV · 2026-06-04 · unverdicted · none · ref 31 · internal anchor
FontFusion adds hierarchical token conditioning, position-aware embeddings, and multi-level dropping to DiT diffusion models, yielding 76% relative gains on decorative fonts and 68-76% consistency improvements via a dual DeepFont+DINOv2 encoder.
Mapping Whisper Representations to Human ECoG Responses with Interpretable Time-Resolved Neural Encoding q-bio.NC · 2026-06-01 · unverdicted · none · ref 35 · internal anchor
The paper introduces a time-resolved neural encoder combining Whisper embeddings with recurrent temporal modeling and soft attention to predict ECoG responses, finding strongest alignment in intermediate layers and anatomically coherent phoneme organization in electrodes.
PhyDrawGen: Physically Grounded Diagram Generation from Natural Language cs.AI · 2026-05-28 · unverdicted · none · ref 37 · internal anchor
PhyDrawGen is a neuro-symbolic pipeline that extracts typed scene graphs via LLM, converts them to physically constrained PSLGs via deterministic solver, and refines via fine-tuned Qwen-VL, claiming superior performance over GPT-5-image and Gemini models on 1,449 physics problems.
Gaze2Act: Gaze-Conditioned Vision-Language-Action Policies for Interactive Robot Manipulation cs.RO · 2026-05-28 · unverdicted · none · ref 12 · internal anchor
Gaze2Act conditions VLA policies on mapped human gaze for precise object and interaction specification, reporting SOTA intent accuracy and success across 16 real-robot tasks on a Unitree G1 humanoid.
Versatile Framework with Semantic and Structural guidance for Image Reconstruction from Brain Activity cs.CV · 2026-05-28 · unverdicted · none · ref 55 · internal anchor
MindDiffuser is a two-stage framework that improves semantic and structural fidelity in reconstructing visual stimuli from fMRI, EEG, and MEG signals using CLIP embeddings and Stable Diffusion with backpropagation refinement.
{\Phi}-Noise: Training-Free Temporal Video Conditioning via Phase-Based Noise Manipulation cs.CV · 2026-05-23 · unverdicted · none · ref 59 · internal anchor
Training-free motion conditioning for latent video diffusion by direct injection of low-frequency phase from a reference video into the diffusion noise.
AssetGen: Deployable 3D Asset Generation at Interactive Speed cs.GR · 2026-05-22 · unverdicted · none · ref 27 · internal anchor
AssetGen is a system that produces deployable 3D assets including meshes, baked normals, and textures from a single reference image in under 30 seconds via a coarse-to-refine VecSet pipeline and co-designed optimizations.
STAMBRIDGE: Spectral-Temporal Amplitude-aware Mid-Feature Bridge for EEG Visual Decoding eess.IV · 2026-05-22 · unverdicted · none · ref 35 · 2 links · internal anchor
STAMBRIDGE uses STAM for robust EEG feature extraction and MFSB for cross-modal alignment, reporting 34.50% top-1 and 65.95% top-5 accuracy in 200-way zero-shot retrieval on THINGS-EEG plus diffusion-based image reconstructions.
PGC: Peak-Guided Calibration for Generalizable AI-Generated Image Detection cs.CV · 2026-05-20 · unverdicted · none · ref 9 · internal anchor
PGC introduces peak-focusing aggregation of local discriminative clues to calibrate global representations for AI-generated image detection, reporting accuracy gains on a new 15-model commercial benchmark and standard datasets.
Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction cs.CV · 2026-05-20 · unverdicted · none · ref 27 · internal anchor
A two-stage method predicts an intermediate Canny map for structure then renders the image conditioned on appearance and structure, paired with a 100k text-aware dataset, to improve detail preservation in subject-driven generation.
Brain-to-Image Retrieval and Reconstruction via Multimodal EEG Alignment cs.CV · 2026-05-18 · unverdicted · none · ref 6 · internal anchor
A multimodal alignment pipeline decodes EEG signals recorded during natural image viewing into image retrieval (86.3% Top-1) and reconstruction (CLIP 0.903) tasks.
DreamEdit3D: Personalization of Multi-View Diffusion Models for 3D Editing cs.CV · 2026-05-16 · unverdicted · none · ref 48 · internal anchor
DreamEdit3D learns separate token embeddings for segmented object components via two-phase multi-view optimization to enable text-guided 3D editing with consistent image generation and mesh reconstruction.
DealMaTe: Multi-Dimensional Material Transfer via Diffusion Transformer cs.GR · 2026-05-15 · unverdicted · none · ref 63 · internal anchor
DealMaTe proposes a simplified diffusion framework for material transfer that injects multi-dimensional 3D conditions via Multi-Dim 3D Shader LoRA and Shader Causal Mutual Attention with KV caching.
DriveCtrl: Conditioned Sim-to-Real Driving Video Generation cs.CV · 2026-05-14 · unverdicted · none · ref 25 · internal anchor
DriveCtrl is a depth-conditioned controllable framework that generates realistic driving videos from simulation while preserving annotations and scene dynamics.
RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation cs.CV · 2026-05-12 · unverdicted · none · ref 38 · internal anchor
RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrative dynamism in sequential image generation.
When Few Steps Are Enough: Training-Free Acceleration of Identity-Preserved Generation cs.CV · 2026-05-10 · unverdicted · none · ref 10 · internal anchor
Frozen identity adapter from FLUX dev works on distilled schnell model, enabling 5.9x faster generation with better identity preservation in few steps.
Semantic-Structural Alignment for Generative Pictorial Charts cs.GR · 2026-05-05 · unverdicted · none · ref 21 · internal anchor
Dual-conditioned Multi-Modal Diffusion Transformer with structural and semantic alignment mechanisms generates pictorial charts from text prompts and abstract chart images.
IConFace: Identity-Structure Asymmetric Conditioning for Unified Reference-Aware Face Restoration cs.CV · 2026-05-04 · unverdicted · none · ref 101 · internal anchor
IConFace performs unified reference-aware and no-reference blind face restoration by asymmetrically conditioning identity from references and structure from the degraded image.
Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion cs.LG · 2026-04-27 · unverdicted · none · ref 47 · internal anchor
Diffusion Templates is a unified plugin framework that allows injecting various controllable capabilities into diffusion models through a standardized interface.
Discriminative-Generative Synergy for Occlusion Robust 3D Human Mesh Recovery cs.CV · 2026-04-20 · unverdicted · none · ref 18 · internal anchor
A synergistic framework merges discriminative ViT features with generative diffusion priors via alignment and cross-attention fusion modules to improve 3D human mesh estimation under occlusion.
DiffMagicFace: Identity Consistent Facial Editing of Real Videos cs.CV · 2026-04-15 · unverdicted · none · ref 34 · internal anchor
DiffMagicFace uses concurrent fine-tuned text and image diffusion models plus a rendered multi-view dataset to achieve identity-consistent text-conditioned editing of real facial videos.
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance cs.CV · 2026-04-13 · unverdicted · none · ref 60 · internal anchor
FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models on new and existing benchmarks.
Brain-Grasp: Graph-based Saliency Priors for Improved fMRI-based Visual Brain Decoding eess.IV · 2026-04-12 · unverdicted · none · ref 21 · internal anchor
Graph-informed saliency masks derived from fMRI signals are used to condition a single diffusion model, improving object structure and semantic fidelity in visual brain decoding.
FREE-Switch: Frequency-based Dynamic LoRA Switch for Style Transfer cs.CV · 2026-04-11 · unverdicted · none · ref 36 · internal anchor
FREE-Switch dynamically switches LoRA adapters using frequency importance per diffusion step and adds semantic alignment to reduce content drift when merging specialized image generators.
ID-Sim: An Identity-Focused Similarity Metric cs.CV · 2026-04-06 · unverdicted · none · ref 97 · internal anchor
ID-Sim is a new similarity metric that aims to capture human selective sensitivity to identities by training on curated real and generative synthetic data and validating against human annotations on recognition, retrieval, and generative tasks.
PureCC: Pure Learning for Text-to-Image Concept Customization cs.CV · 2026-03-08 · unverdicted · none · ref 50 · internal anchor
PureCC introduces a decoupled learning objective, dual-branch training pipeline with frozen extractor, and adaptive guidance scale λ* for high-fidelity concept customization while preserving original model behavior in text-to-image generation.
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards cs.CV · 2025-12-01 · conditional · none · ref 46 · internal anchor
A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.
Animalbooth: multimodal feature enhancement for animal subject personalization cs.CV · 2025-09-20 · unverdicted · none · ref 19 · internal anchor
AnimalBooth introduces an Animal Net, adaptive attention module, and frequency-controlled DCT feature integration to improve identity preservation and perceptual quality in personalized animal image generation, supported by a new high-resolution dataset AnimalBench.

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer