Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
hub Canonical reference
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Canonical reference. 82% of citing Pith papers cite this work as background.
abstract
Guided image synthesis enables everyday users to create and edit photo-realistic images with minimum effort. The key challenge is balancing faithfulness to the user input (e.g., hand-drawn colored strokes) and realism of the synthesized image. Existing GAN-based methods attempt to achieve such balance using either conditional GANs or GAN inversions, which are challenging and often require additional training data or loss functions for individual applications. To address these issues, we introduce a new image synthesis and editing method, Stochastic Differential Editing (SDEdit), based on a diffusion model generative prior, which synthesizes realistic images by iteratively denoising through a stochastic differential equation (SDE). Given an input image with user guide of any type, SDEdit first adds noise to the input, then subsequently denoises the resulting image through the SDE prior to increase its realism. SDEdit does not require task-specific training or inversions and can naturally achieve the balance between realism and faithfulness. SDEdit significantly outperforms state-of-the-art GAN-based methods by up to 98.09% on realism and 91.72% on overall satisfaction scores, according to a human perception study, on multiple tasks, including stroke-based image synthesis and editing as well as image compositing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.
Consistent-Inversion introduces reverse consistency guidance that corrects early target denoising steps by checking reversibility toward the source inversion trajectory under the original prompt.
A transformer model predicts in vivo hip and knee contact forces from uncalibrated monocular video at accuracy matching subject-specific musculoskeletal simulations under leave-one-subject-out validation.
Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.
Presents Decoupled Time Guidance (DTG) for training-free generative video super-resolution by temporally decoupling conditional and unconditional diffusion signals.
Orthogonal Negative Guidance subtracts only the orthogonal component of negative-prompt attention features from positive ones in FLUX models to suppress concepts while preserving semantics and quality.
DeltaCam models relative changes in camera intrinsics via Δ-parameterized neural adaptors in video diffusion models trained on synthetic data to enable controllable generation and real-world transfer.
StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
HL-OutPaint enables high-resolution outpainting of long video sequences via a coarse-to-fine pipeline that first builds Global Coarse Guidance through global-local frame swapping then synthesizes details.
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
AID amortizes guidance for diffusion inpainting by training a reusable module via an auxiliary Gaussian formulation and continuous-time actor-critic algorithm, improving quality-speed trade-off with under 1% overhead.
Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including video extension.
RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
OT-Bridge Editor uses geometrically constrained entropic optimal transport to synthesize CAG images with precise stenosis, improving downstream detection by 27.8% on ARCADE and 23.0% on a multi-center dataset.
Wasserstein Lagrangian Mechanics formalizes second-order dynamics in Wasserstein space and provides an algorithm to learn them from observed marginals without specifying the Lagrangian, outperforming gradient flows on various dynamics.
ResetEdit embeds a recoverable discrepancy signal during image generation in diffusion models to reconstruct an approximate original latent for high-fidelity text-guided editing.
GeoEdit constructs local tangent frames from small perturbations to initial noise, enabling Jacobian-free on-manifold edits in diffusion models via alternating tangent steps and diffusion projections.
StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
Pre-trained diffusion models inherently support image restoration that can be unlocked by optimizing prompt embeddings at the text encoder output using a diffusion bridge formulation, achieving competitive results on models like WAN and FLUX without fine-tuning.
citing papers explorer
-
MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
MPDiT uses a hierarchical multi-patch design in transformers to lower computation in diffusion models by handling coarse global features first then fine local details, plus faster-converging embeddings.
-
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
UniWorld-V2 applies policy optimization via DiffusionNFT and MLLM logit feedback with group filtering to reach state-of-the-art scores of 4.49 on ImgEdit and 7.83 on GEdit-Bench while remaining model-agnostic.
-
Training-Free Reward-Guided Image Editing via Trajectory Optimal Control
A trajectory optimal control framework for reward-guided image editing in diffusion models that balances reward maximization with source fidelity better than prior inversion-based baselines.
-
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
EditVerse unifies image and video editing and generation in one transformer model via unified token sequences and in-context learning, trained jointly on curated video editing data plus image/video corpora and evaluated on a new instruction-based benchmark.
-
LeakyCLIP: Extracting Training Data from CLIP
LeakyCLIP reconstructs images from CLIP embeddings with over 258% SSIM gain versus baselines and enables membership inference from reconstruction metrics on LAION-2B data.
-
NullFace: Training-Free Localized Face Anonymization
NullFace performs training-free localized face anonymization by inverting images to noise and denoising with modified identity embeddings from a pre-trained diffusion model.
-
Semantics Disentanglement and Composition for Universal Image Coding with Efficiently LLM Reasoning and Generative Diffusion
UniCodec uses LLM-driven semantic disentanglement at the encoder and diffusion-based compositional generation at the decoder to enable one codec for both human perception and machine vision tasks without task-specific retraining.
-
Autoregressive Video Generation without Vector Quantization
NOVA reformulates video generation as non-quantized autoregressive frame-by-frame temporal prediction combined with set-by-set spatial prediction, outperforming prior AR video models and some diffusion models in efficiency and quality.
-
VideoPoet: A Large Language Model for Zero-Shot Video Generation
VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.
-
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
-
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-the-art generators.
-
MagicVideo: Efficient Video Generation With Latent Diffusion Models
MagicVideo generates 256x256 text-conditioned video clips via latent diffusion with a custom 3D U-Net, achieving roughly 64 times lower compute than prior video diffusion models.
-
SkelEM: Training-Signal Decoupling of Skeleton and Diffusion for Self-supervised Axial Super-Resolution in Volume Microscopy
SkelEM decouples skeleton topology from diffusion refinement via disjoint objectives and cycle-consistent alignment on sparse slices to enable fast, high-fidelity self-supervised axial super-resolution with a new BRAVE-ASR benchmark.
-
FDM-MFVT: Few-step Sampling Diffusion Model for Mask-Free Virtual Try-On
FDM-MFVT is a few-step mask-free virtual try-on diffusion model using OANO and IDT modules plus a new 30,000-pair MFVT dataset, claiming better efficiency and quality than baselines.
-
VOID: Defeating Unauthorized Mimicry in Latent Diffusion Models
VOID defeats mimicry in LDMs via stochasticity manipulation in the diffusion pipeline, raising average FID from 113 to 365 across evaluations.
-
One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration
Fixed-Point Distillation constructs one-step correction targets for discrete diffusion generators via partial corruption and single teacher refinement, lifted into continuous features with a multi-bandwidth drift loss and straight-through estimation.
-
Stable and Near-Reversible Diffusion ODE Solvers for Image Editing
Near-reversible Runge-Kutta ODE solvers combined with vector-field smoothing deliver more stable and higher-fidelity text-guided edits in diffusion models than exactly reversible schemes.
-
Semantic-Structural Alignment for Generative Pictorial Charts
Dual-conditioned Multi-Modal Diffusion Transformer with structural and semantic alignment mechanisms generates pictorial charts from text prompts and abstract chart images.
-
Structured Diffusion Bridges: Inductive Bias for Denoising Diffusion Bridges
A structured diffusion bridge method achieves near fully-paired modality translation quality using alignment constraints even in unpaired or semi-paired regimes.
-
Uncertainty-Aware Distribution-to-Distribution Flow Matching for Scientific Imaging
SFM improves generalization under distribution shift for scientific imaging tasks while AVUQ supplies sample-efficient epistemic and aleatoric uncertainty estimates plus anomaly scores.
-
Diffusion Models are Secretly Zero-Shot 3DGS Harmonizers
D3DR optimizes inserted 3DGS objects with a DDS-inspired diffusion objective plus a new personalization step to match scene lighting, reporting 2 dB PSNR gain over prior methods.
-
RectifiedHR: Enable Efficient High-Resolution Synthesis via Energy Rectification
RectifiedHR is a training-free method that uses noise refresh and latent energy analysis to enable efficient high-resolution synthesis in diffusion models.
-
HistoryPalette: Supporting Exploration and Reuse of Past Alternatives in Image Generation and Editing
HistoryPalette provides a palette interface for exploring and reusing prior design alternatives in generative image creation and editing, evaluated via user studies with creative professionals and client collaborators.
-
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
I2VGen-XL applies cascaded diffusion models with a base stage for semantic preservation via hierarchical encoders and a refinement stage for detail and resolution, trained on 35 million text-video and 6 billion text-image pairs.
-
IEA: Amateur-Friendly Conversational Image Editing Agent via Three Stages of Multitask Alignment
IEA is a tool-calling VLM for conversational image editing trained in three multitask stages that reports lower pixel distance, higher ROUGE-L, and top user-study rankings versus baselines.
-
Missing Pattern Recognized Diffusion Imputation Model for Missing Not At Random
PRDIM is a diffusion model using a pattern recognizer to impute MNAR missing data by maximizing joint likelihood of observed values and missing mask via EM.
-
Semantic Granularity Navigation in Image Editing
NaviEdit is a training-free inference-time controller that decouples edit progress from model scale traversal in diffusion-based image editing via self-consistency, reporting average gains across editors and backbones.
-
Towards Robust Sequential Decomposition for Complex Image Editing
Develops a synthetic data pipeline for training sequential decomposition in generative image editing, showing robust gains with complexity and sim-to-real transfer via co-training.
-
On the Controllability-Fidelity Frontier in Diffusion Editing
A study deriving mathematical formulations and bounds for diffusion editing objectives while empirically comparing methods on fidelity and control metrics and discussing ethical issues.
-
A Tutorial on Diffusion Theory: From Differential Equations to Diffusion Models
A tutorial that unifies diffusion probabilistic models, score-based generative modeling, and SDE methods by deriving forward and reverse dynamics from a shared Gaussian noising process.
- UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
- CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator
- The Principles of Diffusion Models