arxiv: 2208.01626 · v1 · submitted 2022-08-02 · 💻 cs.CV · cs.CL· cs.GR· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Daniel Cohen-Or, Jay Tenenbaum, Kfir Aberman, Ron Mokady, Yael Pritch

Authors on Pith no claims yet

Pith reviewed 2026-05-11 06:55 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.GRcs.LG

keywords prompt-to-prompt editingcross-attention controltext-to-image synthesisimage editingdiffusion modelsattention mapstext-driven generation

0 comments

The pith

Cross-attention layers let users edit images by changing only the text prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes text-to-image generative models and finds that cross-attention layers determine how each word in the prompt maps to spatial regions in the output image. By editing these attention maps in tandem with prompt modifications, the method produces edited images that follow new instructions while retaining most of the original structure. This removes the need for users to supply spatial masks, which prior techniques required and which often discarded useful content inside the mask. The approach supports local changes through word swaps, global adjustments by adding descriptors, and fine control over how strongly a word appears.

Core claim

Cross-attention layers control the relation between the spatial layout of the image and each word in the prompt. Editing the textual prompt and correspondingly adjusting the cross-attention maps during inference allows the synthesis to reflect the new prompt while preserving the original image outside the edited regions.

What carries the argument

Cross-attention layers, which map words from the prompt to specific spatial positions in the generated image and can be directly edited at inference time.

If this is right

Localized edits arise simply by replacing a word in the prompt and updating its attention map.
Global edits arise by adding new specifications to the prompt and extending the corresponding attention.
The degree to which any single word influences the image can be tuned by scaling its attention map.
No spatial masks or model retraining are needed for the edits to succeed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attention-control idea could simplify interfaces for iterative image refinement where users tweak prompts repeatedly.
If cross-attention proves similarly dominant in other conditional generators, the technique might transfer to tasks such as text-guided video or 3D editing.
Prompt-only editing reduces dependence on precise user drawing skills, which could broaden access to generative tools.

Load-bearing premise

That the cross-attention mechanism dominates word-to-region mapping and that targeted edits to these maps during inference will not create artifacts or require retraining the model.

What would settle it

Run the attention-editing procedure on a prompt change and check whether the output image incorporates the intended edit, keeps unedited regions unchanged, and avoids visible artifacts; systematic failure on either count would falsify the central claim.

read the original abstract

Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities of generating highly diverse images that follow given text prompts. Such text-based synthesis methods are particularly appealing to humans who are used to verbally describe their intent. Therefore, it is only natural to extend the text-driven image synthesis to text-driven image editing. Editing is challenging for these generative models, since an innate property of an editing technique is to preserve most of the original image, while in the text-based models, even a small modification of the text prompt often leads to a completely different outcome. State-of-the-art methods mitigate this by requiring the users to provide a spatial mask to localize the edit, hence, ignoring the original structure and content within the masked region. In this paper, we pursue an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only. To this end, we analyze a text-conditioned model in depth and observe that the cross-attention layers are the key to controlling the relation between the spatial layout of the image to each word in the prompt. With this observation, we present several applications which monitor the image synthesis by editing the textual prompt only. This includes localized editing by replacing a word, global editing by adding a specification, and even delicately controlling the extent to which a word is reflected in the image. We present our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cross-attention map replacement during denoising gives a workable mask-free way to edit images from prompt changes alone.

read the letter

The main point is that you can control image edits in text-to-image diffusion models by directly editing the cross-attention maps that link prompt words to spatial regions, rather than relying on user-provided masks. The authors observe that these cross-attention layers are responsible for the spatial layout corresponding to each word. They use this to implement prompt-to-prompt editing: run the model once with the source prompt to get the maps, then during a second generation with the edited prompt, inject or modify those maps at selected layers and timesteps. This enables replacing a word locally, adding attributes to the whole image, or adjusting the strength of a particular word's influence. What the paper does well is demonstrate this on a variety of examples with diverse images and prompts. The edits maintain fidelity to the original structure while incorporating the prompt changes, which is better than naive prompt swapping that often scrambles everything. The method requires no additional training and can be applied to existing models. The soft spots are in the validation. The results are presented qualitatively through figures, with no quantitative measures like edit accuracy, perceptual similarity scores, or user preference studies. There are also no ablations showing which attention layers or timesteps are critical, or how the method handles cases where the source and target prompts differ substantially. The potential issue of map compatibility with the evolving latent at each step is not deeply analyzed, though the examples suggest it works in practice for the tested cases. This paper is for people developing or using generative AI tools for image manipulation. Practitioners looking for simple ways to add text-driven editing will find it relevant. It shows clear engagement with the model internals and literature on attention mechanisms. I would recommend sending it for peer review. The contribution is concrete and the approach has clear utility, even if additional experiments would strengthen it.

Referee Report

3 major / 2 minor

Summary. The paper claims that cross-attention layers in text-conditioned diffusion models (e.g., Stable Diffusion) are the primary mechanism controlling the spatial mapping from prompt words to image regions. By analyzing these layers, the authors develop a prompt-to-prompt editing framework that monitors and edits cross-attention maps during the denoising trajectory to achieve text-only edits: localized changes via word replacement, global changes via prompt augmentation, and fine control over a word's visual extent. The method requires no spatial masks, no retraining, and preserves most of the original image structure while following the edited prompt. Results are demonstrated qualitatively on diverse images and prompts.

Significance. If the central observation and editing procedure hold under broader testing, this work offers a practical advance for intuitive text-driven editing that avoids the limitations of mask-based methods, which discard original content inside the mask. The approach is grounded in direct analysis of existing model components rather than new training or auxiliary networks, and it yields falsifiable qualitative predictions about attention map edits. Strengths include the identification of cross-attention as a controllable interface and the demonstration of multiple applications (word swap, addition, extent control) without introducing free parameters. The absence of quantitative metrics or ablations, however, leaves the robustness and generality of the claims difficult to assess.

major comments (3)

[§3.2] §3.2 (Cross-Attention Control): The procedure for replacing source-prompt attention maps into the target-prompt synthesis assumes direct transferability, but provides no analysis of compatibility with the evolving noisy latent at each timestep. Because cross-attention is recomputed from the current noisy input, maps derived from an independent source trajectory may misalign with the target prompt's conditioning or noise level, risking artifacts or loss of structure; the manuscript does not report ablation on replacement schedules or layer selection to test this assumption.
[§4] §4 (Applications and Experiments): The central claim that edits are controlled by text only and achieve high fidelity rests on qualitative examples, yet no quantitative metrics (e.g., CLIP similarity to target prompt, LPIPS for structure preservation, or user studies) or baseline comparisons (mask-based editing, prompt interpolation) are provided. This makes it impossible to verify whether observed success generalizes or depends on per-example tuning of which timesteps and layers receive map edits.
[§3.1] §3.1 (Observation): The statement that cross-attention layers are 'the key' to word-to-region mapping is presented as an empirical observation, but the manuscript does not quantify the contribution of cross-attention relative to other components (self-attention, MLP layers) via controlled interventions such as freezing or ablating those layers while editing.

minor comments (2)

[Figures 2,4] Figure 2 and Figure 4: attention map visualizations would benefit from explicit side-by-side comparison of source vs. edited maps at the same timestep, with prompt text overlaid for clarity.
[§3] The method description in §3 does not specify the exact interpolation or injection formula used when combining source and target attention maps (e.g., whether maps are averaged, replaced only in certain heads, or thresholded).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and outline the changes we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Cross-Attention Control): The procedure for replacing source-prompt attention maps into the target-prompt synthesis assumes direct transferability, but provides no analysis of compatibility with the evolving noisy latent at each timestep. Because cross-attention is recomputed from the current noisy input, maps derived from an independent source trajectory may misalign with the target prompt's conditioning or noise level, risking artifacts or loss of structure; the manuscript does not report ablation on replacement schedules or layer selection to test this assumption.

Authors: We appreciate the referee highlighting this aspect of the replacement procedure. The source attention maps are injected into the target denoising trajectory while using the target prompt's text embeddings at each step, which provides a degree of adaptation to the target conditioning. Nevertheless, the manuscript indeed lacks explicit analysis of timestep compatibility or ablations on schedules and layers. In the revised version we will add a dedicated discussion of the replacement mechanism together with ablation experiments varying the timesteps and layers at which maps are replaced, to demonstrate robustness and identify any failure modes. revision: yes
Referee: [§4] §4 (Applications and Experiments): The central claim that edits are controlled by text only and achieve high fidelity rests on qualitative examples, yet no quantitative metrics (e.g., CLIP similarity to target prompt, LPIPS for structure preservation, or user studies) or baseline comparisons (mask-based editing, prompt interpolation) are provided. This makes it impossible to verify whether observed success generalizes or depends on per-example tuning of which timesteps and layers receive map edits.

Authors: We agree that the current evaluation is limited to qualitative demonstrations and that quantitative metrics and baselines would allow readers to better assess generality. Our focus was on showing that text-only control is feasible across diverse cases without masks or retraining. In the revision we will incorporate CLIP similarity to the target prompt, LPIPS to the source image for structure preservation, and direct comparisons against mask-based editing and prompt-interpolation baselines. We will also document the specific timestep and layer choices used throughout the experiments. revision: yes
Referee: [§3.1] §3.1 (Observation): The statement that cross-attention layers are 'the key' to word-to-region mapping is presented as an empirical observation, but the manuscript does not quantify the contribution of cross-attention relative to other components (self-attention, MLP layers) via controlled interventions such as freezing or ablating those layers while editing.

Authors: The claim rests on the direct correspondence we observe between cross-attention maps and the spatial regions governed by individual prompt words, which enables the editing operations we demonstrate. We did not perform layer-freezing or ablation interventions, as these would require non-trivial architectural modifications outside the scope of the presented analysis. In the revised manuscript we will expand §3.1 with additional visualizations comparing attention behavior across layer types and a clearer justification for focusing on cross-attention as the controllable interface for word-to-region mapping. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on empirical observation of existing model components

full rationale

The paper's derivation begins with an analysis of cross-attention layers in standard text-conditioned generative models and observes their role in mapping words to spatial regions. This observation is used to motivate prompt-only editing applications without any reduction to self-defined quantities, fitted parameters renamed as predictions, or load-bearing self-citations. The abstract and described chain present the key property as an independent finding from the underlying architecture rather than a tautology or imported ansatz, rendering the overall argument self-contained against external model behavior.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper builds on the standard cross-attention architecture of text-conditioned generative models without introducing new free parameters, axioms, or invented entities in the described approach.

pith-pipeline@v0.9.0 · 5576 in / 1048 out tokens · 40434 ms · 2026-05-11T06:55:05.216979+00:00 · methodology

discussion (0)

Forward citations

Cited by 48 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Masked Generative Transformer Is What You Need for Image Editing
cs.CV 2026-05 unverdicted novelty 8.0

EditMGT applies masked generative transformers with attention consolidation and region-hold sampling to deliver state-of-the-art localized image editing at 6x the speed of diffusion methods.
RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition
cs.CV 2026-05 unverdicted novelty 7.0

RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 7.0

A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.
Attention Sinks in Diffusion Transformers: A Causal Analysis
cs.CV 2026-05 unverdicted novelty 7.0

Suppressing attention sinks in diffusion transformers does not degrade text-image alignment or most preference metrics, revealing a dissociation between generation trajectory changes and semantic output quality.
Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision
cs.CV 2026-05 unverdicted novelty 7.0

Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.
DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

DirectEdit achieves step-level accurate inversion for flow-based image editing by directly aligning forward paths, using attention feature injection and mask-guided noise blending to balance fidelity and editability w...
SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking
cs.CV 2026-05 unverdicted novelty 7.0

SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.
ResetEdit: Precise Text-guided Editing of Generated Image via Resettable Starting Latent
cs.CV 2026-04 unverdicted novelty 7.0

ResetEdit embeds a recoverable discrepancy signal during image generation in diffusion models to reconstruct an approximate original latent for high-fidelity text-guided editing.
GeoEdit: Local Frames for Fast, Training-Free On-Manifold Editing in Diffusion Models
cs.LG 2026-04 unverdicted novelty 7.0

GeoEdit constructs local tangent frames from small perturbations to initial noise, enabling Jacobian-free on-manifold edits in diffusion models via alternating tangent steps and diffusion projections.
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
cs.CV 2026-04 unverdicted novelty 7.0

Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition
cs.GR 2026-04 unverdicted novelty 7.0

StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
cs.MM 2026-04 unverdicted novelty 7.0

AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both a...
TransSplat: Unbalanced Semantic Transport for Language-Driven 3DGS Editing
cs.CV 2026-04 unverdicted novelty 7.0

TransSplat uses unbalanced semantic transport to match edited 2D evidence with 3D Gaussians and recover a shared 3D edit field, yielding better local accuracy and structural consistency than prior view-consistency methods.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 7.0

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
cs.CV 2026-04 unverdicted novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Prompt Relay is an inference-time plug-and-play method that penalizes cross-attention to enforce temporal prompt alignment and reduce semantic entanglement in multi-event video generation.
Your Pre-trained Diffusion Model Secretly Knows Restoration
cs.CV 2026-04 unverdicted novelty 7.0

Pre-trained diffusion models inherently support image restoration that can be unlocked by optimizing prompt embeddings at the text encoder output using a diffusion bridge formulation, achieving competitive results on ...
PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space
cs.LG 2026-04 unverdicted novelty 7.0

PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator
cs.CV 2026-04 unverdicted novelty 7.0

CAMEO uses coordinated agents for planning, prompting, generation, and quality feedback to achieve higher structural reliability in conditional image editing than single-step models.
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
cs.CV 2024-03 unverdicted novelty 7.0

ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
cs.CV 2026-05 unverdicted novelty 6.0

V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
cs.CV 2026-05 unverdicted novelty 6.0

OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
LimeCross: Context-Conditioned Layered Image Editing with Structural Consistency
cs.CV 2026-05 unverdicted novelty 6.0

LimeCross enables text-guided editing of individual layers in composite images by conditioning on cross-layer context via bi-stream attention while preserving layer integrity and introducing the LayerEditBench benchmark.
Attention Sinks in Diffusion Transformers: A Causal Analysis
cs.CV 2026-05 unverdicted novelty 6.0

Suppressing attention sinks in diffusion transformers does not degrade CLIP-T alignment at moderate levels but induces sink-specific perceptual shifts six times larger than equal-budget random masking.
Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport
cs.CV 2026-05 unverdicted novelty 6.0

OT-Bridge Editor reframes localized image editing as a constrained entropic optimal transport problem to generate synthetic coronary angiograms that boost downstream stenosis detection by 27.8% on ARCADE and 23.0% on ...
Conservative Flows: A New Paradigm of Generative Models
cs.LG 2026-05 unverdicted novelty 6.0

Conservative flows generate by running probability-preserving stochastic dynamics initialized at data points rather than noise, using corrected Langevin or predictor-corrector mechanisms on top of any pretrained flow ...
MooD: Perception-Enhanced Efficient Affective Image Editing via Continuous Valence-Arousal Modeling
cs.CV 2026-05 unverdicted novelty 6.0

MooD introduces continuous valence-arousal modeling with VA-aware retrieval and perception-enhanced guidance for efficient, controllable affective image editing, plus a new AffectSet dataset.
MooD: Perception-Enhanced Efficient Affective Image Editing via Continuous Valence-Arousal Modeling
cs.CV 2026-05 unverdicted novelty 6.0

MooD is the first framework to use continuous Valence-Arousal values for fine-grained affective image editing via a VA-aware retrieval strategy, visual transfer, semantic guidance, and the new AffectSet dataset.
PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.
Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization
cs.CV 2026-04 unverdicted novelty 6.0

Semi-DPO applies semi-supervised learning to noisy preference data in diffusion DPO by training first on consensus pairs then iteratively pseudo-labeling conflicts, yielding state-of-the-art alignment with complex hum...
Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Task-aware localization via attention cues and feature centroids from source/target streams in IIE models improves non-edit consistency while preserving instruction following.
Geometric Decoupling: Diagnosing the Structural Instability of Latent
cs.CV 2026-04 unverdicted novelty 6.0

Latent diffusion models exhibit geometric decoupling where curvature in out-of-distribution generation is misallocated to unstable semantic boundaries instead of image details, identifying geometric hotspots as the st...
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 6.0

UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
Towards Design Compositing
cs.CV 2026-04 unverdicted novelty 6.0

GIST is a training-free identity-preserving image compositor that improves visual harmony when integrating disparate elements into design pipelines.
Region-Constrained Group Relative Policy Optimization for Flow-Based Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

RC-GRPO-Editing constrains GRPO exploration to editing regions via localized noise and attention rewards, improving instruction adherence and non-target preservation in flow-based image editing.
InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
cs.CV 2026-04 unverdicted novelty 6.0

InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks
cs.CV 2026-04 unverdicted novelty 6.0

ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
Generative Phomosaic with Structure-Aligned and Personalized Diffusion
cs.CV 2026-04 unverdicted novelty 6.0

The paper presents the first generative photomosaic framework that synthesizes tiles via structure-aligned diffusion models and few-shot personalization instead of color-based matching from large tile collections.
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing
cs.CV 2026-04 unverdicted novelty 6.0

SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes
cs.CV 2026-04 unverdicted novelty 6.0

HorizonWeaver enables photorealistic, instruction-driven multi-level editing of complex driving scenes with improved generalization via a new paired dataset, language-guided masks, and joint training losses.
Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models
cs.CV 2026-03 unverdicted novelty 6.0

Implicit generative choices in diffusion models for ambiguous prompts are localized principally in self-attention layers, enabling a targeted ICM steering method that outperforms prior debiasing approaches.
ImgEdit: A Unified Image Editing Dataset and Benchmark
cs.CV 2025-05 conditional novelty 6.0

ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
Towards Robust Sequential Decomposition for Complex Image Editing
cs.CV 2026-05 unverdicted novelty 5.0

Sequential decomposition trained on synthetic editing tasks improves robustness for complex image instructions and transfers to real images via co-training.
HEART: Hyperspherical Embedding Alignment via Kent-Representation Traversal in Diffusion Models
cs.CV 2026-05 unverdicted novelty 5.0

HEART performs Kent-aware geodesic transformations on hyperspherical text embeddings to enable precise, training-free control in text-to-image diffusion models.
Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning
cs.CV 2026-04 unverdicted novelty 4.0

A training-free method with time-dependent attention gating and trajectory pruning enhances object-background balance in diffusion-based image synthesis.
Selective Aggregation of Attention Maps Improves Diffusion-Based Visual Interpretation
cs.CV 2026-04 unverdicted novelty 4.0

Selective aggregation of cross-attention maps from the most relevant heads in diffusion-based T2I models yields higher mean IoU for visual interpretation than standard aggregation methods like DAAM.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
cs.CV 2026-04 unverdicted novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 45 Pith papers · 6 internal anchors

[1]

Image2stylegan: How to embed images into the stylegan latent space? In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4432–4441, 2019

Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan: How to embed images into the stylegan latent space? In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4432–4441, 2019

work page 2019
[2]

Clip2stylegan: Unsupervised extraction of stylegan edit directions

Rameen Abdal, Peihao Zhu, John Femiani, Niloy J Mitra, and Peter Wonka. Clip2stylegan: Unsuper- vised extraction of stylegan edit directions. arXiv preprint arXiv:2112.05219, 2021

work page arXiv 2021
[3]

Hyperstyle: Stylegan inversion with hypernetworks for real image editing

Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and Amit Bermano. Hyperstyle: Stylegan inversion with hypernetworks for real image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18511–18521, 2022

work page 2022
[4]

Blended latent diffusion

Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion. arXiv preprint arXiv:2206.02779, 2022

work page arXiv 2022
[5]

Blended diffusion for text-driven editing of natural images

Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18208–18218, 2022. 12

work page 2022
[6]

Text2live: Text-driven layered image and video editing

Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. arXiv preprint arXiv:2204.02491, 2022

work page arXiv 2022
[7]

Paint by word, 2021

David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, and Antonio Torralba. Paint by word, 2021

work page 2021
[8]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high ﬁdelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018

work page internal anchor Pith review arXiv 2018
[9]

Vqgan-clip: Open domain image generation and editing with natural language guidance

Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castri- cato, and Edward Raff. Vqgan-clip: Open domain image generation and editing with natural language guidance. arXiv preprint arXiv:2204.08583, 2022

work page arXiv 2022
[10]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021

work page 2021
[11]

Cogview: Mastering text-to-image generation via transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems , 34:19822–19835, 2021

work page 2021
[12]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 12873–12883, 2021

work page 2021
[13]

Make-a-scene: Scene-based text-to-image generation with human priors

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make- a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131 , 2022

work page arXiv 2022
[14]

Stylegan-nada: Clip-guided domain adaptation of image generators

Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip- guided domain adaptation of image generators. arXiv preprint arXiv:2108.00946, 2021

work page arXiv 2021
[15]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014

work page 2014
[16]

Semantic object accuracy for generative text-to- image synthesis

Tobias Hinz, Stefan Heinrich, and Stefan Wermter. Semantic object accuracy for generative text-to- image synthesis. IEEE transactions on pattern analysis and machine intelligence , 2020

work page 2020
[17]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020

work page 2020
[18]

Classiﬁer-free diffusion guidance

Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications , 2021

work page 2021
[19]

Alias-free generative adversarial networks

Tero Karras, Miika Aittala, Samuli Laine, Erik H ¨ark¨onen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems , 34:852–863, 2021

work page 2021
[20]

A style-based generator architecture for generative adver- sarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adver- sarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 4401–4410, 2019

work page 2019
[21]

Analyzing and improving the image quality of stylegan

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8110–8119, 2020

work page 2020
[22]

Diffusionclip: Text-guided diffusion models for robust image manipulation

Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022

work page 2022
[23]

Clipstyler: Image style transfer with a single text condition

Gihyun Kwon and Jong Chul Ye. Clipstyler: Image style transfer with a single text condition. arXiv preprint arXiv:2112.00374, 2021

work page arXiv 2021
[24]

Fader networks: Manipulating images by sliding attributes

Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic Denoyer, and Marc’Aurelio Ranzato. Fader networks: Manipulating images by sliding attributes. Advances in neural information processing systems, 30, 2017. 13

work page 2017
[25]

Controllable text-to-image generation

Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip Torr. Controllable text-to-image generation. Advances in Neural Information Processing Systems , 32, 2019

work page 2019
[26]

Object-driven text-to-image synthesis via adversarial training

Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, and Jianfeng Gao. Object-driven text-to-image synthesis via adversarial training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12174–12182, 2019

work page 2019
[27]

Self-distilled stylegan: Towards generation from internet photos

Ron Mokady, Omer Tov, Michal Yarom, Oran Lang, Inbar Mosseri, Tali Dekel, Daniel Cohen-Or, and Michal Irani. Self-distilled stylegan: Towards generation from internet photos. InSpecial Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings , pages 1–9, 2022

work page 2022
[28]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

Styleclip: Text-driven manipulation of stylegan imagery

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text- driven manipulation of stylegan imagery. arXiv preprint arXiv:2103.17249, 2021

work page arXiv 2021
[30]

Learn, imagine and create: Text-to-image generation from prior knowledge

Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. Learn, imagine and create: Text-to-image generation from prior knowledge. Advances in neural information processing systems , 32, 2019

work page 2019
[31]

Mirrorgan: Learning text-to-image gen- eration by redescription

Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. Mirrorgan: Learning text-to-image gen- eration by redescription. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1505–1514, 2019

work page 2019
[32]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning , pages 8821–8831. PMLR, 2021

work page 2021
[35]

Bermano, and Daniel Cohen-Or

Daniel Roich, Ron Mokady, Amit H. Bermano, and Daniel Cohen-Or. Pivotal tuning for latent-based editing of real images. ACM Transactions on Graphics (TOG), 2022

work page 2022
[36]

High- resolution image synthesis with latent diffusion models, 2021

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High- resolution image synthesis with latent diffusion models, 2021

work page 2021
[37]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

work page 2015
[38]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kam- yar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image dif- fusion models with deep language understanding. arXiv preprint arXiv:2205...

work page internal anchor Pith review arXiv 2022
[39]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning , pages 2256–2265. PMLR, 2015

work page 2015
[40]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2020

work page 2020
[41]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems , 32, 2019

work page 2019
[42]

Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis

Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan Jing, Fei Wu, and Bingkun Bao. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865, 2020. 14

work page arXiv 2008
[43]

Designing an encoder for stylegan image manipulation

Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and Daniel Cohen-Or. Designing an encoder for stylegan image manipulation. arXiv preprint arXiv:2102.02766, 2021

work page arXiv 2021
[44]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[45]

High-ﬁdelity gan inversion for image attribute editing

Tengfei Wang, Yong Zhang, Yanbo Fan, Jue Wang, and Qifeng Chen. High-ﬁdelity gan inversion for image attribute editing. ArXiv, abs/2109.06590, 2021

work page arXiv 2021
[46]

Tedigan: Text-guided diverse face image generation and manipulation

Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu. Tedigan: Text-guided diverse face image generation and manipulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2256–2265, 2021

work page 2021
[47]

Gan inver- sion: A survey, 2021

Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang. Gan inver- sion: A survey, 2021

work page 2021
[48]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022

work page internal anchor Pith review arXiv 2022
[49]

Photographic text-to-image synthesis with a hierarchically- nested adversarial network

Zizhao Zhang, Yuanpu Xie, and Lin Yang. Photographic text-to-image synthesis with a hierarchically- nested adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6199–6208, 2018

work page 2018
[50]

In-domain gan inversion for real image editing

Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-domain gan inversion for real image editing. arXiv preprint arXiv:2004.00049, 2020

work page arXiv 2004
[51]

forward process

Jun-Yan Zhu, Philipp Kr ¨ahenb¨uhl, Eli Shechtman, and Alexei A Efros. Generative visual manipulation on the natural image manifold. In European conference on computer vision , pages 597–613. Springer, 2016. A Background A.1 Diffusion Models Diffusion Denoising Probabilistic Models (DDPM) [39, 17] are generative latent variable models that aim to model a ...

work page 2016