Recognition: 2 theorem links
· Lean TheoremPrompt-to-Prompt Image Editing with Cross Attention Control
Pith reviewed 2026-05-11 06:55 UTC · model grok-4.3
The pith
Cross-attention layers let users edit images by changing only the text prompt.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Cross-attention layers control the relation between the spatial layout of the image and each word in the prompt. Editing the textual prompt and correspondingly adjusting the cross-attention maps during inference allows the synthesis to reflect the new prompt while preserving the original image outside the edited regions.
What carries the argument
Cross-attention layers, which map words from the prompt to specific spatial positions in the generated image and can be directly edited at inference time.
If this is right
- Localized edits arise simply by replacing a word in the prompt and updating its attention map.
- Global edits arise by adding new specifications to the prompt and extending the corresponding attention.
- The degree to which any single word influences the image can be tuned by scaling its attention map.
- No spatial masks or model retraining are needed for the edits to succeed.
Where Pith is reading between the lines
- The same attention-control idea could simplify interfaces for iterative image refinement where users tweak prompts repeatedly.
- If cross-attention proves similarly dominant in other conditional generators, the technique might transfer to tasks such as text-guided video or 3D editing.
- Prompt-only editing reduces dependence on precise user drawing skills, which could broaden access to generative tools.
Load-bearing premise
That the cross-attention mechanism dominates word-to-region mapping and that targeted edits to these maps during inference will not create artifacts or require retraining the model.
What would settle it
Run the attention-editing procedure on a prompt change and check whether the output image incorporates the intended edit, keeps unedited regions unchanged, and avoids visible artifacts; systematic failure on either count would falsify the central claim.
read the original abstract
Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities of generating highly diverse images that follow given text prompts. Such text-based synthesis methods are particularly appealing to humans who are used to verbally describe their intent. Therefore, it is only natural to extend the text-driven image synthesis to text-driven image editing. Editing is challenging for these generative models, since an innate property of an editing technique is to preserve most of the original image, while in the text-based models, even a small modification of the text prompt often leads to a completely different outcome. State-of-the-art methods mitigate this by requiring the users to provide a spatial mask to localize the edit, hence, ignoring the original structure and content within the masked region. In this paper, we pursue an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only. To this end, we analyze a text-conditioned model in depth and observe that the cross-attention layers are the key to controlling the relation between the spatial layout of the image to each word in the prompt. With this observation, we present several applications which monitor the image synthesis by editing the textual prompt only. This includes localized editing by replacing a word, global editing by adding a specification, and even delicately controlling the extent to which a word is reflected in the image. We present our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that cross-attention layers in text-conditioned diffusion models (e.g., Stable Diffusion) are the primary mechanism controlling the spatial mapping from prompt words to image regions. By analyzing these layers, the authors develop a prompt-to-prompt editing framework that monitors and edits cross-attention maps during the denoising trajectory to achieve text-only edits: localized changes via word replacement, global changes via prompt augmentation, and fine control over a word's visual extent. The method requires no spatial masks, no retraining, and preserves most of the original image structure while following the edited prompt. Results are demonstrated qualitatively on diverse images and prompts.
Significance. If the central observation and editing procedure hold under broader testing, this work offers a practical advance for intuitive text-driven editing that avoids the limitations of mask-based methods, which discard original content inside the mask. The approach is grounded in direct analysis of existing model components rather than new training or auxiliary networks, and it yields falsifiable qualitative predictions about attention map edits. Strengths include the identification of cross-attention as a controllable interface and the demonstration of multiple applications (word swap, addition, extent control) without introducing free parameters. The absence of quantitative metrics or ablations, however, leaves the robustness and generality of the claims difficult to assess.
major comments (3)
- [§3.2] §3.2 (Cross-Attention Control): The procedure for replacing source-prompt attention maps into the target-prompt synthesis assumes direct transferability, but provides no analysis of compatibility with the evolving noisy latent at each timestep. Because cross-attention is recomputed from the current noisy input, maps derived from an independent source trajectory may misalign with the target prompt's conditioning or noise level, risking artifacts or loss of structure; the manuscript does not report ablation on replacement schedules or layer selection to test this assumption.
- [§4] §4 (Applications and Experiments): The central claim that edits are controlled by text only and achieve high fidelity rests on qualitative examples, yet no quantitative metrics (e.g., CLIP similarity to target prompt, LPIPS for structure preservation, or user studies) or baseline comparisons (mask-based editing, prompt interpolation) are provided. This makes it impossible to verify whether observed success generalizes or depends on per-example tuning of which timesteps and layers receive map edits.
- [§3.1] §3.1 (Observation): The statement that cross-attention layers are 'the key' to word-to-region mapping is presented as an empirical observation, but the manuscript does not quantify the contribution of cross-attention relative to other components (self-attention, MLP layers) via controlled interventions such as freezing or ablating those layers while editing.
minor comments (2)
- [Figures 2,4] Figure 2 and Figure 4: attention map visualizations would benefit from explicit side-by-side comparison of source vs. edited maps at the same timestep, with prompt text overlaid for clarity.
- [§3] The method description in §3 does not specify the exact interpolation or injection formula used when combining source and target attention maps (e.g., whether maps are averaged, replaced only in certain heads, or thresholded).
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and outline the changes we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Cross-Attention Control): The procedure for replacing source-prompt attention maps into the target-prompt synthesis assumes direct transferability, but provides no analysis of compatibility with the evolving noisy latent at each timestep. Because cross-attention is recomputed from the current noisy input, maps derived from an independent source trajectory may misalign with the target prompt's conditioning or noise level, risking artifacts or loss of structure; the manuscript does not report ablation on replacement schedules or layer selection to test this assumption.
Authors: We appreciate the referee highlighting this aspect of the replacement procedure. The source attention maps are injected into the target denoising trajectory while using the target prompt's text embeddings at each step, which provides a degree of adaptation to the target conditioning. Nevertheless, the manuscript indeed lacks explicit analysis of timestep compatibility or ablations on schedules and layers. In the revised version we will add a dedicated discussion of the replacement mechanism together with ablation experiments varying the timesteps and layers at which maps are replaced, to demonstrate robustness and identify any failure modes. revision: yes
-
Referee: [§4] §4 (Applications and Experiments): The central claim that edits are controlled by text only and achieve high fidelity rests on qualitative examples, yet no quantitative metrics (e.g., CLIP similarity to target prompt, LPIPS for structure preservation, or user studies) or baseline comparisons (mask-based editing, prompt interpolation) are provided. This makes it impossible to verify whether observed success generalizes or depends on per-example tuning of which timesteps and layers receive map edits.
Authors: We agree that the current evaluation is limited to qualitative demonstrations and that quantitative metrics and baselines would allow readers to better assess generality. Our focus was on showing that text-only control is feasible across diverse cases without masks or retraining. In the revision we will incorporate CLIP similarity to the target prompt, LPIPS to the source image for structure preservation, and direct comparisons against mask-based editing and prompt-interpolation baselines. We will also document the specific timestep and layer choices used throughout the experiments. revision: yes
-
Referee: [§3.1] §3.1 (Observation): The statement that cross-attention layers are 'the key' to word-to-region mapping is presented as an empirical observation, but the manuscript does not quantify the contribution of cross-attention relative to other components (self-attention, MLP layers) via controlled interventions such as freezing or ablating those layers while editing.
Authors: The claim rests on the direct correspondence we observe between cross-attention maps and the spatial regions governed by individual prompt words, which enables the editing operations we demonstrate. We did not perform layer-freezing or ablation interventions, as these would require non-trivial architectural modifications outside the scope of the presented analysis. In the revised manuscript we will expand §3.1 with additional visualizations comparing attention behavior across layer types and a clearer justification for focusing on cross-attention as the controllable interface for word-to-region mapping. revision: partial
Circularity Check
No circularity: claims rest on empirical observation of existing model components
full rationale
The paper's derivation begins with an analysis of cross-attention layers in standard text-conditioned generative models and observes their role in mapping words to spatial regions. This observation is used to motivate prompt-only editing applications without any reduction to self-defined quantities, fitted parameters renamed as predictions, or load-bearing self-citations. The abstract and described chain present the key property as an independent finding from the underlying architecture rather than a tautology or imported ansatz, rendering the overall argument self-contained against external model behavior.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 48 Pith papers
-
Masked Generative Transformer Is What You Need for Image Editing
EditMGT applies masked generative transformers with attention consolidation and region-hold sampling to deliver state-of-the-art localized image editing at 6x the speed of diffusion methods.
-
RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition
RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
-
What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers
A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.
-
Attention Sinks in Diffusion Transformers: A Causal Analysis
Suppressing attention sinks in diffusion transformers does not degrade text-image alignment or most preference metrics, revealing a dissociation between generation trajectory changes and semantic output quality.
-
Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision
Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.
-
DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing
DirectEdit achieves step-level accurate inversion for flow-based image editing by directly aligning forward paths, using attention feature injection and mask-guided noise blending to balance fidelity and editability w...
-
SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking
SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.
-
ResetEdit: Precise Text-guided Editing of Generated Image via Resettable Starting Latent
ResetEdit embeds a recoverable discrepancy signal during image generation in diffusion models to reconstruct an approximate original latent for high-fidelity text-guided editing.
-
GeoEdit: Local Frames for Fast, Training-Free On-Manifold Editing in Diffusion Models
GeoEdit constructs local tangent frames from small perturbations to initial noise, enabling Jacobian-free on-manifold edits in diffusion models via alternating tangent steps and diffusion projections.
-
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
-
StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition
StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
-
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both a...
-
TransSplat: Unbalanced Semantic Transport for Language-Driven 3DGS Editing
TransSplat uses unbalanced semantic transport to match edited 2D evidence with 3D Gaussians and recover a shared 3D edit field, yielding better local accuracy and structural consistency than prior view-consistency methods.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
-
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
-
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
-
Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation
Prompt Relay is an inference-time plug-and-play method that penalizes cross-attention to enforce temporal prompt alignment and reduce semantic entanglement in multi-event video generation.
-
Your Pre-trained Diffusion Model Secretly Knows Restoration
Pre-trained diffusion models inherently support image restoration that can be unlocked by optimizing prompt embeddings at the text encoder output using a diffusion bridge formulation, achieving competitive results on ...
-
PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space
PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
-
CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator
CAMEO uses coordinated agents for planning, prompting, generation, and quality feedback to achieve higher structural reliability in conditional image editing than single-step models.
-
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.
-
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
-
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
-
LimeCross: Context-Conditioned Layered Image Editing with Structural Consistency
LimeCross enables text-guided editing of individual layers in composite images by conditioning on cross-layer context via bi-stream attention while preserving layer integrity and introducing the LayerEditBench benchmark.
-
Attention Sinks in Diffusion Transformers: A Causal Analysis
Suppressing attention sinks in diffusion transformers does not degrade CLIP-T alignment at moderate levels but induces sink-specific perceptual shifts six times larger than equal-budget random masking.
-
Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport
OT-Bridge Editor reframes localized image editing as a constrained entropic optimal transport problem to generate synthetic coronary angiograms that boost downstream stenosis detection by 27.8% on ARCADE and 23.0% on ...
-
Conservative Flows: A New Paradigm of Generative Models
Conservative flows generate by running probability-preserving stochastic dynamics initialized at data points rather than noise, using corrected Langevin or predictor-corrector mechanisms on top of any pretrained flow ...
-
MooD: Perception-Enhanced Efficient Affective Image Editing via Continuous Valence-Arousal Modeling
MooD introduces continuous valence-arousal modeling with VA-aware retrieval and perception-enhanced guidance for efficient, controllable affective image editing, plus a new AffectSet dataset.
-
MooD: Perception-Enhanced Efficient Affective Image Editing via Continuous Valence-Arousal Modeling
MooD is the first framework to use continuous Valence-Arousal values for fine-grained affective image editing via a VA-aware retrieval strategy, visual transfer, semantic guidance, and the new AffectSet dataset.
-
PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning
PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.
-
Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization
Semi-DPO applies semi-supervised learning to noisy preference data in diffusion DPO by training first on consensus pairs then iteratively pseudo-labeling conflicts, yielding state-of-the-art alignment with complex hum...
-
Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing
Task-aware localization via attention cues and feature centroids from source/target streams in IIE models improves non-edit consistency while preserving instruction following.
-
Geometric Decoupling: Diagnosing the Structural Instability of Latent
Latent diffusion models exhibit geometric decoupling where curvature in out-of-distribution generation is misallocated to unstable semantic boundaries instead of image details, identifying geometric hotspots as the st...
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
-
Towards Design Compositing
GIST is a training-free identity-preserving image compositor that improves visual harmony when integrating disparate elements into design pipelines.
-
Region-Constrained Group Relative Policy Optimization for Flow-Based Image Editing
RC-GRPO-Editing constrains GRPO exploration to editing regions via localized noise and attention rewards, improving instruction adherence and non-target preservation in flow-based image editing.
-
InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.
-
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks
ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
-
Generative Phomosaic with Structure-Aligned and Personalized Diffusion
The paper presents the first generative photomosaic framework that synthesizes tiles via structure-aligned diffusion models and few-shot personalization instead of color-based matching from large tile collections.
-
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing
SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
-
HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes
HorizonWeaver enables photorealistic, instruction-driven multi-level editing of complex driving scenes with improved generalization via a new paired dataset, language-guided masks, and joint training losses.
-
Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models
Implicit generative choices in diffusion models for ambiguous prompts are localized principally in self-attention layers, enabling a targeted ICM steering method that outperforms prior debiasing approaches.
-
ImgEdit: A Unified Image Editing Dataset and Benchmark
ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
-
Towards Robust Sequential Decomposition for Complex Image Editing
Sequential decomposition trained on synthetic editing tasks improves robustness for complex image instructions and transfers to real images via co-training.
-
HEART: Hyperspherical Embedding Alignment via Kent-Representation Traversal in Diffusion Models
HEART performs Kent-aware geodesic transformations on hyperspherical text embeddings to enable precise, training-free control in text-to-image diffusion models.
-
Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning
A training-free method with time-dependent attention gating and trajectory pruning enhances object-background balance in diffusion-based image synthesis.
-
Selective Aggregation of Attention Maps Improves Diffusion-Based Visual Interpretation
Selective aggregation of cross-attention maps from the most relevant heads in diffusion-based T2I models yields higher mean IoU for visual interpretation than standard aggregation methods like DAAM.
-
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...
Reference graph
Works this paper leans on
-
[1]
Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan: How to embed images into the stylegan latent space? In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4432–4441, 2019
work page 2019
-
[2]
Clip2stylegan: Unsupervised extraction of stylegan edit directions
Rameen Abdal, Peihao Zhu, John Femiani, Niloy J Mitra, and Peter Wonka. Clip2stylegan: Unsuper- vised extraction of stylegan edit directions. arXiv preprint arXiv:2112.05219, 2021
-
[3]
Hyperstyle: Stylegan inversion with hypernetworks for real image editing
Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and Amit Bermano. Hyperstyle: Stylegan inversion with hypernetworks for real image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18511–18521, 2022
work page 2022
-
[4]
Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion. arXiv preprint arXiv:2206.02779, 2022
-
[5]
Blended diffusion for text-driven editing of natural images
Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18208–18218, 2022. 12
work page 2022
-
[6]
Text2live: Text-driven layered image and video editing
Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. arXiv preprint arXiv:2204.02491, 2022
-
[7]
David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, and Antonio Torralba. Paint by word, 2021
work page 2021
-
[8]
Large Scale GAN Training for High Fidelity Natural Image Synthesis
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018
work page internal anchor Pith review arXiv 2018
-
[9]
Vqgan-clip: Open domain image generation and editing with natural language guidance
Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castri- cato, and Edward Raff. Vqgan-clip: Open domain image generation and editing with natural language guidance. arXiv preprint arXiv:2204.08583, 2022
-
[10]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021
work page 2021
-
[11]
Cogview: Mastering text-to-image generation via transformers
Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems , 34:19822–19835, 2021
work page 2021
-
[12]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 12873–12883, 2021
work page 2021
-
[13]
Make-a-scene: Scene-based text-to-image generation with human priors
Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make- a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131 , 2022
-
[14]
Stylegan-nada: Clip-guided domain adaptation of image generators
Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip- guided domain adaptation of image generators. arXiv preprint arXiv:2108.00946, 2021
-
[15]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014
work page 2014
-
[16]
Semantic object accuracy for generative text-to- image synthesis
Tobias Hinz, Stefan Heinrich, and Stefan Wermter. Semantic object accuracy for generative text-to- image synthesis. IEEE transactions on pattern analysis and machine intelligence , 2020
work page 2020
-
[17]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020
work page 2020
-
[18]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications , 2021
work page 2021
-
[19]
Alias-free generative adversarial networks
Tero Karras, Miika Aittala, Samuli Laine, Erik H ¨ark¨onen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems , 34:852–863, 2021
work page 2021
-
[20]
A style-based generator architecture for generative adver- sarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adver- sarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 4401–4410, 2019
work page 2019
-
[21]
Analyzing and improving the image quality of stylegan
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8110–8119, 2020
work page 2020
-
[22]
Diffusionclip: Text-guided diffusion models for robust image manipulation
Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022
work page 2022
-
[23]
Clipstyler: Image style transfer with a single text condition
Gihyun Kwon and Jong Chul Ye. Clipstyler: Image style transfer with a single text condition. arXiv preprint arXiv:2112.00374, 2021
-
[24]
Fader networks: Manipulating images by sliding attributes
Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic Denoyer, and Marc’Aurelio Ranzato. Fader networks: Manipulating images by sliding attributes. Advances in neural information processing systems, 30, 2017. 13
work page 2017
-
[25]
Controllable text-to-image generation
Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip Torr. Controllable text-to-image generation. Advances in Neural Information Processing Systems , 32, 2019
work page 2019
-
[26]
Object-driven text-to-image synthesis via adversarial training
Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, and Jianfeng Gao. Object-driven text-to-image synthesis via adversarial training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12174–12182, 2019
work page 2019
-
[27]
Self-distilled stylegan: Towards generation from internet photos
Ron Mokady, Omer Tov, Michal Yarom, Oran Lang, Inbar Mosseri, Tali Dekel, Daniel Cohen-Or, and Michal Irani. Self-distilled stylegan: Towards generation from internet photos. InSpecial Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings , pages 1–9, 2022
work page 2022
-
[28]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[29]
Styleclip: Text-driven manipulation of stylegan imagery
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text- driven manipulation of stylegan imagery. arXiv preprint arXiv:2103.17249, 2021
-
[30]
Learn, imagine and create: Text-to-image generation from prior knowledge
Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. Learn, imagine and create: Text-to-image generation from prior knowledge. Advances in neural information processing systems , 32, 2019
work page 2019
-
[31]
Mirrorgan: Learning text-to-image gen- eration by redescription
Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. Mirrorgan: Learning text-to-image gen- eration by redescription. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1505–1514, 2019
work page 2019
-
[32]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[33]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[34]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning , pages 8821–8831. PMLR, 2021
work page 2021
-
[35]
Daniel Roich, Ron Mokady, Amit H. Bermano, and Daniel Cohen-Or. Pivotal tuning for latent-based editing of real images. ACM Transactions on Graphics (TOG), 2022
work page 2022
-
[36]
High- resolution image synthesis with latent diffusion models, 2021
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High- resolution image synthesis with latent diffusion models, 2021
work page 2021
-
[37]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015
work page 2015
-
[38]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kam- yar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image dif- fusion models with deep language understanding. arXiv preprint arXiv:2205...
work page internal anchor Pith review arXiv 2022
-
[39]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning , pages 2256–2265. PMLR, 2015
work page 2015
-
[40]
Denoising diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2020
work page 2020
-
[41]
Generative modeling by estimating gradients of the data distribution
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems , 32, 2019
work page 2019
-
[42]
Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis
Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan Jing, Fei Wu, and Bingkun Bao. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865, 2020. 14
-
[43]
Designing an encoder for stylegan image manipulation
Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and Daniel Cohen-Or. Designing an encoder for stylegan image manipulation. arXiv preprint arXiv:2102.02766, 2021
-
[44]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017
work page 2017
-
[45]
High-fidelity gan inversion for image attribute editing
Tengfei Wang, Yong Zhang, Yanbo Fan, Jue Wang, and Qifeng Chen. High-fidelity gan inversion for image attribute editing. ArXiv, abs/2109.06590, 2021
-
[46]
Tedigan: Text-guided diverse face image generation and manipulation
Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu. Tedigan: Text-guided diverse face image generation and manipulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2256–2265, 2021
work page 2021
-
[47]
Gan inver- sion: A survey, 2021
Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang. Gan inver- sion: A survey, 2021
work page 2021
-
[48]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022
work page internal anchor Pith review arXiv 2022
-
[49]
Photographic text-to-image synthesis with a hierarchically- nested adversarial network
Zizhao Zhang, Yuanpu Xie, and Lin Yang. Photographic text-to-image synthesis with a hierarchically- nested adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6199–6208, 2018
work page 2018
-
[50]
In-domain gan inversion for real image editing
Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-domain gan inversion for real image editing. arXiv preprint arXiv:2004.00049, 2020
-
[51]
Jun-Yan Zhu, Philipp Kr ¨ahenb¨uhl, Eli Shechtman, and Alexei A Efros. Generative visual manipulation on the natural image manifold. In European conference on computer vision , pages 597–613. Springer, 2016. A Background A.1 Diffusion Models Diffusion Denoising Probabilistic Models (DDPM) [39, 17] are generative latent variable models that aim to model a ...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.