arxiv: 2605.10319 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: no theorem link

LimeCross: Context-Conditioned Layered Image Editing with Structural Consistency

Ryugo Morita , Stanislav Frolov , Brian Bernhard Moser , Ko Watanabe , Riku Takahashi , Issey Sukeda , Andreas Dengel

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:10 UTC · model grok-4.3

classification 💻 cs.CV

keywords layered image editingcontext-conditioned editingbi-stream attentionRGBA layerstext-guided editingstructural consistencyimage compositinglayer purity

0 comments

The pith

LimeCross edits chosen RGBA layers via text prompts while preserving cross-layer illumination and contact consistency without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a training-free framework for editing individual layers in composite images according to text instructions. It keeps untouched layers fixed and uses contextual signals from them to maintain realistic lighting, shadows, and physical contacts in the final composite. Current pipelines flatten everything for editing then try to recover layers, which creates leakage and unstable transparency. By contrast, the approach explicitly guards layer integrity and applies a bi-stream attention process to borrow relevant cues across layers. If the claim holds, layered generative tools become practical for iterative creative work that depends on non-destructive compositing.

Core claim

LimeCross is a context-conditioned layered image editing framework that applies text-guided modifications exclusively to user-selected RGBA layers, employs bi-stream attention to incorporate contextual cues from remaining layers for consistency, and explicitly maintains layer purity to avoid background-to-foreground contamination or alpha instability.

What carries the argument

The bi-stream attention mechanism that extracts and applies cross-layer contextual cues while enforcing layer integrity during text-conditioned edits.

If this is right

Selected layers receive prompt-driven changes while all other layers remain pixel-identical to the input.
Composite outputs retain consistent lighting and physical contacts without manual mask adjustments.
Alpha channels stay stable, preventing transparency leakage that occurs when flattening and re-decomposing images.
The method works zero-shot on existing diffusion backbones without additional fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cross-layer cue mechanism could support consistent edits across frames in video layering.
Layer purity preservation might simplify downstream tasks such as animation or 3D lifting from edited composites.
Because the approach is training-free, it could serve as a plug-in module for other text-to-image systems that currently collapse layers.

Load-bearing premise

Bi-stream attention can reliably extract relevant cross-layer cues to preserve illumination and contact without introducing new artifacts or requiring task-specific training.

What would settle it

Apply LimeCross edits on the 1500 LayerEditBench scenes using the provided source-target prompt pairs and measure whether edited layers exhibit measurable alpha channel shifts or illumination mismatches against the target composites.

Figures

Figures reproduced from arXiv: 2605.10319 by Andreas Dengel, Brian Bernhard Moser, Issey Sukeda, Ko Watanabe, Riku Takahashi, Ryugo Morita, Stanislav Frolov.

**Figure 1.** Figure 1: LimeCross enables text-guided image editing within a layered RGBA image set, modifying selected layers while preserving all others as reusable assets. By leveraging contextual cues from the remaining layers, it maintains cross-layer consistency in illumination, contact, and appearance, making it well-suited for layered creation workflows that require independent yet coherent control of multiple image elem… view at source ↗

**Figure 2.** Figure 2: Given layered RGBA inputs, we select a target layer (red box) and construct an opaque context by compositing all non-target layers. Both are encoded into latents and packed into two token streams (ztgt, zctx). Editing is performed via delta-velocity integration: we evaluate source/target velocities and update only the target stream using ∆vtgt = v t tgt − v s tgt, while context tokens are never updated or … view at source ↗

**Figure 3.** Figure 3: Sample layered assets from LayerEditBench, together with the corresponding source and target text for each layer. LayerEditBench includes a wide range of challenging layered editing scenarios, including object replacement, style changes, and atmosphere modifications. The benchmark features complex alpha channel structures, transparent objects, occlusions, and diverse artistic styles, enabling systematic e… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison for single-layer editing. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison for iterative multi-layer editing. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Switching-ratio ablation. Smaller ρ yields better alpha stability but worse fidelity, while larger ρ exhibits the opposite trend [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

Layered image assets are widely used in real-world creative workflows, enabling non-destructive iteration and flexible re-composition. Recent advances in layered image generation and decomposition synthesize or recover layered representations, yet controllable editing of layered images remains challenging. Manual editing requires careful coordination across layers to maintain consistent illumination and contact, while AI-based pipelines collapse layers into a flattened image for editing, then decompose them again, introducing background-to-foreground leakage and unstable transparency. To address these limitations, we propose LimeCross, a training-free context-conditioned layered image editing framework that edits user-selected RGBA layers according to text while keeping the remaining layers unchanged. It leverages contextual cues from other layers using a bi-stream attention mechanism to preserve cross-layer consistency, while explicitly maintaining layer integrity to prevent the contamination of edited layers. To evaluate our approach, we introduce LayerEditBench, a benchmark of 1500 layered scenes with paired source/target prompts, along with evaluation protocols that assess both edit fidelity and alpha channel stability. Extensive experiments demonstrate that LimeCross improves layer purity and composite realism over strong editing baselines, establishing context-conditioned layered editing as a principled framework for controllable generative creation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LimeCross gives a training-free way to edit one RGBA layer at a time with text while trying to hold lighting and contact consistency via bi-stream attention, plus a 1500-scene benchmark.

read the letter

LimeCross targets a practical gap: designers edit layered assets all the time, but current AI tools flatten everything, edit, then re-decompose and leak content across layers. The paper keeps the untouched layers fixed and conditions the edit on the rest of the stack through bi-stream attention, while explicitly protecting alpha integrity. That framing is straightforward and matches real workflow pain points better than generic inpainting or global editing papers do. The new LayerEditBench with paired prompts and protocols for fidelity plus alpha stability is a concrete addition that future work can build on or compare against directly. The method stays training-free, which lowers the barrier for adoption in tools that already ship layered files. The central claim holds together on paper: cross-layer context should help with illumination and contact without retraining. Still, the abstract gives no numbers, error bars, or ablation tables, so the size of the gains over baselines remains unclear. Bi-stream attention could miss subtle cues in complex scenes or add its own blending artifacts; without the full results it is hard to tell how often that happens. The benchmark itself needs checking for diversity and whether the evaluation metrics actually capture the consistency problems the authors highlight. This is for computer-vision researchers who work on controllable generation or layered content pipelines. Anyone building editing features for design software would find the benchmark and the integrity-preserving pipeline useful to read. The work is coherent enough on its own terms to deserve referee time, even if the experiments will need tightening.

Referee Report

3 major / 2 minor

Summary. The paper introduces LimeCross, a training-free context-conditioned framework for text-based editing of user-selected RGBA layers in layered images. It employs a bi-stream attention mechanism to incorporate contextual cues from unchanged layers, thereby preserving illumination, contact, and structural consistency while explicitly maintaining layer integrity to avoid leakage. The work also presents LayerEditBench, a benchmark of 1500 paired layered scenes, along with protocols for assessing edit fidelity and alpha stability, and claims superior layer purity and composite realism relative to strong baselines.

Significance. If the quantitative claims hold under rigorous evaluation, the framework would offer a practical advance for non-destructive layered editing in creative pipelines, reducing reliance on flattening-then-redecomposition approaches that introduce artifacts. The training-free design and new benchmark could serve as a foundation for further research in controllable generative composition, provided the bi-stream mechanism generalizes without task-specific tuning.

major comments (3)

[§3] §3 (Method): The bi-stream attention mechanism is presented as the core component for extracting and applying cross-layer contextual cues to preserve illumination and contact consistency, yet no equations, pseudocode, or implementation details are supplied for how the two streams interact or how attention weights are computed. This directly bears on the central claim, as the weakest assumption is that this mechanism reliably avoids new artifacts without training.
[§4] §4 (Experiments): The abstract asserts improvements in layer purity and composite realism over baselines on LayerEditBench, but the provided text supplies no specific metrics, tables, error bars, ablation studies, or statistical tests. Without these, it is impossible to verify whether the reported gains are attributable to context conditioning or to other factors, undermining the evaluation protocols' ability to support the framework's superiority.
[§4.1] §4.1 (Benchmark): LayerEditBench is introduced with 1500 scenes and paired prompts, but the construction details (e.g., how source/target pairs ensure controlled variation in illumination/contact while isolating edit effects) are not described. This is load-bearing because the benchmark is used to establish the principled nature of context-conditioned editing.

minor comments (2)

[Introduction] The introduction could more explicitly contrast the proposed approach with prior layered decomposition methods (e.g., by citing specific failure modes like transparency instability) to better motivate the bi-stream design.
[§2] Notation for RGBA layers and the distinction between edited and context layers should be formalized early (e.g., via a consistent symbol table) to improve readability of the pipeline description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for their thorough and constructive feedback. Their comments have identified key areas for improving the clarity of our method, the rigor of our experiments, and the description of our benchmark. We address each point below and indicate the revisions incorporated into the updated manuscript.

read point-by-point responses

Referee: [§3] §3 (Method): The bi-stream attention mechanism is presented as the core component for extracting and applying cross-layer contextual cues to preserve illumination and contact consistency, yet no equations, pseudocode, or implementation details are supplied for how the two streams interact or how attention weights are computed. This directly bears on the central claim, as the weakest assumption is that this mechanism reliably avoids new artifacts without training.

Authors: We agree that explicit details on the bi-stream attention are essential for substantiating the central claims. In the revised manuscript, Section 3 now includes the full mathematical formulation: the context stream computes cross-attention over unchanged layers while the edit stream attends to the target layer, with interaction via concatenated key-value pairs and softmax-normalized weights. Pseudocode is added to the appendix, showing the exact implementation steps that integrate contextual cues while enforcing layer integrity to avoid artifacts. revision: yes
Referee: [§4] §4 (Experiments): The abstract asserts improvements in layer purity and composite realism over baselines on LayerEditBench, but the provided text supplies no specific metrics, tables, error bars, ablation studies, or statistical tests. Without these, it is impossible to verify whether the reported gains are attributable to context conditioning or to other factors, undermining the evaluation protocols' ability to support the framework's superiority.

Authors: We acknowledge the need for more explicit quantitative reporting. The revised experiments section now includes Table 2 with concrete metrics (layer purity via alpha-channel MSE and stability scores; composite realism via FID, LPIPS, and user-study preference rates), error bars computed over five random seeds, ablation studies isolating the bi-stream attention component, and statistical significance via paired t-tests (p < 0.05) confirming gains are due to context conditioning. revision: yes
Referee: [§4.1] §4.1 (Benchmark): LayerEditBench is introduced with 1500 scenes and paired prompts, but the construction details (e.g., how source/target pairs ensure controlled variation in illumination/contact while isolating edit effects) are not described. This is load-bearing because the benchmark is used to establish the principled nature of context-conditioned editing.

Authors: We have expanded §4.1 with a full construction protocol. The 1500 scenes were procedurally generated via a 3D rendering pipeline that systematically varies illumination angles and contact points while keeping layer geometry fixed. Source-target prompt pairs were created by applying edits exclusively to the selected RGBA layer, with manual and automated validation steps to confirm that illumination/contact changes are isolated from the edit itself; these details are now provided along with the data-generation code. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a training-free bi-stream attention pipeline for context-conditioned layered editing without any fitted parameters, equations, or predictions that reduce to inputs by construction. The core contribution is an algorithmic framework that explicitly maintains layer integrity and uses cross-layer cues, evaluated on a newly introduced benchmark against external baselines. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the described method or claims; the results are presented as directly verifiable through implementation and comparative experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the unproven effectiveness of the bi-stream attention mechanism and the assumption that explicit layer integrity maintenance suffices to prevent contamination. No free parameters, standard mathematical axioms, or independently evidenced invented entities are described.

invented entities (1)

bi-stream attention mechanism no independent evidence
purpose: to leverage contextual cues from other layers while editing a selected layer
New component introduced to achieve cross-layer consistency; no independent evidence or falsifiable prediction supplied in abstract.

pith-pipeline@v0.9.0 · 5518 in / 1176 out tokens · 43560 ms · 2026-05-12T05:10:28.227361+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 4 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

In: CVPR

Ao, J., Jiang, Y., Ke, Q., Ehinger, K.A.: Open-world amodal appearance comple- tion. In: CVPR. pp. 6490–6499 (2025)

work page 2025
[3]

In: ICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP)

Bai, J., Zhou, J., Wang, B., Chen, W., Yang, Y., Lei, Z., Wang, F.: Layer-animate for transparent video generation. In: ICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2025)

work page 2025
[4]

In: CVPR

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: CVPR. pp. 18392–18402 (2023)

work page 2023
[5]

arXiv preprint arXiv:2508.04228 (2025)

Cen, K., Zhao, B., Xin, Y., Luo, S., Zhai, G., Liu, X.: Layert2v: Interactive multi- object trajectory layering for video generation. arXiv preprint arXiv:2508.04228 (2025)

work page arXiv 2025
[6]

From inpainting to layer decomposition: Repurposing generative inpainting models for image layer decomposition.arXiv preprint arXiv:2511.20996, 2025

Chen, J., Zhang, Y., Qian, X., Li, Z., Fermuller, C., Chen, C., Aloimonos, Y.: From inpainting to layer decomposition: Repurposing generative inpainting models for image layer decomposition. arXiv preprint arXiv:2511.20996 (2025)

work page arXiv 2025
[7]

arXiv preprint arXiv:2505.22523 (2025)

Chen, J., Jiang, H., Wang, Y., Wu, K., Li, J., Zhang, C., Yanai, K., Chen, D., Yuan, Y.: Prismlayers: Open data for high-quality multi-layer transparent image generative models. arXiv preprint arXiv:2505.22523 (2025)

work page arXiv 2025
[8]

Transanimate: Taming layer diffusion to generate rgba video.arXiv preprint arXiv:2503.17934, 2025

Chen, X., Chen, Z., Song, Y.: Transanimate: Taming layer diffusion to generate rgba video. arXiv preprint arXiv:2503.17934 (2025)

work page arXiv 2025
[9]

In: ICCV

Dai, Y., Li, H., Zhou, S., Loy, C.C.: Trans-adapter: A plug-and-play framework for transparent image inpainting. In: ICCV. pp. 15015–15024 (2025)

work page 2025
[10]

In: NeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI (2024)

Dalva, Y., Li, Y., Liu, Q., Zhao, N., Zhang, J., Lin, Z., Yanardag, P.: Layerfu- sion: Harmonized multi-layer text-to-image generation with generative priors. In: NeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI (2024)

work page 2025
[11]

co / DiffSynth-Studio/Qwen-Image-Layered-Control(2025)

DiffSynth-Studio: Qwen-image-layered-control.https : / / huggingface . co / DiffSynth-Studio/Qwen-Image-Layered-Control(2025)

work page 2025
[12]

arXiv preprint arXiv:2509.24979 (2025)

Dong, H., Wang, W., Li, C., Lyu, J., Lin, D.: Video generation with stable trans- parency via shiftable rgb-a distribution learner. arXiv preprint arXiv:2509.24979 (2025)

work page arXiv 2025
[13]

In: ICML (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: ICML (2024)

work page 2024
[14]

NeurIPS37, 43864–43893 (2024)

Fontanella,A.,Tudosiu,P.D.,Yang,Y.,Zhang,S.,Parisot,S.:Generatingcomposi- tional scenes via text-to-image rgba instance generation. NeurIPS37, 43864–43893 (2024)

work page 2024
[15]

Prompt-to-Prompt Image Editing with Cross Attention Control

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

In: Proceedings of the 2021 conference on empirical methods in natural language processing

Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: Clipscore: A reference- free evaluation metric for image captioning. In: Proceedings of the 2021 conference on empirical methods in natural language processing. pp. 7514–7528 (2021)

work page 2021
[17]

arXiv preprint arXiv:2505.11468 (2025) 16 R.Morita et al

Huang, D., Li, W., Zhao, Y., Pan, X., Wang, C., Zeng, Y., Dai, B.: Psdiffusion: Harmonized multi-layer image generation via layout and appearance alignment. arXiv preprint arXiv:2505.11468 (2025) 16 R.Morita et al

work page arXiv 2025
[18]

arXiv preprint arXiv:2503.12838 (2025)

Huang, J., Yan, P., Cai, J., Liu, J., Wang, Z., Wang, Y., Wu, X., Li, G.: Dream- layer: Simultaneous multi-layer generation via diffusion mode. arXiv preprint arXiv:2503.12838 (2025)

work page arXiv 2025
[19]

In: ECCV

Huang, R., Cai, K., Han, J., Liang, X., Pei, R., Lu, G., Xu, S., Zhang, W., Xu, H.: Layerdiff: Exploring text-guided multi-layered composable image synthesis via layer-collaborative diffusion model. In: ECCV. pp. 144–160. Springer (2024)

work page 2024
[20]

In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

Ji, S., Luo, H., Chen, X., Tu, Y., Wang, Y., Zhao, H.: Layerflow: A unified model for layer-aware video generation. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–10 (2025)

work page 2025
[21]

Direct inversion: Boosting diffusion- based editing with 3 lines of code.arXiv preprint arXiv:2310.01506, 2023

Ju, X., Zeng, A., Bian, Y., Liu, S., Xu, Q.: Direct inversion: Boosting diffusion- based editing with 3 lines of code. arXiv preprint arXiv:2310.01506 (2023)

work page arXiv 2023
[22]

Layeringdiff: Layered image synthesis via generation, then disassembly with generative knowledge.arXiv preprint arXiv:2501.01197, 2025

Kang, K., Sim, G., Kim, G., Kim, D., Nam, S., Cho, S.: Layeringdiff: Layered image synthesis via generation, then disassembly with generative knowledge. arXiv preprint arXiv:2501.01197 (2025)

work page arXiv 2025
[23]

arXiv preprint arXiv:2505.23145 (2025)

Kim, J., Hong, Y., Park, J., Ye, J.C.: Flowalign: Trajectory-regularized, inversion- free flow-based image editing. arXiv preprint arXiv:2505.23145 (2025)

work page arXiv 2025
[24]

In: ICCV

Kulikov, V., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion-free text-based editing using pre-trained flow models. In: ICCV. pp. 19721–19730 (2025)

work page 2025
[25]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2025)

work page 2025
[26]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow match- ing for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

In: CVPR

Lee, Y.C., Lu, E., Rumbley, S., Geyer, M., Huang, J.B., Dekel, T., Cole, F.: Gener- ative omnimatte: Learning to decompose video into layers. In: CVPR. pp. 12522– 12532 (2025)

work page 2025
[28]

Omnipsd: Layered psd generation with diffusion transformer.arXiv preprint arXiv:2512.09247, 2025

Liu, C., Song, Y., Wang, H., Shou, M.Z.: Omnipsd: Layered psd generation with diffusion transformer. arXiv preprint arXiv:2512.09247 (2025)

work page arXiv 2025
[29]

Magicquillv2: Precise and interac- tive image editing with layered visual cues.arXiv preprint arXiv:2512.03046, 2025

Liu, Z., Yu, Y., Ouyang, H., Wang, Q., Ma, S., Cheng, K.L., Wang, W., Bai, Q., Zhang, Y., Zeng, Y., et al.: Magicquillv2: Precise and interactive image editing with layered visual cues. arXiv preprint arXiv:2512.03046 (2025)

work page arXiv 2025
[30]

arXiv preprint arXiv:2511.16249 , year=

Liu, Z., Xu, Z., Shu, S., Zhou, J., Zhang, R., Tang, Z., Li, X.: Controllable layer decomposition for reversible multi-layer image generation. arXiv preprint arXiv:2511.16249 (2025)

work page arXiv 2025
[31]

arXiv preprint arXiv:2503.16522 (2025)

Ma, Y., Di, D., Liu, X., Chen, X., Fan, L., Chen, W., Su, T.: Adams bash- forth moulton solver for inversion and editing in rectified flow. arXiv preprint arXiv:2503.16522 (2025)

work page arXiv 2025
[32]

In: CVPR

Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: CVPR. pp. 6038–6047 (2023)

work page 2023
[33]

In: CVPR

Morita, R., Frolov, S., Moser, B.B., Shirakawa, T., Watanabe, K., Dengel, A., Zhou, J.: Tkg-dm: Training-free chroma key content generation diffusion model. In: CVPR. pp. 13031–13040 (2025)

work page 2025
[34]

arXiv preprint arXiv:2603.24086 (2026)

Morita,R.,Frolov,S.,Moser,B.B.,Watanabe,K.,Takahashi,R.,Dengel,A.:Lgtm: Training-free light-guided text-to-image diffusion model via initial noise manipu- lation. arXiv preprint arXiv:2603.24086 (2026)

work page arXiv 2026
[35]

In: WACV

Morita, R., Zhang, Z., Ho, M.M., Zhou, J.: Interactive image manipulation with complex text instructions. In: WACV. pp. 1053–1062 (2023) LimeCross 17

work page 2023
[36]

arXiv preprint arXiv:2511.02580 (2025)

Nagai, D., Morita, R., Kitada, S., Iyatomi, H.: Taue: Training-free noise transplant and cultivation diffusion model. arXiv preprint arXiv:2511.02580 (2025)

work page arXiv 2025
[37]

In: ICML (2025)

Nie, H., Zhang, Z., Cheng, Y., Yang, M., Shi, G., Xie, Q., Shao, J., Wu, X.: De- composition of graphic design with unified multimodal model. In: ICML (2025)

work page 2025
[38]

Lore: Latent optimization for precise semantic control in rectified flow-based image editing.arXiv preprint arXiv:2508.03144, 2025

Ouyang, L., Mao, J.: Lore: Latent optimization for precise semantic control in rectified flow-based image editing. arXiv preprint arXiv:2508.03144 (2025)

work page arXiv 2025
[39]

In: CVPR

Pu, Y., Zhao, Y., Tang, Z., Yin, R., Ye, H., Yuan, Y., Chen, D., Bao, J., Zhang, S., Wang, Y., et al.: Art: Anonymous region transformer for variable multi-layer transparent image generation. In: CVPR. pp. 7952–7962 (2025)

work page 2025
[40]

In: ECCV

Quattrini, F., Pippi, V., Cascianelli, S., Cucchiara, R.: Alfie: Democratising rgba image generation with no. In: ECCV. pp. 38–55. Springer (2024)

work page 2024
[41]

arXiv preprint arXiv:2510.22010 (2025)

Ronai, O., Kulikov, V., Michaeli, T.: Flowopt: Fast optimization through whole flow processes for training-free editing. arXiv preprint arXiv:2510.22010 (2025)

work page arXiv 2025
[42]

Semantic im- age inversion and editing using rectified stochastic differen- tial equations.arXiv preprint arXiv:2410.10792, 2024

Rout, L., Chen, Y., Ruiz, N., Caramanis, C., Shakkottai, S., Chu, W.S.: Semantic image inversion and editing using rectified stochastic differential equations. arXiv preprint arXiv:2410.10792 (2024)

work page arXiv 2024
[43]

NeurIPS35, 25278–25294 (2022)

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS35, 25278–25294 (2022)

work page 2022
[44]

Layertracer: Cognitive-aligned layered svg synthesis via diffusion transformer.arXiv preprint arXiv:2502.01105, 2025

Song, Y., Chen, D., Shou, M.Z.: Layertracer: Cognitive-aligned layered svg syn- thesis via diffusion transformer. arXiv preprint arXiv:2502.01105 (2025)

work page arXiv 2025
[45]

In: ICCV

Suzuki, T., Liu, K.J., Inoue, N., Yamaguchi, K.: Layerd: Decomposing raster graphic designs into layers. In: ICCV. pp. 17783–17792 (2025)

work page 2025
[46]

In: CVPR

Tudosiu, P.D., Yang, Y., Zhang, S., Chen, F., McDonagh, S., Lampouras, G., Iacobacci, I., Parisot, S.: Mulan: A multi layer annotated dataset for controllable text-to-image generation. In: CVPR. pp. 22413–22422 (2024)

work page 2024
[47]

arXiv preprint arXiv:2411.04746 , year=

Wang, J., Pu, J., Qi, Z., Guo, J., Ma, Y., Huang, N., Chen, Y., Li, X., Shan, Y.: Taming rectified flow for inversion and editing. arXiv preprint arXiv:2411.04746 (2024)

work page arXiv 2024
[48]

In: CVPR

Wang, L., Li, Y., Chen, Z., Wang, J.H., Zhang, Z., Zhang, H., Lin, Z., Chen, Y.C.: Transpixeler: Advancing text-to-video generation with transparency. In: CVPR. pp. 18229–18239 (2025)

work page 2025
[49]

arXiv preprint arXiv:2507.09308 , year=

Wang, Z., Yu, H., Zhan, J., Yuan, C.: Alphavae: Unified end-to-end rgba im- age reconstruction and generation with alpha-aware representation learning. arXiv preprint arXiv:2507.09308 (2025)

work page arXiv 2025
[50]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., Li, H.: Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

arXiv preprint arXiv:2506.01430 (2025)

Xie, C., Li, M., Li, S., Wu, Y., Yi, Q., Zhang, L.: Dnaedit: Direct noise alignment for text-guided rectified flow editing. arXiv preprint arXiv:2506.01430 (2025)

work page arXiv 2025
[52]

NeurIPS36, 15903–15935 (2023)

Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: Learning and evaluating human preferences for text-to-image generation. NeurIPS36, 15903–15935 (2023)

work page 2023
[53]

In: CVPR

Xu, P., Jiang, B., Hu, X., Luo, D., He, Q., Zhang, J., Wang, C., Wu, Y., Ling, C., Wang, B.: Unveil inversion and invariance in flow transformer for versatile image editing. In: CVPR. pp. 28479–28489 (2025)

work page 2025
[54]

arXiv preprint arXiv:2312.04965 (2023) 18 R.Morita et al

Xu, S., Huang, Y., Pan, J., Ma, Z., Chai, J.: Inversion-free image editing with natural language. arXiv preprint arXiv:2312.04965 (2023) 18 R.Morita et al

work page arXiv 2023
[55]

Eedit: Rethinking the spatial and temporal redundancy for efficient image editing.CoRR, abs/2503.10270, 2025

Yan, Z., Ma, Y., Zou, C., Chen, W., Chen, Q., Zhang, L.: Eedit: Rethinking the spatial and temporal redundancy for efficient image editing. arXiv preprint arXiv:2503.10270 (2025)

work page arXiv 2025
[56]

In: CVPR

Yang, J., Liu, Q., Li, Y., Kim, S.Y., Pakhomov, D., Ren, M., Zhang, J., Lin, Z., Xie, C., Zhou, Y.: Generative image layer decomposition with visual effects. In: CVPR. pp. 7643–7653 (2025)

work page 2025
[57]

arXiv preprint arXiv:2511.12151 (2025)

Yang, K., Shen, B., Li, X., Dai, Y., Luo, Y., Ma, Y., Fang, W., Li, Q., Wang, Z.: Fia-edit: Frequency-interactive attention for efficient and high-fidelity inversion- free text-guided image editing. arXiv preprint arXiv:2511.12151 (2025)

work page arXiv 2025
[58]

Qwen-image-layered: Towards inherent editability via layer decomposition.arXiv preprint arXiv:2512.15603, 2025

Yin, S., Zhang, Z., Tang, Z., Gao, K., Xu, X., Yan, K., Li, J., Chen, Y., Chen, Y., Shum, H.Y., et al.: Qwen-image-layered: Towards inherent editability via layer decomposition. arXiv preprint arXiv:2512.15603 (2025)

work page arXiv 2025
[59]

Transparent image layer diffusion using latent trans- parency.arXiv preprint arXiv:2402.17113, 2024

Zhang, L., Agrawala, M.: Transparent image layer diffusion using latent trans- parency. arXiv preprint arXiv:2402.17113 (2024)

work page arXiv 2024
[60]

Text2layer: Layered image generation using latent diffusion model.arXiv preprint arXiv:2307.09781, 2023

Zhang, X., Zhao, W., Lu, X., Chien, J.: Text2layer: Layered image generation using latent diffusion model. arXiv preprint arXiv:2307.09781 (2023)

work page arXiv 2023
[61]

arXiv preprint arXiv:2502.17363 (2025)

Zhu, T., Zhang, S., Shao, J., Tang, Y.: Kv-edit: Training-free image editing for precise background preservation. arXiv preprint arXiv:2502.17363 (2025)

work page arXiv 2025
[62]

In: ICCV

Zou, K., Feng, X., Huang, T., Huang, Z., Zhang, H., Zou, Y., Li, D.: Zero-shot subject-centric generation for creative application using entropy fusion. In: ICCV. pp. 6136–6145 (2025)

work page 2025