pith. machine review for the scientific record. sign in

arxiv: 2605.10319 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: no theorem link

LimeCross: Context-Conditioned Layered Image Editing with Structural Consistency

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:10 UTC · model grok-4.3

classification 💻 cs.CV
keywords layered image editingcontext-conditioned editingbi-stream attentionRGBA layerstext-guided editingstructural consistencyimage compositinglayer purity
0
0 comments X

The pith

LimeCross edits chosen RGBA layers via text prompts while preserving cross-layer illumination and contact consistency without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a training-free framework for editing individual layers in composite images according to text instructions. It keeps untouched layers fixed and uses contextual signals from them to maintain realistic lighting, shadows, and physical contacts in the final composite. Current pipelines flatten everything for editing then try to recover layers, which creates leakage and unstable transparency. By contrast, the approach explicitly guards layer integrity and applies a bi-stream attention process to borrow relevant cues across layers. If the claim holds, layered generative tools become practical for iterative creative work that depends on non-destructive compositing.

Core claim

LimeCross is a context-conditioned layered image editing framework that applies text-guided modifications exclusively to user-selected RGBA layers, employs bi-stream attention to incorporate contextual cues from remaining layers for consistency, and explicitly maintains layer purity to avoid background-to-foreground contamination or alpha instability.

What carries the argument

The bi-stream attention mechanism that extracts and applies cross-layer contextual cues while enforcing layer integrity during text-conditioned edits.

If this is right

  • Selected layers receive prompt-driven changes while all other layers remain pixel-identical to the input.
  • Composite outputs retain consistent lighting and physical contacts without manual mask adjustments.
  • Alpha channels stay stable, preventing transparency leakage that occurs when flattening and re-decomposing images.
  • The method works zero-shot on existing diffusion backbones without additional fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cross-layer cue mechanism could support consistent edits across frames in video layering.
  • Layer purity preservation might simplify downstream tasks such as animation or 3D lifting from edited composites.
  • Because the approach is training-free, it could serve as a plug-in module for other text-to-image systems that currently collapse layers.

Load-bearing premise

Bi-stream attention can reliably extract relevant cross-layer cues to preserve illumination and contact without introducing new artifacts or requiring task-specific training.

What would settle it

Apply LimeCross edits on the 1500 LayerEditBench scenes using the provided source-target prompt pairs and measure whether edited layers exhibit measurable alpha channel shifts or illumination mismatches against the target composites.

Figures

Figures reproduced from arXiv: 2605.10319 by Andreas Dengel, Brian Bernhard Moser, Issey Sukeda, Ko Watanabe, Riku Takahashi, Ryugo Morita, Stanislav Frolov.

Figure 1
Figure 1. Figure 1: LimeCross enables text-guided image editing within a layered RGBA image set, modifying selected layers while preserving all others as reusable assets. By lever￾aging contextual cues from the remaining layers, it maintains cross-layer consistency in illumination, contact, and appearance, making it well-suited for layered creation workflows that require independent yet coherent control of multiple image elem… view at source ↗
Figure 2
Figure 2. Figure 2: Given layered RGBA inputs, we select a target layer (red box) and construct an opaque context by compositing all non-target layers. Both are encoded into latents and packed into two token streams (ztgt, zctx). Editing is performed via delta-velocity integration: we evaluate source/target velocities and update only the target stream using ∆vtgt = v t tgt − v s tgt, while context tokens are never updated or … view at source ↗
Figure 3
Figure 3. Figure 3: Sample layered assets from LayerEditBench, together with the correspond￾ing source and target text for each layer. LayerEditBench includes a wide range of challenging layered editing scenarios, including object replacement, style changes, and atmosphere modifications. The benchmark features complex alpha channel structures, transparent objects, occlusions, and diverse artistic styles, enabling systematic e… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison for single-layer editing. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison for iterative multi-layer editing. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Switching-ratio ablation. Smaller ρ yields better alpha stabil￾ity but worse fidelity, while larger ρ exhibits the opposite trend [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

Layered image assets are widely used in real-world creative workflows, enabling non-destructive iteration and flexible re-composition. Recent advances in layered image generation and decomposition synthesize or recover layered representations, yet controllable editing of layered images remains challenging. Manual editing requires careful coordination across layers to maintain consistent illumination and contact, while AI-based pipelines collapse layers into a flattened image for editing, then decompose them again, introducing background-to-foreground leakage and unstable transparency. To address these limitations, we propose LimeCross, a training-free context-conditioned layered image editing framework that edits user-selected RGBA layers according to text while keeping the remaining layers unchanged. It leverages contextual cues from other layers using a bi-stream attention mechanism to preserve cross-layer consistency, while explicitly maintaining layer integrity to prevent the contamination of edited layers. To evaluate our approach, we introduce LayerEditBench, a benchmark of 1500 layered scenes with paired source/target prompts, along with evaluation protocols that assess both edit fidelity and alpha channel stability. Extensive experiments demonstrate that LimeCross improves layer purity and composite realism over strong editing baselines, establishing context-conditioned layered editing as a principled framework for controllable generative creation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces LimeCross, a training-free context-conditioned framework for text-based editing of user-selected RGBA layers in layered images. It employs a bi-stream attention mechanism to incorporate contextual cues from unchanged layers, thereby preserving illumination, contact, and structural consistency while explicitly maintaining layer integrity to avoid leakage. The work also presents LayerEditBench, a benchmark of 1500 paired layered scenes, along with protocols for assessing edit fidelity and alpha stability, and claims superior layer purity and composite realism relative to strong baselines.

Significance. If the quantitative claims hold under rigorous evaluation, the framework would offer a practical advance for non-destructive layered editing in creative pipelines, reducing reliance on flattening-then-redecomposition approaches that introduce artifacts. The training-free design and new benchmark could serve as a foundation for further research in controllable generative composition, provided the bi-stream mechanism generalizes without task-specific tuning.

major comments (3)
  1. [§3] §3 (Method): The bi-stream attention mechanism is presented as the core component for extracting and applying cross-layer contextual cues to preserve illumination and contact consistency, yet no equations, pseudocode, or implementation details are supplied for how the two streams interact or how attention weights are computed. This directly bears on the central claim, as the weakest assumption is that this mechanism reliably avoids new artifacts without training.
  2. [§4] §4 (Experiments): The abstract asserts improvements in layer purity and composite realism over baselines on LayerEditBench, but the provided text supplies no specific metrics, tables, error bars, ablation studies, or statistical tests. Without these, it is impossible to verify whether the reported gains are attributable to context conditioning or to other factors, undermining the evaluation protocols' ability to support the framework's superiority.
  3. [§4.1] §4.1 (Benchmark): LayerEditBench is introduced with 1500 scenes and paired prompts, but the construction details (e.g., how source/target pairs ensure controlled variation in illumination/contact while isolating edit effects) are not described. This is load-bearing because the benchmark is used to establish the principled nature of context-conditioned editing.
minor comments (2)
  1. [Introduction] The introduction could more explicitly contrast the proposed approach with prior layered decomposition methods (e.g., by citing specific failure modes like transparency instability) to better motivate the bi-stream design.
  2. [§2] Notation for RGBA layers and the distinction between edited and context layers should be formalized early (e.g., via a consistent symbol table) to improve readability of the pipeline description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for their thorough and constructive feedback. Their comments have identified key areas for improving the clarity of our method, the rigor of our experiments, and the description of our benchmark. We address each point below and indicate the revisions incorporated into the updated manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The bi-stream attention mechanism is presented as the core component for extracting and applying cross-layer contextual cues to preserve illumination and contact consistency, yet no equations, pseudocode, or implementation details are supplied for how the two streams interact or how attention weights are computed. This directly bears on the central claim, as the weakest assumption is that this mechanism reliably avoids new artifacts without training.

    Authors: We agree that explicit details on the bi-stream attention are essential for substantiating the central claims. In the revised manuscript, Section 3 now includes the full mathematical formulation: the context stream computes cross-attention over unchanged layers while the edit stream attends to the target layer, with interaction via concatenated key-value pairs and softmax-normalized weights. Pseudocode is added to the appendix, showing the exact implementation steps that integrate contextual cues while enforcing layer integrity to avoid artifacts. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract asserts improvements in layer purity and composite realism over baselines on LayerEditBench, but the provided text supplies no specific metrics, tables, error bars, ablation studies, or statistical tests. Without these, it is impossible to verify whether the reported gains are attributable to context conditioning or to other factors, undermining the evaluation protocols' ability to support the framework's superiority.

    Authors: We acknowledge the need for more explicit quantitative reporting. The revised experiments section now includes Table 2 with concrete metrics (layer purity via alpha-channel MSE and stability scores; composite realism via FID, LPIPS, and user-study preference rates), error bars computed over five random seeds, ablation studies isolating the bi-stream attention component, and statistical significance via paired t-tests (p < 0.05) confirming gains are due to context conditioning. revision: yes

  3. Referee: [§4.1] §4.1 (Benchmark): LayerEditBench is introduced with 1500 scenes and paired prompts, but the construction details (e.g., how source/target pairs ensure controlled variation in illumination/contact while isolating edit effects) are not described. This is load-bearing because the benchmark is used to establish the principled nature of context-conditioned editing.

    Authors: We have expanded §4.1 with a full construction protocol. The 1500 scenes were procedurally generated via a 3D rendering pipeline that systematically varies illumination angles and contact points while keeping layer geometry fixed. Source-target prompt pairs were created by applying edits exclusively to the selected RGBA layer, with manual and automated validation steps to confirm that illumination/contact changes are isolated from the edit itself; these details are now provided along with the data-generation code. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a training-free bi-stream attention pipeline for context-conditioned layered editing without any fitted parameters, equations, or predictions that reduce to inputs by construction. The core contribution is an algorithmic framework that explicitly maintains layer integrity and uses cross-layer cues, evaluated on a newly introduced benchmark against external baselines. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the described method or claims; the results are presented as directly verifiable through implementation and comparative experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the unproven effectiveness of the bi-stream attention mechanism and the assumption that explicit layer integrity maintenance suffices to prevent contamination. No free parameters, standard mathematical axioms, or independently evidenced invented entities are described.

invented entities (1)
  • bi-stream attention mechanism no independent evidence
    purpose: to leverage contextual cues from other layers while editing a selected layer
    New component introduced to achieve cross-layer consistency; no independent evidence or falsifiable prediction supplied in abstract.

pith-pipeline@v0.9.0 · 5518 in / 1176 out tokens · 43560 ms · 2026-05-12T05:10:28.227361+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 4 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    In: CVPR

    Ao, J., Jiang, Y., Ke, Q., Ehinger, K.A.: Open-world amodal appearance comple- tion. In: CVPR. pp. 6490–6499 (2025)

  3. [3]

    In: ICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP)

    Bai, J., Zhou, J., Wang, B., Chen, W., Yang, Y., Lei, Z., Wang, F.: Layer-animate for transparent video generation. In: ICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2025)

  4. [4]

    In: CVPR

    Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: CVPR. pp. 18392–18402 (2023)

  5. [5]

    arXiv preprint arXiv:2508.04228 (2025)

    Cen, K., Zhao, B., Xin, Y., Luo, S., Zhai, G., Liu, X.: Layert2v: Interactive multi- object trajectory layering for video generation. arXiv preprint arXiv:2508.04228 (2025)

  6. [6]

    From inpainting to layer decomposition: Repurposing generative inpainting models for image layer decomposition.arXiv preprint arXiv:2511.20996, 2025

    Chen, J., Zhang, Y., Qian, X., Li, Z., Fermuller, C., Chen, C., Aloimonos, Y.: From inpainting to layer decomposition: Repurposing generative inpainting models for image layer decomposition. arXiv preprint arXiv:2511.20996 (2025)

  7. [7]

    arXiv preprint arXiv:2505.22523 (2025)

    Chen, J., Jiang, H., Wang, Y., Wu, K., Li, J., Zhang, C., Yanai, K., Chen, D., Yuan, Y.: Prismlayers: Open data for high-quality multi-layer transparent image generative models. arXiv preprint arXiv:2505.22523 (2025)

  8. [8]

    Transanimate: Taming layer diffusion to generate rgba video.arXiv preprint arXiv:2503.17934, 2025

    Chen, X., Chen, Z., Song, Y.: Transanimate: Taming layer diffusion to generate rgba video. arXiv preprint arXiv:2503.17934 (2025)

  9. [9]

    In: ICCV

    Dai, Y., Li, H., Zhou, S., Loy, C.C.: Trans-adapter: A plug-and-play framework for transparent image inpainting. In: ICCV. pp. 15015–15024 (2025)

  10. [10]

    In: NeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI (2024)

    Dalva, Y., Li, Y., Liu, Q., Zhao, N., Zhang, J., Lin, Z., Yanardag, P.: Layerfu- sion: Harmonized multi-layer text-to-image generation with generative priors. In: NeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI (2024)

  11. [11]

    co / DiffSynth-Studio/Qwen-Image-Layered-Control(2025)

    DiffSynth-Studio: Qwen-image-layered-control.https : / / huggingface . co / DiffSynth-Studio/Qwen-Image-Layered-Control(2025)

  12. [12]

    arXiv preprint arXiv:2509.24979 (2025)

    Dong, H., Wang, W., Li, C., Lyu, J., Lin, D.: Video generation with stable trans- parency via shiftable rgb-a distribution learner. arXiv preprint arXiv:2509.24979 (2025)

  13. [13]

    In: ICML (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: ICML (2024)

  14. [14]

    NeurIPS37, 43864–43893 (2024)

    Fontanella,A.,Tudosiu,P.D.,Yang,Y.,Zhang,S.,Parisot,S.:Generatingcomposi- tional scenes via text-to-image rgba instance generation. NeurIPS37, 43864–43893 (2024)

  15. [15]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

  16. [16]

    In: Proceedings of the 2021 conference on empirical methods in natural language processing

    Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: Clipscore: A reference- free evaluation metric for image captioning. In: Proceedings of the 2021 conference on empirical methods in natural language processing. pp. 7514–7528 (2021)

  17. [17]

    arXiv preprint arXiv:2505.11468 (2025) 16 R.Morita et al

    Huang, D., Li, W., Zhao, Y., Pan, X., Wang, C., Zeng, Y., Dai, B.: Psdiffusion: Harmonized multi-layer image generation via layout and appearance alignment. arXiv preprint arXiv:2505.11468 (2025) 16 R.Morita et al

  18. [18]

    arXiv preprint arXiv:2503.12838 (2025)

    Huang, J., Yan, P., Cai, J., Liu, J., Wang, Z., Wang, Y., Wu, X., Li, G.: Dream- layer: Simultaneous multi-layer generation via diffusion mode. arXiv preprint arXiv:2503.12838 (2025)

  19. [19]

    In: ECCV

    Huang, R., Cai, K., Han, J., Liang, X., Pei, R., Lu, G., Xu, S., Zhang, W., Xu, H.: Layerdiff: Exploring text-guided multi-layered composable image synthesis via layer-collaborative diffusion model. In: ECCV. pp. 144–160. Springer (2024)

  20. [20]

    In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

    Ji, S., Luo, H., Chen, X., Tu, Y., Wang, Y., Zhao, H.: Layerflow: A unified model for layer-aware video generation. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–10 (2025)

  21. [21]

    Direct inversion: Boosting diffusion- based editing with 3 lines of code.arXiv preprint arXiv:2310.01506, 2023

    Ju, X., Zeng, A., Bian, Y., Liu, S., Xu, Q.: Direct inversion: Boosting diffusion- based editing with 3 lines of code. arXiv preprint arXiv:2310.01506 (2023)

  22. [22]

    Layeringdiff: Layered image synthesis via generation, then disassembly with generative knowledge.arXiv preprint arXiv:2501.01197, 2025

    Kang, K., Sim, G., Kim, G., Kim, D., Nam, S., Cho, S.: Layeringdiff: Layered image synthesis via generation, then disassembly with generative knowledge. arXiv preprint arXiv:2501.01197 (2025)

  23. [23]

    arXiv preprint arXiv:2505.23145 (2025)

    Kim, J., Hong, Y., Park, J., Ye, J.C.: Flowalign: Trajectory-regularized, inversion- free flow-based image editing. arXiv preprint arXiv:2505.23145 (2025)

  24. [24]

    In: ICCV

    Kulikov, V., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion-free text-based editing using pre-trained flow models. In: ICCV. pp. 19721–19730 (2025)

  25. [25]

    Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2025)

  26. [26]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow match- ing for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025)

  27. [27]

    In: CVPR

    Lee, Y.C., Lu, E., Rumbley, S., Geyer, M., Huang, J.B., Dekel, T., Cole, F.: Gener- ative omnimatte: Learning to decompose video into layers. In: CVPR. pp. 12522– 12532 (2025)

  28. [28]

    Omnipsd: Layered psd generation with diffusion transformer.arXiv preprint arXiv:2512.09247, 2025

    Liu, C., Song, Y., Wang, H., Shou, M.Z.: Omnipsd: Layered psd generation with diffusion transformer. arXiv preprint arXiv:2512.09247 (2025)

  29. [29]

    Magicquillv2: Precise and interac- tive image editing with layered visual cues.arXiv preprint arXiv:2512.03046, 2025

    Liu, Z., Yu, Y., Ouyang, H., Wang, Q., Ma, S., Cheng, K.L., Wang, W., Bai, Q., Zhang, Y., Zeng, Y., et al.: Magicquillv2: Precise and interactive image editing with layered visual cues. arXiv preprint arXiv:2512.03046 (2025)

  30. [30]

    arXiv preprint arXiv:2511.16249 , year=

    Liu, Z., Xu, Z., Shu, S., Zhou, J., Zhang, R., Tang, Z., Li, X.: Controllable layer decomposition for reversible multi-layer image generation. arXiv preprint arXiv:2511.16249 (2025)

  31. [31]

    arXiv preprint arXiv:2503.16522 (2025)

    Ma, Y., Di, D., Liu, X., Chen, X., Fan, L., Chen, W., Su, T.: Adams bash- forth moulton solver for inversion and editing in rectified flow. arXiv preprint arXiv:2503.16522 (2025)

  32. [32]

    In: CVPR

    Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: CVPR. pp. 6038–6047 (2023)

  33. [33]

    In: CVPR

    Morita, R., Frolov, S., Moser, B.B., Shirakawa, T., Watanabe, K., Dengel, A., Zhou, J.: Tkg-dm: Training-free chroma key content generation diffusion model. In: CVPR. pp. 13031–13040 (2025)

  34. [34]

    arXiv preprint arXiv:2603.24086 (2026)

    Morita,R.,Frolov,S.,Moser,B.B.,Watanabe,K.,Takahashi,R.,Dengel,A.:Lgtm: Training-free light-guided text-to-image diffusion model via initial noise manipu- lation. arXiv preprint arXiv:2603.24086 (2026)

  35. [35]

    In: WACV

    Morita, R., Zhang, Z., Ho, M.M., Zhou, J.: Interactive image manipulation with complex text instructions. In: WACV. pp. 1053–1062 (2023) LimeCross 17

  36. [36]

    arXiv preprint arXiv:2511.02580 (2025)

    Nagai, D., Morita, R., Kitada, S., Iyatomi, H.: Taue: Training-free noise transplant and cultivation diffusion model. arXiv preprint arXiv:2511.02580 (2025)

  37. [37]

    In: ICML (2025)

    Nie, H., Zhang, Z., Cheng, Y., Yang, M., Shi, G., Xie, Q., Shao, J., Wu, X.: De- composition of graphic design with unified multimodal model. In: ICML (2025)

  38. [38]

    Lore: Latent optimization for precise semantic control in rectified flow-based image editing.arXiv preprint arXiv:2508.03144, 2025

    Ouyang, L., Mao, J.: Lore: Latent optimization for precise semantic control in rectified flow-based image editing. arXiv preprint arXiv:2508.03144 (2025)

  39. [39]

    In: CVPR

    Pu, Y., Zhao, Y., Tang, Z., Yin, R., Ye, H., Yuan, Y., Chen, D., Bao, J., Zhang, S., Wang, Y., et al.: Art: Anonymous region transformer for variable multi-layer transparent image generation. In: CVPR. pp. 7952–7962 (2025)

  40. [40]

    In: ECCV

    Quattrini, F., Pippi, V., Cascianelli, S., Cucchiara, R.: Alfie: Democratising rgba image generation with no. In: ECCV. pp. 38–55. Springer (2024)

  41. [41]

    arXiv preprint arXiv:2510.22010 (2025)

    Ronai, O., Kulikov, V., Michaeli, T.: Flowopt: Fast optimization through whole flow processes for training-free editing. arXiv preprint arXiv:2510.22010 (2025)

  42. [42]

    Semantic im- age inversion and editing using rectified stochastic differen- tial equations.arXiv preprint arXiv:2410.10792, 2024

    Rout, L., Chen, Y., Ruiz, N., Caramanis, C., Shakkottai, S., Chu, W.S.: Semantic image inversion and editing using rectified stochastic differential equations. arXiv preprint arXiv:2410.10792 (2024)

  43. [43]

    NeurIPS35, 25278–25294 (2022)

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS35, 25278–25294 (2022)

  44. [44]

    Layertracer: Cognitive-aligned layered svg synthesis via diffusion transformer.arXiv preprint arXiv:2502.01105, 2025

    Song, Y., Chen, D., Shou, M.Z.: Layertracer: Cognitive-aligned layered svg syn- thesis via diffusion transformer. arXiv preprint arXiv:2502.01105 (2025)

  45. [45]

    In: ICCV

    Suzuki, T., Liu, K.J., Inoue, N., Yamaguchi, K.: Layerd: Decomposing raster graphic designs into layers. In: ICCV. pp. 17783–17792 (2025)

  46. [46]

    In: CVPR

    Tudosiu, P.D., Yang, Y., Zhang, S., Chen, F., McDonagh, S., Lampouras, G., Iacobacci, I., Parisot, S.: Mulan: A multi layer annotated dataset for controllable text-to-image generation. In: CVPR. pp. 22413–22422 (2024)

  47. [47]

    arXiv preprint arXiv:2411.04746 , year=

    Wang, J., Pu, J., Qi, Z., Guo, J., Ma, Y., Huang, N., Chen, Y., Li, X., Shan, Y.: Taming rectified flow for inversion and editing. arXiv preprint arXiv:2411.04746 (2024)

  48. [48]

    In: CVPR

    Wang, L., Li, Y., Chen, Z., Wang, J.H., Zhang, Z., Zhang, H., Lin, Z., Chen, Y.C.: Transpixeler: Advancing text-to-video generation with transparency. In: CVPR. pp. 18229–18239 (2025)

  49. [49]

    arXiv preprint arXiv:2507.09308 , year=

    Wang, Z., Yu, H., Zhan, J., Yuan, C.: Alphavae: Unified end-to-end rgba im- age reconstruction and generation with alpha-aware representation learning. arXiv preprint arXiv:2507.09308 (2025)

  50. [50]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., Li, H.: Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023)

  51. [51]

    arXiv preprint arXiv:2506.01430 (2025)

    Xie, C., Li, M., Li, S., Wu, Y., Yi, Q., Zhang, L.: Dnaedit: Direct noise alignment for text-guided rectified flow editing. arXiv preprint arXiv:2506.01430 (2025)

  52. [52]

    NeurIPS36, 15903–15935 (2023)

    Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: Learning and evaluating human preferences for text-to-image generation. NeurIPS36, 15903–15935 (2023)

  53. [53]

    In: CVPR

    Xu, P., Jiang, B., Hu, X., Luo, D., He, Q., Zhang, J., Wang, C., Wu, Y., Ling, C., Wang, B.: Unveil inversion and invariance in flow transformer for versatile image editing. In: CVPR. pp. 28479–28489 (2025)

  54. [54]

    arXiv preprint arXiv:2312.04965 (2023) 18 R.Morita et al

    Xu, S., Huang, Y., Pan, J., Ma, Z., Chai, J.: Inversion-free image editing with natural language. arXiv preprint arXiv:2312.04965 (2023) 18 R.Morita et al

  55. [55]

    Eedit: Rethinking the spatial and temporal redundancy for efficient image editing.CoRR, abs/2503.10270, 2025

    Yan, Z., Ma, Y., Zou, C., Chen, W., Chen, Q., Zhang, L.: Eedit: Rethinking the spatial and temporal redundancy for efficient image editing. arXiv preprint arXiv:2503.10270 (2025)

  56. [56]

    In: CVPR

    Yang, J., Liu, Q., Li, Y., Kim, S.Y., Pakhomov, D., Ren, M., Zhang, J., Lin, Z., Xie, C., Zhou, Y.: Generative image layer decomposition with visual effects. In: CVPR. pp. 7643–7653 (2025)

  57. [57]

    arXiv preprint arXiv:2511.12151 (2025)

    Yang, K., Shen, B., Li, X., Dai, Y., Luo, Y., Ma, Y., Fang, W., Li, Q., Wang, Z.: Fia-edit: Frequency-interactive attention for efficient and high-fidelity inversion- free text-guided image editing. arXiv preprint arXiv:2511.12151 (2025)

  58. [58]

    Qwen-image-layered: Towards inherent editability via layer decomposition.arXiv preprint arXiv:2512.15603, 2025

    Yin, S., Zhang, Z., Tang, Z., Gao, K., Xu, X., Yan, K., Li, J., Chen, Y., Chen, Y., Shum, H.Y., et al.: Qwen-image-layered: Towards inherent editability via layer decomposition. arXiv preprint arXiv:2512.15603 (2025)

  59. [59]

    Transparent image layer diffusion using latent trans- parency.arXiv preprint arXiv:2402.17113, 2024

    Zhang, L., Agrawala, M.: Transparent image layer diffusion using latent trans- parency. arXiv preprint arXiv:2402.17113 (2024)

  60. [60]

    Text2layer: Layered image generation using latent diffusion model.arXiv preprint arXiv:2307.09781, 2023

    Zhang, X., Zhao, W., Lu, X., Chien, J.: Text2layer: Layered image generation using latent diffusion model. arXiv preprint arXiv:2307.09781 (2023)

  61. [61]

    arXiv preprint arXiv:2502.17363 (2025)

    Zhu, T., Zhang, S., Shao, J., Tang, Y.: Kv-edit: Training-free image editing for precise background preservation. arXiv preprint arXiv:2502.17363 (2025)

  62. [62]

    In: ICCV

    Zou, K., Feng, X., Huang, T., Huang, Z., Zhang, H., Zou, Y., Li, D.: Zero-shot subject-centric generation for creative application using entropy fusion. In: ICCV. pp. 6136–6145 (2025)