arxiv: 2605.14461 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: no theorem link

ClickRemoval: An Interactive Open-Source Tool for Object Removal in Diffusion Models

Ledun Zhang , Yatu Ji , Xufei Zhuang , Xinying Yao

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords object removaldiffusion modelsinteractive editingself-attention modulationStable Diffusionimage inpaintinguser clicksdenoising

0 comments

The pith

Clicks alone let users remove objects from images in pretrained diffusion models

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ClickRemoval as an interactive tool that removes objects from images using only user clicks on a pretrained Stable Diffusion model. It localizes the target through those clicks and restores the background by modulating self-attention maps during the denoising process. This setup requires no additional training, no hand-drawn masks, and no text prompts. A sympathetic reader would care because it lowers the barrier for precise editing in complex scenes where manual masking is error-prone. Experiments report competitive scores on quantitative metrics and positive results in user studies.

Core claim

ClickRemoval localizes target objects and restores the background through self-attention modulation during denoising without additional training, hand-drawn masks, or text descriptions, achieving competitive results across quantitative metrics and user studies.

What carries the argument

Self-attention modulation during denoising in a pretrained Stable Diffusion model, driven by user click points for object localization

If this is right

Non-expert users gain precise control over object removal without needing to draw masks or write prompts.
Background completion draws directly on the diffusion model's learned priors rather than external inpainting networks.
The method runs on existing Stable Diffusion checkpoints with no fine-tuning required.
Quantitative metrics and user studies both place the results on par with mask-based and text-based alternatives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same click-to-modulation pattern could be tested on video frames to remove moving objects without per-frame masks.
Combining the clicks with simple text hints might handle cases where localization remains ambiguous.

Load-bearing premise

User clicks alone supply enough signal for accurate object localization and natural background restoration inside a fixed pretrained diffusion model.

What would settle it

A test set of images with ambiguous or overlapping objects where a single click either removes the wrong region or leaves visible seams would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.14461 by Ledun Zhang, Xinying Yao, Xufei Zhuang, Yatu Ji.

**Figure 1.** Figure 1: Overview of ClickRemoval. M2N2 converts user clicks into semantic maps, SGAR and SGAS redirect self-attention [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Qualitative comparison with baseline methods. Green points indicate positive clicks for removal, and red points [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation comparison of different model variants [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Existing object removal tools often rely on manual masks or text prompts, making precise removal difficult for non-expert users in complex scenes and often leading to incomplete removal or unnatural background completion. To address this issue, we present ClickRemoval, an open-source interactive object removal tool built on pretrained Stable Diffusion models and driven solely by user clicks. Without additional training, hand-drawn masks, or text descriptions, ClickRemoval localizes target objects and restores the background through self-attention modulation during denoising. Experiments show that ClickRemoval achieves competitive results across quantitative metrics and user studies. We release a complete software package at https://github.com/zld-make/ClickRemoval under the Apache-2.0 license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClickRemoval gives a click-only way to remove objects from images in Stable Diffusion by modulating self-attention during denoising, with no training or masks required, but the abstract's performance claims lack any numbers or details to check.

read the letter

The paper's core contribution is a practical interactive tool that takes a user click on an image, uses it to guide self-attention in a pretrained Stable Diffusion model, and handles both object localization and background inpainting in one pass during denoising. No fine-tuning, no masks, no text prompts. They release the full code and package on GitHub under Apache 2.0, which makes it immediately usable for anyone working with diffusion-based editing.

Referee Report

2 major / 2 minor

Summary. The manuscript presents ClickRemoval, an interactive open-source tool for object removal in images that operates on pretrained Stable Diffusion models. It localizes target objects and restores backgrounds solely via user clicks that modulate self-attention during the denoising process, without additional training, hand-drawn masks, or text prompts. The authors report that the method achieves competitive results on quantitative metrics and user studies.

Significance. If the central claims hold, the work offers a practical advance in making diffusion-based image editing accessible to non-experts through minimal input. The open-source release under Apache-2.0 is a clear strength that supports reproducibility and community use.

major comments (2)

[§3] §3 (Method), self-attention modulation description: the mapping from a single click to attention modulation is presented at a high level without pseudocode, exact query/key selection rules, or ablation on how modulation is applied across denoising timesteps; this is load-bearing because the skeptic concern about bleed into similar distractors or boundary ambiguity cannot be assessed without these details.
[§4] §4 (Experiments), quantitative results: the abstract and results section assert competitive metrics but supply no numerical values, specific baselines (e.g., mask-based inpainting or text-prompt methods), dataset size, or error analysis; without these the claim cannot be verified against failure modes in complex scenes.

minor comments (2)

[§4.2] The user-study protocol (participant count, task instructions, statistical tests) should be expanded for clarity and replicability.
[Figures] Figure captions and the GitHub link could include example click placements and failure cases to illustrate the method's operating range.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper accordingly to strengthen the presentation of the method and results.

read point-by-point responses

Referee: [§3] §3 (Method), self-attention modulation description: the mapping from a single click to attention modulation is presented at a high level without pseudocode, exact query/key selection rules, or ablation on how modulation is applied across denoising timesteps; this is load-bearing because the skeptic concern about bleed into similar distractors or boundary ambiguity cannot be assessed without these details.

Authors: We agree that the self-attention modulation in §3 is described at a high level. In the revised manuscript we will insert pseudocode for the click-to-modulation mapping, state the precise query/key selection rules, and add an ablation on modulation strength across denoising timesteps. These additions will allow readers to evaluate potential issues such as bleed into distractors or boundary ambiguity. revision: yes
Referee: [§4] §4 (Experiments), quantitative results: the abstract and results section assert competitive metrics but supply no numerical values, specific baselines (e.g., mask-based inpainting or text-prompt methods), dataset size, or error analysis; without these the claim cannot be verified against failure modes in complex scenes.

Authors: We acknowledge that the current results section lacks explicit numerical values, named baselines, dataset size, and error analysis. The revised version will report concrete metrics, include comparisons to mask-based inpainting and text-prompt methods, state the evaluation dataset size, and provide error analysis of failure cases in complex scenes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method applies pretrained models without derivations or self-referential predictions

full rationale

The paper describes an interactive tool for object removal using user clicks to modulate self-attention in a pretrained Stable Diffusion model during denoising. No equations, parameter fitting, or new derivations are presented that could reduce to inputs by construction. Claims rest on application of existing diffusion techniques and empirical results from user studies and metrics, with no self-citation chains or ansatzes that load-bear the central result. This is a standard low-circularity engineering/application paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that self-attention modulation guided by clicks can perform object localization and natural background restoration inside pretrained diffusion models; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Self-attention mechanisms in pretrained Stable Diffusion models can be modulated by click inputs to localize and remove objects while completing backgrounds naturally.
This is the core mechanism invoked to achieve removal without masks or prompts.

pith-pipeline@v0.9.0 · 5416 in / 1248 out tokens · 46859 ms · 2026-05-15T02:44:52.808238+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

Aditya Chandrasekar, Goirik Chakrabarty, Jai Bardhan, Ramya Hebbalaguppe, and Prathosh AP. 2024. Remove: A reference-free metric for object erasure. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7901–7910

work page 2024
[2]

Nima Fathi, Amar Kumar, and Tal Arbel. 2025. Aura: A multi-modal medical agent for understanding, reasoning and annotation. InInternational Workshop on Agentic AI for Medicine. 105–114

work page 2025
[3]

Tariq Berrada Ifriqi, Adriana Romero-Soriano, Michal Drozdzal, Jakob Verbeek, and Karteek Alahari. 2025. Entropy Rectifying Guidance for Diffusion and Flow Models. InNeurIPS 2025-Thirty-ninth Conference on Neural Information Processing Systems

work page 2025
[4]

Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, and Qiang Xu. 2024. Brushnet: A plug-and-play image inpainting model with decomposed dual- branch diffusion. InEuropean Conference on Computer Vision. 150–168

work page 2024
[5]

Markus Karmann and Onay Urfalioglu. 2025. Repurposing stable diffusion attention for training-free unsupervised interactive segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference. 24518–24528

work page 2025
[6]

Changsuk Oh, Dongseok Shim, Taekbeom Lee, and H Jin Kim. 2024. Object Remover Performance Evaluation Methods Using Classwise Object Removal Images.IEEE Sensors Letters8, 6 (2024), 1–4

work page 2024
[7]

Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. 2025. Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808(2025)

work page arXiv 2025
[8]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

work page 2022
[9]

Wenhao Sun, Xue-Mei Dong, Benlei Cui, and Jingqun Tang. 2025. Attentive eraser: Unleashing diffusion model’s object removal potential via self-attention ClickRemoval: An Interactive Open-Source Tool for Object Removal in Diffusion Models redirection guidance. InProceedings of the AAAI Conference on Artificial Intelli- gence. 20734–20742

work page 2025
[10]

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. 2022. Resolution-robust large mask inpainting with fourier convolutions. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 2149–2159

work page 2022
[11]

Shaoan Xie, Zhifei Zhang, Zhe Lin, Tobias Hinz, and Kun Zhang. 2023. Smart- brush: Text and shape guided object inpainting with diffusion model. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. 22428–22437

work page 2023
[12]

Ziyang Xu, Kangsheng Duan, Xiaolei Shen, Zhifeng Ding, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, and Xinggang Wang. 2025. Pixelhacker: Image inpainting with structural and semantic consistency.arXiv preprint arXiv:2504.20438(2025)

work page arXiv 2025
[13]

Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. 2018. Generative image inpainting with contextual attention. InProceedings of the IEEE conference on computer vision and pattern recognition. 5505–5514

work page 2018
[14]

Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. 2023. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790(2023)

work page arXiv 2023
[15]

Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, and Kai Chen. 2024. A task is worth one word: Learning with task prompts for high-quality versatile image inpainting. InEuropean Conference on Computer Vision. 195–211

work page 2024