arxiv: 2605.13122 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: unknown

Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation

Jingxuan He , Xiyu Wang , Yunke Wang , Mengyu Zheng , Chang Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords zero-shot referring image segmentationimage editing modelssemantic groundingdenoising timestepdiffusion modelstraining-freeforeground-background separability

0 comments

The pith

Instruction-based image editing models show strong foreground-background separability in their earliest internal features, enabling zero-shot referring image segmentation from a single denoising step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that these models implicitly perform language-conditioned visual semantic grounding when following edit instructions. This grounding appears as clear foreground-background separability in internal representations at the very first denoising timestep, before any visible changes occur. The authors repurpose this property into a training-free pipeline that extracts accurate pixel-level masks for referring expressions by combining attention-based spatial priors with feature-based semantic discrimination. The resulting masks outperform prior zero-shot methods on RefCOCO, RefCOCO+, and RefCOCOg while requiring only one denoising step and no full image synthesis.

Core claim

Strong foreground-background separability emerges in the internal representations of these models at the earliest denoising timestep, well before any visible image transformation occurs. This separability supports a training-free framework that decomposes localization into attention-based spatial priors and feature-based semantic discrimination, producing accurate segmentation masks from a single denoising step without full synthesis.

What carries the argument

Feature-space separability at the earliest denoising timestep, which supplies attention-based spatial priors for location and feature-based semantic discrimination for object identity.

If this is right

Accurate referring segmentation masks can be generated using only a single denoising step without full image synthesis.
The method achieves higher performance than existing zero-shot baselines on RefCOCO, RefCOCO+, and RefCOCOg.
Localization decomposes cleanly into attention-driven spatial focus and feature-driven semantic separation.
Pretrained editing models inherently encode language-conditioned visual grounding that can be read out directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Semantic understanding in diffusion-based editing models forms very early in the sampling trajectory rather than only at the end.
The same early internal states could support other zero-shot localization tasks such as detection or referring expression comprehension.
Reducing computation to one timestep suggests similar early-readout strategies may apply to other generative models for efficiency gains.
The decomposition into spatial priors and semantic features may transfer to non-editing diffusion models without retraining.

Load-bearing premise

The feature-space separability observed at the earliest denoising timestep is sufficient to produce accurate pixel-level segmentation masks for arbitrary referring expressions without full synthesis or task-specific training.

What would settle it

If extracting masks from first-timestep features alone yields segmentation accuracy no better than random or existing zero-shot baselines on RefCOCO for complex referring expressions, the claim that early separability suffices would be falsified.

Figures

Figures reproduced from arXiv: 2605.13122 by Chang Xu, Jingxuan He, Mengyu Zheng, Xiyu Wang, Yunke Wang.

**Figure 1.** Figure 1: Overview of our framework. We repurpose pretrained instruction-based image editing models for referring image segmentation. In the module of attention-based region proposal, we derive coarse spatial priors through cross-attention and affinity propagation. In the module of feature-based semantic localization, the semantic separability of MMSA output features is exploited under the guidance of the proposed r… view at source ↗

**Figure 2.** Figure 2: Qualitative examples from RefCOCO, RefCOCO+, and RefCOCOg datasets. “CAM” denotes Coarse Attention Map, “RAM” denotes Refined Attention Map, and “GT” denotes Ground Truth. The first row of text presents the original referring expressions, and the second row presents the corresponding transformed editing instructions. experiments are conducted on one NVIDIA A100 GPU. A detailed analysis of inference efficie… view at source ↗

**Figure 4.** Figure 4: Impact of timesteps and DiT blocks within the FSL module. Heatmaps show oIoU (1st row) and mIoU (2nd row) on RefCOCO, RefCOCO+, and RefCOCOg datasets. Each cell represents segmentation performance when extracting features from a specific block at a given timestep. 4.4 Ablation Studies To assess the contribution of each module in our framework and the impact of hyperparameters, we conduct comprehensive abl… view at source ↗

**Figure 5.** Figure 5: Impact of thresholding within the FSL module. The performance exhibits a steady increase until τ = 0.8, then a slight performance degradation occurs at τ = 0.9. Impact of thresholding. We study the effect of the threshold τ within the FSL module in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Instruction-based image editing (IIE) models have recently demonstrated strong capability in modifying specific image regions according to natural language instructions, which implicitly requires identifying where an edit should be applied. This indicates that such models inherently perform language-conditioned visual semantic grounding. In this work, we investigate whether this implicit grounding can be leveraged for zero-shot referring image segmentation (RIS), a task that requires pixel-level localization of objects described by natural language expressions. Through systematic analysis, we reveal that strong foreground-background separability emerges in the internal representations of these models at the earliest denoising timestep, well before any visible image transformation occurs. Building on this insight, we propose a training-free framework that repurposes pretrained image editing models for RIS by exploiting their intermediate representations. Our approach decomposes localization into two complementary components: attention-based spatial priors that estimate where to focus, and feature-based semantic discrimination that determines what to segment. By leveraging feature-space separability, the framework produces accurate segmentation masks using only a single denoising step, without requiring full image synthesis. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate that our method achieves superior performance over existing zero-shot baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds that instruction-based editing models already separate foreground from background in their first denoising step and turns that into a training-free zero-shot RIS method via attention priors plus feature discrimination.

read the letter

The main thing to know is that these models show clear foreground-background separability in their internal features right at the earliest denoising timestep, before any visible change, and the authors extract masks from that without training or full synthesis. The decomposition into attention-based spatial priors for where to look and feature-based discrimination for what to segment is a straightforward way to operationalize the observation. That framing does not appear in the zero-shot RIS papers cited, so the angle is new. The experiments on RefCOCO, RefCOCO+, and RefCOCOg report better numbers than the listed baselines, which is the practical payoff if the numbers hold under scrutiny. The approach stays training-free, which matters for anyone trying to cut data and compute costs on vision-language localization tasks. The soft spot is the assumption that single-timestep separability alone produces accurate pixel masks for arbitrary expressions. The abstract states the separability emerges but does not detail the exact quantification, aggregation steps, or ablations that remove the priors to test whether the features are doing independent work. If the full paper shows those controls and the gains survive them, the claim strengthens; without them the result could lean more on the attention component than the new framing suggests. This is for readers working on efficient repurposing of large editing or diffusion models rather than core segmentation architecture. It deserves peer review because the idea is grounded in observable model behavior, the benchmarks are standard, and the training-free route is worth testing even if revisions are needed on the implementation details.

Referee Report

3 major / 2 minor

Summary. The paper claims that instruction-based image editing (IIE) models implicitly perform language-conditioned visual semantic grounding, which can be leveraged for zero-shot referring image segmentation (RIS). It observes that strong foreground-background separability emerges in internal representations at the earliest denoising timestep, before visible changes. A training-free framework is proposed that decomposes localization into attention-based spatial priors and feature-based semantic discrimination, producing accurate masks from a single denoising step. Experiments on RefCOCO, RefCOCO+, and RefCOCOg show superior performance over existing zero-shot baselines.

Significance. If the result holds, the work provides evidence of emergent semantic grounding in diffusion-based editing models and offers an efficient, training-free route to zero-shot RIS. This could advance understanding of early-stage representations in generative models and reduce reliance on full synthesis or task-specific fine-tuning for localization tasks.

major comments (3)

[Framework description (likely §3)] Framework description (likely §3): The central claim that single-timestep feature separability suffices for pixel-accurate masks requires explicit detail on the aggregation and thresholding procedure that converts attention priors plus feature discrimination into final binary masks; without this, it is unclear whether the method avoids implicit tuning or full synthesis.
[Experiments section] Experiments section: The abstract asserts superior benchmark performance, yet no quantitative metrics, tables, or ablation results are referenced here; the separability claim at the earliest timestep must be supported by concrete scores (e.g., foreground-background cosine similarity or IoU at t=1) to substantiate emergence before visible transformation.
[Ablation studies (likely §4.3)] Ablation studies (likely §4.3): To address the sufficiency concern, the paper must include an ablation removing the attention-based spatial priors and reporting performance using only feature-based discrimination; if performance drops sharply, the separability observation alone does not carry the full load-bearing claim.

minor comments (2)

[Figure 1 or visualization section] Figure 1 or visualization section: Timestep labels and feature-map scales should be added for clarity when showing internal representations at the earliest denoising step.
[Related work] Related work: Ensure citations to recent zero-shot RIS baselines (e.g., CLIP-based or diffusion segmentation methods) are complete and up-to-date.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped us strengthen the clarity and empirical support in our manuscript. We address each major comment below and have revised the paper accordingly.

read point-by-point responses

Referee: Framework description (likely §3): The central claim that single-timestep feature separability suffices for pixel-accurate masks requires explicit detail on the aggregation and thresholding procedure that converts attention priors plus feature discrimination into final binary masks; without this, it is unclear whether the method avoids implicit tuning or full synthesis.

Authors: We agree that additional detail is warranted for reproducibility. In the revised manuscript, Section 3 now explicitly describes the procedure: attention-based spatial priors are obtained by averaging cross-attention maps across relevant layers and heads followed by softmax normalization; these are combined with feature-based semantic discrimination via cosine similarity between the instruction embedding and the t=1 denoised features. The final binary mask is produced by applying a threshold (determined via Otsu's method on the similarity distribution) to the combined map. The entire process remains training-free and does not rely on full synthesis or implicit tuning. revision: yes
Referee: Experiments section: The abstract asserts superior benchmark performance, yet no quantitative metrics, tables, or ablation results are referenced here; the separability claim at the earliest timestep must be supported by concrete scores (e.g., foreground-background cosine similarity or IoU at t=1) to substantiate emergence before visible transformation.

Authors: We acknowledge the need for explicit quantitative backing in support of the separability claim. The revised manuscript augments the experiments section with concrete metrics, including foreground-background cosine similarity and IoU scores computed at t=1 (prior to any visible image change), along with references to the main results tables. These additions directly substantiate the emergence of separability at the earliest denoising step while preserving the abstract's summary nature. revision: yes
Referee: Ablation studies (likely §4.3): To address the sufficiency concern, the paper must include an ablation removing the attention-based spatial priors and reporting performance using only feature-based discrimination; if performance drops sharply, the separability observation alone does not carry the full load-bearing claim.

Authors: We have performed the requested ablation and included the results in the revised Section 4.3. Removing the attention-based spatial priors and relying solely on feature-based discrimination yields a substantial performance drop across RefCOCO, RefCOCO+, and RefCOCOg. This confirms that both components are required to fully exploit the observed early separability and supports the design of the proposed framework. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation of separability drives training-free framework

full rationale

The paper's derivation begins with a systematic empirical analysis revealing foreground-background separability in internal representations at the earliest denoising timestep of pretrained image editing models. This observation is then used to construct a training-free decomposition into attention-based spatial priors and feature-based semantic discrimination for zero-shot referring image segmentation. No steps reduce by construction to fitted parameters, self-definitional loops, or load-bearing self-citations; the framework exploits the observed property without renaming known results or smuggling ansatzes via prior author work. The central claim remains independent of the target task quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that IIE models perform implicit language-conditioned grounding; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Instruction-based image editing models inherently perform language-conditioned visual semantic grounding
This premise is stated explicitly in the first sentence of the abstract as the basis for the investigation.

pith-pipeline@v0.9.0 · 5515 in / 1148 out tokens · 51548 ms · 2026-05-14T19:39:49.067021+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 14 canonical work pages · 6 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

2023
[3]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Dvin: Dynamic visual routing network for weakly supervised referring expression comprehension

Xiaofu Chen, Yaxin Luo, Gen Luo, Jiayi Ji, Henghui Ding, and Yiyi Zhou. Dvin: Dynamic visual routing network for weakly supervised referring expression comprehension. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14347–14357, 2025

2025
[6]

Weakmcn: Multi-task collaborative network for weakly supervised referring expression comprehension and segmentation

Silin Cheng, Yang Liu, Xinwei He, Sebastien Ourselin, Lei Tan, and Gen Luo. Weakmcn: Multi-task collaborative network for weakly supervised referring expression comprehension and segmentation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 9175–9185, 2025

2025
[7]

Mask grounding for referring image segmentation

Yong Xien Chng, Henry Zheng, Yizeng Han, Xuchong Qiu, and Gao Huang. Mask grounding for referring image segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26573–26583, 2024

2024
[8]

Deris: Decoupling perception and cognition for enhanced referring image segmentation through loopback synergy.arXiv preprint arXiv:2507.01738, 2025

Ming Dai, Wenxuan Cheng, Jiang-jiang Liu, Sen Yang, Wenxiao Cai, Yanpeng Sun, and Wankou Yang. Deris: Decoupling perception and cognition for enhanced referring image segmentation through loopback synergy.arXiv preprint arXiv:2507.01738, 2025

work page arXiv 2025
[9]

Curriculum point prompting for weakly-supervised referring image segmenta- tion

Qiyuan Dai and Sibei Yang. Curriculum point prompting for weakly-supervised referring image segmenta- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13711–13722, 2024

2024
[10]

The use of multiple measurements in taxonomic problems.Annals of eugenics, 7(2): 179–188, 1936

Ronald A Fisher. The use of multiple measurements in taxonomic problems.Annals of eugenics, 7(2): 179–188, 1936

1936
[11]

Unleashing diffusion transformers for visual correspondence by modulating massive activations.arXiv preprint arXiv:2505.18584, 2025

Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, and Weiyao Lin. Unleashing diffusion transformers for visual correspondence by modulating massive activations.arXiv preprint arXiv:2505.18584, 2025

work page arXiv 2025
[12]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[13]

Segmentation from natural language expressions

Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. In European conference on computer vision, pages 108–124. Springer, 2016

2016
[14]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

2023
[15]

Refam: Attention magnets for zero-shot referral segmentation.arXiv preprint arXiv:2509.22650, 2025

Anna Kukleva, Enis Simsar, Alessio Tonioni, Muhammad Ferjad Naeem, Federico Tombari, Jan Eric Lenssen, and Bernt Schiele. Refam: Attention magnets for zero-shot referral segmentation.arXiv preprint arXiv:2509.22650, 2025

work page arXiv 2025
[16]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024
[17]

FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

2025
[18]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

2014
[19]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Rotated multi-scale interaction network for referring remote sensing image segmentation

Sihan Liu, Yiwei Ma, Xiaoqing Zhang, Haowei Wang, Jiayi Ji, Xiaoshuai Sun, and Rongrong Ji. Rotated multi-scale interaction network for referring remote sensing image segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26658–26668, 2024

2024
[21]

Hybrid global-local representation with augmented spatial guidance for zero-shot referring image segmentation

Ting Liu and Siyuan Li. Hybrid global-local representation with augmented spatial guidance for zero-shot referring image segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29634–29643, 2025

2025
[22]

Random walks on graphs.Combinatorics, Paul erdos is eighty, 2(1-46):4, 1993

László Lovász. Random walks on graphs.Combinatorics, Paul erdos is eighty, 2(1-46):4, 1993

1993
[23]

Generation and comprehension of unambiguous object descriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016

2016
[24]

Modeling context between objects for referring expression understanding

Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. Modeling context between objects for referring expression understanding. InEuropean Conference on Computer Vision, pages 792–807. Springer, 2016

2016
[25]

Ref-diff: Zero-shot referring image segmentation with generative models.arXiv preprint arXiv:2308.16777, 2023

Minheng Ni, Yabo Zhang, Kailai Feng, Xiaoming Li, Yiwen Guo, and Wangmeng Zuo. Ref-diff: Zero-shot referring image segmentation with generative models.arXiv preprint arXiv:2308.16777, 2023

work page arXiv 2023
[26]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[27]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[28]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[29]

Learning affinity from attention: End-to-end weakly- supervised semantic segmentation with transformers

Lixiang Ru, Yibing Zhan, Baosheng Yu, and Bo Du. Learning affinity from attention: End-to-end weakly- supervised semantic segmentation with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16846–16855, 2022

2022
[30]

Grad-cam: Visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017

2017
[31]

Lqmformer: Language-aware query mask transformer for referring image segmentation

Nisarg A Shah, Vibashan VS, and Vishal M Patel. Lqmformer: Language-aware query mask transformer for referring image segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12903–12913, 2024

2024
[32]

Prompt-driven referring image segmentation with instance contrasting

Chao Shang, Zichen Song, Heqian Qiu, Lanxiao Wang, Fanman Meng, and Hongliang Li. Prompt-driven referring image segmentation with instance contrasting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4124–4134, 2024

2024
[33]

Clip as rnn: Segment countless visual concepts without training endeavor

Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, and Siyang Li. Clip as rnn: Segment countless visual concepts without training endeavor. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13171–13182, 2024

2024
[34]

Text augmented spatial-aware zero-shot referring image segmentation.arXiv preprint arXiv:2310.18049, 2023

Yucheng Suo, Linchao Zhu, and Yi Yang. Text augmented spatial-aware zero-shot referring image segmentation.arXiv preprint arXiv:2310.18049, 2023

work page arXiv 2023
[35]

Ominicontrol: Minimal and universal control for diffusion transformer

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14940–14950, 2025

2025
[36]

Longcat-image technical report

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584, 2025

work page arXiv 2025
[37]

Learning visual grounding from generative vision and language model

Shijie Wang, Dahun Kim, Ali Taalimi, Chen Sun, and Weicheng Kuo. Learning visual grounding from generative vision and language model. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 8057–8067. IEEE, 2025

2025
[38]

Freesolo: Learning to segment objects without annotations

Xinlong Wang, Zhiding Yu, Shalini De Mello, Jan Kautz, Anima Anandkumar, Chunhua Shen, and Jose M Alvarez. Freesolo: Learning to segment objects without annotations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14176–14186, 2022. 11

2022
[39]

Iterprime: Zero-shot referring image segmentation with iterative grad-cam refinement and primary word emphasis

Yuji Wang, Jingchen Ni, Yong Liu, Chun Yuan, and Yansong Tang. Iterprime: Zero-shot referring image segmentation with iterative grad-cam refinement and primary word emphasis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 8159–8168, 2025

2025
[40]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Omnigen: Unified image generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13294–13304, 2025

2025
[43]

Multi-class token transformer for weakly supervised semantic segmentation

Lian Xu, Wanli Ouyang, Mohammed Bennamoun, Farid Boussaid, and Dan Xu. Multi-class token transformer for weakly supervised semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4310–4319, 2022

2022
[44]

Boosting weakly supervised referring image segmentation via progressive comprehension.Advances in Neural Information Processing Systems, 37:93213–93239, 2024

Zaiquan Yang, Yuhao Liu, Jiaying Lin, Gerhard Hancke, and Rynson Lau. Boosting weakly supervised referring image segmentation via progressive comprehension.Advances in Neural Information Processing Systems, 37:93213–93239, 2024

2024
[45]

ReasonEdit: Towards reasoning-enhanced image editing models.arXiv preprint arXiv:2511.22625, 2025

Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, et al. Reasonedit: Towards reasoning-enhanced image editing models.arXiv preprint arXiv:2511.22625, 2025

work page arXiv 2025
[46]

Anyedit: Mastering unified high-quality image editing for any idea

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26125–26135, 2025

2025
[47]

Zero-shot referring image segmentation with global- local context features

Seonghoon Yu, Paul Hongsuck Seo, and Jeany Son. Zero-shot referring image segmentation with global- local context features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19456–19465, 2023

2023
[48]

Pseudo-ris: Distinctive pseudo-supervision generation for referring image segmentation

Seonghoon Yu, Paul Hongsuck Seo, and Jeany Son. Pseudo-ris: Distinctive pseudo-supervision generation for referring image segmentation. InEuropean Conference on Computer Vision, pages 18–36. Springer, 2024

2024
[49]

Latent expression generation for referring image segmentation and grounding

Seonghoon Yu, Joonbeom Hong, Joonseok Lee, and Jeany Son. Latent expression generation for referring image segmentation and grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21374–21383, 2025

2025
[50]

Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

2023
[51]

In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025

work page arXiv 2025
[52]

Enabling instructional image editing with in-context generation in large scale diffusion transformer

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. Enabling instructional image editing with in-context generation in large scale diffusion transformer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[53]

Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37:3058–3093, 2024

Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37:3058–3093, 2024

2024
[54]

Extract free dense labels from clip

Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean conference on computer vision, pages 696–712. Springer, 2022. 12 Appendix A1 Feature Separability: Definition, Rationale, and Implementation Definition.Let F(t,l) ∈R Nv×D denote MMSA output features of clean image tokens within the l-th DiT block at timestep t. Given ...

2022