arxiv: 2605.04590 · v1 · submitted 2026-05-06 · 💻 cs.CV · cs.AI

Recognition: 3 theorem links

· Lean Theorem

From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation

Haijian Gu, Hongwei Kang, Quan Meng, Ruidong Pan, Tianrui Niu, Xin Yang, Xuesong Li, Zishen Qu

Pith reviewed 2026-05-08 18:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords text-based image segmentationrectified flowdiffusion modelszero-shot segmentationlatent space mappingRLFSeg

0 comments

The pith

Rectified Flow replaces diffusion's noise-denoise process with direct mapping for text-based image segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that diffusion models' generative noise-denoise process hinders their use as feature extractors for text-based image segmentation. To address this, it introduces a framework that uses Rectified Flow to establish a direct mapping from images to segmentation masks in latent space. This eliminates the need for time-step optimization and iterative denoising, yielding better results than diffusion-based approaches, particularly in zero-shot cases. Additional techniques like label refinement and adaptive one-step sampling allow accurate segmentation in a single forward pass. The approach repurposes pretrained generative models for segmentation without any structural changes.

Core claim

We propose RLFSeg, a novel framework that leverages Rectified Flow to learn direct mapping from the image to the segmentation mask within the latent space. The model is thus freed from the noise-denoise process and the need to optimize the time step of diffusion models, resulting in substantially better performance than previous diffusion-based methods, especially on zero-shot scenarios. By introducing label refinement and an Adaptive One-Step Sampling strategy, the model achieves higher accuracy even on a single inference step.

What carries the argument

Rectified Flow's direct mapping from image to segmentation mask in the latent space, which replaces the iterative noise-denoise process of diffusion models.

If this is right

Substantially better performance than diffusion-based methods on text-based segmentation tasks.
Particularly strong gains in zero-shot scenarios without task-specific fine-tuning.
Higher accuracy achievable even with a single inference step via adaptive sampling.
Pretrained generative models can be redirected to discriminative segmentation with no structural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The direct-mapping idea could apply to other discriminative tasks that currently borrow features from generative diffusion models.
Single-step inference opens the possibility of real-time text-prompted segmentation in interactive applications.
Avoiding noise addition may improve boundary precision on images with fine details or ambiguous text prompts.

Load-bearing premise

The generative noise-denoise process in diffusion models is inherently harmful to discriminative segmentation while Rectified Flow's direct mapping preserves rich multimodal semantic features without it.

What would settle it

A side-by-side evaluation on zero-shot text-based segmentation benchmarks where an optimized diffusion baseline matches or beats RLFSeg performance would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.04590 by Haijian Gu, Hongwei Kang, Quan Meng, Ruidong Pan, Tianrui Niu, Xin Yang, Xuesong Li, Zishen Qu.

**Figure 1.** Figure 1: Prior methods rely on LDM as a feature extractor view at source ↗

**Figure 2.** Figure 2: Overview of RLFSeg. (a) Training pipeline with Rectified Latent Flow and SAM-driven Label Refinement, where view at source ↗

**Figure 3.** Figure 3: Visualization of results without RDS. The Rectified view at source ↗

**Figure 4.** Figure 4: Comparison of sampling trajectories and the effect of AOS. Interpolation with (a) the ground-truth view at source ↗

**Figure 5.** Figure 5: Qualitative comparison with different methods. view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of mask boundaries. Lever view at source ↗

**Figure 7.** Figure 7: presents a particularly extreme case of this failure mode; view at source ↗

read the original abstract

Text-based image segmentation aims to delineate object boundaries within an image from text prompts, offering higher flexibility and broader application scope compared to traditional fixed-category segmentation tasks. Recent studies have shown that diffusion models (e.g., Stable Diffusion) can provide rich multimodal semantic features, leading to studies of using diffusion models as feature extractors for segmentation tasks. Such methods, however, inherit the generative natures of diffusion models that are harmful to discriminative segmentation tasks. In response, we propose RLFSeg, a novel framework that leverages Rectified Flow to learn direct mapping from the image to the segmentation mask within the latent space. The model is thus freed from the noise-denoise process and the need to optimize the time step of diffusion models, resulting in substantially better performance than previous diffusion-based methods, especially on zero-shot scenarios. By introducing label refinement and an Adaptive One-Step Sampling strategy, the model achieves higher accuracy even on a single inference step. The framework redirects a pretrained generative model to the discriminative segmentation task with zero modification to model structure, thus reveals promising application potential and significant research value.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RLFSeg swaps diffusion for rectified flow to get a direct latent mapping for text-based segmentation and adds label refinement plus adaptive sampling, but the performance edge may not be cleanly due to the flow change.

read the letter

The main contribution is a framework that takes a pretrained generative model and uses rectified flow for a straight image-to-mask mapping in latent space instead of the usual iterative noise-denoise loop. They argue this avoids dynamics that hurt discriminative tasks and pair it with label refinement and an adaptive one-step sampling strategy to reach decent accuracy in a single pass, with special attention to zero-shot cases. The model structure stays untouched, which is a practical plus. Repurposing existing models this way for segmentation is a reasonable direction and could be useful for people already working with flow-based generators. The soft spot is the attribution of gains. The abstract ties the improvement to escaping the generative process, yet the same paper introduces two other components that could be doing much of the work. Without ablations that add label refinement and adaptive sampling to a diffusion baseline, it is hard to know whether the direct mapping itself is load-bearing. The abstract also gives no quantitative results or baseline numbers, so the claim of substantially better performance stays untested from the summary alone. This paper is for computer vision researchers focused on zero-shot or prompt-driven segmentation who want to explore flow models as feature sources. A reader interested in adapting generative techniques to discriminative problems would get value from the setup even if the results need closer scrutiny. I would send it for peer review because the core idea is distinct from prior diffusion segmentation work and the implementation looks straightforward enough for referees to evaluate the experiments and controls directly.

Referee Report

2 major / 2 minor

Summary. The paper proposes RLFSeg, a framework that replaces diffusion models with Rectified Flow to learn a direct image-to-mask mapping in latent space for text-based segmentation. It claims this avoids the harmful noise-denoise process and timestep optimization of diffusion models, yielding substantially better performance than prior diffusion-based methods (especially zero-shot), while label refinement and Adaptive One-Step Sampling enable high accuracy even in one inference step. The approach redirects pretrained generative models to discriminative tasks with no structural changes.

Significance. If the performance gains are robustly demonstrated and attributable to the Rectified Flow direct mapping, the work would provide a promising route to adapt large pretrained generative models for segmentation without retraining or architectural overhaul, with particular value for zero-shot and efficient inference scenarios.

major comments (2)

[Abstract] Abstract: the central claim of 'substantially better performance than previous diffusion-based methods, especially on zero-shot scenarios' is asserted without any quantitative results, baselines, datasets, or metrics. The experiments section must supply these comparisons (including diffusion baselines) to support the claim.
[Method and Experiments] Method and Experiments: the attribution of gains to Rectified Flow's direct mapping (freeing the model from noise-denoise and timestep issues) is load-bearing, yet label refinement and Adaptive One-Step Sampling are introduced as key components. Ablations applying the same refinements to a diffusion baseline are required to isolate whether the flow formulation itself drives the improvements or whether the auxiliary strategies suffice.

minor comments (2)

[Method] Clarify the exact formulation of the Rectified Flow mapping (e.g., the velocity field or ODE) with an equation to distinguish it from diffusion's forward/reverse processes.
[Method] The phrase 'zero modification to model structure' should be supported by explicitly stating which pretrained model (e.g., Stable Diffusion variant) is used and confirming no fine-tuning occurs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'substantially better performance than previous diffusion-based methods, especially on zero-shot scenarios' is asserted without any quantitative results, baselines, datasets, or metrics. The experiments section must supply these comparisons (including diffusion baselines) to support the claim.

Authors: Abstracts conventionally present high-level claims without numerical details. The experiments section of the manuscript already contains quantitative comparisons to prior diffusion-based methods across multiple datasets and standard metrics (including zero-shot settings). To strengthen the link between the abstract claim and the supporting evidence, we will revise the abstract to include a brief reference to the key experimental outcomes and ensure the experiments section explicitly enumerates all diffusion baselines, datasets, and metrics used. revision: partial
Referee: [Method and Experiments] Method and Experiments: the attribution of gains to Rectified Flow's direct mapping (freeing the model from noise-denoise and timestep issues) is load-bearing, yet label refinement and Adaptive One-Step Sampling are introduced as key components. Ablations applying the same refinements to a diffusion baseline are required to isolate whether the flow formulation itself drives the improvements or whether the auxiliary strategies suffice.

Authors: We agree that rigorous isolation of the Rectified Flow contribution is necessary. We will add new ablation experiments in the revised manuscript that apply label refinement and Adaptive One-Step Sampling to a diffusion baseline under identical conditions. These results will be presented alongside the original RLFSeg results to demonstrate that the performance advantages stem primarily from the direct image-to-mask mapping enabled by Rectified Flow rather than the auxiliary components alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes RLFSeg as a new framework adapting Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, explicitly contrasting it with diffusion models' noise-denoise process. It introduces independent components (label refinement, Adaptive One-Step Sampling) and reports empirical gains, especially zero-shot. No equations, self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the claims; the central result is a methodological shift with external pretrained models, remaining self-contained against benchmarks rather than reducing to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method assumes pretrained generative models provide usable features and that direct mapping suffices for segmentation.

pith-pipeline@v0.9.0 · 5506 in / 977 out tokens · 30258 ms · 2026-05-08T18:34:08.865920+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 9 canonical work pages · 4 internal anchors

[1]

Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, et al
[2]

Imagen 3.arXiv preprint arXiv:2408.07009(2024)

work page arXiv 2024
[3]

Dmitry Baranchuk, Ivan Rubachev, Andrey Voynov, Valentin Khrulkov, and Artem Babenko. 2021. Label-efficient semantic segmentation with diffusion models.arXiv preprint arXiv:2112.03126(2021)

work page arXiv 2021
[4]

Black Forest Labs. 2024. Flux: Official inference repository for flux.1 models. Accessed: 2025-02-07

2024
[5]

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao
[6]

Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195(2023)

work page internal anchor Pith review arXiv 2023
[7]

Barbara Toniella Corradini, Mustafa Shukor, Paul Couairon, Guillaume Couairon, Franco Scarselli, and Matthieu Cord. 2024. Freeseg-diff: Training-free open- vocabulary segmentation with diffusion models.arXiv preprint arXiv:2403.20105 (2024)

work page arXiv 2024
[8]

Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. InNeurIPS, Vol. 34. 8780–8794

2021
[9]

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. 2021. Cogview: Mastering text-to-image generation via transformers. InNeurIPS, Vol. 34. 19822–19835

2021
[10]

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. InICML

2024
[11]

Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. InCVPR. 12873–12883

2021
[12]

Li Fei-Fei, Robert Fergus, and Pietro Perona. 2006. One-shot learning of object categories.IEEE transactions on pattern analysis and machine intelligence28, 4 (2006), 594–611

2006
[13]

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. 2022. Make-a-scene: Scene-based text-to-image generation with human priors. InECCV. Springer, 89–106

2022
[14]

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. 2022. Vector quantized diffusion model for text-to-image synthesis. InCVPR. 10696–10706

2022
[15]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3

2022
[16]

Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. 2016. Segmentation from natural language expressions. InECCV. Springer, 108–124

2016
[17]

Laurynas Karazija, Iro Laina, Andrea Vedaldi, and Christian Rupprecht. 2023. Diffusion models for zero-shot open-vocabulary segmentation.arXiv e-prints (2023), arXiv–2306

2023
[18]

Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator archi- tecture for generative adversarial networks. InCVPR. 4401–4410

2019
[19]

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. Referitgame: Referring to objects in photographs of natural scenes. InEMNLP. 787–798

2014
[20]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al
[21]

Segment anything. InICCV. 4015–4026
[22]

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. 2024. Lisa: Reasoning segmentation via large language model. InCVPR. 9579–9589

2024
[23]

Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. 2018. Referring image segmentation via recurrent refinement networks. InCVPR. 5745–5753

2018
[24]

Ziyi Li, Qinye Zhou, Xiaoyun Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie
[25]

Open-vocabulary object segmentation with diffusion models. InICCV. 7667–7676
[26]

Timo Lüddecke and Alexander Ecker. 2022. Image segmentation using text and image prompts. InCVPR. 7086–7096

2022
[27]

Edgar Margffoy-Tuay, Juan C Pérez, Emilio Botero, and Pablo Arbeláez. 2018. Dynamic multimodal instance segmentation guided by natural language queries. InECCV. 630–645

2018
[28]

Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. 2016. Modeling context between objects for referring expression understanding. InECCV. Springer, 792– 807

2016
[29]

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741(2021)

work page internal anchor Pith review arXiv 2021
[30]

Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffu- sion probabilistic models. InICML. PMLR, 8162–8171

2021
[31]

Ziqi Pang, Xin Xu, and Yu-Xiong Wang. 2025. Aligning generative denoising with discriminative objectives unleashes diffusion for visual perception. InICLR

2025
[32]

William Peebles and Saining Xie. 2023. Scalable diffusion models with transform- ers. InICCV. 4195–4205

2023
[33]

Koutilya Pnvr, Bharat Singh, Pallabi Ghosh, Behjat Siddiquie, and David Jacobs
[34]

Ld-znet: A latent diffusion approach for text-based image segmentation. In ICCV. 4157–4168
[35]

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952 (2023)

work page internal anchor Pith review arXiv 2023
[36]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InICML. 8748–8763

2021
[37]

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. InICML. 8821–8831

2021
[38]

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. 2024. Glamm: Pixel grounding large multimodal model. InCVPR. 13009–13018

2024
[39]

Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. 2019. Generating diverse high-fidelity images with vq-vae-2. InNeurIPS, Vol. 32

2019
[40]

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. 2024. Pixellm: Pixel reasoning with large multimodal model. InCVPR. 26374–26383

2024
[41]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In CVPR. 10684–10695

2022
[42]

Hengcan Shi, Hongliang Li, Fanman Meng, and Qingbo Wu. 2018. Key-word- aware network for referring expression image segmentation. InECCV. 38–54

2018
[43]

Nick Stracke, Stefan Andreas Baumann, Kolja Bauer, Frank Fundel, and Björn Ommer. 2025. Cleandift: Diffusion features without noise. InCVPR. 117–127

2025
[44]

Zhicong Tang, Shuyang Gu, Jianmin Bao, Dong Chen, and Fang Wen. 2022. Improved vector quantized diffusion models.arXiv preprint arXiv:2205.16007 (2022)

work page arXiv 2022
[45]

Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, and Changsheng Xu. 2022. Df-gan: A simple and effective baseline for text-to-image synthesis. In CVPR. 16515–16525

2022
[46]

Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning. InNeurIPS, Vol. 30

2017
[47]

Chaoyang Wang, Xiangtai Li, Lu Qi, Henghui Ding, Yunhai Tong, and Ming- Hsuan Yang. 2024. Semflow: Binding semantic segmentation and image synthesis via rectified flow. InNeurIPS, Vol. 37. 138981–139001

2024
[48]

Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. 2022. Cris: Clip-driven referring image segmentation. In CVPR. 11686–11695

2022
[49]

Chenyun Wu, Zhe Lin, Scott Cohen, Trung Bui, and Subhransu Maji. 2020. Phrase- cut: Language-based image segmentation in the wild. InCVPR. 10216–10225

2020
[50]

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. InCVPR. 1316–1324

2018
[51]

Hui Ye, Xiulong Yang, Martin Takac, Rajshekhar Sunderraman, and Shihao Ji
[52]

Improving text-to-image synthesis using contrastive learning. InBMVC
[53]

Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-modal self- attention network for referring image segmentation. InCVPR. 10502–10511

2019
[54]

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. 2023. Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704(2023)

work page arXiv 2023
[55]

Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. 2015. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop.arXiv preprint arXiv:1506.03365(2015)

work page internal anchor Pith review arXiv 2015
[56]

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. Mattnet: Modular attention network for referring ex- pression comprehension. InCVPR. 1307–1315

2018
[57]

Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. 2021. Cross-modal contrastive learning for text-to-image generation. InCVPR. 833–842

2021
[60]

Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, and Jiwen Lu
[61]

Unleashing text-to-image diffusion models for visual perception. InICCV. 5729–5739
[62]

Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. 2022. Towards language-free training for text-to-image generation. InCVPR. 17907–17917

2022
[63]

Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. 2019. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. InCVPR. 5802–5810. ICMR ’26, June 16–19, 2026, Amsterdam, Netherlands Zishen Qu, Xuesong Li, Haijian Gu, Quan Meng, Tianrui Niu, Xin Yang, Ruidong Pan, and Hongwei Kang

2019
[64]

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. 2023. Segment everything everywhere all at once. InNeurIPS, Vol. 36. 19769–19782

2023