pith. machine review for the scientific record. sign in

arxiv: 2605.04590 · v1 · submitted 2026-05-06 · 💻 cs.CV · cs.AI

Recognition: 3 theorem links

· Lean Theorem

From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation

Haijian Gu, Hongwei Kang, Quan Meng, Ruidong Pan, Tianrui Niu, Xin Yang, Xuesong Li, Zishen Qu

Pith reviewed 2026-05-08 18:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords text-based image segmentationrectified flowdiffusion modelszero-shot segmentationlatent space mappingRLFSeg
0
0 comments X

The pith

Rectified Flow replaces diffusion's noise-denoise process with direct mapping for text-based image segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that diffusion models' generative noise-denoise process hinders their use as feature extractors for text-based image segmentation. To address this, it introduces a framework that uses Rectified Flow to establish a direct mapping from images to segmentation masks in latent space. This eliminates the need for time-step optimization and iterative denoising, yielding better results than diffusion-based approaches, particularly in zero-shot cases. Additional techniques like label refinement and adaptive one-step sampling allow accurate segmentation in a single forward pass. The approach repurposes pretrained generative models for segmentation without any structural changes.

Core claim

We propose RLFSeg, a novel framework that leverages Rectified Flow to learn direct mapping from the image to the segmentation mask within the latent space. The model is thus freed from the noise-denoise process and the need to optimize the time step of diffusion models, resulting in substantially better performance than previous diffusion-based methods, especially on zero-shot scenarios. By introducing label refinement and an Adaptive One-Step Sampling strategy, the model achieves higher accuracy even on a single inference step.

What carries the argument

Rectified Flow's direct mapping from image to segmentation mask in the latent space, which replaces the iterative noise-denoise process of diffusion models.

If this is right

  • Substantially better performance than diffusion-based methods on text-based segmentation tasks.
  • Particularly strong gains in zero-shot scenarios without task-specific fine-tuning.
  • Higher accuracy achievable even with a single inference step via adaptive sampling.
  • Pretrained generative models can be redirected to discriminative segmentation with no structural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The direct-mapping idea could apply to other discriminative tasks that currently borrow features from generative diffusion models.
  • Single-step inference opens the possibility of real-time text-prompted segmentation in interactive applications.
  • Avoiding noise addition may improve boundary precision on images with fine details or ambiguous text prompts.

Load-bearing premise

The generative noise-denoise process in diffusion models is inherently harmful to discriminative segmentation while Rectified Flow's direct mapping preserves rich multimodal semantic features without it.

What would settle it

A side-by-side evaluation on zero-shot text-based segmentation benchmarks where an optimized diffusion baseline matches or beats RLFSeg performance would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.04590 by Haijian Gu, Hongwei Kang, Quan Meng, Ruidong Pan, Tianrui Niu, Xin Yang, Xuesong Li, Zishen Qu.

Figure 1
Figure 1. Figure 1: Prior methods rely on LDM as a feature extractor view at source ↗
Figure 2
Figure 2. Figure 2: Overview of RLFSeg. (a) Training pipeline with Rectified Latent Flow and SAM-driven Label Refinement, where view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of results without RDS. The Rectified view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of sampling trajectories and the effect of AOS. Interpolation with (a) the ground-truth view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison with different methods. view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of mask boundaries. Lever view at source ↗
Figure 7
Figure 7. Figure 7: presents a particularly extreme case of this failure mode; view at source ↗
read the original abstract

Text-based image segmentation aims to delineate object boundaries within an image from text prompts, offering higher flexibility and broader application scope compared to traditional fixed-category segmentation tasks. Recent studies have shown that diffusion models (e.g., Stable Diffusion) can provide rich multimodal semantic features, leading to studies of using diffusion models as feature extractors for segmentation tasks. Such methods, however, inherit the generative natures of diffusion models that are harmful to discriminative segmentation tasks. In response, we propose RLFSeg, a novel framework that leverages Rectified Flow to learn direct mapping from the image to the segmentation mask within the latent space. The model is thus freed from the noise-denoise process and the need to optimize the time step of diffusion models, resulting in substantially better performance than previous diffusion-based methods, especially on zero-shot scenarios. By introducing label refinement and an Adaptive One-Step Sampling strategy, the model achieves higher accuracy even on a single inference step. The framework redirects a pretrained generative model to the discriminative segmentation task with zero modification to model structure, thus reveals promising application potential and significant research value.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes RLFSeg, a framework that replaces diffusion models with Rectified Flow to learn a direct image-to-mask mapping in latent space for text-based segmentation. It claims this avoids the harmful noise-denoise process and timestep optimization of diffusion models, yielding substantially better performance than prior diffusion-based methods (especially zero-shot), while label refinement and Adaptive One-Step Sampling enable high accuracy even in one inference step. The approach redirects pretrained generative models to discriminative tasks with no structural changes.

Significance. If the performance gains are robustly demonstrated and attributable to the Rectified Flow direct mapping, the work would provide a promising route to adapt large pretrained generative models for segmentation without retraining or architectural overhaul, with particular value for zero-shot and efficient inference scenarios.

major comments (2)
  1. [Abstract] Abstract: the central claim of 'substantially better performance than previous diffusion-based methods, especially on zero-shot scenarios' is asserted without any quantitative results, baselines, datasets, or metrics. The experiments section must supply these comparisons (including diffusion baselines) to support the claim.
  2. [Method and Experiments] Method and Experiments: the attribution of gains to Rectified Flow's direct mapping (freeing the model from noise-denoise and timestep issues) is load-bearing, yet label refinement and Adaptive One-Step Sampling are introduced as key components. Ablations applying the same refinements to a diffusion baseline are required to isolate whether the flow formulation itself drives the improvements or whether the auxiliary strategies suffice.
minor comments (2)
  1. [Method] Clarify the exact formulation of the Rectified Flow mapping (e.g., the velocity field or ODE) with an equation to distinguish it from diffusion's forward/reverse processes.
  2. [Method] The phrase 'zero modification to model structure' should be supported by explicitly stating which pretrained model (e.g., Stable Diffusion variant) is used and confirming no fine-tuning occurs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'substantially better performance than previous diffusion-based methods, especially on zero-shot scenarios' is asserted without any quantitative results, baselines, datasets, or metrics. The experiments section must supply these comparisons (including diffusion baselines) to support the claim.

    Authors: Abstracts conventionally present high-level claims without numerical details. The experiments section of the manuscript already contains quantitative comparisons to prior diffusion-based methods across multiple datasets and standard metrics (including zero-shot settings). To strengthen the link between the abstract claim and the supporting evidence, we will revise the abstract to include a brief reference to the key experimental outcomes and ensure the experiments section explicitly enumerates all diffusion baselines, datasets, and metrics used. revision: partial

  2. Referee: [Method and Experiments] Method and Experiments: the attribution of gains to Rectified Flow's direct mapping (freeing the model from noise-denoise and timestep issues) is load-bearing, yet label refinement and Adaptive One-Step Sampling are introduced as key components. Ablations applying the same refinements to a diffusion baseline are required to isolate whether the flow formulation itself drives the improvements or whether the auxiliary strategies suffice.

    Authors: We agree that rigorous isolation of the Rectified Flow contribution is necessary. We will add new ablation experiments in the revised manuscript that apply label refinement and Adaptive One-Step Sampling to a diffusion baseline under identical conditions. These results will be presented alongside the original RLFSeg results to demonstrate that the performance advantages stem primarily from the direct image-to-mask mapping enabled by Rectified Flow rather than the auxiliary components alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes RLFSeg as a new framework adapting Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, explicitly contrasting it with diffusion models' noise-denoise process. It introduces independent components (label refinement, Adaptive One-Step Sampling) and reports empirical gains, especially zero-shot. No equations, self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the claims; the central result is a methodological shift with external pretrained models, remaining self-contained against benchmarks rather than reducing to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method assumes pretrained generative models provide usable features and that direct mapping suffices for segmentation.

pith-pipeline@v0.9.0 · 5506 in / 977 out tokens · 30258 ms · 2026-05-08T18:34:08.865920+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 9 canonical work pages · 4 internal anchors

  1. [1]

    Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, et al

  2. [2]

    Imagen 3.arXiv preprint arXiv:2408.07009(2024)

  3. [3]

    Dmitry Baranchuk, Ivan Rubachev, Andrey Voynov, Valentin Khrulkov, and Artem Babenko. 2021. Label-efficient semantic segmentation with diffusion models.arXiv preprint arXiv:2112.03126(2021)

  4. [4]

    Black Forest Labs. 2024. Flux: Official inference repository for flux.1 models. Accessed: 2025-02-07

  5. [5]

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao

  6. [6]

    Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195(2023)

  7. [7]

    Barbara Toniella Corradini, Mustafa Shukor, Paul Couairon, Guillaume Couairon, Franco Scarselli, and Matthieu Cord. 2024. Freeseg-diff: Training-free open- vocabulary segmentation with diffusion models.arXiv preprint arXiv:2403.20105 (2024)

  8. [8]

    Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. InNeurIPS, Vol. 34. 8780–8794

  9. [9]

    Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. 2021. Cogview: Mastering text-to-image generation via transformers. InNeurIPS, Vol. 34. 19822–19835

  10. [10]

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. InICML

  11. [11]

    Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. InCVPR. 12873–12883

  12. [12]

    Li Fei-Fei, Robert Fergus, and Pietro Perona. 2006. One-shot learning of object categories.IEEE transactions on pattern analysis and machine intelligence28, 4 (2006), 594–611

  13. [13]

    Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. 2022. Make-a-scene: Scene-based text-to-image generation with human priors. InECCV. Springer, 89–106

  14. [14]

    Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. 2022. Vector quantized diffusion model for text-to-image synthesis. InCVPR. 10696–10706

  15. [15]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3

  16. [16]

    Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. 2016. Segmentation from natural language expressions. InECCV. Springer, 108–124

  17. [17]

    Laurynas Karazija, Iro Laina, Andrea Vedaldi, and Christian Rupprecht. 2023. Diffusion models for zero-shot open-vocabulary segmentation.arXiv e-prints (2023), arXiv–2306

  18. [18]

    Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator archi- tecture for generative adversarial networks. InCVPR. 4401–4410

  19. [19]

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. Referitgame: Referring to objects in photographs of natural scenes. InEMNLP. 787–798

  20. [20]

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al

  21. [21]

    Segment anything. InICCV. 4015–4026

  22. [22]

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. 2024. Lisa: Reasoning segmentation via large language model. InCVPR. 9579–9589

  23. [23]

    Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. 2018. Referring image segmentation via recurrent refinement networks. InCVPR. 5745–5753

  24. [24]

    Ziyi Li, Qinye Zhou, Xiaoyun Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie

  25. [25]

    Open-vocabulary object segmentation with diffusion models. InICCV. 7667–7676

  26. [26]

    Timo Lüddecke and Alexander Ecker. 2022. Image segmentation using text and image prompts. InCVPR. 7086–7096

  27. [27]

    Edgar Margffoy-Tuay, Juan C Pérez, Emilio Botero, and Pablo Arbeláez. 2018. Dynamic multimodal instance segmentation guided by natural language queries. InECCV. 630–645

  28. [28]

    Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. 2016. Modeling context between objects for referring expression understanding. InECCV. Springer, 792– 807

  29. [29]

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741(2021)

  30. [30]

    Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffu- sion probabilistic models. InICML. PMLR, 8162–8171

  31. [31]

    Ziqi Pang, Xin Xu, and Yu-Xiong Wang. 2025. Aligning generative denoising with discriminative objectives unleashes diffusion for visual perception. InICLR

  32. [32]

    William Peebles and Saining Xie. 2023. Scalable diffusion models with transform- ers. InICCV. 4195–4205

  33. [33]

    Koutilya Pnvr, Bharat Singh, Pallabi Ghosh, Behjat Siddiquie, and David Jacobs

  34. [34]

    Ld-znet: A latent diffusion approach for text-based image segmentation. In ICCV. 4157–4168

  35. [35]

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952 (2023)

  36. [36]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InICML. 8748–8763

  37. [37]

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. InICML. 8821–8831

  38. [38]

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. 2024. Glamm: Pixel grounding large multimodal model. InCVPR. 13009–13018

  39. [39]

    Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. 2019. Generating diverse high-fidelity images with vq-vae-2. InNeurIPS, Vol. 32

  40. [40]

    Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. 2024. Pixellm: Pixel reasoning with large multimodal model. InCVPR. 26374–26383

  41. [41]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In CVPR. 10684–10695

  42. [42]

    Hengcan Shi, Hongliang Li, Fanman Meng, and Qingbo Wu. 2018. Key-word- aware network for referring expression image segmentation. InECCV. 38–54

  43. [43]

    Nick Stracke, Stefan Andreas Baumann, Kolja Bauer, Frank Fundel, and Björn Ommer. 2025. Cleandift: Diffusion features without noise. InCVPR. 117–127

  44. [44]

    Zhicong Tang, Shuyang Gu, Jianmin Bao, Dong Chen, and Fang Wen. 2022. Improved vector quantized diffusion models.arXiv preprint arXiv:2205.16007 (2022)

  45. [45]

    Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, and Changsheng Xu. 2022. Df-gan: A simple and effective baseline for text-to-image synthesis. In CVPR. 16515–16525

  46. [46]

    Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning. InNeurIPS, Vol. 30

  47. [47]

    Chaoyang Wang, Xiangtai Li, Lu Qi, Henghui Ding, Yunhai Tong, and Ming- Hsuan Yang. 2024. Semflow: Binding semantic segmentation and image synthesis via rectified flow. InNeurIPS, Vol. 37. 138981–139001

  48. [48]

    Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. 2022. Cris: Clip-driven referring image segmentation. In CVPR. 11686–11695

  49. [49]

    Chenyun Wu, Zhe Lin, Scott Cohen, Trung Bui, and Subhransu Maji. 2020. Phrase- cut: Language-based image segmentation in the wild. InCVPR. 10216–10225

  50. [50]

    Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. InCVPR. 1316–1324

  51. [51]

    Hui Ye, Xiulong Yang, Martin Takac, Rajshekhar Sunderraman, and Shihao Ji

  52. [52]

    Improving text-to-image synthesis using contrastive learning. InBMVC

  53. [53]

    Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-modal self- attention network for referring image segmentation. InCVPR. 10502–10511

  54. [54]

    Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. 2023. Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704(2023)

  55. [55]

    Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. 2015. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop.arXiv preprint arXiv:1506.03365(2015)

  56. [56]

    Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. Mattnet: Modular attention network for referring ex- pression comprehension. InCVPR. 1307–1315

  57. [57]

    Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. 2021. Cross-modal contrastive learning for text-to-image generation. InCVPR. 833–842

  58. [60]

    Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, and Jiwen Lu

  59. [61]

    Unleashing text-to-image diffusion models for visual perception. InICCV. 5729–5739

  60. [62]

    Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. 2022. Towards language-free training for text-to-image generation. InCVPR. 17907–17917

  61. [63]

    Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. 2019. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. InCVPR. 5802–5810. ICMR ’26, June 16–19, 2026, Amsterdam, Netherlands Zishen Qu, Xuesong Li, Haijian Gu, Quan Meng, Tianrui Niu, Xin Yang, Ruidong Pan, and Hongwei Kang

  62. [64]

    Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. 2023. Segment everything everywhere all at once. InNeurIPS, Vol. 36. 19769–19782