pith. sign in

arxiv: 2606.09303 · v1 · pith:YIQCXVVEnew · submitted 2026-06-08 · 💻 cs.CV

Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

Pith reviewed 2026-06-27 16:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords image segmentationmultimodal large language modelsreasoning-based segmentationcandidate discoverymask selectionattention mapsbenchmark evaluation
0
0 comments X

The pith

Segmentation for complex queries works by first finding candidate masks from attention maps then using an MLLM to reason and pick the highest-scoring one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a two-stage framework that treats image segmentation as candidate discovery followed by discriminative selection. Potential regions are first extracted as candidate masks from the attention maps of a segmentation MLLM. An MLLM then reasons jointly over the original query and each candidate mask to assign scores, after which the masks are reranked and the top one is chosen. The approach is motivated by the observation that direct mask generation from MLLMs struggles with limited training data and the separation between perception and mask output. A new benchmark is introduced to test perception, grounding, and multiple forms of reasoning, along with collected data for training the scoring step.

Core claim

Rea2Seg reformulates image segmentation for complex reasoning queries as a two-stage process: candidate masks are discovered from attention maps of a segmentation MLLM, after which a second MLLM reasons over the query and each candidate to produce scores; the highest-scoring mask is selected as output.

What carries the argument

The Rea2Seg two-stage pipeline that separates mask candidate discovery from comparative reasoning-based selection.

If this is right

  • The method separates perception from final mask choice, allowing the MLLM to apply comparative reasoning after initial region proposals.
  • Training data collected for scoring enables the MLLM to jointly process queries and candidate masks.
  • The new ReasonSeg-SGDR benchmark measures performance across discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning with fine-grained masks.
  • Existing benchmarks are noted to emphasize commonsense reasoning that may not fully test joint visual and reasoning demands.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation into discovery and selection stages could be tested on queries that require iterative refinement beyond a single pass.
  • If attention maps prove reliable for candidate generation, similar discovery steps might apply to other MLLM-driven vision tasks that output structured outputs.
  • The benchmark's multi-dimension design could serve as a template for evaluating whether gains come from perception, grounding, or the reasoning step itself.

Load-bearing premise

Attention maps from the segmentation MLLM contain the correct mask among the generated candidates even for complex reasoning queries.

What would settle it

If reranking via MLLM reasoning on the ReasonSeg-SGDR benchmark yields no accuracy gain over the raw attention-derived masks or prior direct-generation baselines, the two-stage claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.09303 by Haoran Hao, Xiangyu Yue, Xinyan Gao.

Figure 1
Figure 1. Figure 1: Comparison of different MLLM-based segmentation methods. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The proposed ReasonSeg-SGDR benchmark. It focuses on the joint use of visual perception and high-level reasoning. It spans multiple dimensions, including discriminative, geometric, spatial, and multi-step reasoning, all of which require fine￾grained mask generation. The target region is identified by carefully examining the image and reasoning over the visual evidence. reasoning), and accurate pixel-level … view at source ↗
Figure 3
Figure 3. Figure 3: The overall framework of Rea2Seg. Rea2Seg consists of an attention-driven candidate mask generator and a mask evaluator for selection. Specifically, (1) candidate regions are first identified based on the attention maps of a segmentation MLLM; (2) an MLLM then performs reasoning and assigns scores to all candidate masks; and (3) the candidates are reranked, and the highest-scoring mask is selected as the f… view at source ↗
read the original abstract

The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal large language models (MLLMs) have been widely explored for image segmentation with complex queries that require high-level reasoning. Despite promising progress, existing methods are often constrained by limited training data and the gap between MLLMs and mask generation modules. To better transfer MLLMs' perception and reasoning ability to complex reasoning-based segmentation tasks, we propose a two-stage framework Rea2Seg for mask generation and selection. Specifically, the framework first identifies potential regions as candidate masks based on the attention maps of a segmentation MLLM. It then employs an MLLM to reason over the question and candidate masks and assign scores to each mask. The final segmentation result is obtained by reranking the candidates and selecting the highest-scoring mask, reformulating image segmentation as candidate discovery followed by discriminative mask selection. We also notice that a large portion of questions in existing benchmarks focus on commonsense reasoning, and these questions usually do not fully require joint visual observation and reasoning. To address this issue, we introduce a new benchmark called ReasonSeg-SGDR that comprehensively evaluates a model's perception, grounding, and reasoning abilities across multiple dimensions, including discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning, with fine-grained mask generation. In addition, we collect training data to enhance MLLMs' ability to jointly understand multimodal queries and candidate masks, and to assign scores through reasoning. Experimental results on the proposed benchmark and ReasonSeg demonstrate the effectiveness of the unified mask generation and selection framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Rea2Seg, a two-stage framework for complex reasoning-based image segmentation with MLLMs. Stage 1 generates candidate masks from attention maps of a segmentation MLLM. Stage 2 uses an MLLM to reason over the query and candidates, assign scores, and select the highest-scoring mask via reranking. The authors introduce the ReasonSeg-SGDR benchmark targeting discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning with fine-grained masks, collect training data for the scoring stage, and claim that experiments on ReasonSeg-SGDR and ReasonSeg demonstrate the framework's effectiveness.

Significance. If the central claims hold, the decoupling of candidate discovery from discriminative selection could be a practical way to leverage MLLM reasoning without retraining mask generators, and ReasonSeg-SGDR would address limitations in existing benchmarks that over-rely on commonsense queries. The approach is a natural extension of current MLLM segmentation work, but its significance cannot be assessed without evidence that the attention-based candidate stage achieves sufficient recall on the new benchmark's harder reasoning dimensions.

major comments (2)
  1. [Abstract / candidate discovery stage] The load-bearing assumption of the first stage (Abstract and method description) that attention maps from the segmentation MLLM reliably surface a candidate set containing the ground-truth mask is not validated. No recall@K, coverage statistics, or failure-case analysis is reported for multi-step reasoning queries on ReasonSeg-SGDR, where attention may be diffuse or latch onto incorrect salient regions.
  2. [Abstract / experimental claims] The abstract states that 'experimental results on the proposed benchmark and ReasonSeg demonstrate the effectiveness,' yet no quantitative results, ablation studies, baselines, error bars, or dataset statistics are supplied. This prevents any evaluation of whether the two-stage reranking improves over direct MLLM segmentation or whether the collected training data yields measurable gains.
minor comments (1)
  1. [Abstract] The benchmark description would benefit from one concrete query example per dimension (discriminative, spatial, geometric, multi-step) to clarify the distinctions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below and will revise the manuscript to incorporate additional validation and experimental details.

read point-by-point responses
  1. Referee: [Abstract / candidate discovery stage] The load-bearing assumption of the first stage (Abstract and method description) that attention maps from the segmentation MLLM reliably surface a candidate set containing the ground-truth mask is not validated. No recall@K, coverage statistics, or failure-case analysis is reported for multi-step reasoning queries on ReasonSeg-SGDR, where attention may be diffuse or latch onto incorrect salient regions.

    Authors: We agree that explicit validation of the candidate discovery stage is necessary to support the framework, especially on the more challenging multi-step reasoning queries in ReasonSeg-SGDR. The current manuscript prioritizes end-to-end segmentation performance but does not report recall@K, coverage, or failure-case analysis for this stage. We will add these metrics with breakdowns by reasoning dimension and include failure-case discussion in the revision. revision: yes

  2. Referee: [Abstract / experimental claims] The abstract states that 'experimental results on the proposed benchmark and ReasonSeg demonstrate the effectiveness,' yet no quantitative results, ablation studies, baselines, error bars, or dataset statistics are supplied. This prevents any evaluation of whether the two-stage reranking improves over direct MLLM segmentation or whether the collected training data yields measurable gains.

    Authors: We acknowledge that the current manuscript version does not supply the requested quantitative details, ablations, baselines, error bars, or dataset statistics. We will add these elements, including direct comparisons showing gains from reranking over single-stage MLLM segmentation and the contribution of the collected training data, to substantiate the abstract claims. revision: yes

Circularity Check

0 steps flagged

No circularity detected; framework is self-contained empirical proposal

full rationale

The paper describes a two-stage Rea2Seg framework that uses attention maps from a segmentation MLLM for candidate mask discovery followed by MLLM-based scoring and reranking. No equations, fitted parameters, self-citations, or uniqueness theorems appear in the abstract or described content. The central claim is an empirical method reformulation supported by a new benchmark and collected training data, with no reduction of outputs to inputs by construction. This matches the most common honest finding of a self-contained proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5815 in / 1134 out tokens · 21562 ms · 2026-06-27T16:59:18.629954+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

122 extracted references · 31 canonical work pages · 15 internal anchors

  1. [1]

    TPAMI (2017) 4

    Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. TPAMI (2017) 4

  2. [2]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 11

  3. [3]

    In: European Conference on Computer Vision

    Bao, X., Sun, S., Ma, S., Zheng, K., Guo, Y., Zhao, G., Zheng, Y., Wang, X.: Cores: Orchestrating the dance of reasoning and segmentation. In: European Conference on Computer Vision. pp. 187–204. Springer (2024) 2, 4, 12

  4. [4]

    Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context (2018),https://arxiv.org/abs/1612.0371611, 24

  5. [5]

    arXiv preprint arXiv:2410.08209 (2024) 9

    Cao, S., Gui, L.Y., Wang, Y.X.: Emergent visual grounding in large multimodal models without grounding supervision. arXiv preprint arXiv:2410.08209 (2024) 9

  6. [6]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Cao, S., Wei, Z., Kuen, J., Liu, K., Zhang, L., Gu, J., Jung, H., Gui, L.Y., Wang, Y.X.: Refer to any segmentation mask group with vision-language prompts. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21853–21863 (2025) 7, 26, 27

  7. [7]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., ...

  8. [8]

    TPAMI (2018) 4

    Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI (2018) 4

  9. [9]

    Chen, X., Elbayad, M., Nguyen, J., Verbeek, J.: VUGEN: Visual understanding priors for GENeration (2026),https://openreview.net/forum?id=tubF5vyrQ05

  10. [10]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.: Detect what you can: Detecting and representing objects using holistic models and body parts. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1971–1978 (2014) 11, 24

  11. [11]

    In: European Conference on Computer Vision

    Chen, Y.C., Li, W.H., Sun, C., Wang, Y.C.F., Chen, C.S.: Sam4mllm: Enhance multi-modal large language model for referring expression segmentation. In: European Conference on Computer Vision. pp. 323–340. Springer (2024) 5

  12. [12]

    In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=JKEIYQUSUc5

    Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: SpatialRGPT: Grounded spatial reasoning in vision-language models. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=JKEIYQUSUc5

  13. [13]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025) 7

  14. [14]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J.S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., et al.: Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 91–104 (2025) 25 Reason Twice 17

  15. [15]

    In: The Fourteenth In- ternational Conference on Learning Representations (2026),https://openreview

    Du, T., Li, H., Fan, Z., Zhang, J., Pan, P., Zhang, Y.: SAM-veteran: An MLLM- based human-like SAM agent for reasoning segmentation. In: The Fourteenth In- ternational Conference on Learning Representations (2026),https://openreview. net/forum?id=oN55r8iJJW5

  16. [16]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2025),https://openreview.net/forum?id=DgH9YCsqWm5, 7, 26, 28

    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y., Ji, R., Shan, C., He, R.: MME: A comprehensive evaluation benchmark for multimodal large language models. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2025),https://openreview.net/forum?id...

  17. [17]

    In: CVPR (2019) 4

    Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual attention network for scene segmentation. In: CVPR (2019) 4

  18. [18]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum? id=ILr4UNiZcQ5

    Han, J., Chen, H., Zhao, Y., Wang, H., Zhao, Q., Yang, Z., He, H., Yue, X., Jiang, L.: Vision as a dialect: Unifying visual understanding and generation via text-aligned representations. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum? id=ILr4UNiZcQ5

  19. [19]

    In: Proceedings of the IEEE international conference on computer vision

    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017) 4

  20. [20]

    In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

  21. [21]

    OpenReview.net (2022),https://openreview.net/forum?id=nZeVKeeFYf912

  22. [22]

    arXiv preprint arXiv:2505.22596 (2025) 5, 13

    Huang, J., Xu, Z., Zhou, J., Liu, T., Xiao, Y., Ou, M., Ji, B., Li, X., Yuan, K.: Sam-r1: Leveraging sam for reward feedback in multimodal segmentation via reinforcement learning. arXiv preprint arXiv:2505.22596 (2025) 5, 13

  23. [23]

    In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

    Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019) 11, 29

  24. [24]

    In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=mzL19kKE3r3, 5

    Jang, D., Cho, Y., Lee, S., Kim, T., Kim, D.: MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation. In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=mzL19kKE3r3, 5

  25. [25]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2901–2910 (2017) 27

  26. [26]

    See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025

    Kang, S., Kim, J., Kim, J., Hwang, S.J.: See what you are told: Visual attention sink in large multimodal models. arXiv preprint arXiv:2503.03321 (2025) 9

  27. [27]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Kang, S., Kim, J., Kim, J., Hwang, S.J.: Your large vision-language model only needs a few attention heads for visual grounding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9339–9350 (2025) 3, 5, 9

  28. [28]

    In: EMNLP (2014) 2, 11, 24

    Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: Referring to objects in photographs of natural scenes. In: EMNLP (2014) 2, 11, 24

  29. [29]

    Advances in Neural Information Processing Systems36, 29914–29934 (2023) 7

    Ke, L., Ye, M., Danelljan, M., Tai, Y.W., Tang, C.K., Yu, F., et al.: Segment anything in high quality. Advances in Neural Information Processing Systems36, 29914–29934 (2023) 7

  30. [30]

    Segment Anything

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. arXiv:2304.02643 (2023) 2 18 X. Gao and H. Hao et al

  31. [31]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9579–9589 (2024) 2, 3, 4, 5, 7, 9, 11, 12, 13, 24

  32. [32]

    In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=vkakKdznFS5

    Lan, M., Chen, C., Zhou, Y., Xu, J., Ke, Y., Wang, X., Feng, L., Zhang, W.: Text4seg: Reimagining image segmentation as text generation. In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=vkakKdznFS5

  33. [33]

    Computer vision and image understanding 184, 45–56 (2019) 7, 26

    Le, T.N., Nguyen, T.V., Nie, Z., Tran, M.T., Sugimoto, A.: Anabranch network for camouflaged object segmentation. Computer vision and image understanding 184, 45–56 (2019) 7, 26

  34. [34]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025) 2, 4

    Li, Z., Yang, B., Liu, Q., Zhang, S., Ma, Z., Yin, L., Deng, L., Sun, Y., Liu, Y., Bai, X.: Lira: Inferring segmentation in large multi-modal models with local interleaved region assistance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025) 2, 4

  35. [35]

    In: Winter Conference on Applications of Computer Vision (WACV) (2021) 7, 26

    Liew, J.H., Cohen, S., Price, B., Mai, L., Feng, J.: Deep interactive thin object selection. In: Winter Conference on Applications of Computer Vision (WACV) (2021) 7, 26

  36. [36]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, C., Ding, H., Jiang, X.: Gres: Generalized referring expression segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 23592–23601 (2023) 4, 7, 26, 27

  37. [37]

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023) 11, 24

  38. [38]

    In: CVPR (2026) 5

    Liu, J., Feng, M., Chen, L.: Better, stronger, faster: Tackling the trilemma in mllm- based segmentation with simultaneous textual mask prediction. In: CVPR (2026) 5

  39. [39]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, R., Liu, C., Bai, Y., Yuille, A.L.: Clevr-ref+: Diagnosing visual reasoning with referring expressions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4185–4194 (2019) 7, 26, 27

  40. [40]

    arXiv preprint arXiv:2509.18094 (2025) 5

    Liu, Y., Ma, Z., Pu, J., Qi, Z., Wu, Y., Shan, Y., Chen, C.W.: Unipixel: Unified object referring and segmentation for pixel-level visual reasoning. arXiv preprint arXiv:2509.18094 (2025) 5

  41. [41]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Liu, Y., Peng, B., Zhong, Z., Yue, Z., Lu, F., Yu, B., Jia, J.: Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement. arXiv preprint arXiv:2503.06520 (2025) 5, 12, 13

  42. [42]

    arXiv preprint arXiv:2505.12081 (2025) 5, 12

    Liu, Y., Qu, T., Zhong, Z., Peng, B., Liu, S., Yu, B., Jia, J.: Visionreasoner: Unified visual perception and reasoning via reinforcement learning. arXiv preprint arXiv:2505.12081 (2025) 5, 12

  43. [43]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 24

  44. [44]

    arXiv preprint arXiv:2506.04277 (2025) 13

    Lu, Y., Cao, J., Wu, Y., Li, B., Tang, L., Ji, Y., Wu, C., Wu, J., Zhu, W.: Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought. arXiv preprint arXiv:2506.04277 (2025) 13

  45. [45]

    arXiv preprint arXiv:2510.11173 (2025) 3, 5, 13

    Lu, Z., Li, L., Wang, J., Feng, Y., Chen, B., Chen, K., Wang, Y.: Coprs: Learning positional prior from chain-of-thought for reasoning segmentation. arXiv preprint arXiv:2510.11173 (2025) 3, 5, 13

  46. [46]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Ma, W., Chen, H., Zhang, G., Chou, Y.C., Chen, J., de Melo, C., Yuille, A.: 3dsrbench: A comprehensive 3d spatial reasoning benchmark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6924–6934 (2025) 5 Reason Twice 19

  47. [47]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 11–20 (2016) 11, 25

  48. [48]

    IEEE Transactions on Pattern Analysis and Machine Intelligence44(7), 3523–3542 (2022).https://doi.org/10

    Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N., Terzopoulos, D.: Image segmentation using deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence44(7), 3523–3542 (2022).https://doi.org/10. 1109/TPAMI.2021.30599684

  49. [49]

    In: ECCV (2016) 2

    Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: ECCV (2016) 2

  50. [50]

    In: ICCV (2015) 4

    Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: ICCV (2015) 4

  51. [51]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 4

  52. [52]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Park, J.S., Ma, Z., Li, L., Zheng, C., Hsieh, C.Y., Lu, X., Chandu, K., Kong, Q., Kobori, N., Farhadi, A., et al.: Synthetic visual genome. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9073–9086 (2025) 25

  53. [53]

    In: International Conference on Computer Vision (ICCV) (October 2023) 7, 11, 26, 27, 29

    Qi, L., Kuen, J., Shen, T., Gu, J., Guo, W., Jia, J., Lin, Z., Yang, M.H.: High- quality entity segmentation. In: International Conference on Computer Vision (ICCV) (October 2023) 7, 11, 26, 27, 29

  54. [54]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025) 2, 4, 7, 9

    Qian, R., Yin, X., Dou, D.: Reasoning to attend: Try to understand how< seg> token works. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025) 2, 4, 7, 9

  55. [55]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Qian, R., Yin, X., Dou, D.: Reasoning to attend: Try to understand how< seg> token works. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24722–24731 (2025) 9, 12, 13

  56. [56]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Qin, J., Wu, J., Yan, P., Li, M., Yuxi, R., Xiao, X., Wang, Y., Wang, R., Wen, S., Pan, X., Wang, X.: Freeseg: Unified, universal and open-vocabulary image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19446–19455 (June 2023) 4

  57. [57]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 4

  58. [58]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R.M., Xing, E., Yang, M.H., Khan, F.S.: Glamm: Pixel grounding large multimodal model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13009–13018 (June 2024) 5

  59. [59]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) 4, 25

  60. [60]

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models (2024),https://arxiv.org/abs/2402.033005

  61. [61]

    TPAMI (2017) 4

    Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. TPAMI (2017) 4

  62. [62]

    OpenAI GPT-5 System Card

    Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., Nathan, A., Luo, A., Helyar, A., Madry, A., Efremov, A., Spyra, A., Baker-Whitcomb, A., Beutel, A., Karpenko, A., 20 X. Gao and H. Hao et al. Makelov, A., Neitz, A., Wei, A., Barr, A., Kirchmeyer, A., Ivanov, A., Christakis, A., Gille...

  63. [63]

    Springer Nature (2022) 4

    Szeliski, R.: Computer vision: algorithms and applications. Springer Nature (2022) 4

  64. [64]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum?id=8omLr8BtjL2, 4

    Tang, H., Xie, C.W., Wang, H., Bao, X., Weng, T., Li, P., Zheng, Y., Wang, L.: UFO: A unified approach to fine-grained visual perception via open-ended language interface. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum?id=8omLr8BtjL2, 4

  65. [65]

    Gemma 3 Technical Report

    Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., Rouillard, L., Mesnard, T., Cideron, G., bastien Grill, J., Ramos, S., Yvinec, E., Casbon, M., Pot, E., Penchev, I., Liu, G., Visin, F., Kenealy, K., Beyer, L., Zhai, X., Tsitsulin, A., Busa-Fekete, R., Feng, A., Sachdeva, N., Cole...

  66. [66]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https:// openreview.net/forum?id=h3lyFa5e1W5

    Wan, Z., Dou, Z., Liu, C., Zhang, Y., Cui, D., Zhao, Q., Shen, H., Xiong, J., Xin, Y., Jiang, Y., Tao, C., He, Y., Zhang, M., Yan, S.: SRPO: Enhancing multimodal LLM reasoning via reflection-aware reinforcement learning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https:// openreview.net/forum?id=h3lyFa5e1W5

  67. [67]

    In: European Conference on Computer Vision

    Wang, H., Tang, H., Jiang, L., Shi, S., Naeem, M.F., Li, H., Schiele, B., Wang, L.: Git: Towards generalist vision transformer through universal language interface. In: European Conference on Computer Vision. pp. 55–73. Springer (2024) 4

  68. [68]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Wang, H., Qiao, L., Jie, Z., Huang, Z., Feng, C., Zheng, Q., Ma, L., Lan, X., Liang, X.: X-sam: From segment anything to any segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 26187–26196 (2026) 4

  69. [69]

    Wang, K., Pan, J., Shi, W., Lu, Z., Ren, H., Zhou, A., Zhan, M., Li, H.: MeasuringmultimodalmathematicalreasoningwithMATH-visiondataset.In:The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2024),https://openreview.net/forum?id=QWTCcxMpPA5

  70. [70]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, P., Li, Z.Z., Yin, F., Ran, D., Liu, C.L.: Mv-math: Evaluating multimodal math reasoning in multi-visual contexts. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19541–19551 (2025) 5

  71. [71]

    In: The Thirty- ninth Annual Conference on Neural Information Processing Systems (2025),https: //openreview.net/forum?id=rud3M6wlxH5

    Wang, X., Ru, L., Huang, Z., Ji, K., Zheng, D., Chen, J., ZHOU, J.: ARGenseg: Image segmentation with autoregressive image generation model. In: The Thirty- ninth Annual Conference on Neural Information Processing Systems (2025),https: //openreview.net/forum?id=rud3M6wlxH5

  72. [72]

    In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=Pm1NXHgzyf5

    Wang, X., Zhang, S., Li, S., Li, K., Kallidromitis, K., Kato, Y., Kozuka, K., Darrell, T.: SegLLM: Multi-round reasoning segmentation with large language models. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=Pm1NXHgzyf5

  73. [73]

    Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

    Wang, Y., Wu, S., Zhang, Y., Yan, S., Liu, Z., Luo, J., Fei, H.: Multimodal chain- of-thought reasoning: A comprehensive survey. arXiv preprint arXiv:2503.12605 (2025) 5

  74. [74]

    arXiv preprint arXiv:2411.17606 (2024) 4

    Wei, C., Zhong, Y., Tan, H., Liu, Y., Zhao, Z., Hu, J., Yang, Y.: Hyperseg: Towards universal visual segmentation with large language model. arXiv preprint arXiv:2411.17606 (2024) 4

  75. [75]

    In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., brian ichter, Xia, F., Chi, E.H., Le, Q.V., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022),https://openreview.net/forum? id=_VjQlMeSB_J5

  76. [76]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wu, T.H., Biamby, G., Chan, D., Dunlap, L., Gupta, R., Wang, X., Gonzalez, J.E., Darrell, T.: See say and segment: Teaching lmms to overcome false premises. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13459–13469 (2024) 2, 4, 7, 9, 12

  77. [77]

    In: Proceedings of the Reason Twice 23 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Xia, Z., Han, D., Han, Y., Pan, X., Song, S., Huang, G.: Gsva: Generalized segmentation via multimodal large language models. In: Proceedings of the Reason Twice 23 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3858–3869 (June 2024) 2, 4

  78. [78]

    Xiao, Y., Sun, E., Liu, T., Wang, W.: Logicvista: Multimodal llm logical reasoning benchmark in visual contexts (2024),https://arxiv.org/abs/2407.049735

  79. [79]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Xu, G., Jin, P., Wu, Z., Li, H., Song, Y., Sun, L., Yuan, L.: Llava-cot: Let vision language models reason step-by-step. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2087–2098 (2025) 25

  80. [80]

    In:TheThirty-ninthAnnualConferenceonNeuralInformationProcessingSystems (2025),https://openreview.net/forum?id=uWEcZkrSkZ5

    Xu, J., Fei, H., Zhang, Y., Pan, L., Huang, Q., Liu, Q., Nakov, P., Kan, M.Y., Wang, W.Y., Lee, M.L., Hsu, W.: MuSLR: Multimodal symbolic logical reasoning. In:TheThirty-ninthAnnualConferenceonNeuralInformationProcessingSystems (2025),https://openreview.net/forum?id=uWEcZkrSkZ5

Showing first 80 references.