Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

Haoran Hao; Xiangyu Yue; Xinyan Gao

arxiv: 2606.09303 · v1 · pith:YIQCXVVEnew · submitted 2026-06-08 · 💻 cs.CV

Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

Xinyan Gao , Haoran Hao , Xiangyu Yue This is my paper

Pith reviewed 2026-06-27 16:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords image segmentationmultimodal large language modelsreasoning-based segmentationcandidate discoverymask selectionattention mapsbenchmark evaluation

0 comments

The pith

Segmentation for complex queries works by first finding candidate masks from attention maps then using an MLLM to reason and pick the highest-scoring one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a two-stage framework that treats image segmentation as candidate discovery followed by discriminative selection. Potential regions are first extracted as candidate masks from the attention maps of a segmentation MLLM. An MLLM then reasons jointly over the original query and each candidate mask to assign scores, after which the masks are reranked and the top one is chosen. The approach is motivated by the observation that direct mask generation from MLLMs struggles with limited training data and the separation between perception and mask output. A new benchmark is introduced to test perception, grounding, and multiple forms of reasoning, along with collected data for training the scoring step.

Core claim

Rea2Seg reformulates image segmentation for complex reasoning queries as a two-stage process: candidate masks are discovered from attention maps of a segmentation MLLM, after which a second MLLM reasons over the query and each candidate to produce scores; the highest-scoring mask is selected as output.

What carries the argument

The Rea2Seg two-stage pipeline that separates mask candidate discovery from comparative reasoning-based selection.

If this is right

The method separates perception from final mask choice, allowing the MLLM to apply comparative reasoning after initial region proposals.
Training data collected for scoring enables the MLLM to jointly process queries and candidate masks.
The new ReasonSeg-SGDR benchmark measures performance across discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning with fine-grained masks.
Existing benchmarks are noted to emphasize commonsense reasoning that may not fully test joint visual and reasoning demands.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation into discovery and selection stages could be tested on queries that require iterative refinement beyond a single pass.
If attention maps prove reliable for candidate generation, similar discovery steps might apply to other MLLM-driven vision tasks that output structured outputs.
The benchmark's multi-dimension design could serve as a template for evaluating whether gains come from perception, grounding, or the reasoning step itself.

Load-bearing premise

Attention maps from the segmentation MLLM contain the correct mask among the generated candidates even for complex reasoning queries.

What would settle it

If reranking via MLLM reasoning on the ReasonSeg-SGDR benchmark yields no accuracy gain over the raw attention-derived masks or prior direct-generation baselines, the two-stage claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.09303 by Haoran Hao, Xiangyu Yue, Xinyan Gao.

**Figure 2.** Figure 2: The proposed ReasonSeg-SGDR benchmark. It focuses on the joint use of visual perception and high-level reasoning. It spans multiple dimensions, including discriminative, geometric, spatial, and multi-step reasoning, all of which require finegrained mask generation. The target region is identified by carefully examining the image and reasoning over the visual evidence. reasoning), and accurate pixel-level … view at source ↗

**Figure 3.** Figure 3: The overall framework of Rea2Seg. Rea2Seg consists of an attention-driven candidate mask generator and a mask evaluator for selection. Specifically, (1) candidate regions are first identified based on the attention maps of a segmentation MLLM; (2) an MLLM then performs reasoning and assigns scores to all candidate masks; and (3) the candidates are reranked, and the highest-scoring mask is selected as the f… view at source ↗

read the original abstract

The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal large language models (MLLMs) have been widely explored for image segmentation with complex queries that require high-level reasoning. Despite promising progress, existing methods are often constrained by limited training data and the gap between MLLMs and mask generation modules. To better transfer MLLMs' perception and reasoning ability to complex reasoning-based segmentation tasks, we propose a two-stage framework Rea2Seg for mask generation and selection. Specifically, the framework first identifies potential regions as candidate masks based on the attention maps of a segmentation MLLM. It then employs an MLLM to reason over the question and candidate masks and assign scores to each mask. The final segmentation result is obtained by reranking the candidates and selecting the highest-scoring mask, reformulating image segmentation as candidate discovery followed by discriminative mask selection. We also notice that a large portion of questions in existing benchmarks focus on commonsense reasoning, and these questions usually do not fully require joint visual observation and reasoning. To address this issue, we introduce a new benchmark called ReasonSeg-SGDR that comprehensively evaluates a model's perception, grounding, and reasoning abilities across multiple dimensions, including discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning, with fine-grained mask generation. In addition, we collect training data to enhance MLLMs' ability to jointly understand multimodal queries and candidate masks, and to assign scores through reasoning. Experimental results on the proposed benchmark and ReasonSeg demonstrate the effectiveness of the unified mask generation and selection framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rea2Seg splits segmentation into attention-based candidates then MLLM scoring, but the candidate stage lacks evidence it catches the right mask on the new benchmark's harder queries.

read the letter

The paper's main contribution is Rea2Seg, a two-stage setup where a segmentation MLLM first produces candidate masks from its attention maps, after which a second MLLM reasons over the query and the candidates to score and pick the best one. They also introduce ReasonSeg-SGDR, a benchmark that adds tests for discriminative, spatial, geometric, and multi-step reasoning with mask outputs.

The benchmark is the clearer advance. Existing datasets often stick to straightforward commonsense questions that do not force tight visual-reasoning loops, so building one that spans those four dimensions is a direct response to a real limitation.

The method itself is cleanly described as candidate discovery followed by selection. Collecting extra training data to improve the scoring MLLM's ability to compare masks is a sensible supporting step.

The soft spot is the first stage. The approach assumes the attention maps will reliably include the ground-truth region among the candidates, yet the stress-test concern holds: on multi-step or geometric queries the maps can be diffuse or fixate on salient distractors. No recall figures for the candidate sets or recovery mechanisms are mentioned, so it is unclear whether the reranking step ever gets a chance to work. The abstract states that results on ReasonSeg and the new benchmark show effectiveness, but without numbers, ablations, or error analysis the size of the improvement stays hard to judge.

This is for labs already working on MLLM segmentation who need benchmarks that push beyond simple queries. It deserves peer review so the candidate coverage and experimental details can be checked.

Referee Report

2 major / 1 minor

Summary. The paper proposes Rea2Seg, a two-stage framework for complex reasoning-based image segmentation with MLLMs. Stage 1 generates candidate masks from attention maps of a segmentation MLLM. Stage 2 uses an MLLM to reason over the query and candidates, assign scores, and select the highest-scoring mask via reranking. The authors introduce the ReasonSeg-SGDR benchmark targeting discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning with fine-grained masks, collect training data for the scoring stage, and claim that experiments on ReasonSeg-SGDR and ReasonSeg demonstrate the framework's effectiveness.

Significance. If the central claims hold, the decoupling of candidate discovery from discriminative selection could be a practical way to leverage MLLM reasoning without retraining mask generators, and ReasonSeg-SGDR would address limitations in existing benchmarks that over-rely on commonsense queries. The approach is a natural extension of current MLLM segmentation work, but its significance cannot be assessed without evidence that the attention-based candidate stage achieves sufficient recall on the new benchmark's harder reasoning dimensions.

major comments (2)

[Abstract / candidate discovery stage] The load-bearing assumption of the first stage (Abstract and method description) that attention maps from the segmentation MLLM reliably surface a candidate set containing the ground-truth mask is not validated. No recall@K, coverage statistics, or failure-case analysis is reported for multi-step reasoning queries on ReasonSeg-SGDR, where attention may be diffuse or latch onto incorrect salient regions.
[Abstract / experimental claims] The abstract states that 'experimental results on the proposed benchmark and ReasonSeg demonstrate the effectiveness,' yet no quantitative results, ablation studies, baselines, error bars, or dataset statistics are supplied. This prevents any evaluation of whether the two-stage reranking improves over direct MLLM segmentation or whether the collected training data yields measurable gains.

minor comments (1)

[Abstract] The benchmark description would benefit from one concrete query example per dimension (discriminative, spatial, geometric, multi-step) to clarify the distinctions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below and will revise the manuscript to incorporate additional validation and experimental details.

read point-by-point responses

Referee: [Abstract / candidate discovery stage] The load-bearing assumption of the first stage (Abstract and method description) that attention maps from the segmentation MLLM reliably surface a candidate set containing the ground-truth mask is not validated. No recall@K, coverage statistics, or failure-case analysis is reported for multi-step reasoning queries on ReasonSeg-SGDR, where attention may be diffuse or latch onto incorrect salient regions.

Authors: We agree that explicit validation of the candidate discovery stage is necessary to support the framework, especially on the more challenging multi-step reasoning queries in ReasonSeg-SGDR. The current manuscript prioritizes end-to-end segmentation performance but does not report recall@K, coverage, or failure-case analysis for this stage. We will add these metrics with breakdowns by reasoning dimension and include failure-case discussion in the revision. revision: yes
Referee: [Abstract / experimental claims] The abstract states that 'experimental results on the proposed benchmark and ReasonSeg demonstrate the effectiveness,' yet no quantitative results, ablation studies, baselines, error bars, or dataset statistics are supplied. This prevents any evaluation of whether the two-stage reranking improves over direct MLLM segmentation or whether the collected training data yields measurable gains.

Authors: We acknowledge that the current manuscript version does not supply the requested quantitative details, ablations, baselines, error bars, or dataset statistics. We will add these elements, including direct comparisons showing gains from reranking over single-stage MLLM segmentation and the contribution of the collected training data, to substantiate the abstract claims. revision: yes

Circularity Check

0 steps flagged

No circularity detected; framework is self-contained empirical proposal

full rationale

The paper describes a two-stage Rea2Seg framework that uses attention maps from a segmentation MLLM for candidate mask discovery followed by MLLM-based scoring and reranking. No equations, fitted parameters, self-citations, or uniqueness theorems appear in the abstract or described content. The central claim is an empirical method reformulation supported by a new benchmark and collected training data, with no reduction of outputs to inputs by construction. This matches the most common honest finding of a self-contained proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5815 in / 1134 out tokens · 21562 ms · 2026-06-27T16:59:18.629954+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

122 extracted references · 31 canonical work pages · 15 internal anchors

[1]

TPAMI (2017) 4

Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. TPAMI (2017) 4

2017
[2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

In: European Conference on Computer Vision

Bao, X., Sun, S., Ma, S., Zheng, K., Guo, Y., Zhao, G., Zheng, Y., Wang, X.: Cores: Orchestrating the dance of reasoning and segmentation. In: European Conference on Computer Vision. pp. 187–204. Springer (2024) 2, 4, 12

2024
[4]

Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context (2018),https://arxiv.org/abs/1612.0371611, 24

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

arXiv preprint arXiv:2410.08209 (2024) 9

Cao, S., Gui, L.Y., Wang, Y.X.: Emergent visual grounding in large multimodal models without grounding supervision. arXiv preprint arXiv:2410.08209 (2024) 9

work page arXiv 2024
[6]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Cao, S., Wei, Z., Kuen, J., Liu, K., Zhang, L., Gu, J., Jung, H., Gui, L.Y., Wang, Y.X.: Refer to any segmentation mask group with vision-language prompts. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21853–21863 (2025) 7, 26, 27

2025
[7]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

TPAMI (2018) 4

Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI (2018) 4

2018
[9]

Chen, X., Elbayad, M., Nguyen, J., Verbeek, J.: VUGEN: Visual understanding priors for GENeration (2026),https://openreview.net/forum?id=tubF5vyrQ05

2026
[10]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.: Detect what you can: Detecting and representing objects using holistic models and body parts. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1971–1978 (2014) 11, 24

1971
[11]

In: European Conference on Computer Vision

Chen, Y.C., Li, W.H., Sun, C., Wang, Y.C.F., Chen, C.S.: Sam4mllm: Enhance multi-modal large language model for referring expression segmentation. In: European Conference on Computer Vision. pp. 323–340. Springer (2024) 5

2024
[12]

In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=JKEIYQUSUc5

Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: SpatialRGPT: Grounded spatial reasoning in vision-language models. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=JKEIYQUSUc5

2024
[13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025) 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J.S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., et al.: Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 91–104 (2025) 25 Reason Twice 17

2025
[15]

In: The Fourteenth In- ternational Conference on Learning Representations (2026),https://openreview

Du, T., Li, H., Fan, Z., Zhang, J., Pan, P., Zhang, Y.: SAM-veteran: An MLLM- based human-like SAM agent for reasoning segmentation. In: The Fourteenth In- ternational Conference on Learning Representations (2026),https://openreview. net/forum?id=oN55r8iJJW5

2026
[16]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2025),https://openreview.net/forum?id=DgH9YCsqWm5, 7, 26, 28

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y., Ji, R., Shan, C., He, R.: MME: A comprehensive evaluation benchmark for multimodal large language models. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2025),https://openreview.net/forum?id...

2025
[17]

In: CVPR (2019) 4

Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual attention network for scene segmentation. In: CVPR (2019) 4

2019
[18]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum? id=ILr4UNiZcQ5

Han, J., Chen, H., Zhao, Y., Wang, H., Zhao, Q., Yang, Z., He, H., Yue, X., Jiang, L.: Vision as a dialect: Unifying visual understanding and generation via text-aligned representations. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum? id=ILr4UNiZcQ5

2025
[19]

In: Proceedings of the IEEE international conference on computer vision

He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017) 4

2017
[20]

In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

2022
[21]

OpenReview.net (2022),https://openreview.net/forum?id=nZeVKeeFYf912

2022
[22]

arXiv preprint arXiv:2505.22596 (2025) 5, 13

Huang, J., Xu, Z., Zhou, J., Liu, T., Xiao, Y., Ou, M., Ji, B., Li, X., Yuan, K.: Sam-r1: Leveraging sam for reward feedback in multimodal segmentation via reinforcement learning. arXiv preprint arXiv:2505.22596 (2025) 5, 13

work page arXiv 2025
[23]

In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019) 11, 29

2019
[24]

In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=mzL19kKE3r3, 5

Jang, D., Cho, Y., Lee, S., Kim, T., Kim, D.: MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation. In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=mzL19kKE3r3, 5

2025
[25]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2901–2910 (2017) 27

2017
[26]

See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025

Kang, S., Kim, J., Kim, J., Hwang, S.J.: See what you are told: Visual attention sink in large multimodal models. arXiv preprint arXiv:2503.03321 (2025) 9

work page arXiv 2025
[27]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Kang, S., Kim, J., Kim, J., Hwang, S.J.: Your large vision-language model only needs a few attention heads for visual grounding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9339–9350 (2025) 3, 5, 9

2025
[28]

In: EMNLP (2014) 2, 11, 24

Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: Referring to objects in photographs of natural scenes. In: EMNLP (2014) 2, 11, 24

2014
[29]

Advances in Neural Information Processing Systems36, 29914–29934 (2023) 7

Ke, L., Ye, M., Danelljan, M., Tai, Y.W., Tang, C.K., Yu, F., et al.: Segment anything in high quality. Advances in Neural Information Processing Systems36, 29914–29934 (2023) 7

2023
[30]

Segment Anything

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. arXiv:2304.02643 (2023) 2 18 X. Gao and H. Hao et al

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9579–9589 (2024) 2, 3, 4, 5, 7, 9, 11, 12, 13, 24

2024
[32]

In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=vkakKdznFS5

Lan, M., Chen, C., Zhou, Y., Xu, J., Ke, Y., Wang, X., Feng, L., Zhang, W.: Text4seg: Reimagining image segmentation as text generation. In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=vkakKdznFS5

2025
[33]

Computer vision and image understanding 184, 45–56 (2019) 7, 26

Le, T.N., Nguyen, T.V., Nie, Z., Tran, M.T., Sugimoto, A.: Anabranch network for camouflaged object segmentation. Computer vision and image understanding 184, 45–56 (2019) 7, 26

2019
[34]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025) 2, 4

Li, Z., Yang, B., Liu, Q., Zhang, S., Ma, Z., Yin, L., Deng, L., Sun, Y., Liu, Y., Bai, X.: Lira: Inferring segmentation in large multi-modal models with local interleaved region assistance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025) 2, 4

2025
[35]

In: Winter Conference on Applications of Computer Vision (WACV) (2021) 7, 26

Liew, J.H., Cohen, S., Price, B., Mai, L., Feng, J.: Deep interactive thin object selection. In: Winter Conference on Applications of Computer Vision (WACV) (2021) 7, 26

2021
[36]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, C., Ding, H., Jiang, X.: Gres: Generalized referring expression segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 23592–23601 (2023) 4, 7, 26, 27

2023
[37]

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023) 11, 24

2023
[38]

In: CVPR (2026) 5

Liu, J., Feng, M., Chen, L.: Better, stronger, faster: Tackling the trilemma in mllm- based segmentation with simultaneous textual mask prediction. In: CVPR (2026) 5

2026
[39]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, R., Liu, C., Bai, Y., Yuille, A.L.: Clevr-ref+: Diagnosing visual reasoning with referring expressions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4185–4194 (2019) 7, 26, 27

2019
[40]

arXiv preprint arXiv:2509.18094 (2025) 5

Liu, Y., Ma, Z., Pu, J., Qi, Z., Wu, Y., Shan, Y., Chen, C.W.: Unipixel: Unified object referring and segmentation for pixel-level visual reasoning. arXiv preprint arXiv:2509.18094 (2025) 5

work page arXiv 2025
[41]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Liu, Y., Peng, B., Zhong, Z., Yue, Z., Lu, F., Yu, B., Jia, J.: Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement. arXiv preprint arXiv:2503.06520 (2025) 5, 12, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

arXiv preprint arXiv:2505.12081 (2025) 5, 12

Liu, Y., Qu, T., Zhong, Z., Peng, B., Liu, S., Yu, B., Jia, J.: Visionreasoner: Unified visual perception and reasoning via reinforcement learning. arXiv preprint arXiv:2505.12081 (2025) 5, 12

work page arXiv 2025
[43]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 24

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

arXiv preprint arXiv:2506.04277 (2025) 13

Lu, Y., Cao, J., Wu, Y., Li, B., Tang, L., Ji, Y., Wu, C., Wu, J., Zhu, W.: Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought. arXiv preprint arXiv:2506.04277 (2025) 13

work page arXiv 2025
[45]

arXiv preprint arXiv:2510.11173 (2025) 3, 5, 13

Lu, Z., Li, L., Wang, J., Feng, Y., Chen, B., Chen, K., Wang, Y.: Coprs: Learning positional prior from chain-of-thought for reasoning segmentation. arXiv preprint arXiv:2510.11173 (2025) 3, 5, 13

work page arXiv 2025
[46]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Ma, W., Chen, H., Zhang, G., Chou, Y.C., Chen, J., de Melo, C., Yuille, A.: 3dsrbench: A comprehensive 3d spatial reasoning benchmark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6924–6934 (2025) 5 Reason Twice 19

2025
[47]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 11–20 (2016) 11, 25

2016
[48]

IEEE Transactions on Pattern Analysis and Machine Intelligence44(7), 3523–3542 (2022).https://doi.org/10

Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N., Terzopoulos, D.: Image segmentation using deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence44(7), 3523–3542 (2022).https://doi.org/10. 1109/TPAMI.2021.30599684

work page arXiv 2022
[49]

In: ECCV (2016) 2

Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: ECCV (2016) 2

2016
[50]

In: ICCV (2015) 4

Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: ICCV (2015) 4

2015
[51]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Park, J.S., Ma, Z., Li, L., Zheng, C., Hsieh, C.Y., Lu, X., Chandu, K., Kong, Q., Kobori, N., Farhadi, A., et al.: Synthetic visual genome. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9073–9086 (2025) 25

2025
[53]

In: International Conference on Computer Vision (ICCV) (October 2023) 7, 11, 26, 27, 29

Qi, L., Kuen, J., Shen, T., Gu, J., Guo, W., Jia, J., Lin, Z., Yang, M.H.: High- quality entity segmentation. In: International Conference on Computer Vision (ICCV) (October 2023) 7, 11, 26, 27, 29

2023
[54]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025) 2, 4, 7, 9

Qian, R., Yin, X., Dou, D.: Reasoning to attend: Try to understand how< seg> token works. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025) 2, 4, 7, 9

2025
[55]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Qian, R., Yin, X., Dou, D.: Reasoning to attend: Try to understand how< seg> token works. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24722–24731 (2025) 9, 12, 13

2025
[56]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Qin, J., Wu, J., Yan, P., Li, M., Yuxi, R., Xiao, X., Wang, Y., Wang, R., Wen, S., Pan, X., Wang, X.: Freeseg: Unified, universal and open-vocabulary image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19446–19455 (June 2023) 4

2023
[57]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 4

2021
[58]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R.M., Xing, E., Yang, M.H., Khan, F.S.: Glamm: Pixel grounding large multimodal model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13009–13018 (June 2024) 5

2024
[59]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) 4, 25

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models (2024),https://arxiv.org/abs/2402.033005

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

TPAMI (2017) 4

Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. TPAMI (2017) 4

2017
[62]

OpenAI GPT-5 System Card

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., Nathan, A., Luo, A., Helyar, A., Madry, A., Efremov, A., Spyra, A., Baker-Whitcomb, A., Beutel, A., Karpenko, A., 20 X. Gao and H. Hao et al. Makelov, A., Neitz, A., Wei, A., Barr, A., Kirchmeyer, A., Ivanov, A., Christakis, A., Gille...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Springer Nature (2022) 4

Szeliski, R.: Computer vision: algorithms and applications. Springer Nature (2022) 4

2022
[64]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum?id=8omLr8BtjL2, 4

Tang, H., Xie, C.W., Wang, H., Bao, X., Weng, T., Li, P., Zheng, Y., Wang, L.: UFO: A unified approach to fine-grained visual perception via open-ended language interface. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum?id=8omLr8BtjL2, 4

2025
[65]

Gemma 3 Technical Report

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., Rouillard, L., Mesnard, T., Cideron, G., bastien Grill, J., Ramos, S., Yvinec, E., Casbon, M., Pot, E., Penchev, I., Liu, G., Visin, F., Kenealy, K., Beyer, L., Zhai, X., Tsitsulin, A., Busa-Fekete, R., Feng, A., Sachdeva, N., Cole...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https:// openreview.net/forum?id=h3lyFa5e1W5

Wan, Z., Dou, Z., Liu, C., Zhang, Y., Cui, D., Zhao, Q., Shen, H., Xiong, J., Xin, Y., Jiang, Y., Tao, C., He, Y., Zhang, M., Yan, S.: SRPO: Enhancing multimodal LLM reasoning via reflection-aware reinforcement learning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https:// openreview.net/forum?id=h3lyFa5e1W5

2025
[67]

In: European Conference on Computer Vision

Wang, H., Tang, H., Jiang, L., Shi, S., Naeem, M.F., Li, H., Schiele, B., Wang, L.: Git: Towards generalist vision transformer through universal language interface. In: European Conference on Computer Vision. pp. 55–73. Springer (2024) 4

2024
[68]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Wang, H., Qiao, L., Jie, Z., Huang, Z., Feng, C., Zheng, Q., Ma, L., Lan, X., Liang, X.: X-sam: From segment anything to any segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 26187–26196 (2026) 4

2026
[69]

Wang, K., Pan, J., Shi, W., Lu, Z., Ren, H., Zhou, A., Zhan, M., Li, H.: MeasuringmultimodalmathematicalreasoningwithMATH-visiondataset.In:The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2024),https://openreview.net/forum?id=QWTCcxMpPA5

2024
[70]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, P., Li, Z.Z., Yin, F., Ran, D., Liu, C.L.: Mv-math: Evaluating multimodal math reasoning in multi-visual contexts. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19541–19551 (2025) 5

2025
[71]

In: The Thirty- ninth Annual Conference on Neural Information Processing Systems (2025),https: //openreview.net/forum?id=rud3M6wlxH5

Wang, X., Ru, L., Huang, Z., Ji, K., Zheng, D., Chen, J., ZHOU, J.: ARGenseg: Image segmentation with autoregressive image generation model. In: The Thirty- ninth Annual Conference on Neural Information Processing Systems (2025),https: //openreview.net/forum?id=rud3M6wlxH5

2025
[72]

In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=Pm1NXHgzyf5

Wang, X., Zhang, S., Li, S., Li, K., Kallidromitis, K., Kato, Y., Kozuka, K., Darrell, T.: SegLLM: Multi-round reasoning segmentation with large language models. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=Pm1NXHgzyf5

2025
[73]

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Wang, Y., Wu, S., Zhang, Y., Yan, S., Liu, Z., Luo, J., Fei, H.: Multimodal chain- of-thought reasoning: A comprehensive survey. arXiv preprint arXiv:2503.12605 (2025) 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

arXiv preprint arXiv:2411.17606 (2024) 4

Wei, C., Zhong, Y., Tan, H., Liu, Y., Zhao, Z., Hu, J., Yang, Y.: Hyperseg: Towards universal visual segmentation with large language model. arXiv preprint arXiv:2411.17606 (2024) 4

work page arXiv 2024
[75]

In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K

Wei, J., Wang, X., Schuurmans, D., Bosma, M., brian ichter, Xia, F., Chi, E.H., Le, Q.V., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022),https://openreview.net/forum? id=_VjQlMeSB_J5

2022
[76]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, T.H., Biamby, G., Chan, D., Dunlap, L., Gupta, R., Wang, X., Gonzalez, J.E., Darrell, T.: See say and segment: Teaching lmms to overcome false premises. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13459–13469 (2024) 2, 4, 7, 9, 12

2024
[77]

In: Proceedings of the Reason Twice 23 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xia, Z., Han, D., Han, Y., Pan, X., Song, S., Huang, G.: Gsva: Generalized segmentation via multimodal large language models. In: Proceedings of the Reason Twice 23 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3858–3869 (June 2024) 2, 4

2024
[78]

Xiao, Y., Sun, E., Liu, T., Wang, W.: Logicvista: Multimodal llm logical reasoning benchmark in visual contexts (2024),https://arxiv.org/abs/2407.049735

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Xu, G., Jin, P., Wu, Z., Li, H., Song, Y., Sun, L., Yuan, L.: Llava-cot: Let vision language models reason step-by-step. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2087–2098 (2025) 25

2087
[80]

In:TheThirty-ninthAnnualConferenceonNeuralInformationProcessingSystems (2025),https://openreview.net/forum?id=uWEcZkrSkZ5

Xu, J., Fei, H., Zhang, Y., Pan, L., Huang, Q., Liu, Q., Nakov, P., Kan, M.Y., Wang, W.Y., Lee, M.L., Hsu, W.: MuSLR: Multimodal symbolic logical reasoning. In:TheThirty-ninthAnnualConferenceonNeuralInformationProcessingSystems (2025),https://openreview.net/forum?id=uWEcZkrSkZ5

2025

Showing first 80 references.

[1] [1]

TPAMI (2017) 4

Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. TPAMI (2017) 4

2017

[2] [2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

In: European Conference on Computer Vision

Bao, X., Sun, S., Ma, S., Zheng, K., Guo, Y., Zhao, G., Zheng, Y., Wang, X.: Cores: Orchestrating the dance of reasoning and segmentation. In: European Conference on Computer Vision. pp. 187–204. Springer (2024) 2, 4, 12

2024

[4] [4]

Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context (2018),https://arxiv.org/abs/1612.0371611, 24

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

arXiv preprint arXiv:2410.08209 (2024) 9

Cao, S., Gui, L.Y., Wang, Y.X.: Emergent visual grounding in large multimodal models without grounding supervision. arXiv preprint arXiv:2410.08209 (2024) 9

work page arXiv 2024

[6] [6]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Cao, S., Wei, Z., Kuen, J., Liu, K., Zhang, L., Gu, J., Jung, H., Gui, L.Y., Wang, Y.X.: Refer to any segmentation mask group with vision-language prompts. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21853–21863 (2025) 7, 26, 27

2025

[7] [7]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

TPAMI (2018) 4

Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI (2018) 4

2018

[9] [9]

Chen, X., Elbayad, M., Nguyen, J., Verbeek, J.: VUGEN: Visual understanding priors for GENeration (2026),https://openreview.net/forum?id=tubF5vyrQ05

2026

[10] [10]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.: Detect what you can: Detecting and representing objects using holistic models and body parts. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1971–1978 (2014) 11, 24

1971

[11] [11]

In: European Conference on Computer Vision

Chen, Y.C., Li, W.H., Sun, C., Wang, Y.C.F., Chen, C.S.: Sam4mllm: Enhance multi-modal large language model for referring expression segmentation. In: European Conference on Computer Vision. pp. 323–340. Springer (2024) 5

2024

[12] [12]

In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=JKEIYQUSUc5

Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: SpatialRGPT: Grounded spatial reasoning in vision-language models. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=JKEIYQUSUc5

2024

[13] [13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025) 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J.S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., et al.: Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 91–104 (2025) 25 Reason Twice 17

2025

[15] [15]

In: The Fourteenth In- ternational Conference on Learning Representations (2026),https://openreview

Du, T., Li, H., Fan, Z., Zhang, J., Pan, P., Zhang, Y.: SAM-veteran: An MLLM- based human-like SAM agent for reasoning segmentation. In: The Fourteenth In- ternational Conference on Learning Representations (2026),https://openreview. net/forum?id=oN55r8iJJW5

2026

[16] [16]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2025),https://openreview.net/forum?id=DgH9YCsqWm5, 7, 26, 28

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y., Ji, R., Shan, C., He, R.: MME: A comprehensive evaluation benchmark for multimodal large language models. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2025),https://openreview.net/forum?id...

2025

[17] [17]

In: CVPR (2019) 4

Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual attention network for scene segmentation. In: CVPR (2019) 4

2019

[18] [18]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum? id=ILr4UNiZcQ5

Han, J., Chen, H., Zhao, Y., Wang, H., Zhao, Q., Yang, Z., He, H., Yue, X., Jiang, L.: Vision as a dialect: Unifying visual understanding and generation via text-aligned representations. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum? id=ILr4UNiZcQ5

2025

[19] [19]

In: Proceedings of the IEEE international conference on computer vision

He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017) 4

2017

[20] [20]

In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

2022

[21] [21]

OpenReview.net (2022),https://openreview.net/forum?id=nZeVKeeFYf912

2022

[22] [22]

arXiv preprint arXiv:2505.22596 (2025) 5, 13

Huang, J., Xu, Z., Zhou, J., Liu, T., Xiao, Y., Ou, M., Ji, B., Li, X., Yuan, K.: Sam-r1: Leveraging sam for reward feedback in multimodal segmentation via reinforcement learning. arXiv preprint arXiv:2505.22596 (2025) 5, 13

work page arXiv 2025

[23] [23]

In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019) 11, 29

2019

[24] [24]

In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=mzL19kKE3r3, 5

Jang, D., Cho, Y., Lee, S., Kim, T., Kim, D.: MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation. In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=mzL19kKE3r3, 5

2025

[25] [25]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2901–2910 (2017) 27

2017

[26] [26]

See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025

Kang, S., Kim, J., Kim, J., Hwang, S.J.: See what you are told: Visual attention sink in large multimodal models. arXiv preprint arXiv:2503.03321 (2025) 9

work page arXiv 2025

[27] [27]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Kang, S., Kim, J., Kim, J., Hwang, S.J.: Your large vision-language model only needs a few attention heads for visual grounding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9339–9350 (2025) 3, 5, 9

2025

[28] [28]

In: EMNLP (2014) 2, 11, 24

Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: Referring to objects in photographs of natural scenes. In: EMNLP (2014) 2, 11, 24

2014

[29] [29]

Advances in Neural Information Processing Systems36, 29914–29934 (2023) 7

Ke, L., Ye, M., Danelljan, M., Tai, Y.W., Tang, C.K., Yu, F., et al.: Segment anything in high quality. Advances in Neural Information Processing Systems36, 29914–29934 (2023) 7

2023

[30] [30]

Segment Anything

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. arXiv:2304.02643 (2023) 2 18 X. Gao and H. Hao et al

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9579–9589 (2024) 2, 3, 4, 5, 7, 9, 11, 12, 13, 24

2024

[32] [32]

In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=vkakKdznFS5

Lan, M., Chen, C., Zhou, Y., Xu, J., Ke, Y., Wang, X., Feng, L., Zhang, W.: Text4seg: Reimagining image segmentation as text generation. In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=vkakKdznFS5

2025

[33] [33]

Computer vision and image understanding 184, 45–56 (2019) 7, 26

Le, T.N., Nguyen, T.V., Nie, Z., Tran, M.T., Sugimoto, A.: Anabranch network for camouflaged object segmentation. Computer vision and image understanding 184, 45–56 (2019) 7, 26

2019

[34] [34]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025) 2, 4

Li, Z., Yang, B., Liu, Q., Zhang, S., Ma, Z., Yin, L., Deng, L., Sun, Y., Liu, Y., Bai, X.: Lira: Inferring segmentation in large multi-modal models with local interleaved region assistance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025) 2, 4

2025

[35] [35]

In: Winter Conference on Applications of Computer Vision (WACV) (2021) 7, 26

Liew, J.H., Cohen, S., Price, B., Mai, L., Feng, J.: Deep interactive thin object selection. In: Winter Conference on Applications of Computer Vision (WACV) (2021) 7, 26

2021

[36] [36]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, C., Ding, H., Jiang, X.: Gres: Generalized referring expression segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 23592–23601 (2023) 4, 7, 26, 27

2023

[37] [37]

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023) 11, 24

2023

[38] [38]

In: CVPR (2026) 5

Liu, J., Feng, M., Chen, L.: Better, stronger, faster: Tackling the trilemma in mllm- based segmentation with simultaneous textual mask prediction. In: CVPR (2026) 5

2026

[39] [39]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, R., Liu, C., Bai, Y., Yuille, A.L.: Clevr-ref+: Diagnosing visual reasoning with referring expressions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4185–4194 (2019) 7, 26, 27

2019

[40] [40]

arXiv preprint arXiv:2509.18094 (2025) 5

Liu, Y., Ma, Z., Pu, J., Qi, Z., Wu, Y., Shan, Y., Chen, C.W.: Unipixel: Unified object referring and segmentation for pixel-level visual reasoning. arXiv preprint arXiv:2509.18094 (2025) 5

work page arXiv 2025

[41] [41]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Liu, Y., Peng, B., Zhong, Z., Yue, Z., Lu, F., Yu, B., Jia, J.: Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement. arXiv preprint arXiv:2503.06520 (2025) 5, 12, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

arXiv preprint arXiv:2505.12081 (2025) 5, 12

Liu, Y., Qu, T., Zhong, Z., Peng, B., Liu, S., Yu, B., Jia, J.: Visionreasoner: Unified visual perception and reasoning via reinforcement learning. arXiv preprint arXiv:2505.12081 (2025) 5, 12

work page arXiv 2025

[43] [43]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 24

work page internal anchor Pith review Pith/arXiv arXiv 2017

[44] [44]

arXiv preprint arXiv:2506.04277 (2025) 13

Lu, Y., Cao, J., Wu, Y., Li, B., Tang, L., Ji, Y., Wu, C., Wu, J., Zhu, W.: Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought. arXiv preprint arXiv:2506.04277 (2025) 13

work page arXiv 2025

[45] [45]

arXiv preprint arXiv:2510.11173 (2025) 3, 5, 13

Lu, Z., Li, L., Wang, J., Feng, Y., Chen, B., Chen, K., Wang, Y.: Coprs: Learning positional prior from chain-of-thought for reasoning segmentation. arXiv preprint arXiv:2510.11173 (2025) 3, 5, 13

work page arXiv 2025

[46] [46]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Ma, W., Chen, H., Zhang, G., Chou, Y.C., Chen, J., de Melo, C., Yuille, A.: 3dsrbench: A comprehensive 3d spatial reasoning benchmark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6924–6934 (2025) 5 Reason Twice 19

2025

[47] [47]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 11–20 (2016) 11, 25

2016

[48] [48]

IEEE Transactions on Pattern Analysis and Machine Intelligence44(7), 3523–3542 (2022).https://doi.org/10

Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N., Terzopoulos, D.: Image segmentation using deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence44(7), 3523–3542 (2022).https://doi.org/10. 1109/TPAMI.2021.30599684

work page arXiv 2022

[49] [49]

In: ECCV (2016) 2

Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: ECCV (2016) 2

2016

[50] [50]

In: ICCV (2015) 4

Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: ICCV (2015) 4

2015

[51] [51]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Park, J.S., Ma, Z., Li, L., Zheng, C., Hsieh, C.Y., Lu, X., Chandu, K., Kong, Q., Kobori, N., Farhadi, A., et al.: Synthetic visual genome. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9073–9086 (2025) 25

2025

[53] [53]

In: International Conference on Computer Vision (ICCV) (October 2023) 7, 11, 26, 27, 29

Qi, L., Kuen, J., Shen, T., Gu, J., Guo, W., Jia, J., Lin, Z., Yang, M.H.: High- quality entity segmentation. In: International Conference on Computer Vision (ICCV) (October 2023) 7, 11, 26, 27, 29

2023

[54] [54]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025) 2, 4, 7, 9

Qian, R., Yin, X., Dou, D.: Reasoning to attend: Try to understand how< seg> token works. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2025) 2, 4, 7, 9

2025

[55] [55]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Qian, R., Yin, X., Dou, D.: Reasoning to attend: Try to understand how< seg> token works. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24722–24731 (2025) 9, 12, 13

2025

[56] [56]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Qin, J., Wu, J., Yan, P., Li, M., Yuxi, R., Xiao, X., Wang, Y., Wang, R., Wen, S., Pan, X., Wang, X.: Freeseg: Unified, universal and open-vocabulary image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19446–19455 (June 2023) 4

2023

[57] [57]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 4

2021

[58] [58]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R.M., Xing, E., Yang, M.H., Khan, F.S.: Glamm: Pixel grounding large multimodal model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13009–13018 (June 2024) 5

2024

[59] [59]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) 4, 25

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [60]

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models (2024),https://arxiv.org/abs/2402.033005

work page internal anchor Pith review Pith/arXiv arXiv 2024

[61] [61]

TPAMI (2017) 4

Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. TPAMI (2017) 4

2017

[62] [62]

OpenAI GPT-5 System Card

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., Nathan, A., Luo, A., Helyar, A., Madry, A., Efremov, A., Spyra, A., Baker-Whitcomb, A., Beutel, A., Karpenko, A., 20 X. Gao and H. Hao et al. Makelov, A., Neitz, A., Wei, A., Barr, A., Kirchmeyer, A., Ivanov, A., Christakis, A., Gille...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [63]

Springer Nature (2022) 4

Szeliski, R.: Computer vision: algorithms and applications. Springer Nature (2022) 4

2022

[64] [64]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum?id=8omLr8BtjL2, 4

Tang, H., Xie, C.W., Wang, H., Bao, X., Weng, T., Li, P., Zheng, Y., Wang, L.: UFO: A unified approach to fine-grained visual perception via open-ended language interface. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum?id=8omLr8BtjL2, 4

2025

[65] [65]

Gemma 3 Technical Report

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., Rouillard, L., Mesnard, T., Cideron, G., bastien Grill, J., Ramos, S., Yvinec, E., Casbon, M., Pot, E., Penchev, I., Liu, G., Visin, F., Kenealy, K., Beyer, L., Zhai, X., Tsitsulin, A., Busa-Fekete, R., Feng, A., Sachdeva, N., Cole...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https:// openreview.net/forum?id=h3lyFa5e1W5

Wan, Z., Dou, Z., Liu, C., Zhang, Y., Cui, D., Zhao, Q., Shen, H., Xiong, J., Xin, Y., Jiang, Y., Tao, C., He, Y., Zhang, M., Yan, S.: SRPO: Enhancing multimodal LLM reasoning via reflection-aware reinforcement learning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https:// openreview.net/forum?id=h3lyFa5e1W5

2025

[67] [67]

In: European Conference on Computer Vision

Wang, H., Tang, H., Jiang, L., Shi, S., Naeem, M.F., Li, H., Schiele, B., Wang, L.: Git: Towards generalist vision transformer through universal language interface. In: European Conference on Computer Vision. pp. 55–73. Springer (2024) 4

2024

[68] [68]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Wang, H., Qiao, L., Jie, Z., Huang, Z., Feng, C., Zheng, Q., Ma, L., Lan, X., Liang, X.: X-sam: From segment anything to any segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 26187–26196 (2026) 4

2026

[69] [69]

Wang, K., Pan, J., Shi, W., Lu, Z., Ren, H., Zhou, A., Zhan, M., Li, H.: MeasuringmultimodalmathematicalreasoningwithMATH-visiondataset.In:The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2024),https://openreview.net/forum?id=QWTCcxMpPA5

2024

[70] [70]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, P., Li, Z.Z., Yin, F., Ran, D., Liu, C.L.: Mv-math: Evaluating multimodal math reasoning in multi-visual contexts. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19541–19551 (2025) 5

2025

[71] [71]

In: The Thirty- ninth Annual Conference on Neural Information Processing Systems (2025),https: //openreview.net/forum?id=rud3M6wlxH5

Wang, X., Ru, L., Huang, Z., Ji, K., Zheng, D., Chen, J., ZHOU, J.: ARGenseg: Image segmentation with autoregressive image generation model. In: The Thirty- ninth Annual Conference on Neural Information Processing Systems (2025),https: //openreview.net/forum?id=rud3M6wlxH5

2025

[72] [72]

In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=Pm1NXHgzyf5

Wang, X., Zhang, S., Li, S., Li, K., Kallidromitis, K., Kato, Y., Kozuka, K., Darrell, T.: SegLLM: Multi-round reasoning segmentation with large language models. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=Pm1NXHgzyf5

2025

[73] [73]

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Wang, Y., Wu, S., Zhang, Y., Yan, S., Liu, Z., Luo, J., Fei, H.: Multimodal chain- of-thought reasoning: A comprehensive survey. arXiv preprint arXiv:2503.12605 (2025) 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [74]

arXiv preprint arXiv:2411.17606 (2024) 4

Wei, C., Zhong, Y., Tan, H., Liu, Y., Zhao, Z., Hu, J., Yang, Y.: Hyperseg: Towards universal visual segmentation with large language model. arXiv preprint arXiv:2411.17606 (2024) 4

work page arXiv 2024

[75] [75]

In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K

Wei, J., Wang, X., Schuurmans, D., Bosma, M., brian ichter, Xia, F., Chi, E.H., Le, Q.V., Zhou, D.: Chain of thought prompting elicits reasoning in large language models. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022),https://openreview.net/forum? id=_VjQlMeSB_J5

2022

[76] [76]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, T.H., Biamby, G., Chan, D., Dunlap, L., Gupta, R., Wang, X., Gonzalez, J.E., Darrell, T.: See say and segment: Teaching lmms to overcome false premises. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13459–13469 (2024) 2, 4, 7, 9, 12

2024

[77] [77]

In: Proceedings of the Reason Twice 23 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xia, Z., Han, D., Han, Y., Pan, X., Song, S., Huang, G.: Gsva: Generalized segmentation via multimodal large language models. In: Proceedings of the Reason Twice 23 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3858–3869 (June 2024) 2, 4

2024

[78] [78]

Xiao, Y., Sun, E., Liu, T., Wang, W.: Logicvista: Multimodal llm logical reasoning benchmark in visual contexts (2024),https://arxiv.org/abs/2407.049735

work page internal anchor Pith review Pith/arXiv arXiv 2024

[79] [79]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Xu, G., Jin, P., Wu, Z., Li, H., Song, Y., Sun, L., Yuan, L.: Llava-cot: Let vision language models reason step-by-step. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2087–2098 (2025) 25

2087

[80] [80]

In:TheThirty-ninthAnnualConferenceonNeuralInformationProcessingSystems (2025),https://openreview.net/forum?id=uWEcZkrSkZ5

Xu, J., Fei, H., Zhang, Y., Pan, L., Huang, Q., Liu, Q., Nakov, P., Kan, M.Y., Wang, W.Y., Lee, M.L., Hsu, W.: MuSLR: Multimodal symbolic logical reasoning. In:TheThirty-ninthAnnualConferenceonNeuralInformationProcessingSystems (2025),https://openreview.net/forum?id=uWEcZkrSkZ5

2025