From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models

Haoxiang Sun; Jiancheng Lv; Jian Zhao; Li Yuan; Tao Wang

arxiv: 2606.26196 · v1 · pith:VPOLTG3Inew · submitted 2026-06-24 · 💻 cs.CL · cs.AI· cs.CV· cs.LG· cs.MM

From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models

Haoxiang Sun , Tao Wang , Li Yuan , Jian Zhao , Jiancheng Lv This is my paper

Pith reviewed 2026-06-26 01:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CVcs.LGcs.MM

keywords multimodal large language modelsvision-language perceptionparadigm evolutionfive-stage taxonomyunified multimodal intelligencesurvey

0 comments

The pith

MLLM perception is formalized as a single unified vision-language capability that evolves through five distinct paradigms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the first systematic survey that treats vision and language as an inseparable modality in multimodal large language models. It formalizes perception in these models as an intrinsic unified capability similar to human perception and traces its development across five stages. A reader would care because this integrated view could replace fragmented approaches that study vision or language in isolation. The survey also maps representative methods at each stage and points to open challenges on the path to general multimodal intelligence.

Core claim

MLLM perception is an intrinsic, unified vision-language capability analogous to human innate perception; the paper introduces a five-stage taxonomy that traces its paradigm evolution from early separate structures to synergistic integration and identifies open challenges toward general unified multimodal intelligence.

What carries the argument

A five-stage taxonomy tracing the paradigm evolution of MLLM perception from structure to synergy.

If this is right

Design of future MLLMs will treat perception as one integrated module rather than separate vision and language components.
Evaluation benchmarks will shift toward measuring cross-modal synergy instead of isolated modality performance.
Research roadmaps will prioritize methods that advance the later stages of the taxonomy.
Progress toward general multimodal intelligence will be measured by how well models achieve unified perception.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could be tested by checking whether new models fit cleanly into one of the five stages or require an additional stage.
Insights from this survey might suggest concrete experiments that combine perception tasks across stages to measure synergy gains.
If the unified lens holds, it would imply that separate vision-only or language-only improvements are insufficient for the next generation of models.

Load-bearing premise

Existing surveys are too fragmented because they examine vision and language separately, so a single unified vision-language lens is the right way to track how perception develops in these models.

What would settle it

Discovery of an earlier survey that already presents a unified five-stage taxonomy of vision-language perception in MLLMs would falsify the claim that this is the first such systematic treatment.

Figures

Figures reproduced from arXiv: 2606.26196 by Haoxiang Sun, Jiancheng Lv, Jian Zhao, Li Yuan, Tao Wang.

**Figure 2.** Figure 2: Organization of our five-stage evolutionary framework. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Encoder-Centric Optimization Strategies for Multimodal Perception in MLLMs. (a) Region [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Decoder-Centric Optimization Strategies for Multimodal Perception in MLLMs. (a) Auxiliary [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Dynamic Perception via Adaptive Processing. (a) External Tool Scheduling: The LLM decides [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Architecture-Free Strategies for Perception Enhancement. (a) Instruction-Based Strategies: the [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Case study of OpenAI o3’s thinking with images, reaching the correct answer after 42 seconds of [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have recently made remarkable progress in unifying vision-language understanding and reasoning, especially following the introduction of models such as OpenAI's O-series and DeepSeek's R-series, which have driven a paradigm shift toward perception-centric intelligence. However, there remains a lack of systematic surveys that examine perception from a truly unified vision-language perspective -- one that treats vision and language as an inseparable modality. Existing reviews are often fragmented, focusing separately on either vision or language, and thus rarely capture the cross-modal evolution of perception as an integrated capability. To bridge this gap, we present the first systematic survey of unified vision-language perception in MLLMs. Specifically, we (1) formalize MLLM perception as an intrinsic, unified vision-language capability analogous to human innate perception, (2) introduce a five-stage taxonomy tracing the paradigm evolution of MLLM perception and survey representative methods and milestones at each phase, and (3) identify open challenges and outline promising research directions toward truly general, unified multimodal intelligence. We hope our study will provide both a foundational understanding and an actionable roadmap to foster further innovation on the path toward artificial general intelligence (AGI).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a survey that organizes MLLM work into a five-stage taxonomy around unified vision-language perception, with no new methods or results.

read the letter

This paper is a literature survey that formalizes MLLM perception as a single intrinsic vision-language capability and lays out a five-stage taxonomy for how that capability has evolved. The stages trace shifts from early separate-modality work through more integrated approaches, with examples of models and methods at each point plus a section on open challenges.

The useful part is the synthesis. Pulling milestones into one timeline under a unified lens can help someone entering the area see the progression without jumping between vision-only and language-only reviews. The challenge list at the end is straightforward and points to concrete gaps like better cross-modal grounding.

The soft spots are the usual ones for taxonomy papers. The claim that this is the first systematic unified survey rests on the assertion that prior work stayed fragmented, but that needs explicit comparison to the most recent MLLM surveys to hold up. The taxonomy itself is an organizational choice rather than something derived from new data or proofs, so its lasting value will depend on whether the community finds the five stages more clarifying than existing breakdowns. No internal contradictions show up in the structure, and the citations appear to draw from standard sources.

This is for researchers who need a quick map of the perception side of MLLMs rather than for people looking for new experiments. It deserves peer review because a well-organized survey can reduce duplication even if the taxonomy ends up being debated.

Referee Report

1 major / 2 minor

Summary. The manuscript is a survey on vision-language perception in Multimodal Large Language Models (MLLMs). It formalizes MLLM perception as an intrinsic, unified vision-language capability analogous to human innate perception, claims to be the first systematic survey from this unified perspective, introduces a five-stage taxonomy tracing paradigm evolution, surveys representative methods and milestones at each stage, and identifies open challenges with future research directions toward general multimodal intelligence and AGI.

Significance. If the five-stage taxonomy is well-motivated and the survey comprehensively covers the literature without major omissions, the work could provide a useful organizational lens for the field, highlighting the shift to perception-centric models following recent O-series and R-series models. The unified-modality framing may help consolidate fragmented prior reviews, though its value depends on explicit comparisons to existing surveys.

major comments (1)

[Abstract] Abstract: The assertion that 'existing reviews are often fragmented, focusing separately on either vision or language' is central to the novelty claim but lacks supporting citations or a dedicated comparison subsection; without this, the justification for the five-stage taxonomy as filling a unique gap remains under-supported.

minor comments (2)

The five-stage taxonomy should include explicit classification criteria or decision rules for assigning methods to stages to enhance reproducibility.
Consider adding a summary table listing the five stages, key models/milestones, and representative papers for improved readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the constructive review and the minor revision recommendation. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'existing reviews are often fragmented, focusing separately on either vision or language' is central to the novelty claim but lacks supporting citations or a dedicated comparison subsection; without this, the justification for the five-stage taxonomy as filling a unique gap remains under-supported.

Authors: We agree the abstract claim would be strengthened by explicit citations and a brief comparison. In revision we will (1) add 3-4 representative citations to prior surveys that treat vision or language in isolation and (2) insert a short paragraph in the introduction that contrasts those works with our unified perception framing, thereby better motivating the five-stage taxonomy. revision: yes

Circularity Check

0 steps flagged

No significant circularity; survey with no derivation chain

full rationale

This is a literature review paper that formalizes a concept and proposes a five-stage taxonomy based on surveying existing external work. No equations, predictions, fitted parameters, or load-bearing self-citations appear in the provided text or abstract. The central claims are organizational and do not reduce to any input by construction, satisfying the criteria for a self-contained survey with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper that introduces no free parameters, mathematical axioms, or invented entities; the taxonomy is an organizational construct rather than a derived result.

pith-pipeline@v0.9.1-grok · 5764 in / 971 out tokens · 43113 ms · 2026-06-26T01:49:38.561094+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

237 extracted references · 38 linked inside Pith

[1]

J. Wang, L. Yuan, Y. Zhang, H. Sun, Tarsier: Recipes for training and evaluating large video description models, arXiv preprint arXiv:2407.00634 (2024)

arXiv 2024
[2]

W. Chai, E. Song, Y. Du, C. Meng, V. Madhavan, O. Bar-Tal, J.-N. Hwang, S. Xie, C. D. Manning, Auroracap: Efficient, performant video detailed captioning and a new benchmark, arXiv preprint arXiv:2410.03051 (2024)

arXiv 2024
[3]

L. Yu, P. Poirson, S. Yang, A. C. Berg, T. L. Berg, Modeling context in referring expressions, in: European conference on computer vision, Springer, 2016, pp. 69–85

2016
[4]

X. Lai, Z. Tian, Y. Chen, Y. Li, Y. Yuan, S. Liu, J. Jia, Lisa: Reasoning segmentation via large language model, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9579–9589

2024
[5]

Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al., Mmbench: Is your multi-modal model an all-around player?, in: European conference on computer vision, Springer, 2024, pp. 216–233

2024
[6]

Masry, D

A. Masry, D. X. Long, J. Q. Tan, S. Joty, E. Hoque, Chartqa: A benchmark for question answering about charts with visual and logical reasoning, arXiv preprint arXiv:2203.10244 (2022)

Pith/arXiv arXiv 2022
[7]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: European conference on computer vision, Springer, 2014, pp. 740–755. 29

2014
[8]

P. Wu, S. Xie, V?: Guided visual search as a core mechanism in multimodal llms, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024, pp. 13084–13094

2024
[9]

J. Chen, T. Liang, S. Siu, Z. Wang, K. Wang, Y. Wang, Y. Ni, W. Zhu, Z. Jiang, B. Lyu, et al., Mega-bench: Scaling multimodal evaluation to over 500 real-world tasks, arXiv preprint arXiv:2410.10563 (2024)

arXiv 2024
[10]

M.A.L.Ralph, E.Jefferies, K.Patterson, T.T.Rogers, Theneuralandcomputational bases of semantic cognition, Nature reviews neuroscience 18 (2017) 42–55

2017
[11]

Sapkota, M

R. Sapkota, M. Karkee, Object detection with multimodal large vision-language mod- els: An in-depth review, Available at SSRN 5233953 (2025)

2025
[12]

Y. Shen, C. Li, F. Xiong, J.-O. Jeong, T. Wang, M. Latman, M. Unberath, Reasoning segmentationforimagesandvideos: Asurvey, arXivpreprintarXiv:2505.18816(2025)

arXiv 2025
[13]

Y.Wang, S.Wu, Y.Zhang, S.Yan, Z.Liu, J.Luo, H.Fei, Multimodalchain-of-thought reasoning: A comprehensive survey, arXiv preprint arXiv:2503.12605 (2025)

Pith/arXiv arXiv 2025
[14]

Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y. Zhou, T. Gao, W. Che, Towards reasoning era: A survey of long chain-of-thought for reasoning large language models, arXiv preprint arXiv:2503.09567 (2025)

Pith/arXiv arXiv 2025
[15]

Cobbe, V

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al., Training verifiers to solve math word problems, arXiv preprint arXiv:2110.14168 (2021)

Pith/arXiv arXiv 2021
[16]

H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V. Myers, M. J. Kim, M. Du, et al., Bridgedata v2: A dataset for robot learning at scale, in: Conference on Robot Learning, PMLR, 2023, pp. 1723–1736

2023
[17]

Z. Guo, R. Xu, Y. Yao, J. Cui, Z. Ni, C. Ge, T.-S. Chua, Z. Liu, G. Huang, Llava- uhd: an lmm perceiving any aspect ratio and high-resolution images, in: European Conference on Computer Vision, Springer, 2024, pp. 390–406

2024
[18]

D. Liu, R. Zhang, L. Qiu, S. Huang, W. Lin, S. Zhao, S. Geng, Z. Lin, P. Jin, K. Zhang, et al., Sphinx-x: Scaling data and parameters for a family of multi-modal large lan- guage models, arXiv preprint arXiv:2402.05935 (2024)

arXiv 2024
[19]

C. X. Liang, P. Tian, C. H. Yin, Y. Yua, W. An-Hou, L. Ming, T. Wang, Z. Bi, M. Liu, A comprehensive survey and guide to multimodal large language models in vision-language tasks, arXiv preprint arXiv:2411.06284 (2024)

arXiv 2024
[20]

Caffagni, F

D. Caffagni, F. Cocchi, L. Barsellotti, N. Moratelli, S. Sarto, L. Baraldi, M. Cornia, R. Cucchiara, The revolution of multimodal large language models: a survey, arXiv preprint arXiv:2402.12451 (2024). 30

arXiv 2024
[21]

Zhang, Y

D. Zhang, Y. Yu, J. Dong, C. Li, D. Su, C. Chu, D. Yu, Mm-llms: Recent advances in multimodal large language models, arXiv preprint arXiv:2401.13601 (2024)

arXiv 2024
[22]

T. Wang, Z. Jiang, Z. He, S. Tong, W. Yang, Y. Zheng, Z. Li, Z. He, H. Gong, Towards hierarchicalmulti-steprewardmodelsforenhanced reasoninginlargelanguage models, arXiv preprint arXiv:2503.13551 (2025)

Pith/arXiv arXiv 2025
[23]

C. Zhou, M. Wang, Y. Ma, C. Wu, W. Chen, Z. Qian, X. Liu, Y. Zhang, J. Wang, H. Xu, et al., From perception to cognition: A survey of vision-language interac- tive reasoning in multimodal large language models, arXiv preprint arXiv:2509.25373 (2025)

arXiv 2025
[24]

K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969

2017
[25]

X. Zhao, X. Li, H. Duan, H. Huang, Y. Li, K. Chen, H. Yang, Mg-llava: Towards multi-granularity visual instruction tuning, arXiv preprint arXiv:2406.17770 (2024)

arXiv 2024
[26]

Zhang, X

Y. Zhang, X. Huang, J. Ma, Z. Li, Z. Luo, Y. Xie, Y. Qin, T. Luo, Y. Li, S. Liu, et al., Recognize anything: A strong image tagging model, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1724–1732

2024
[27]

Minderer, A

M. Minderer, A. Gritsenko, N. Houlsby, Scaling open-vocabulary object detection, Advances in Neural Information Processing Systems 36 (2023) 72983–73007

2023
[28]

Jiang, G

Q. Jiang, G. Luo, Y. Yang, Y. Xiong, Y. Chen, Z. Zeng, T. Ren, L. Zhang, Cha- trex: Taming multimodal llm for joint perception and understanding, arXiv preprint arXiv:2411.18363 (2024)

arXiv 2024
[29]

C. Ma, Y. Jiang, J. Wu, Z. Yuan, X. Qi, Groma: Localized visual tokenization for grounding multimodal large language models, in: European Conference on Computer Vision, Springer, 2024, pp. 417–435

2024
[30]

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: Deformable trans- formers for end-to-end object detection, arXiv preprint arXiv:2010.04159 (2020)

Pith/arXiv arXiv 2010
[31]

W. Wang, Y. Ren, H. Luo, T. Li, C. Yan, Z. Chen, W. Wang, Q. Li, L. Lu, X. Zhu, et al., The all-seeing project v2: Towards general relation comprehension of the open world, in: European Conference on Computer Vision, Springer, 2024, pp. 471–490

2024
[32]

J. Qiu, Y. Zhang, X. Tang, L. Xie, T. Ma, P. Yan, D. Doermann, Q. Ye, Y. Tian, Artemis: Towards referential understanding in complex videos, Advances in Neural Information Processing Systems 37 (2024) 114321–114347

2024
[33]

Y. Yuan, W. Li, J. Liu, D. Tang, X. Luo, C. Qin, L. Zhang, J. Zhu, Osprey: Pixel understanding with visual instruction tuning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28202–28211. 31

2024
[34]

H. Hua, Q. Liu, L. Zhang, J. Shi, S. Y. Kim, Z. Zhang, Y. Wang, J. Zhang, Z. Lin, J. Luo, Finecaption: Compositional image captioning focusing on wherever you want at any granularity, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24763–24773

2025
[35]

W. Lin, X. Wei, R. An, P. Gao, B. Zou, Y. Luo, S. Huang, S. Zhang, H. Li, Draw- and-understand: Leveraging visual prompts to enable mllms to comprehend what you want, arXiv preprint arXiv:2403.20271 (2024)

arXiv 2024
[36]

H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, Y. Yang, Ferret: Refer and ground anything anywhere at any granularity, arXiv preprint arXiv:2310.07704 (2023)

Pith/arXiv arXiv 2023
[37]

W. Tang, Y. Sun, Q. Gu, Z. Li, Visual position prompt for mllm based visual ground- ing, arXiv preprint arXiv:2503.15426 (2025)

arXiv 2025
[38]

Q. Guo, S. De Mello, H. Yin, W. Byeon, K. C. Cheung, Y. Yu, P. Luo, S. Liu, Regiongpt: Towards region understanding vision language model, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13796–13806

2024
[39]

Zhang, P

S. Zhang, P. Sun, S. Chen, M. Xiao, W. Shao, W. Zhang, Y. Liu, K. Chen, P. Luo, Gpt4roi: Instruction tuning large language model on region-of-interest, in: European Conference on Computer Vision, Springer, 2025, pp. 52–70

2025
[40]

M. Cai, H. Liu, S. K. Mustikovela, G. P. Meyer, Y. Chai, D. Park, Y. J. Lee, Vip-llava: Makinglargemultimodalmodelsunderstandarbitraryvisualprompts, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 12914–12923

2024
[41]

G. Chen, L. Shen, R. Shao, X. Deng, L. Nie, Lion: Empowering multimodal large language model with dual-level visual knowledge, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26540–26550

2024
[42]

Dwibedi, V

D. Dwibedi, V. Jain, J. J. Tompson, A. Zisserman, Y. Aytar, Flexcap: Describe anything in images in controllable detail, Advances in Neural Information Processing Systems 37 (2024) 111172–111198

2024
[43]

H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, Advances in neural information processing systems 36 (2023) 34892–34916

2023
[44]

Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, F. Wei, Kosmos-2: Grounding multimodal large language models to the world, arXiv preprint arXiv:2306.14824 (2023). 32

Pith/arXiv arXiv 2023
[45]

J. Cha, W. Kang, J. Mun, B. Roh, Honeybee: Locality-enhanced projector for mul- timodal llm, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13817–13827

2024
[46]

A.-L. Wang, B. Shan, W. Shi, K.-Y. Lin, X. Fei, G. Tang, L. Liao, J. Tang, C. Huang, W.-S. Zheng, Pargo: Bridging vision-language with partial and global views, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 2025, pp. 7491–7499

2025
[47]

Zhang, Z

Y. Zhang, Z. Ma, X. Gao, S. Shakiah, Q. Gao, J. Chai, Groundhog: Grounding large language models to holistic segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 14227–14238

2024
[48]

Z. Yao, X. Cheng, Z. Huang, L. Li, Countllm: Towards generalizable repetitive action counting via large language model, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19143–19153

2025
[49]

H. Wang, Y. Ye, Y. Wang, Y. Nie, C. Huang, Elysium: Exploring object-level per- ception in videos via mllm, in: European Conference on Computer Vision, Springer, 2024, pp. 166–185

2024
[50]

S. Ren, L. Yao, S. Li, X. Sun, L. Hou, Timechat: A time-sensitive multimodal large language model for long video understanding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14313–14323

2024
[51]

M. Shi, F. Liu, S. Wang, S. Liao, S. Radhakrishnan, Y. Zhao, D.-A. Huang, H. Yin, K. Sapra, Y. Yacoob, et al., Eagle: Exploring the design space for multimodal llms with mixture of encoders, arXiv preprint arXiv:2408.15998 (2024)

arXiv 2024
[52]

X. Fan, T. Ji, C. Jiang, S. Li, S. Jin, S. Song, J. Wang, B. Hong, L. Chen, G. Zheng, et al., Mousi: Poly-visual-expert vision-language models, arXiv preprint arXiv:2401.17221 (2024)

arXiv 2024
[53]

L. Shen, G. Chen, R. Shao, W. Guan, L. Nie, Mome: Mixture of multimodal experts for generalist multimodal large language models, Advances in neural information processing systems 37 (2024) 42048–42070

2024
[54]

J. Ma, J. Wang, J. Luo, P. Yu, G. Zhou, Sherlock: Towards multi-scene video ab- normal event extraction and localization via a global-local spatial-sensitive llm, in: Proceedings of the ACM on Web Conference 2025, 2025, pp. 4004–4013

2025
[55]

Y. Liu, Z. Zhao, Z. Zhuang, L. Tian, X. Zhou, J. Zhou, Points: Improving your vision- language model with affordable strategies, arXiv preprint arXiv:2409.04828 (2024)

arXiv 2024
[56]

Jiang, L

Q. Jiang, L. Wu, Z. Zeng, T. Ren, Y. Xiong, Y. Chen, Q. Liu, L. Zhang, Referring to any person, arXiv preprint arXiv:2503.08507 (2025). 33

arXiv 2025
[57]

B.-K. Lee, B. Park, C. Won Kim, Y. Man Ro, Moai: Mixture of all intelligence for large language and vision models, in: European Conference on Computer Vision, Springer, 2024, pp. 273–302

2024
[58]

Y. Li, H. Wang, S. Yuan, M. Liu, D. Zhao, Y. Guo, C. Xu, G. Shi, W. Zuo, Myriad: Large multimodal model by applying vision experts for industrial anomaly detection, arXiv preprint arXiv:2310.19070 (2023)

arXiv 2023
[59]

H. Yin, Y. Ren, K. Yan, S. Ding, Y. Hao, Rod-mllm: Towards more reliable object detection in multimodal large language models, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 14358–14368

2025
[60]

X. He, L. Wei, L. Xie, Q. Tian, Incorporating visual experts to resolve the information loss in multimodal large language models, arXiv preprint arXiv:2401.03105 (2024)

arXiv 2024
[61]

Cheng, H

A.-C. Cheng, H. Yin, Y. Fu, Q. Guo, R. Yang, J. Kautz, X. Wang, S. Liu, Spa- tialrgpt: Grounded spatial reasoning in vision language models, arXiv preprint arXiv:2406.01584 (2024)

arXiv 2024
[62]

J. Jain, J. Yang, H. Shi, Vcoder: Versatile vision encoders for multimodal large language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27992–28002

2024
[63]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. White- head, A. C. Berg, W.-Y. Lo, et al., Segment anything, in: Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

2023
[64]

Rasheed, M

H. Rasheed, M. Maaz, S. Shaji, A. Shaker, S. Khan, H. Cholakkal, R. M. Anwer, E. Xing, M.-H. Yang, F. S. Khan, Glamm: Pixel grounding large multimodal model, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2024, pp. 13009–13018

2024
[65]

R. Pi, L. Yao, J. Gao, J. Zhang, T. Zhang, Perceptiongpt: Effectively fusing visual perceptionintollm, in: ProceedingsoftheIEEE/CVFConferenceonComputerVision and Pattern Recognition, 2024, pp. 27124–27133

2024
[66]

Zhang, X

T. Zhang, X. Li, H. Fei, H. Yuan, S. Wu, S. Ji, C. C. Loy, S. Yan, Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding, Advances in Neural Information Processing Systems 37 (2024) 71737–71767

2024
[67]

X. Li, H. Yuan, W. Li, H. Ding, S. Wu, W. Zhang, Y. Li, K. Chen, C. C. Loy, Omg-seg: Is one model good enough for all segmentation?, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 27948–27959

2024
[68]

J. He, Y. Wang, L. Wang, H. Lu, J.-Y. He, J.-P. Lan, B. Luo, X. Xie, Multi-modal instruction tuned llms with fine-grained visual perception, in: Proceedings of the 34 ieee/cvf conference on computer vision and pattern recognition, 2024, pp. 13980– 13990

2024
[69]

Zhang, Y

Z. Zhang, Y. Ma, E. Zhang, X. Bai, Psalm: Pixelwise segmentation with large multi- modal model, in: European Conference on Computer Vision, Springer, 2024, pp. 74–91

2024
[70]

Y.Tian, T.Ma, L.Xie, J.Qiu, X.Tang, Y.Zhang, J.Jiao, Q.Tian, Q.Ye, Chatterbox: Multi-round multimodal referring and grounding, arXiv preprint arXiv:2401.13307 (2024)

arXiv 2024
[71]

Zhang, F

H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, H.-Y. Shum, Dino: Detr with improved denoising anchor boxes for end-to-end object detection, arXiv preprint arXiv:2203.03605 (2022)

Pith/arXiv arXiv 2022
[72]

S. Yang, T. Qu, X. Lai, Z. Tian, B. Peng, S. Liu, J. Jia, Lisa++: An im- proved baseline for reasoning segmentation with large language model, arXiv preprint arXiv:2312.17240 (2023)

arXiv 2023
[73]

Yang, P.-T

Y. Yang, P.-T. Jiang, J. Wang, H. Zhang, K. Zhao, J. Chen, B. Li, Empow- ering segmentation ability to multi-modal large language models, arXiv preprint arXiv:2403.14141 (2024)

arXiv 2024
[74]

D. Cai, X. Yang, Y. Liu, D. Wang, S. Feng, Y. Zhang, S. Poria, Pixel-level reasoning segmentation via multi-turn conversations, arXiv preprint arXiv:2502.09447 (2025)

arXiv 2025
[75]

X. Wang, S. Zhang, S. Li, K. Kallidromitis, K. Li, Y. Kato, K. Kozuka, T. Darrell, Segllm: Multi-round reasoning segmentation, arXiv preprint arXiv:2410.18923 (2024)

arXiv 2024
[76]

Z. Ren, Z. Huang, Y. Wei, Y. Zhao, D. Fu, J. Feng, X. Jin, Pixellm: Pixel reason- ing with large multimodal model, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26374–26383

2024
[77]

K. Han, Y. Hu, M. Qu, H. Shi, Y. Zhao, Y. Wei, Rose: Revolutionizing open-set dense segmentation with patch-wise perceptual large multimodal model, arXiv preprint arXiv:2412.00153 (2024)

arXiv 2024
[78]

Z. Xia, D. Han, Y. Han, X. Pan, S. Song, G. Huang, Gsva: Generalized segmentation via multimodal large language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3858–3869

2024
[79]

C. Wei, H. Tan, Y. Zhong, Y. Yang, L. Ma, Lasagna: Language-based segmentation assistant for complex queries, arXiv preprint arXiv:2404.08506 (2024)

arXiv 2024
[80]

C. Wei, Y. Zhong, H. Tan, Y. Zeng, Y. Liu, Z. Zhao, Y. Yang, Instructseg: Unifying instructedvisualsegmentationwithmulti-modallargelanguagemodels, arXivpreprint arXiv:2412.14006 (2024). 35

arXiv 2024

Showing first 80 references.

[1] [1]

J. Wang, L. Yuan, Y. Zhang, H. Sun, Tarsier: Recipes for training and evaluating large video description models, arXiv preprint arXiv:2407.00634 (2024)

arXiv 2024

[2] [2]

W. Chai, E. Song, Y. Du, C. Meng, V. Madhavan, O. Bar-Tal, J.-N. Hwang, S. Xie, C. D. Manning, Auroracap: Efficient, performant video detailed captioning and a new benchmark, arXiv preprint arXiv:2410.03051 (2024)

arXiv 2024

[3] [3]

L. Yu, P. Poirson, S. Yang, A. C. Berg, T. L. Berg, Modeling context in referring expressions, in: European conference on computer vision, Springer, 2016, pp. 69–85

2016

[4] [4]

X. Lai, Z. Tian, Y. Chen, Y. Li, Y. Yuan, S. Liu, J. Jia, Lisa: Reasoning segmentation via large language model, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9579–9589

2024

[5] [5]

Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al., Mmbench: Is your multi-modal model an all-around player?, in: European conference on computer vision, Springer, 2024, pp. 216–233

2024

[6] [6]

Masry, D

A. Masry, D. X. Long, J. Q. Tan, S. Joty, E. Hoque, Chartqa: A benchmark for question answering about charts with visual and logical reasoning, arXiv preprint arXiv:2203.10244 (2022)

Pith/arXiv arXiv 2022

[7] [7]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: European conference on computer vision, Springer, 2014, pp. 740–755. 29

2014

[8] [8]

P. Wu, S. Xie, V?: Guided visual search as a core mechanism in multimodal llms, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024, pp. 13084–13094

2024

[9] [9]

J. Chen, T. Liang, S. Siu, Z. Wang, K. Wang, Y. Wang, Y. Ni, W. Zhu, Z. Jiang, B. Lyu, et al., Mega-bench: Scaling multimodal evaluation to over 500 real-world tasks, arXiv preprint arXiv:2410.10563 (2024)

arXiv 2024

[10] [10]

M.A.L.Ralph, E.Jefferies, K.Patterson, T.T.Rogers, Theneuralandcomputational bases of semantic cognition, Nature reviews neuroscience 18 (2017) 42–55

2017

[11] [11]

Sapkota, M

R. Sapkota, M. Karkee, Object detection with multimodal large vision-language mod- els: An in-depth review, Available at SSRN 5233953 (2025)

2025

[12] [12]

Y. Shen, C. Li, F. Xiong, J.-O. Jeong, T. Wang, M. Latman, M. Unberath, Reasoning segmentationforimagesandvideos: Asurvey, arXivpreprintarXiv:2505.18816(2025)

arXiv 2025

[13] [13]

Y.Wang, S.Wu, Y.Zhang, S.Yan, Z.Liu, J.Luo, H.Fei, Multimodalchain-of-thought reasoning: A comprehensive survey, arXiv preprint arXiv:2503.12605 (2025)

Pith/arXiv arXiv 2025

[14] [14]

Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y. Zhou, T. Gao, W. Che, Towards reasoning era: A survey of long chain-of-thought for reasoning large language models, arXiv preprint arXiv:2503.09567 (2025)

Pith/arXiv arXiv 2025

[15] [15]

Cobbe, V

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al., Training verifiers to solve math word problems, arXiv preprint arXiv:2110.14168 (2021)

Pith/arXiv arXiv 2021

[16] [16]

H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V. Myers, M. J. Kim, M. Du, et al., Bridgedata v2: A dataset for robot learning at scale, in: Conference on Robot Learning, PMLR, 2023, pp. 1723–1736

2023

[17] [17]

Z. Guo, R. Xu, Y. Yao, J. Cui, Z. Ni, C. Ge, T.-S. Chua, Z. Liu, G. Huang, Llava- uhd: an lmm perceiving any aspect ratio and high-resolution images, in: European Conference on Computer Vision, Springer, 2024, pp. 390–406

2024

[18] [18]

D. Liu, R. Zhang, L. Qiu, S. Huang, W. Lin, S. Zhao, S. Geng, Z. Lin, P. Jin, K. Zhang, et al., Sphinx-x: Scaling data and parameters for a family of multi-modal large lan- guage models, arXiv preprint arXiv:2402.05935 (2024)

arXiv 2024

[19] [19]

C. X. Liang, P. Tian, C. H. Yin, Y. Yua, W. An-Hou, L. Ming, T. Wang, Z. Bi, M. Liu, A comprehensive survey and guide to multimodal large language models in vision-language tasks, arXiv preprint arXiv:2411.06284 (2024)

arXiv 2024

[20] [20]

Caffagni, F

D. Caffagni, F. Cocchi, L. Barsellotti, N. Moratelli, S. Sarto, L. Baraldi, M. Cornia, R. Cucchiara, The revolution of multimodal large language models: a survey, arXiv preprint arXiv:2402.12451 (2024). 30

arXiv 2024

[21] [21]

Zhang, Y

D. Zhang, Y. Yu, J. Dong, C. Li, D. Su, C. Chu, D. Yu, Mm-llms: Recent advances in multimodal large language models, arXiv preprint arXiv:2401.13601 (2024)

arXiv 2024

[22] [22]

T. Wang, Z. Jiang, Z. He, S. Tong, W. Yang, Y. Zheng, Z. Li, Z. He, H. Gong, Towards hierarchicalmulti-steprewardmodelsforenhanced reasoninginlargelanguage models, arXiv preprint arXiv:2503.13551 (2025)

Pith/arXiv arXiv 2025

[23] [23]

C. Zhou, M. Wang, Y. Ma, C. Wu, W. Chen, Z. Qian, X. Liu, Y. Zhang, J. Wang, H. Xu, et al., From perception to cognition: A survey of vision-language interac- tive reasoning in multimodal large language models, arXiv preprint arXiv:2509.25373 (2025)

arXiv 2025

[24] [24]

K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969

2017

[25] [25]

X. Zhao, X. Li, H. Duan, H. Huang, Y. Li, K. Chen, H. Yang, Mg-llava: Towards multi-granularity visual instruction tuning, arXiv preprint arXiv:2406.17770 (2024)

arXiv 2024

[26] [26]

Zhang, X

Y. Zhang, X. Huang, J. Ma, Z. Li, Z. Luo, Y. Xie, Y. Qin, T. Luo, Y. Li, S. Liu, et al., Recognize anything: A strong image tagging model, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1724–1732

2024

[27] [27]

Minderer, A

M. Minderer, A. Gritsenko, N. Houlsby, Scaling open-vocabulary object detection, Advances in Neural Information Processing Systems 36 (2023) 72983–73007

2023

[28] [28]

Jiang, G

Q. Jiang, G. Luo, Y. Yang, Y. Xiong, Y. Chen, Z. Zeng, T. Ren, L. Zhang, Cha- trex: Taming multimodal llm for joint perception and understanding, arXiv preprint arXiv:2411.18363 (2024)

arXiv 2024

[29] [29]

C. Ma, Y. Jiang, J. Wu, Z. Yuan, X. Qi, Groma: Localized visual tokenization for grounding multimodal large language models, in: European Conference on Computer Vision, Springer, 2024, pp. 417–435

2024

[30] [30]

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: Deformable trans- formers for end-to-end object detection, arXiv preprint arXiv:2010.04159 (2020)

Pith/arXiv arXiv 2010

[31] [31]

W. Wang, Y. Ren, H. Luo, T. Li, C. Yan, Z. Chen, W. Wang, Q. Li, L. Lu, X. Zhu, et al., The all-seeing project v2: Towards general relation comprehension of the open world, in: European Conference on Computer Vision, Springer, 2024, pp. 471–490

2024

[32] [32]

J. Qiu, Y. Zhang, X. Tang, L. Xie, T. Ma, P. Yan, D. Doermann, Q. Ye, Y. Tian, Artemis: Towards referential understanding in complex videos, Advances in Neural Information Processing Systems 37 (2024) 114321–114347

2024

[33] [33]

Y. Yuan, W. Li, J. Liu, D. Tang, X. Luo, C. Qin, L. Zhang, J. Zhu, Osprey: Pixel understanding with visual instruction tuning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28202–28211. 31

2024

[34] [34]

H. Hua, Q. Liu, L. Zhang, J. Shi, S. Y. Kim, Z. Zhang, Y. Wang, J. Zhang, Z. Lin, J. Luo, Finecaption: Compositional image captioning focusing on wherever you want at any granularity, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24763–24773

2025

[35] [35]

W. Lin, X. Wei, R. An, P. Gao, B. Zou, Y. Luo, S. Huang, S. Zhang, H. Li, Draw- and-understand: Leveraging visual prompts to enable mllms to comprehend what you want, arXiv preprint arXiv:2403.20271 (2024)

arXiv 2024

[36] [36]

H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, Y. Yang, Ferret: Refer and ground anything anywhere at any granularity, arXiv preprint arXiv:2310.07704 (2023)

Pith/arXiv arXiv 2023

[37] [37]

W. Tang, Y. Sun, Q. Gu, Z. Li, Visual position prompt for mllm based visual ground- ing, arXiv preprint arXiv:2503.15426 (2025)

arXiv 2025

[38] [38]

Q. Guo, S. De Mello, H. Yin, W. Byeon, K. C. Cheung, Y. Yu, P. Luo, S. Liu, Regiongpt: Towards region understanding vision language model, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13796–13806

2024

[39] [39]

Zhang, P

S. Zhang, P. Sun, S. Chen, M. Xiao, W. Shao, W. Zhang, Y. Liu, K. Chen, P. Luo, Gpt4roi: Instruction tuning large language model on region-of-interest, in: European Conference on Computer Vision, Springer, 2025, pp. 52–70

2025

[40] [40]

M. Cai, H. Liu, S. K. Mustikovela, G. P. Meyer, Y. Chai, D. Park, Y. J. Lee, Vip-llava: Makinglargemultimodalmodelsunderstandarbitraryvisualprompts, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 12914–12923

2024

[41] [41]

G. Chen, L. Shen, R. Shao, X. Deng, L. Nie, Lion: Empowering multimodal large language model with dual-level visual knowledge, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26540–26550

2024

[42] [42]

Dwibedi, V

D. Dwibedi, V. Jain, J. J. Tompson, A. Zisserman, Y. Aytar, Flexcap: Describe anything in images in controllable detail, Advances in Neural Information Processing Systems 37 (2024) 111172–111198

2024

[43] [43]

H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, Advances in neural information processing systems 36 (2023) 34892–34916

2023

[44] [44]

Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, F. Wei, Kosmos-2: Grounding multimodal large language models to the world, arXiv preprint arXiv:2306.14824 (2023). 32

Pith/arXiv arXiv 2023

[45] [45]

J. Cha, W. Kang, J. Mun, B. Roh, Honeybee: Locality-enhanced projector for mul- timodal llm, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13817–13827

2024

[46] [46]

A.-L. Wang, B. Shan, W. Shi, K.-Y. Lin, X. Fei, G. Tang, L. Liao, J. Tang, C. Huang, W.-S. Zheng, Pargo: Bridging vision-language with partial and global views, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 2025, pp. 7491–7499

2025

[47] [47]

Zhang, Z

Y. Zhang, Z. Ma, X. Gao, S. Shakiah, Q. Gao, J. Chai, Groundhog: Grounding large language models to holistic segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 14227–14238

2024

[48] [48]

Z. Yao, X. Cheng, Z. Huang, L. Li, Countllm: Towards generalizable repetitive action counting via large language model, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19143–19153

2025

[49] [49]

H. Wang, Y. Ye, Y. Wang, Y. Nie, C. Huang, Elysium: Exploring object-level per- ception in videos via mllm, in: European Conference on Computer Vision, Springer, 2024, pp. 166–185

2024

[50] [50]

S. Ren, L. Yao, S. Li, X. Sun, L. Hou, Timechat: A time-sensitive multimodal large language model for long video understanding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14313–14323

2024

[51] [51]

M. Shi, F. Liu, S. Wang, S. Liao, S. Radhakrishnan, Y. Zhao, D.-A. Huang, H. Yin, K. Sapra, Y. Yacoob, et al., Eagle: Exploring the design space for multimodal llms with mixture of encoders, arXiv preprint arXiv:2408.15998 (2024)

arXiv 2024

[52] [52]

X. Fan, T. Ji, C. Jiang, S. Li, S. Jin, S. Song, J. Wang, B. Hong, L. Chen, G. Zheng, et al., Mousi: Poly-visual-expert vision-language models, arXiv preprint arXiv:2401.17221 (2024)

arXiv 2024

[53] [53]

L. Shen, G. Chen, R. Shao, W. Guan, L. Nie, Mome: Mixture of multimodal experts for generalist multimodal large language models, Advances in neural information processing systems 37 (2024) 42048–42070

2024

[54] [54]

J. Ma, J. Wang, J. Luo, P. Yu, G. Zhou, Sherlock: Towards multi-scene video ab- normal event extraction and localization via a global-local spatial-sensitive llm, in: Proceedings of the ACM on Web Conference 2025, 2025, pp. 4004–4013

2025

[55] [55]

Y. Liu, Z. Zhao, Z. Zhuang, L. Tian, X. Zhou, J. Zhou, Points: Improving your vision- language model with affordable strategies, arXiv preprint arXiv:2409.04828 (2024)

arXiv 2024

[56] [56]

Jiang, L

Q. Jiang, L. Wu, Z. Zeng, T. Ren, Y. Xiong, Y. Chen, Q. Liu, L. Zhang, Referring to any person, arXiv preprint arXiv:2503.08507 (2025). 33

arXiv 2025

[57] [57]

B.-K. Lee, B. Park, C. Won Kim, Y. Man Ro, Moai: Mixture of all intelligence for large language and vision models, in: European Conference on Computer Vision, Springer, 2024, pp. 273–302

2024

[58] [58]

Y. Li, H. Wang, S. Yuan, M. Liu, D. Zhao, Y. Guo, C. Xu, G. Shi, W. Zuo, Myriad: Large multimodal model by applying vision experts for industrial anomaly detection, arXiv preprint arXiv:2310.19070 (2023)

arXiv 2023

[59] [59]

H. Yin, Y. Ren, K. Yan, S. Ding, Y. Hao, Rod-mllm: Towards more reliable object detection in multimodal large language models, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 14358–14368

2025

[60] [60]

X. He, L. Wei, L. Xie, Q. Tian, Incorporating visual experts to resolve the information loss in multimodal large language models, arXiv preprint arXiv:2401.03105 (2024)

arXiv 2024

[61] [61]

Cheng, H

A.-C. Cheng, H. Yin, Y. Fu, Q. Guo, R. Yang, J. Kautz, X. Wang, S. Liu, Spa- tialrgpt: Grounded spatial reasoning in vision language models, arXiv preprint arXiv:2406.01584 (2024)

arXiv 2024

[62] [62]

J. Jain, J. Yang, H. Shi, Vcoder: Versatile vision encoders for multimodal large language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27992–28002

2024

[63] [63]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. White- head, A. C. Berg, W.-Y. Lo, et al., Segment anything, in: Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

2023

[64] [64]

Rasheed, M

H. Rasheed, M. Maaz, S. Shaji, A. Shaker, S. Khan, H. Cholakkal, R. M. Anwer, E. Xing, M.-H. Yang, F. S. Khan, Glamm: Pixel grounding large multimodal model, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2024, pp. 13009–13018

2024

[65] [65]

R. Pi, L. Yao, J. Gao, J. Zhang, T. Zhang, Perceptiongpt: Effectively fusing visual perceptionintollm, in: ProceedingsoftheIEEE/CVFConferenceonComputerVision and Pattern Recognition, 2024, pp. 27124–27133

2024

[66] [66]

Zhang, X

T. Zhang, X. Li, H. Fei, H. Yuan, S. Wu, S. Ji, C. C. Loy, S. Yan, Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding, Advances in Neural Information Processing Systems 37 (2024) 71737–71767

2024

[67] [67]

X. Li, H. Yuan, W. Li, H. Ding, S. Wu, W. Zhang, Y. Li, K. Chen, C. C. Loy, Omg-seg: Is one model good enough for all segmentation?, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 27948–27959

2024

[68] [68]

J. He, Y. Wang, L. Wang, H. Lu, J.-Y. He, J.-P. Lan, B. Luo, X. Xie, Multi-modal instruction tuned llms with fine-grained visual perception, in: Proceedings of the 34 ieee/cvf conference on computer vision and pattern recognition, 2024, pp. 13980– 13990

2024

[69] [69]

Zhang, Y

Z. Zhang, Y. Ma, E. Zhang, X. Bai, Psalm: Pixelwise segmentation with large multi- modal model, in: European Conference on Computer Vision, Springer, 2024, pp. 74–91

2024

[70] [70]

Y.Tian, T.Ma, L.Xie, J.Qiu, X.Tang, Y.Zhang, J.Jiao, Q.Tian, Q.Ye, Chatterbox: Multi-round multimodal referring and grounding, arXiv preprint arXiv:2401.13307 (2024)

arXiv 2024

[71] [71]

Zhang, F

H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, H.-Y. Shum, Dino: Detr with improved denoising anchor boxes for end-to-end object detection, arXiv preprint arXiv:2203.03605 (2022)

Pith/arXiv arXiv 2022

[72] [72]

S. Yang, T. Qu, X. Lai, Z. Tian, B. Peng, S. Liu, J. Jia, Lisa++: An im- proved baseline for reasoning segmentation with large language model, arXiv preprint arXiv:2312.17240 (2023)

arXiv 2023

[73] [73]

Yang, P.-T

Y. Yang, P.-T. Jiang, J. Wang, H. Zhang, K. Zhao, J. Chen, B. Li, Empow- ering segmentation ability to multi-modal large language models, arXiv preprint arXiv:2403.14141 (2024)

arXiv 2024

[74] [74]

D. Cai, X. Yang, Y. Liu, D. Wang, S. Feng, Y. Zhang, S. Poria, Pixel-level reasoning segmentation via multi-turn conversations, arXiv preprint arXiv:2502.09447 (2025)

arXiv 2025

[75] [75]

X. Wang, S. Zhang, S. Li, K. Kallidromitis, K. Li, Y. Kato, K. Kozuka, T. Darrell, Segllm: Multi-round reasoning segmentation, arXiv preprint arXiv:2410.18923 (2024)

arXiv 2024

[76] [76]

Z. Ren, Z. Huang, Y. Wei, Y. Zhao, D. Fu, J. Feng, X. Jin, Pixellm: Pixel reason- ing with large multimodal model, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26374–26383

2024

[77] [77]

K. Han, Y. Hu, M. Qu, H. Shi, Y. Zhao, Y. Wei, Rose: Revolutionizing open-set dense segmentation with patch-wise perceptual large multimodal model, arXiv preprint arXiv:2412.00153 (2024)

arXiv 2024

[78] [78]

Z. Xia, D. Han, Y. Han, X. Pan, S. Song, G. Huang, Gsva: Generalized segmentation via multimodal large language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3858–3869

2024

[79] [79]

C. Wei, H. Tan, Y. Zhong, Y. Yang, L. Ma, Lasagna: Language-based segmentation assistant for complex queries, arXiv preprint arXiv:2404.08506 (2024)

arXiv 2024

[80] [80]

C. Wei, Y. Zhong, H. Tan, Y. Zeng, Y. Liu, Z. Zhao, Y. Yang, Instructseg: Unifying instructedvisualsegmentationwithmulti-modallargelanguagemodels, arXivpreprint arXiv:2412.14006 (2024). 35

arXiv 2024