pith. sign in

arxiv: 2606.26196 · v1 · pith:VPOLTG3Inew · submitted 2026-06-24 · 💻 cs.CL · cs.AI· cs.CV· cs.LG· cs.MM

From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models

Pith reviewed 2026-06-26 01:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CVcs.LGcs.MM
keywords multimodal large language modelsvision-language perceptionparadigm evolutionfive-stage taxonomyunified multimodal intelligencesurvey
0
0 comments X

The pith

MLLM perception is formalized as a single unified vision-language capability that evolves through five distinct paradigms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the first systematic survey that treats vision and language as an inseparable modality in multimodal large language models. It formalizes perception in these models as an intrinsic unified capability similar to human perception and traces its development across five stages. A reader would care because this integrated view could replace fragmented approaches that study vision or language in isolation. The survey also maps representative methods at each stage and points to open challenges on the path to general multimodal intelligence.

Core claim

MLLM perception is an intrinsic, unified vision-language capability analogous to human innate perception; the paper introduces a five-stage taxonomy that traces its paradigm evolution from early separate structures to synergistic integration and identifies open challenges toward general unified multimodal intelligence.

What carries the argument

A five-stage taxonomy tracing the paradigm evolution of MLLM perception from structure to synergy.

If this is right

  • Design of future MLLMs will treat perception as one integrated module rather than separate vision and language components.
  • Evaluation benchmarks will shift toward measuring cross-modal synergy instead of isolated modality performance.
  • Research roadmaps will prioritize methods that advance the later stages of the taxonomy.
  • Progress toward general multimodal intelligence will be measured by how well models achieve unified perception.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy could be tested by checking whether new models fit cleanly into one of the five stages or require an additional stage.
  • Insights from this survey might suggest concrete experiments that combine perception tasks across stages to measure synergy gains.
  • If the unified lens holds, it would imply that separate vision-only or language-only improvements are insufficient for the next generation of models.

Load-bearing premise

Existing surveys are too fragmented because they examine vision and language separately, so a single unified vision-language lens is the right way to track how perception develops in these models.

What would settle it

Discovery of an earlier survey that already presents a unified five-stage taxonomy of vision-language perception in MLLMs would falsify the claim that this is the first such systematic treatment.

Figures

Figures reproduced from arXiv: 2606.26196 by Haoxiang Sun, Jiancheng Lv, Jian Zhao, Li Yuan, Tao Wang.

Figure 1
Figure 1. Figure 1: An overview of the evolving paradigms for enhancing perception in multimodal large language [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Organization of our five-stage evolutionary framework. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Encoder-Centric Optimization Strategies for Multimodal Perception in MLLMs. (a) Region [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Decoder-Centric Optimization Strategies for Multimodal Perception in MLLMs. (a) Auxiliary [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Dynamic Perception via Adaptive Processing. (a) External Tool Scheduling: The LLM decides [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Architecture-Free Strategies for Perception Enhancement. (a) Instruction-Based Strategies: the [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case study of OpenAI o3’s thinking with images, reaching the correct answer after 42 seconds of [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have recently made remarkable progress in unifying vision-language understanding and reasoning, especially following the introduction of models such as OpenAI's O-series and DeepSeek's R-series, which have driven a paradigm shift toward perception-centric intelligence. However, there remains a lack of systematic surveys that examine perception from a truly unified vision-language perspective -- one that treats vision and language as an inseparable modality. Existing reviews are often fragmented, focusing separately on either vision or language, and thus rarely capture the cross-modal evolution of perception as an integrated capability. To bridge this gap, we present the first systematic survey of unified vision-language perception in MLLMs. Specifically, we (1) formalize MLLM perception as an intrinsic, unified vision-language capability analogous to human innate perception, (2) introduce a five-stage taxonomy tracing the paradigm evolution of MLLM perception and survey representative methods and milestones at each phase, and (3) identify open challenges and outline promising research directions toward truly general, unified multimodal intelligence. We hope our study will provide both a foundational understanding and an actionable roadmap to foster further innovation on the path toward artificial general intelligence (AGI).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript is a survey on vision-language perception in Multimodal Large Language Models (MLLMs). It formalizes MLLM perception as an intrinsic, unified vision-language capability analogous to human innate perception, claims to be the first systematic survey from this unified perspective, introduces a five-stage taxonomy tracing paradigm evolution, surveys representative methods and milestones at each stage, and identifies open challenges with future research directions toward general multimodal intelligence and AGI.

Significance. If the five-stage taxonomy is well-motivated and the survey comprehensively covers the literature without major omissions, the work could provide a useful organizational lens for the field, highlighting the shift to perception-centric models following recent O-series and R-series models. The unified-modality framing may help consolidate fragmented prior reviews, though its value depends on explicit comparisons to existing surveys.

major comments (1)
  1. [Abstract] Abstract: The assertion that 'existing reviews are often fragmented, focusing separately on either vision or language' is central to the novelty claim but lacks supporting citations or a dedicated comparison subsection; without this, the justification for the five-stage taxonomy as filling a unique gap remains under-supported.
minor comments (2)
  1. The five-stage taxonomy should include explicit classification criteria or decision rules for assigning methods to stages to enhance reproducibility.
  2. Consider adding a summary table listing the five stages, key models/milestones, and representative papers for improved readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the constructive review and the minor revision recommendation. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that 'existing reviews are often fragmented, focusing separately on either vision or language' is central to the novelty claim but lacks supporting citations or a dedicated comparison subsection; without this, the justification for the five-stage taxonomy as filling a unique gap remains under-supported.

    Authors: We agree the abstract claim would be strengthened by explicit citations and a brief comparison. In revision we will (1) add 3-4 representative citations to prior surveys that treat vision or language in isolation and (2) insert a short paragraph in the introduction that contrasts those works with our unified perception framing, thereby better motivating the five-stage taxonomy. revision: yes

Circularity Check

0 steps flagged

No significant circularity; survey with no derivation chain

full rationale

This is a literature review paper that formalizes a concept and proposes a five-stage taxonomy based on surveying existing external work. No equations, predictions, fitted parameters, or load-bearing self-citations appear in the provided text or abstract. The central claims are organizational and do not reduce to any input by construction, satisfying the criteria for a self-contained survey with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper that introduces no free parameters, mathematical axioms, or invented entities; the taxonomy is an organizational construct rather than a derived result.

pith-pipeline@v0.9.1-grok · 5764 in / 971 out tokens · 43113 ms · 2026-06-26T01:49:38.561094+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

237 extracted references · 38 linked inside Pith

  1. [1]

    J. Wang, L. Yuan, Y. Zhang, H. Sun, Tarsier: Recipes for training and evaluating large video description models, arXiv preprint arXiv:2407.00634 (2024)

  2. [2]

    W. Chai, E. Song, Y. Du, C. Meng, V. Madhavan, O. Bar-Tal, J.-N. Hwang, S. Xie, C. D. Manning, Auroracap: Efficient, performant video detailed captioning and a new benchmark, arXiv preprint arXiv:2410.03051 (2024)

  3. [3]

    L. Yu, P. Poirson, S. Yang, A. C. Berg, T. L. Berg, Modeling context in referring expressions, in: European conference on computer vision, Springer, 2016, pp. 69–85

  4. [4]

    X. Lai, Z. Tian, Y. Chen, Y. Li, Y. Yuan, S. Liu, J. Jia, Lisa: Reasoning segmentation via large language model, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9579–9589

  5. [5]

    Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al., Mmbench: Is your multi-modal model an all-around player?, in: European conference on computer vision, Springer, 2024, pp. 216–233

  6. [6]

    Masry, D

    A. Masry, D. X. Long, J. Q. Tan, S. Joty, E. Hoque, Chartqa: A benchmark for question answering about charts with visual and logical reasoning, arXiv preprint arXiv:2203.10244 (2022)

  7. [7]

    T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: European conference on computer vision, Springer, 2014, pp. 740–755. 29

  8. [8]

    P. Wu, S. Xie, V?: Guided visual search as a core mechanism in multimodal llms, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024, pp. 13084–13094

  9. [9]

    J. Chen, T. Liang, S. Siu, Z. Wang, K. Wang, Y. Wang, Y. Ni, W. Zhu, Z. Jiang, B. Lyu, et al., Mega-bench: Scaling multimodal evaluation to over 500 real-world tasks, arXiv preprint arXiv:2410.10563 (2024)

  10. [10]

    M.A.L.Ralph, E.Jefferies, K.Patterson, T.T.Rogers, Theneuralandcomputational bases of semantic cognition, Nature reviews neuroscience 18 (2017) 42–55

  11. [11]

    Sapkota, M

    R. Sapkota, M. Karkee, Object detection with multimodal large vision-language mod- els: An in-depth review, Available at SSRN 5233953 (2025)

  12. [12]

    Y. Shen, C. Li, F. Xiong, J.-O. Jeong, T. Wang, M. Latman, M. Unberath, Reasoning segmentationforimagesandvideos: Asurvey, arXivpreprintarXiv:2505.18816(2025)

  13. [13]

    Y.Wang, S.Wu, Y.Zhang, S.Yan, Z.Liu, J.Luo, H.Fei, Multimodalchain-of-thought reasoning: A comprehensive survey, arXiv preprint arXiv:2503.12605 (2025)

  14. [14]

    Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y. Zhou, T. Gao, W. Che, Towards reasoning era: A survey of long chain-of-thought for reasoning large language models, arXiv preprint arXiv:2503.09567 (2025)

  15. [15]

    Cobbe, V

    K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al., Training verifiers to solve math word problems, arXiv preprint arXiv:2110.14168 (2021)

  16. [16]

    H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V. Myers, M. J. Kim, M. Du, et al., Bridgedata v2: A dataset for robot learning at scale, in: Conference on Robot Learning, PMLR, 2023, pp. 1723–1736

  17. [17]

    Z. Guo, R. Xu, Y. Yao, J. Cui, Z. Ni, C. Ge, T.-S. Chua, Z. Liu, G. Huang, Llava- uhd: an lmm perceiving any aspect ratio and high-resolution images, in: European Conference on Computer Vision, Springer, 2024, pp. 390–406

  18. [18]

    D. Liu, R. Zhang, L. Qiu, S. Huang, W. Lin, S. Zhao, S. Geng, Z. Lin, P. Jin, K. Zhang, et al., Sphinx-x: Scaling data and parameters for a family of multi-modal large lan- guage models, arXiv preprint arXiv:2402.05935 (2024)

  19. [19]

    C. X. Liang, P. Tian, C. H. Yin, Y. Yua, W. An-Hou, L. Ming, T. Wang, Z. Bi, M. Liu, A comprehensive survey and guide to multimodal large language models in vision-language tasks, arXiv preprint arXiv:2411.06284 (2024)

  20. [20]

    Caffagni, F

    D. Caffagni, F. Cocchi, L. Barsellotti, N. Moratelli, S. Sarto, L. Baraldi, M. Cornia, R. Cucchiara, The revolution of multimodal large language models: a survey, arXiv preprint arXiv:2402.12451 (2024). 30

  21. [21]

    Zhang, Y

    D. Zhang, Y. Yu, J. Dong, C. Li, D. Su, C. Chu, D. Yu, Mm-llms: Recent advances in multimodal large language models, arXiv preprint arXiv:2401.13601 (2024)

  22. [22]

    T. Wang, Z. Jiang, Z. He, S. Tong, W. Yang, Y. Zheng, Z. Li, Z. He, H. Gong, Towards hierarchicalmulti-steprewardmodelsforenhanced reasoninginlargelanguage models, arXiv preprint arXiv:2503.13551 (2025)

  23. [23]

    C. Zhou, M. Wang, Y. Ma, C. Wu, W. Chen, Z. Qian, X. Liu, Y. Zhang, J. Wang, H. Xu, et al., From perception to cognition: A survey of vision-language interac- tive reasoning in multimodal large language models, arXiv preprint arXiv:2509.25373 (2025)

  24. [24]

    K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969

  25. [25]

    X. Zhao, X. Li, H. Duan, H. Huang, Y. Li, K. Chen, H. Yang, Mg-llava: Towards multi-granularity visual instruction tuning, arXiv preprint arXiv:2406.17770 (2024)

  26. [26]

    Zhang, X

    Y. Zhang, X. Huang, J. Ma, Z. Li, Z. Luo, Y. Xie, Y. Qin, T. Luo, Y. Li, S. Liu, et al., Recognize anything: A strong image tagging model, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1724–1732

  27. [27]

    Minderer, A

    M. Minderer, A. Gritsenko, N. Houlsby, Scaling open-vocabulary object detection, Advances in Neural Information Processing Systems 36 (2023) 72983–73007

  28. [28]

    Jiang, G

    Q. Jiang, G. Luo, Y. Yang, Y. Xiong, Y. Chen, Z. Zeng, T. Ren, L. Zhang, Cha- trex: Taming multimodal llm for joint perception and understanding, arXiv preprint arXiv:2411.18363 (2024)

  29. [29]

    C. Ma, Y. Jiang, J. Wu, Z. Yuan, X. Qi, Groma: Localized visual tokenization for grounding multimodal large language models, in: European Conference on Computer Vision, Springer, 2024, pp. 417–435

  30. [30]

    X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: Deformable trans- formers for end-to-end object detection, arXiv preprint arXiv:2010.04159 (2020)

  31. [31]

    W. Wang, Y. Ren, H. Luo, T. Li, C. Yan, Z. Chen, W. Wang, Q. Li, L. Lu, X. Zhu, et al., The all-seeing project v2: Towards general relation comprehension of the open world, in: European Conference on Computer Vision, Springer, 2024, pp. 471–490

  32. [32]

    J. Qiu, Y. Zhang, X. Tang, L. Xie, T. Ma, P. Yan, D. Doermann, Q. Ye, Y. Tian, Artemis: Towards referential understanding in complex videos, Advances in Neural Information Processing Systems 37 (2024) 114321–114347

  33. [33]

    Y. Yuan, W. Li, J. Liu, D. Tang, X. Luo, C. Qin, L. Zhang, J. Zhu, Osprey: Pixel understanding with visual instruction tuning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28202–28211. 31

  34. [34]

    H. Hua, Q. Liu, L. Zhang, J. Shi, S. Y. Kim, Z. Zhang, Y. Wang, J. Zhang, Z. Lin, J. Luo, Finecaption: Compositional image captioning focusing on wherever you want at any granularity, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24763–24773

  35. [35]

    W. Lin, X. Wei, R. An, P. Gao, B. Zou, Y. Luo, S. Huang, S. Zhang, H. Li, Draw- and-understand: Leveraging visual prompts to enable mllms to comprehend what you want, arXiv preprint arXiv:2403.20271 (2024)

  36. [36]

    H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, Y. Yang, Ferret: Refer and ground anything anywhere at any granularity, arXiv preprint arXiv:2310.07704 (2023)

  37. [37]

    W. Tang, Y. Sun, Q. Gu, Z. Li, Visual position prompt for mllm based visual ground- ing, arXiv preprint arXiv:2503.15426 (2025)

  38. [38]

    Q. Guo, S. De Mello, H. Yin, W. Byeon, K. C. Cheung, Y. Yu, P. Luo, S. Liu, Regiongpt: Towards region understanding vision language model, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13796–13806

  39. [39]

    Zhang, P

    S. Zhang, P. Sun, S. Chen, M. Xiao, W. Shao, W. Zhang, Y. Liu, K. Chen, P. Luo, Gpt4roi: Instruction tuning large language model on region-of-interest, in: European Conference on Computer Vision, Springer, 2025, pp. 52–70

  40. [40]

    M. Cai, H. Liu, S. K. Mustikovela, G. P. Meyer, Y. Chai, D. Park, Y. J. Lee, Vip-llava: Makinglargemultimodalmodelsunderstandarbitraryvisualprompts, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 12914–12923

  41. [41]

    G. Chen, L. Shen, R. Shao, X. Deng, L. Nie, Lion: Empowering multimodal large language model with dual-level visual knowledge, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26540–26550

  42. [42]

    Dwibedi, V

    D. Dwibedi, V. Jain, J. J. Tompson, A. Zisserman, Y. Aytar, Flexcap: Describe anything in images in controllable detail, Advances in Neural Information Processing Systems 37 (2024) 111172–111198

  43. [43]

    H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, Advances in neural information processing systems 36 (2023) 34892–34916

  44. [44]

    Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, F. Wei, Kosmos-2: Grounding multimodal large language models to the world, arXiv preprint arXiv:2306.14824 (2023). 32

  45. [45]

    J. Cha, W. Kang, J. Mun, B. Roh, Honeybee: Locality-enhanced projector for mul- timodal llm, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13817–13827

  46. [46]

    A.-L. Wang, B. Shan, W. Shi, K.-Y. Lin, X. Fei, G. Tang, L. Liao, J. Tang, C. Huang, W.-S. Zheng, Pargo: Bridging vision-language with partial and global views, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 2025, pp. 7491–7499

  47. [47]

    Zhang, Z

    Y. Zhang, Z. Ma, X. Gao, S. Shakiah, Q. Gao, J. Chai, Groundhog: Grounding large language models to holistic segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 14227–14238

  48. [48]

    Z. Yao, X. Cheng, Z. Huang, L. Li, Countllm: Towards generalizable repetitive action counting via large language model, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19143–19153

  49. [49]

    H. Wang, Y. Ye, Y. Wang, Y. Nie, C. Huang, Elysium: Exploring object-level per- ception in videos via mllm, in: European Conference on Computer Vision, Springer, 2024, pp. 166–185

  50. [50]

    S. Ren, L. Yao, S. Li, X. Sun, L. Hou, Timechat: A time-sensitive multimodal large language model for long video understanding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14313–14323

  51. [51]

    M. Shi, F. Liu, S. Wang, S. Liao, S. Radhakrishnan, Y. Zhao, D.-A. Huang, H. Yin, K. Sapra, Y. Yacoob, et al., Eagle: Exploring the design space for multimodal llms with mixture of encoders, arXiv preprint arXiv:2408.15998 (2024)

  52. [52]

    X. Fan, T. Ji, C. Jiang, S. Li, S. Jin, S. Song, J. Wang, B. Hong, L. Chen, G. Zheng, et al., Mousi: Poly-visual-expert vision-language models, arXiv preprint arXiv:2401.17221 (2024)

  53. [53]

    L. Shen, G. Chen, R. Shao, W. Guan, L. Nie, Mome: Mixture of multimodal experts for generalist multimodal large language models, Advances in neural information processing systems 37 (2024) 42048–42070

  54. [54]

    J. Ma, J. Wang, J. Luo, P. Yu, G. Zhou, Sherlock: Towards multi-scene video ab- normal event extraction and localization via a global-local spatial-sensitive llm, in: Proceedings of the ACM on Web Conference 2025, 2025, pp. 4004–4013

  55. [55]

    Y. Liu, Z. Zhao, Z. Zhuang, L. Tian, X. Zhou, J. Zhou, Points: Improving your vision- language model with affordable strategies, arXiv preprint arXiv:2409.04828 (2024)

  56. [56]

    Jiang, L

    Q. Jiang, L. Wu, Z. Zeng, T. Ren, Y. Xiong, Y. Chen, Q. Liu, L. Zhang, Referring to any person, arXiv preprint arXiv:2503.08507 (2025). 33

  57. [57]

    B.-K. Lee, B. Park, C. Won Kim, Y. Man Ro, Moai: Mixture of all intelligence for large language and vision models, in: European Conference on Computer Vision, Springer, 2024, pp. 273–302

  58. [58]

    Y. Li, H. Wang, S. Yuan, M. Liu, D. Zhao, Y. Guo, C. Xu, G. Shi, W. Zuo, Myriad: Large multimodal model by applying vision experts for industrial anomaly detection, arXiv preprint arXiv:2310.19070 (2023)

  59. [59]

    H. Yin, Y. Ren, K. Yan, S. Ding, Y. Hao, Rod-mllm: Towards more reliable object detection in multimodal large language models, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 14358–14368

  60. [60]

    X. He, L. Wei, L. Xie, Q. Tian, Incorporating visual experts to resolve the information loss in multimodal large language models, arXiv preprint arXiv:2401.03105 (2024)

  61. [61]

    Cheng, H

    A.-C. Cheng, H. Yin, Y. Fu, Q. Guo, R. Yang, J. Kautz, X. Wang, S. Liu, Spa- tialrgpt: Grounded spatial reasoning in vision language models, arXiv preprint arXiv:2406.01584 (2024)

  62. [62]

    J. Jain, J. Yang, H. Shi, Vcoder: Versatile vision encoders for multimodal large language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27992–28002

  63. [63]

    Kirillov, E

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. White- head, A. C. Berg, W.-Y. Lo, et al., Segment anything, in: Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

  64. [64]

    Rasheed, M

    H. Rasheed, M. Maaz, S. Shaji, A. Shaker, S. Khan, H. Cholakkal, R. M. Anwer, E. Xing, M.-H. Yang, F. S. Khan, Glamm: Pixel grounding large multimodal model, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2024, pp. 13009–13018

  65. [65]

    R. Pi, L. Yao, J. Gao, J. Zhang, T. Zhang, Perceptiongpt: Effectively fusing visual perceptionintollm, in: ProceedingsoftheIEEE/CVFConferenceonComputerVision and Pattern Recognition, 2024, pp. 27124–27133

  66. [66]

    Zhang, X

    T. Zhang, X. Li, H. Fei, H. Yuan, S. Wu, S. Ji, C. C. Loy, S. Yan, Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding, Advances in Neural Information Processing Systems 37 (2024) 71737–71767

  67. [67]

    X. Li, H. Yuan, W. Li, H. Ding, S. Wu, W. Zhang, Y. Li, K. Chen, C. C. Loy, Omg-seg: Is one model good enough for all segmentation?, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 27948–27959

  68. [68]

    J. He, Y. Wang, L. Wang, H. Lu, J.-Y. He, J.-P. Lan, B. Luo, X. Xie, Multi-modal instruction tuned llms with fine-grained visual perception, in: Proceedings of the 34 ieee/cvf conference on computer vision and pattern recognition, 2024, pp. 13980– 13990

  69. [69]

    Zhang, Y

    Z. Zhang, Y. Ma, E. Zhang, X. Bai, Psalm: Pixelwise segmentation with large multi- modal model, in: European Conference on Computer Vision, Springer, 2024, pp. 74–91

  70. [70]

    Y.Tian, T.Ma, L.Xie, J.Qiu, X.Tang, Y.Zhang, J.Jiao, Q.Tian, Q.Ye, Chatterbox: Multi-round multimodal referring and grounding, arXiv preprint arXiv:2401.13307 (2024)

  71. [71]

    Zhang, F

    H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, H.-Y. Shum, Dino: Detr with improved denoising anchor boxes for end-to-end object detection, arXiv preprint arXiv:2203.03605 (2022)

  72. [72]

    S. Yang, T. Qu, X. Lai, Z. Tian, B. Peng, S. Liu, J. Jia, Lisa++: An im- proved baseline for reasoning segmentation with large language model, arXiv preprint arXiv:2312.17240 (2023)

  73. [73]

    Yang, P.-T

    Y. Yang, P.-T. Jiang, J. Wang, H. Zhang, K. Zhao, J. Chen, B. Li, Empow- ering segmentation ability to multi-modal large language models, arXiv preprint arXiv:2403.14141 (2024)

  74. [74]

    D. Cai, X. Yang, Y. Liu, D. Wang, S. Feng, Y. Zhang, S. Poria, Pixel-level reasoning segmentation via multi-turn conversations, arXiv preprint arXiv:2502.09447 (2025)

  75. [75]

    X. Wang, S. Zhang, S. Li, K. Kallidromitis, K. Li, Y. Kato, K. Kozuka, T. Darrell, Segllm: Multi-round reasoning segmentation, arXiv preprint arXiv:2410.18923 (2024)

  76. [76]

    Z. Ren, Z. Huang, Y. Wei, Y. Zhao, D. Fu, J. Feng, X. Jin, Pixellm: Pixel reason- ing with large multimodal model, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26374–26383

  77. [77]

    K. Han, Y. Hu, M. Qu, H. Shi, Y. Zhao, Y. Wei, Rose: Revolutionizing open-set dense segmentation with patch-wise perceptual large multimodal model, arXiv preprint arXiv:2412.00153 (2024)

  78. [78]

    Z. Xia, D. Han, Y. Han, X. Pan, S. Song, G. Huang, Gsva: Generalized segmentation via multimodal large language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3858–3869

  79. [79]

    C. Wei, H. Tan, Y. Zhong, Y. Yang, L. Ma, Lasagna: Language-based segmentation assistant for complex queries, arXiv preprint arXiv:2404.08506 (2024)

  80. [80]

    C. Wei, Y. Zhong, H. Tan, Y. Zeng, Y. Liu, Z. Zhao, Y. Yang, Instructseg: Unifying instructedvisualsegmentationwithmulti-modallargelanguagemodels, arXivpreprint arXiv:2412.14006 (2024). 35

Showing first 80 references.