pith. machine review for the scientific record. sign in

arxiv: 2605.00891 · v1 · submitted 2026-04-27 · 💻 cs.CV · cs.AI

Recognition: unknown

X2SAM: Any Segmentation in Images and Videos

Chi Zhang, Guanglu Wan, Hao Wang, Limeng Qiao, Lin Ma, Xiangyuan Lan, Xiaodan Liang

Pith reviewed 2026-05-09 20:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multimodal large language modelsvideo segmentationimage segmentationMask Memoryunified segmentationvisual promptsreferring segmentationvideo benchmark
0
0 comments X

The pith

X2SAM unifies any segmentation for images and videos in one multimodal model via a Mask Memory module.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents X2SAM as a multimodal large language model that extends pixel-level segmentation from images to videos using conversational instructions and visual prompts. It adds a Mask Memory module to store guided vision features, enabling consistent mask generation across video frames without separate models for each modality. Joint training on mixed image and video datasets is used to support tasks like open-vocabulary, referring, reasoning, and interactive segmentation while aiming to keep image benchmark performance and general chat abilities intact. The work also proposes the Video Visual Grounded segmentation benchmark to test object track segmentation from interactive prompts. If the approach holds, it would allow a single system to handle complex video analysis and image tasks through one interface.

Core claim

X2SAM extends any-segmentation capabilities from images to videos by coupling an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability. It introduces the Video Visual Grounded (V-VGD) segmentation benchmark.

What carries the argument

The Mask Memory module, which stores guided vision features to produce temporally consistent masks over video frames from text and visual prompts.

If this is right

  • The model supports both textual and visual prompts for segmentation in one interface across images and videos.
  • Video segmentation performance reaches strong levels without task-specific fine-tuning after joint training.
  • Image segmentation benchmarks remain competitive with no reported loss from the added video training.
  • General multimodal chat ability is preserved alongside the segmentation functions.
  • The V-VGD benchmark provides a new way to evaluate interactive video object track segmentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The memory mechanism could be reused for related video tasks such as object tracking or action segmentation without new modules.
  • Similar memory additions might help other multimodal models handle sequential data while retaining earlier capabilities.
  • Real-world tools for interactive video editing or analysis could be built around one model instead of multiple specialized ones.
  • Longer video sequences would be a natural next test to see if the memory module maintains consistency over extended time.

Load-bearing premise

A single Mask Memory module plus joint training on mixed image and video data suffices for temporal consistency in videos and prevents loss of image-only skills.

What would settle it

Remove the Mask Memory module during training and test whether video masks lose temporal consistency on V-VGD or similar benchmarks, or compare image segmentation scores before and after adding video data to training.

read the original abstract

Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The paper introduces X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos by coupling an LLM with a Mask Memory module storing guided vision features for temporally consistent video mask generation. It supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. A new Video Visual Grounded (V-VGD) benchmark is proposed, and the model is trained jointly on heterogeneous image and video datasets, claiming strong video segmentation performance while remaining competitive on image benchmarks and preserving general chat ability.

Significance. If the performance claims and architectural sufficiency hold, X2SAM would advance the field by unifying image and video pixel-level perception in a single conversational MLLM, reducing reliance on separate specialized models and enabling more flexible multimodal interfaces. The V-VGD benchmark could provide a useful new evaluation axis for interactive video segmentation.

major comments (4)
  1. [Abstract] Abstract: The central claims that 'X2SAM delivers strong video segmentation performance' and 'remains competitive on image segmentation benchmarks' are asserted without any reported quantitative metrics (e.g., mIoU, J&F scores), ablation results, error bars, or dataset statistics, preventing verification of the headline performance assertions.
  2. [Method (Mask Memory module)] Method description of Mask Memory module: The module is presented as sufficient to achieve temporally consistent video masks via storage of guided vision features, yet no details are given on its update/query mechanics, capacity, or integration with the LLM backbone, leaving the mechanism for temporal consistency unexamined and the sufficiency claim untestable.
  3. [Experiments] Experiments/Results: No ablation is reported that removes the Mask Memory module or compares video temporal consistency with/without it, which is load-bearing for the claim that this single module plus joint training suffices for consistency without task-specific safeguards.
  4. [Experiments] Experiments/Results: No before/after quantitative comparison of image benchmark scores (or chat ability metrics) is provided after joint training on mixed image/video data, leaving the claim of no catastrophic forgetting of image-only capabilities unverified.
minor comments (2)
  1. [Abstract] The list of supported tasks ('generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation') would benefit from brief definitions or examples to clarify distinctions and avoid overlap.
  2. [Introduction/Benchmark] The introduction of the V-VGD benchmark lacks any description of its construction, scale, or evaluation protocol, which would aid reproducibility even in an early manuscript.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the acknowledgment of X2SAM's potential to advance unified image-video segmentation in a single conversational MLLM and the value of the proposed V-VGD benchmark. We address each major comment point by point below, indicating revisions where the manuscript will be updated to strengthen clarity and verifiability.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims that 'X2SAM delivers strong video segmentation performance' and 'remains competitive on image segmentation benchmarks' are asserted without any reported quantitative metrics (e.g., mIoU, J&F scores), ablation results, error bars, or dataset statistics, preventing verification of the headline performance assertions.

    Authors: We agree that the abstract would be strengthened by including explicit quantitative support for the performance claims. Although detailed results appear in the experiments section, we will revise the abstract to report key metrics such as mIoU and J&F scores on video benchmarks, competitive image benchmark scores, and relevant dataset statistics. This change will make the headline assertions directly verifiable. revision: yes

  2. Referee: [Method (Mask Memory module)] Method description of Mask Memory module: The module is presented as sufficient to achieve temporally consistent video masks via storage of guided vision features, yet no details are given on its update/query mechanics, capacity, or integration with the LLM backbone, leaving the mechanism for temporal consistency unexamined and the sufficiency claim untestable.

    Authors: We acknowledge that the current description lacks sufficient technical specificity. The Mask Memory module stores LLM-guided vision features from prior frames in a fixed-capacity buffer and retrieves relevant features via similarity-based querying to condition the current frame's mask prediction. We will expand the method section with precise specifications of the update rule, query mechanism, buffer capacity, feature guidance process, and integration points with the LLM backbone to render the temporal consistency approach fully examinable and testable. revision: yes

  3. Referee: [Experiments] Experiments/Results: No ablation is reported that removes the Mask Memory module or compares video temporal consistency with/without it, which is load-bearing for the claim that this single module plus joint training suffices for consistency without task-specific safeguards.

    Authors: We agree that an ablation isolating the Mask Memory module is necessary to substantiate its contribution. We will add this ablation to the experiments section, comparing video segmentation performance (including temporal consistency metrics such as frame-to-frame mask stability and J&F scores) with and without the module while keeping joint training fixed. This will directly test the sufficiency claim. revision: yes

  4. Referee: [Experiments] Experiments/Results: No before/after quantitative comparison of image benchmark scores (or chat ability metrics) is provided after joint training on mixed image/video data, leaving the claim of no catastrophic forgetting of image-only capabilities unverified.

    Authors: We recognize that a direct pre/post comparison would provide stronger evidence for the absence of catastrophic forgetting. We will include in the revised experiments section quantitative results on image segmentation benchmarks and chat ability metrics evaluated before and after joint training on the mixed datasets. This will allow verification of preserved image-only capabilities. revision: yes

Circularity Check

0 steps flagged

No circularity; architecture and training claims are self-contained empirical proposals

full rationale

The paper presents X2SAM as a new MLLM architecture coupling an LLM with a Mask Memory module, trained jointly on heterogeneous image/video datasets to support unified segmentation tasks. No equations, fitted parameters, or derivation steps are described that reduce by construction to self-definitions, renamed predictions, or self-citation chains. Central claims rest on standard supervised training and benchmark evaluation rather than tautological reductions; the Mask Memory and joint-training strategy are introduced as design choices without invoking uniqueness theorems or prior self-work as load-bearing justification. This is the normal case of a model paper whose validity is assessed externally via reported metrics, not internally by circular logic.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no equations, training details, or component specifications are available to enumerate free parameters, axioms, or invented entities beyond the high-level Mask Memory module.

invented entities (1)
  • Mask Memory module no independent evidence
    purpose: Stores guided vision features to enable temporally consistent mask generation across video frames
    Described in abstract as the key coupling mechanism between LLM and segmentation output

pith-pipeline@v0.9.0 · 5530 in / 1169 out tokens · 29764 ms · 2026-05-09T20:42:37.358975+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 14 canonical work pages · 8 internal anchors

  1. [1]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  2. [2]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  3. [3]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763. PmLR, 2021

  4. [4]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916. PMLR, 2021

  5. [5]

    Show, attend and tell: Neural image caption generation with visual attention

    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. InICML, pages 2048–2057. PMLR, 2015

  6. [6]

    Vqa: Visual question answering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InICCV, pages 2425–2433, 2015

  7. [7]

    Language-based image editing with recurrent attentive models

    Jianbo Chen, Yelong Shen, Jianfeng Gao, Jingjing Liu, and Xiaodong Liu. Language-based image editing with recurrent attentive models. InCVPR, pages 8721–8729, 2018

  8. [8]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InICCV, pages 4015–4026, 2023

  9. [9]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  10. [10]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InCVPR, pages 9579–9589, 2024

  11. [11]

    Visa: Reasoning video object segmentation via large language models.ECCV, 2024

    Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models.ECCV, 2024

  12. [12]

    One token to seg them all: Language instructed reasoning segmentation in videos.NeurIPS, 2024

    Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning segmentation in videos.NeurIPS, 2024

  13. [13]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, pages 12888–12900. PMLR, 2022

  14. [15]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, pages 26296–26306, 2024

  15. [16]

    Tarvis: A unified approach for target-based video segmentation

    Ali Athar, Alexander Hermans, Jonathon Luiten, Deva Ramanan, and Bastian Leibe. Tarvis: A unified approach for target-based video segmentation. InCVPR, pages 18738–18748, 2023

  16. [17]

    Oneformer: One transformer to rule universal image segmentation

    Jitesh Jain, Jiachen Li, Mang Tik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. Oneformer: One transformer to rule universal image segmentation. InCVPR, pages 2989–2998, 2023

  17. [18]

    Omg-seg: Is one model good enough for all segmentation? InCVPR, pages 27948–27959, 2024

    Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, and Chen Change Loy. Omg-seg: Is one model good enough for all segmentation? InCVPR, pages 27948–27959, 2024

  18. [19]

    Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding.NeurIPS, 37:71737–71767, 2024

    Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding.NeurIPS, 37:71737–71767, 2024

  19. [20]

    Temporal memory attention for video semantic segmentation

    Hao Wang, Weining Wang, and Jing Liu. Temporal memory attention for video semantic segmentation. InICIP, pages 2254–2258. IEEE, 2021. 24

  20. [21]

    Video k-net: A simple, strong, and unified baseline for video segmentation

    Xiangtai Li, Wenwei Zhang, Jiangmiao Pang, Kai Chen, Guangliang Cheng, Yunhai Tong, and Chen Change Loy. Video k-net: A simple, strong, and unified baseline for video segmentation. InCVPR, 2022

  21. [22]

    X-sam: From segment anything to any segmentation

    Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, and Xiaodan Liang. X-sam: From segment anything to any segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 26187–26196, 2026

  22. [23]

    Visual instruction tuning.NeurIPS, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36:34892–34916, 2023

  23. [24]

    Llavanext: Improved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024

  24. [25]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  25. [26]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

  26. [27]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

  27. [28]

    Glamm: Pixel grounding large multimodal model

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InCVPR, pages 13009–13018, 2024

  28. [29]

    Videoglamm: A large multimodal model for pixel-level visual grounding in videos.CVPR, 2024

    Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Khan, and Salman Khan. Videoglamm: A large multimodal model for pixel-level visual grounding in videos.CVPR, 2024

  29. [30]

    Psalm: Pixelwise segmentation with large multi-modal model

    Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InECCV, pages 74–91. Springer, 2024

  30. [31]

    Hyperseg: Towards universal visual segmentation with large language model.arXiv preprint arXiv:2411.17606, 2024

    Cong Wei, Yujie Zhong, Haoxian Tan, Yong Liu, Zheng Zhao, Jie Hu, and Yujiu Yang. Hyperseg: Towards universal visual segmentation with large language model.arXiv preprint arXiv:2411.17606, 2024

  31. [32]

    arXiv preprint arXiv:2501.04001 (2025)

    Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming-Hsuan Yang. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos. arXiv preprint arXiv:2501.04001, 2025

  32. [33]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  33. [34]

    Ferret: Refer and ground anything anywhere at any granularity,

    Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704, 2023

  34. [35]

    V-net: Fully convolutional neural networks for volumetric medical image segmentation

    Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In3DV, pages 565–571. Ieee, 2016

  35. [36]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. Technical report, OpenAI, San Francisco, CA, USA, 2018

  36. [37]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017

  37. [38]

    Large-scale video panoptic segmentation in the wild: A benchmark

    Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, and Yi Yang. Large-scale video panoptic segmentation in the wild: A benchmark. InCVPR, pages 21033–21043, 2022. 25

  38. [39]

    Vspw: A large-scale dataset for video scene parsing in the wild

    Jiaxu Miao, Yunchao Wei, Yu Wu, Chen Liang, Guangrui Li, and Yi Yang. Vspw: A large-scale dataset for video scene parsing in the wild. InCVPR, pages 4133–4143, 2021

  39. [40]

    Video instance segmentation

    Linjie Yang, Yuchen Fan, and Ning Xu. Video instance segmentation. InICCV, pages 5188–5197, 2019

  40. [41]

    Urvos: Unified referring video object segmentation network with a large-scale benchmark

    Seonguk Seo, Joon-Young Lee, and Bohyung Han. Urvos: Unified referring video object segmentation network with a large-scale benchmark. InECCV, pages 208–223. Springer, 2020

  41. [42]

    Youtube-vos: A large-scale video object segmentation benchmark.arXiv preprint arXiv:1809.03327, 2018

    Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube- vos: A large-scale video object segmentation benchmark.arXiv preprint arXiv:1809.03327, 2018

  42. [43]

    A benchmark dataset and evaluation methodology for video object segmentation

    Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InCVPR, pages 724–732, 2016

  43. [44]

    Video-chatgpt: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024

  44. [45]

    Gres: Generalized referring expression segmentation

    Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Generalized referring expression segmentation. InCVPR, pages 23592–23601, 2023

  45. [46]

    Semantic understanding of scenes through the ade20k dataset.IJCV, 127(3):302–321, 2019

    Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset.IJCV, 127(3):302–321, 2019

  46. [47]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022. URLhttps://openreview.net/ forum?id=nZeVKeeFYf9

  47. [48]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  48. [50]

    Open-vocabulary panoptic segmentation with text-to-image diffusion models

    Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. InCVPR, pages 2955–2966, 2023

  49. [51]

    Uniref++: Segment every reference object in spatial and temporal spaces.ICCV, 2023

    Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, and Ping Luo. Uniref++: Segment every reference object in spatial and temporal spaces.ICCV, 2023

  50. [52]

    Unipixel: Unified object referring and segmentation for pixel-level visual reasoning

    Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Shan Ying, and Chang Wen Chen. Unipixel: Unified object referring and segmentation for pixel-level visual reasoning. InNeurIPS, 2025

  51. [53]

    Segment everything everywhere all at once.NeurIPS, 36:19769–19782, 2023

    Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once.NeurIPS, 36:19769–19782, 2023

  52. [54]

    Language as queries for referring video object segmentation

    Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. Language as queries for referring video object segmentation. InCVPR, pages 4974–4984, 2022

  53. [55]

    Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2404.08506, 2024

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2404.08506, 2024

  54. [56]

    Mmbench: Is your multi-modal model an all-around player? InECCV, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, pages 216–233. Springer, 2024

  55. [57]

    Seed-bench: Benchmarking multimodal large language models

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InCVPR, pages 13299–13308, 2024

  56. [58]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023

  57. [59]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016. 26

  58. [60]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, pages 740–755. Springer, 2014

  59. [61]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, pages 24108–24118, 2025

  60. [62]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, pages 22195–22206, 2024

  61. [63]

    Mlvu: Benchmarking multi-task long video understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InCVPR, pages 13691–13701, 2025

  62. [64]

    Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 37:28828–28857, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 37:28828–28857, 2024

  63. [65]

    Masked-attention mask transformer for universal image segmentation

    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InCVPR, pages 1290–1299, 2022

  64. [66]

    Cris: Clip-driven referring image segmentation

    Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-driven referring image segmentation. InCVPR, pages 11686–11695, 2022

  65. [67]

    PG-Video-LLaV A: Pixel Grounding Large Video-Language Models.ArXiv 2311.13435, 2023

    Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, and Fahad Khan. Pg-video-llava: Pixel grounding large video-language models.arXiv preprint arXiv:2311.13435, 2023

  66. [68]

    Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

  67. [69]

    Chat-univi: Unified visual representation empowers large language models with image and video understanding

    Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. InCVPR, pages 13700–13710, 2024

  68. [70]

    Vila: On pre-training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InCVPR, pages 26689–26699, 2024. 27