arxiv: 2605.00891 · v1 · submitted 2026-04-27 · 💻 cs.CV · cs.AI

Recognition: unknown

X2SAM: Any Segmentation in Images and Videos

Chi Zhang, Guanglu Wan, Hao Wang, Limeng Qiao, Lin Ma, Xiangyuan Lan, Xiaodan Liang

Pith reviewed 2026-05-09 20:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multimodal large language modelsvideo segmentationimage segmentationMask Memoryunified segmentationvisual promptsreferring segmentationvideo benchmark

0 comments

The pith

X2SAM unifies any segmentation for images and videos in one multimodal model via a Mask Memory module.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents X2SAM as a multimodal large language model that extends pixel-level segmentation from images to videos using conversational instructions and visual prompts. It adds a Mask Memory module to store guided vision features, enabling consistent mask generation across video frames without separate models for each modality. Joint training on mixed image and video datasets is used to support tasks like open-vocabulary, referring, reasoning, and interactive segmentation while aiming to keep image benchmark performance and general chat abilities intact. The work also proposes the Video Visual Grounded segmentation benchmark to test object track segmentation from interactive prompts. If the approach holds, it would allow a single system to handle complex video analysis and image tasks through one interface.

Core claim

X2SAM extends any-segmentation capabilities from images to videos by coupling an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability. It introduces the Video Visual Grounded (V-VGD) segmentation benchmark.

What carries the argument

The Mask Memory module, which stores guided vision features to produce temporally consistent masks over video frames from text and visual prompts.

If this is right

The model supports both textual and visual prompts for segmentation in one interface across images and videos.
Video segmentation performance reaches strong levels without task-specific fine-tuning after joint training.
Image segmentation benchmarks remain competitive with no reported loss from the added video training.
General multimodal chat ability is preserved alongside the segmentation functions.
The V-VGD benchmark provides a new way to evaluate interactive video object track segmentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The memory mechanism could be reused for related video tasks such as object tracking or action segmentation without new modules.
Similar memory additions might help other multimodal models handle sequential data while retaining earlier capabilities.
Real-world tools for interactive video editing or analysis could be built around one model instead of multiple specialized ones.
Longer video sequences would be a natural next test to see if the memory module maintains consistency over extended time.

Load-bearing premise

A single Mask Memory module plus joint training on mixed image and video data suffices for temporal consistency in videos and prevents loss of image-only skills.

What would settle it

Remove the Mask Memory module during training and test whether video masks lose temporal consistency on V-VGD or similar benchmarks, or compare image segmentation scores before and after adding video data to training.

read the original abstract

Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

X2SAM adds a Mask Memory module and joint image-video training to create a unified segmentation MLLM, but the abstract supplies no quantitative results to support the performance assertions.

read the letter

Colleague, X2SAM tries to put image and video segmentation into one MLLM that takes conversational text plus visual prompts. The Mask Memory module stores guided features to keep masks consistent across frames, and they train jointly on mixed image and video data to hold onto general chat skills while adding video support. They also define a new V-VGD benchmark for video object tracking from interactive prompts. That combination of unified prompts, the memory component, and the benchmark is the concrete step beyond prior SAM variants and image-only segmentation MLLMs. The joint training choice is a straightforward way to handle heterogeneous data without separate fine-tuning stages. The architecture description shows clear thinking about how to extend pixel-level output to dynamic scenes while covering generic, referring, reasoning, and grounded tasks in one interface. The main gap is that the abstract states strong video results and competitive image numbers without any tables, ablations, error bars, or pre/post joint-training comparisons. No mechanics are given for how the memory updates or retrieves features, so it is impossible to judge whether temporal consistency actually holds or whether image performance drops. The stress-test point about the single module plus mixed training being sufficient lands directly on what is shown, since those assumptions carry the central claim and lack reported checks. This paper is for groups working on multimodal interfaces that need pixel outputs for both static and moving content, such as editing tools or basic robotic perception. Readers who want to see how segmentation MLLMs can be extended to video and who value new benchmarks will find the setup useful. I would send it for peer review because the unification goal and the benchmark are worth proper evaluation, even with the current lack of numbers.

Referee Report

4 major / 2 minor

Summary. The paper introduces X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos by coupling an LLM with a Mask Memory module storing guided vision features for temporally consistent video mask generation. It supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. A new Video Visual Grounded (V-VGD) benchmark is proposed, and the model is trained jointly on heterogeneous image and video datasets, claiming strong video segmentation performance while remaining competitive on image benchmarks and preserving general chat ability.

Significance. If the performance claims and architectural sufficiency hold, X2SAM would advance the field by unifying image and video pixel-level perception in a single conversational MLLM, reducing reliance on separate specialized models and enabling more flexible multimodal interfaces. The V-VGD benchmark could provide a useful new evaluation axis for interactive video segmentation.

major comments (4)

[Abstract] Abstract: The central claims that 'X2SAM delivers strong video segmentation performance' and 'remains competitive on image segmentation benchmarks' are asserted without any reported quantitative metrics (e.g., mIoU, J&F scores), ablation results, error bars, or dataset statistics, preventing verification of the headline performance assertions.
[Method (Mask Memory module)] Method description of Mask Memory module: The module is presented as sufficient to achieve temporally consistent video masks via storage of guided vision features, yet no details are given on its update/query mechanics, capacity, or integration with the LLM backbone, leaving the mechanism for temporal consistency unexamined and the sufficiency claim untestable.
[Experiments] Experiments/Results: No ablation is reported that removes the Mask Memory module or compares video temporal consistency with/without it, which is load-bearing for the claim that this single module plus joint training suffices for consistency without task-specific safeguards.
[Experiments] Experiments/Results: No before/after quantitative comparison of image benchmark scores (or chat ability metrics) is provided after joint training on mixed image/video data, leaving the claim of no catastrophic forgetting of image-only capabilities unverified.

minor comments (2)

[Abstract] The list of supported tasks ('generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation') would benefit from brief definitions or examples to clarify distinctions and avoid overlap.
[Introduction/Benchmark] The introduction of the V-VGD benchmark lacks any description of its construction, scale, or evaluation protocol, which would aid reproducibility even in an early manuscript.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the acknowledgment of X2SAM's potential to advance unified image-video segmentation in a single conversational MLLM and the value of the proposed V-VGD benchmark. We address each major comment point by point below, indicating revisions where the manuscript will be updated to strengthen clarity and verifiability.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims that 'X2SAM delivers strong video segmentation performance' and 'remains competitive on image segmentation benchmarks' are asserted without any reported quantitative metrics (e.g., mIoU, J&F scores), ablation results, error bars, or dataset statistics, preventing verification of the headline performance assertions.

Authors: We agree that the abstract would be strengthened by including explicit quantitative support for the performance claims. Although detailed results appear in the experiments section, we will revise the abstract to report key metrics such as mIoU and J&F scores on video benchmarks, competitive image benchmark scores, and relevant dataset statistics. This change will make the headline assertions directly verifiable. revision: yes
Referee: [Method (Mask Memory module)] Method description of Mask Memory module: The module is presented as sufficient to achieve temporally consistent video masks via storage of guided vision features, yet no details are given on its update/query mechanics, capacity, or integration with the LLM backbone, leaving the mechanism for temporal consistency unexamined and the sufficiency claim untestable.

Authors: We acknowledge that the current description lacks sufficient technical specificity. The Mask Memory module stores LLM-guided vision features from prior frames in a fixed-capacity buffer and retrieves relevant features via similarity-based querying to condition the current frame's mask prediction. We will expand the method section with precise specifications of the update rule, query mechanism, buffer capacity, feature guidance process, and integration points with the LLM backbone to render the temporal consistency approach fully examinable and testable. revision: yes
Referee: [Experiments] Experiments/Results: No ablation is reported that removes the Mask Memory module or compares video temporal consistency with/without it, which is load-bearing for the claim that this single module plus joint training suffices for consistency without task-specific safeguards.

Authors: We agree that an ablation isolating the Mask Memory module is necessary to substantiate its contribution. We will add this ablation to the experiments section, comparing video segmentation performance (including temporal consistency metrics such as frame-to-frame mask stability and J&F scores) with and without the module while keeping joint training fixed. This will directly test the sufficiency claim. revision: yes
Referee: [Experiments] Experiments/Results: No before/after quantitative comparison of image benchmark scores (or chat ability metrics) is provided after joint training on mixed image/video data, leaving the claim of no catastrophic forgetting of image-only capabilities unverified.

Authors: We recognize that a direct pre/post comparison would provide stronger evidence for the absence of catastrophic forgetting. We will include in the revised experiments section quantitative results on image segmentation benchmarks and chat ability metrics evaluated before and after joint training on the mixed datasets. This will allow verification of preserved image-only capabilities. revision: yes

Circularity Check

0 steps flagged

No circularity; architecture and training claims are self-contained empirical proposals

full rationale

The paper presents X2SAM as a new MLLM architecture coupling an LLM with a Mask Memory module, trained jointly on heterogeneous image/video datasets to support unified segmentation tasks. No equations, fitted parameters, or derivation steps are described that reduce by construction to self-definitions, renamed predictions, or self-citation chains. Central claims rest on standard supervised training and benchmark evaluation rather than tautological reductions; the Mask Memory and joint-training strategy are introduced as design choices without invoking uniqueness theorems or prior self-work as load-bearing justification. This is the normal case of a model paper whose validity is assessed externally via reported metrics, not internally by circular logic.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no equations, training details, or component specifications are available to enumerate free parameters, axioms, or invented entities beyond the high-level Mask Memory module.

invented entities (1)

Mask Memory module no independent evidence
purpose: Stores guided vision features to enable temporally consistent mask generation across video frames
Described in abstract as the key coupling mechanism between LLM and segmentation output

pith-pipeline@v0.9.0 · 5530 in / 1169 out tokens · 29764 ms · 2026-05-09T20:42:37.358975+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 14 canonical work pages · 8 internal anchors

[1]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763. PmLR, 2021

2021
[4]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916. PMLR, 2021

2021
[5]

Show, attend and tell: Neural image caption generation with visual attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. InICML, pages 2048–2057. PMLR, 2015

2048
[6]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InICCV, pages 2425–2433, 2015

2015
[7]

Language-based image editing with recurrent attentive models

Jianbo Chen, Yelong Shen, Jianfeng Gao, Jingjing Liu, and Xiaodong Liu. Language-based image editing with recurrent attentive models. InCVPR, pages 8721–8729, 2018

2018
[8]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InICCV, pages 4015–4026, 2023

2023
[9]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review arXiv 2024
[10]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InCVPR, pages 9579–9589, 2024

2024
[11]

Visa: Reasoning video object segmentation via large language models.ECCV, 2024

Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models.ECCV, 2024

2024
[12]

One token to seg them all: Language instructed reasoning segmentation in videos.NeurIPS, 2024

Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning segmentation in videos.NeurIPS, 2024

2024
[13]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, pages 12888–12900. PMLR, 2022

2022
[15]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, pages 26296–26306, 2024

2024
[16]

Tarvis: A unified approach for target-based video segmentation

Ali Athar, Alexander Hermans, Jonathon Luiten, Deva Ramanan, and Bastian Leibe. Tarvis: A unified approach for target-based video segmentation. InCVPR, pages 18738–18748, 2023

2023
[17]

Oneformer: One transformer to rule universal image segmentation

Jitesh Jain, Jiachen Li, Mang Tik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. Oneformer: One transformer to rule universal image segmentation. InCVPR, pages 2989–2998, 2023

2023
[18]

Omg-seg: Is one model good enough for all segmentation? InCVPR, pages 27948–27959, 2024

Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, and Chen Change Loy. Omg-seg: Is one model good enough for all segmentation? InCVPR, pages 27948–27959, 2024

2024
[19]

Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding.NeurIPS, 37:71737–71767, 2024

Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding.NeurIPS, 37:71737–71767, 2024

2024
[20]

Temporal memory attention for video semantic segmentation

Hao Wang, Weining Wang, and Jing Liu. Temporal memory attention for video semantic segmentation. InICIP, pages 2254–2258. IEEE, 2021. 24

2021
[21]

Video k-net: A simple, strong, and unified baseline for video segmentation

Xiangtai Li, Wenwei Zhang, Jiangmiao Pang, Kai Chen, Guangliang Cheng, Yunhai Tong, and Chen Change Loy. Video k-net: A simple, strong, and unified baseline for video segmentation. InCVPR, 2022

2022
[22]

X-sam: From segment anything to any segmentation

Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, and Xiaodan Liang. X-sam: From segment anything to any segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 26187–26196, 2026

2026
[23]

Visual instruction tuning.NeurIPS, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36:34892–34916, 2023

2023
[24]

Llavanext: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024

2024
[25]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review arXiv 2024
[26]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

2024
[27]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review arXiv 2023
[28]

Glamm: Pixel grounding large multimodal model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InCVPR, pages 13009–13018, 2024

2024
[29]

Videoglamm: A large multimodal model for pixel-level visual grounding in videos.CVPR, 2024

Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Khan, and Salman Khan. Videoglamm: A large multimodal model for pixel-level visual grounding in videos.CVPR, 2024

2024
[30]

Psalm: Pixelwise segmentation with large multi-modal model

Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InECCV, pages 74–91. Springer, 2024

2024
[31]

Hyperseg: Towards universal visual segmentation with large language model.arXiv preprint arXiv:2411.17606, 2024

Cong Wei, Yujie Zhong, Haoxian Tan, Yong Liu, Zheng Zhao, Jie Hu, and Yujiu Yang. Hyperseg: Towards universal visual segmentation with large language model.arXiv preprint arXiv:2411.17606, 2024

work page arXiv 2024
[32]

arXiv preprint arXiv:2501.04001 (2025)

Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming-Hsuan Yang. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos. arXiv preprint arXiv:2501.04001, 2025

work page arXiv 2025
[33]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Ferret: Refer and ground anything anywhere at any granularity,

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704, 2023

work page arXiv 2023
[35]

V-net: Fully convolutional neural networks for volumetric medical image segmentation

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In3DV, pages 565–571. Ieee, 2016

2016
[36]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. Technical report, OpenAI, San Francisco, CA, USA, 2018

2018
[37]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017

2017
[38]

Large-scale video panoptic segmentation in the wild: A benchmark

Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, and Yi Yang. Large-scale video panoptic segmentation in the wild: A benchmark. InCVPR, pages 21033–21043, 2022. 25

2022
[39]

Vspw: A large-scale dataset for video scene parsing in the wild

Jiaxu Miao, Yunchao Wei, Yu Wu, Chen Liang, Guangrui Li, and Yi Yang. Vspw: A large-scale dataset for video scene parsing in the wild. InCVPR, pages 4133–4143, 2021

2021
[40]

Video instance segmentation

Linjie Yang, Yuchen Fan, and Ning Xu. Video instance segmentation. InICCV, pages 5188–5197, 2019

2019
[41]

Urvos: Unified referring video object segmentation network with a large-scale benchmark

Seonguk Seo, Joon-Young Lee, and Bohyung Han. Urvos: Unified referring video object segmentation network with a large-scale benchmark. InECCV, pages 208–223. Springer, 2020

2020
[42]

Youtube-vos: A large-scale video object segmentation benchmark.arXiv preprint arXiv:1809.03327, 2018

Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube- vos: A large-scale video object segmentation benchmark.arXiv preprint arXiv:1809.03327, 2018

work page arXiv 2018
[43]

A benchmark dataset and evaluation methodology for video object segmentation

Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InCVPR, pages 724–732, 2016

2016
[44]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024

2024
[45]

Gres: Generalized referring expression segmentation

Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Generalized referring expression segmentation. InCVPR, pages 23592–23601, 2023

2023
[46]

Semantic understanding of scenes through the ade20k dataset.IJCV, 127(3):302–321, 2019

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset.IJCV, 127(3):302–321, 2019

2019
[47]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022. URLhttps://openreview.net/ forum?id=nZeVKeeFYf9

2022
[48]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[50]

Open-vocabulary panoptic segmentation with text-to-image diffusion models

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. InCVPR, pages 2955–2966, 2023

2023
[51]

Uniref++: Segment every reference object in spatial and temporal spaces.ICCV, 2023

Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, and Ping Luo. Uniref++: Segment every reference object in spatial and temporal spaces.ICCV, 2023

2023
[52]

Unipixel: Unified object referring and segmentation for pixel-level visual reasoning

Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Shan Ying, and Chang Wen Chen. Unipixel: Unified object referring and segmentation for pixel-level visual reasoning. InNeurIPS, 2025

2025
[53]

Segment everything everywhere all at once.NeurIPS, 36:19769–19782, 2023

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once.NeurIPS, 36:19769–19782, 2023

2023
[54]

Language as queries for referring video object segmentation

Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. Language as queries for referring video object segmentation. InCVPR, pages 4974–4984, 2022

2022
[55]

Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2404.08506, 2024

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2404.08506, 2024

work page arXiv 2024
[56]

Mmbench: Is your multi-modal model an all-around player? InECCV, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, pages 216–233. Springer, 2024

2024
[57]

Seed-bench: Benchmarking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InCVPR, pages 13299–13308, 2024

2024
[58]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023

work page internal anchor Pith review arXiv 2023
[59]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016. 26

2016
[60]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, pages 740–755. Springer, 2014

2014
[61]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, pages 24108–24118, 2025

2025
[62]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, pages 22195–22206, 2024

2024
[63]

Mlvu: Benchmarking multi-task long video understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InCVPR, pages 13691–13701, 2025

2025
[64]

Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 37:28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 37:28828–28857, 2024

2024
[65]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InCVPR, pages 1290–1299, 2022

2022
[66]

Cris: Clip-driven referring image segmentation

Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-driven referring image segmentation. InCVPR, pages 11686–11695, 2022

2022
[67]

PG-Video-LLaV A: Pixel Grounding Large Video-Language Models.ArXiv 2311.13435, 2023

Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, and Fahad Khan. Pg-video-llava: Pixel grounding large video-language models.arXiv preprint arXiv:2311.13435, 2023

work page arXiv 2023
[68]

Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

2025
[69]

Chat-univi: Unified visual representation empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. InCVPR, pages 13700–13710, 2024

2024
[70]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InCVPR, pages 26689–26699, 2024. 27

2024