Recognition: unknown
X2SAM: Any Segmentation in Images and Videos
Pith reviewed 2026-05-09 20:42 UTC · model grok-4.3
The pith
X2SAM unifies any segmentation for images and videos in one multimodal model via a Mask Memory module.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
X2SAM extends any-segmentation capabilities from images to videos by coupling an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability. It introduces the Video Visual Grounded (V-VGD) segmentation benchmark.
What carries the argument
The Mask Memory module, which stores guided vision features to produce temporally consistent masks over video frames from text and visual prompts.
If this is right
- The model supports both textual and visual prompts for segmentation in one interface across images and videos.
- Video segmentation performance reaches strong levels without task-specific fine-tuning after joint training.
- Image segmentation benchmarks remain competitive with no reported loss from the added video training.
- General multimodal chat ability is preserved alongside the segmentation functions.
- The V-VGD benchmark provides a new way to evaluate interactive video object track segmentation.
Where Pith is reading between the lines
- The memory mechanism could be reused for related video tasks such as object tracking or action segmentation without new modules.
- Similar memory additions might help other multimodal models handle sequential data while retaining earlier capabilities.
- Real-world tools for interactive video editing or analysis could be built around one model instead of multiple specialized ones.
- Longer video sequences would be a natural next test to see if the memory module maintains consistency over extended time.
Load-bearing premise
A single Mask Memory module plus joint training on mixed image and video data suffices for temporal consistency in videos and prevents loss of image-only skills.
What would settle it
Remove the Mask Memory module during training and test whether video masks lose temporal consistency on V-VGD or similar benchmarks, or compare image segmentation scores before and after adding video data to training.
read the original abstract
Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos by coupling an LLM with a Mask Memory module storing guided vision features for temporally consistent video mask generation. It supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. A new Video Visual Grounded (V-VGD) benchmark is proposed, and the model is trained jointly on heterogeneous image and video datasets, claiming strong video segmentation performance while remaining competitive on image benchmarks and preserving general chat ability.
Significance. If the performance claims and architectural sufficiency hold, X2SAM would advance the field by unifying image and video pixel-level perception in a single conversational MLLM, reducing reliance on separate specialized models and enabling more flexible multimodal interfaces. The V-VGD benchmark could provide a useful new evaluation axis for interactive video segmentation.
major comments (4)
- [Abstract] Abstract: The central claims that 'X2SAM delivers strong video segmentation performance' and 'remains competitive on image segmentation benchmarks' are asserted without any reported quantitative metrics (e.g., mIoU, J&F scores), ablation results, error bars, or dataset statistics, preventing verification of the headline performance assertions.
- [Method (Mask Memory module)] Method description of Mask Memory module: The module is presented as sufficient to achieve temporally consistent video masks via storage of guided vision features, yet no details are given on its update/query mechanics, capacity, or integration with the LLM backbone, leaving the mechanism for temporal consistency unexamined and the sufficiency claim untestable.
- [Experiments] Experiments/Results: No ablation is reported that removes the Mask Memory module or compares video temporal consistency with/without it, which is load-bearing for the claim that this single module plus joint training suffices for consistency without task-specific safeguards.
- [Experiments] Experiments/Results: No before/after quantitative comparison of image benchmark scores (or chat ability metrics) is provided after joint training on mixed image/video data, leaving the claim of no catastrophic forgetting of image-only capabilities unverified.
minor comments (2)
- [Abstract] The list of supported tasks ('generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation') would benefit from brief definitions or examples to clarify distinctions and avoid overlap.
- [Introduction/Benchmark] The introduction of the V-VGD benchmark lacks any description of its construction, scale, or evaluation protocol, which would aid reproducibility even in an early manuscript.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We appreciate the acknowledgment of X2SAM's potential to advance unified image-video segmentation in a single conversational MLLM and the value of the proposed V-VGD benchmark. We address each major comment point by point below, indicating revisions where the manuscript will be updated to strengthen clarity and verifiability.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims that 'X2SAM delivers strong video segmentation performance' and 'remains competitive on image segmentation benchmarks' are asserted without any reported quantitative metrics (e.g., mIoU, J&F scores), ablation results, error bars, or dataset statistics, preventing verification of the headline performance assertions.
Authors: We agree that the abstract would be strengthened by including explicit quantitative support for the performance claims. Although detailed results appear in the experiments section, we will revise the abstract to report key metrics such as mIoU and J&F scores on video benchmarks, competitive image benchmark scores, and relevant dataset statistics. This change will make the headline assertions directly verifiable. revision: yes
-
Referee: [Method (Mask Memory module)] Method description of Mask Memory module: The module is presented as sufficient to achieve temporally consistent video masks via storage of guided vision features, yet no details are given on its update/query mechanics, capacity, or integration with the LLM backbone, leaving the mechanism for temporal consistency unexamined and the sufficiency claim untestable.
Authors: We acknowledge that the current description lacks sufficient technical specificity. The Mask Memory module stores LLM-guided vision features from prior frames in a fixed-capacity buffer and retrieves relevant features via similarity-based querying to condition the current frame's mask prediction. We will expand the method section with precise specifications of the update rule, query mechanism, buffer capacity, feature guidance process, and integration points with the LLM backbone to render the temporal consistency approach fully examinable and testable. revision: yes
-
Referee: [Experiments] Experiments/Results: No ablation is reported that removes the Mask Memory module or compares video temporal consistency with/without it, which is load-bearing for the claim that this single module plus joint training suffices for consistency without task-specific safeguards.
Authors: We agree that an ablation isolating the Mask Memory module is necessary to substantiate its contribution. We will add this ablation to the experiments section, comparing video segmentation performance (including temporal consistency metrics such as frame-to-frame mask stability and J&F scores) with and without the module while keeping joint training fixed. This will directly test the sufficiency claim. revision: yes
-
Referee: [Experiments] Experiments/Results: No before/after quantitative comparison of image benchmark scores (or chat ability metrics) is provided after joint training on mixed image/video data, leaving the claim of no catastrophic forgetting of image-only capabilities unverified.
Authors: We recognize that a direct pre/post comparison would provide stronger evidence for the absence of catastrophic forgetting. We will include in the revised experiments section quantitative results on image segmentation benchmarks and chat ability metrics evaluated before and after joint training on the mixed datasets. This will allow verification of preserved image-only capabilities. revision: yes
Circularity Check
No circularity; architecture and training claims are self-contained empirical proposals
full rationale
The paper presents X2SAM as a new MLLM architecture coupling an LLM with a Mask Memory module, trained jointly on heterogeneous image/video datasets to support unified segmentation tasks. No equations, fitted parameters, or derivation steps are described that reduce by construction to self-definitions, renamed predictions, or self-citation chains. Central claims rest on standard supervised training and benchmark evaluation rather than tautological reductions; the Mask Memory and joint-training strategy are introduced as design choices without invoking uniqueness theorems or prior self-work as load-bearing justification. This is the normal case of a model paper whose validity is assessed externally via reported metrics, not internally by circular logic.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Mask Memory module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763. PmLR, 2021
2021
-
[4]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916. PMLR, 2021
2021
-
[5]
Show, attend and tell: Neural image caption generation with visual attention
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. InICML, pages 2048–2057. PMLR, 2015
2048
-
[6]
Vqa: Visual question answering
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InICCV, pages 2425–2433, 2015
2015
-
[7]
Language-based image editing with recurrent attentive models
Jianbo Chen, Yelong Shen, Jianfeng Gao, Jingjing Liu, and Xiaodong Liu. Language-based image editing with recurrent attentive models. InCVPR, pages 8721–8729, 2018
2018
-
[8]
Segment anything
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InICCV, pages 4015–4026, 2023
2023
-
[9]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024
work page internal anchor Pith review arXiv 2024
-
[10]
Lisa: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InCVPR, pages 9579–9589, 2024
2024
-
[11]
Visa: Reasoning video object segmentation via large language models.ECCV, 2024
Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models.ECCV, 2024
2024
-
[12]
One token to seg them all: Language instructed reasoning segmentation in videos.NeurIPS, 2024
Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning segmentation in videos.NeurIPS, 2024
2024
-
[13]
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, pages 12888–12900. PMLR, 2022
2022
-
[15]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, pages 26296–26306, 2024
2024
-
[16]
Tarvis: A unified approach for target-based video segmentation
Ali Athar, Alexander Hermans, Jonathon Luiten, Deva Ramanan, and Bastian Leibe. Tarvis: A unified approach for target-based video segmentation. InCVPR, pages 18738–18748, 2023
2023
-
[17]
Oneformer: One transformer to rule universal image segmentation
Jitesh Jain, Jiachen Li, Mang Tik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. Oneformer: One transformer to rule universal image segmentation. InCVPR, pages 2989–2998, 2023
2023
-
[18]
Omg-seg: Is one model good enough for all segmentation? InCVPR, pages 27948–27959, 2024
Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, and Chen Change Loy. Omg-seg: Is one model good enough for all segmentation? InCVPR, pages 27948–27959, 2024
2024
-
[19]
Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding.NeurIPS, 37:71737–71767, 2024
Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding.NeurIPS, 37:71737–71767, 2024
2024
-
[20]
Temporal memory attention for video semantic segmentation
Hao Wang, Weining Wang, and Jing Liu. Temporal memory attention for video semantic segmentation. InICIP, pages 2254–2258. IEEE, 2021. 24
2021
-
[21]
Video k-net: A simple, strong, and unified baseline for video segmentation
Xiangtai Li, Wenwei Zhang, Jiangmiao Pang, Kai Chen, Guangliang Cheng, Yunhai Tong, and Chen Change Loy. Video k-net: A simple, strong, and unified baseline for video segmentation. InCVPR, 2022
2022
-
[22]
X-sam: From segment anything to any segmentation
Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, and Xiaodan Liang. X-sam: From segment anything to any segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 26187–26196, 2026
2026
-
[23]
Visual instruction tuning.NeurIPS, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36:34892–34916, 2023
2023
-
[24]
Llavanext: Improved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024
2024
-
[25]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review arXiv 2024
-
[26]
How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024
2024
-
[27]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review arXiv 2023
-
[28]
Glamm: Pixel grounding large multimodal model
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InCVPR, pages 13009–13018, 2024
2024
-
[29]
Videoglamm: A large multimodal model for pixel-level visual grounding in videos.CVPR, 2024
Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Khan, and Salman Khan. Videoglamm: A large multimodal model for pixel-level visual grounding in videos.CVPR, 2024
2024
-
[30]
Psalm: Pixelwise segmentation with large multi-modal model
Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InECCV, pages 74–91. Springer, 2024
2024
-
[31]
Cong Wei, Yujie Zhong, Haoxian Tan, Yong Liu, Zheng Zhao, Jie Hu, and Yujiu Yang. Hyperseg: Towards universal visual segmentation with large language model.arXiv preprint arXiv:2411.17606, 2024
-
[32]
arXiv preprint arXiv:2501.04001 (2025)
Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming-Hsuan Yang. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos. arXiv preprint arXiv:2501.04001, 2025
-
[33]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Ferret: Refer and ground anything anywhere at any granularity,
Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704, 2023
-
[35]
V-net: Fully convolutional neural networks for volumetric medical image segmentation
Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In3DV, pages 565–571. Ieee, 2016
2016
-
[36]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. Technical report, OpenAI, San Francisco, CA, USA, 2018
2018
-
[37]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017
2017
-
[38]
Large-scale video panoptic segmentation in the wild: A benchmark
Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, and Yi Yang. Large-scale video panoptic segmentation in the wild: A benchmark. InCVPR, pages 21033–21043, 2022. 25
2022
-
[39]
Vspw: A large-scale dataset for video scene parsing in the wild
Jiaxu Miao, Yunchao Wei, Yu Wu, Chen Liang, Guangrui Li, and Yi Yang. Vspw: A large-scale dataset for video scene parsing in the wild. InCVPR, pages 4133–4143, 2021
2021
-
[40]
Video instance segmentation
Linjie Yang, Yuchen Fan, and Ning Xu. Video instance segmentation. InICCV, pages 5188–5197, 2019
2019
-
[41]
Urvos: Unified referring video object segmentation network with a large-scale benchmark
Seonguk Seo, Joon-Young Lee, and Bohyung Han. Urvos: Unified referring video object segmentation network with a large-scale benchmark. InECCV, pages 208–223. Springer, 2020
2020
-
[42]
Youtube-vos: A large-scale video object segmentation benchmark.arXiv preprint arXiv:1809.03327, 2018
Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube- vos: A large-scale video object segmentation benchmark.arXiv preprint arXiv:1809.03327, 2018
-
[43]
A benchmark dataset and evaluation methodology for video object segmentation
Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InCVPR, pages 724–732, 2016
2016
-
[44]
Video-chatgpt: Towards detailed video understanding via large vision and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024
2024
-
[45]
Gres: Generalized referring expression segmentation
Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Generalized referring expression segmentation. InCVPR, pages 23592–23601, 2023
2023
-
[46]
Semantic understanding of scenes through the ade20k dataset.IJCV, 127(3):302–321, 2019
Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset.IJCV, 127(3):302–321, 2019
2019
-
[47]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022. URLhttps://openreview.net/ forum?id=nZeVKeeFYf9
2022
-
[48]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[50]
Open-vocabulary panoptic segmentation with text-to-image diffusion models
Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. InCVPR, pages 2955–2966, 2023
2023
-
[51]
Uniref++: Segment every reference object in spatial and temporal spaces.ICCV, 2023
Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, and Ping Luo. Uniref++: Segment every reference object in spatial and temporal spaces.ICCV, 2023
2023
-
[52]
Unipixel: Unified object referring and segmentation for pixel-level visual reasoning
Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Shan Ying, and Chang Wen Chen. Unipixel: Unified object referring and segmentation for pixel-level visual reasoning. InNeurIPS, 2025
2025
-
[53]
Segment everything everywhere all at once.NeurIPS, 36:19769–19782, 2023
Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once.NeurIPS, 36:19769–19782, 2023
2023
-
[54]
Language as queries for referring video object segmentation
Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. Language as queries for referring video object segmentation. InCVPR, pages 4974–4984, 2022
2022
-
[55]
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2404.08506, 2024
-
[56]
Mmbench: Is your multi-modal model an all-around player? InECCV, pages 216–233
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, pages 216–233. Springer, 2024
2024
-
[57]
Seed-bench: Benchmarking multimodal large language models
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InCVPR, pages 13299–13308, 2024
2024
-
[58]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023
work page internal anchor Pith review arXiv 2023
-
[59]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016. 26
2016
-
[60]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, pages 740–755. Springer, 2014
2014
-
[61]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, pages 24108–24118, 2025
2025
-
[62]
Mvbench: A comprehensive multi-modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, pages 22195–22206, 2024
2024
-
[63]
Mlvu: Benchmarking multi-task long video understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InCVPR, pages 13691–13701, 2025
2025
-
[64]
Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 37:28828–28857, 2024
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 37:28828–28857, 2024
2024
-
[65]
Masked-attention mask transformer for universal image segmentation
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InCVPR, pages 1290–1299, 2022
2022
-
[66]
Cris: Clip-driven referring image segmentation
Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-driven referring image segmentation. InCVPR, pages 11686–11695, 2022
2022
-
[67]
PG-Video-LLaV A: Pixel Grounding Large Video-Language Models.ArXiv 2311.13435, 2023
Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, and Fahad Khan. Pg-video-llava: Pixel grounding large video-language models.arXiv preprint arXiv:2311.13435, 2023
-
[68]
Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025
2025
-
[69]
Chat-univi: Unified visual representation empowers large language models with image and video understanding
Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. InCVPR, pages 13700–13710, 2024
2024
-
[70]
Vila: On pre-training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InCVPR, pages 26689–26699, 2024. 27
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.