arxiv: 2604.11411 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

Online Reasoning Video Object Segmentation

Jinyuan Liu , Yang Wang , Zeyu Zhao , Weixin Li , Song Wang , Ruize Han

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords online video object segmentationcausal video reasoningreferent shiftsnatural language video queriestemporal token reservoirframe-by-frame segmentationvideo understanding benchmark

0 comments

The pith

Reasoning video object segmentation must run causally using only past and current frames while tracking shifting referents in language queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Online Reasoning Video Object Segmentation as the task of generating pixel-level masks from natural-language queries that may contain implicit references and change what they point to as events unfold. It shows that current methods assume the full video is available at once, allowing future frames to resolve ambiguities, which does not match real deployments that demand incremental frame-by-frame decisions. To enable study of this setting the authors release ORVOSB, a benchmark of 210 videos with frame-level causal annotations and explicit labels for referent shifts across 512 queries in five reasoning categories. They also introduce a baseline that keeps segmentation prompts updated and maintains a structured temporal token reservoir so long sequences can be handled without unbounded memory growth. If the claim holds, the field must move from offline retrospective methods to architectures that reason causally under strict computational bounds.

Core claim

The central claim is that reasoning video object segmentation requires strictly causal operation: models must interpret queries and produce masks incrementally from past and current frames alone, without revisiting earlier outputs or accessing future frames, while correctly handling referent shifts that occur as the video progresses. The authors support this by constructing ORVOSB with the necessary frame-level causal and shift annotations and by releasing a baseline whose continually updated prompts and temporal token reservoir allow bounded long-horizon reasoning.

What carries the argument

A baseline architecture that maintains continually-updated segmentation prompts together with a structured temporal token reservoir for efficient long-horizon causal reasoning.

If this is right

Existing offline methods lose accuracy when future frames are withheld and when queries change reference during the video.
Any successful model must maintain and update its interpretation of the query across time without retrospective correction.
Benchmarks for video segmentation now need explicit causal constraints and referent-shift labels at the frame level.
Long video sequences require memory mechanisms whose size remains bounded even as the number of frames grows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-time systems such as live monitoring or autonomous navigation would gain immediate usability from causal models that never wait for the end of a clip.
The temporal token reservoir idea could be combined with hierarchical memory to scale to hour-long videos while preserving causality.
Similar causal constraints likely appear in other sequential reasoning tasks such as online video captioning or action anticipation.

Load-bearing premise

The ORVOSB benchmark's frame-level causal annotations and referent-shift labels sufficiently represent the distribution of real-world online queries and video content.

What would settle it

A new method that achieves substantially higher accuracy on ORVOSB while still obeying strict causality would falsify the claim that existing approaches cannot be adapted to the online regime without major redesign.

Figures

Figures reproduced from arXiv: 2604.11411 by Jinyuan Liu, Ruize Han, Song Wang, Weixin Li, Yang Wang, Zeyu Zhao.

**Figure 1.** Figure 1: Overview of ORVOSB data construction and annotation pipeline. where the prediction at time t must not depend on any future frame It ′ with t ′ > t. Causality Constraints. In ORVOS, the causality constraint is a new yet important concept, which is to say that an object should be segmented iff. It satisfies the condition of the referring expression q. For instance, under the referring expression q indicatin… view at source ↗

**Figure 2.** Figure 2: Illustrative examples of the five reasoning query types in ORVOSB [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Instruct Template Online Video Timeline Context Frames Write Into Reservoir Retrieval ······ t-K t-1 Current Frame t (a) Overall Architecture. which vehicles are moving in the current scene? Context Frames Current Frame Memory Tokens Token Reservoir t i K i iS     1 t g   Affinity-guided Adaptive Fusion St-K <SEG> t-K ······ ······ <SEG> t-1 St-1 <TGT> t t g Multimodal Large Language Model gt ~ t Pre… view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Reasoning video object segmentation predicts pixel-level masks in videos from natural-language queries that may involve implicit and temporally grounded references. However, existing methods are developed and evaluated in an offline regime, where the entire video is available at inference time and future frames can be exploited for retrospective disambiguation, deviating from real-world deployments that require strictly causal, frame-by-frame decisions. We study Online Reasoning Video Object Segmentation (ORVOS), where models must incrementally interpret queries using only past and current frames without revisiting previous predictions, while handling referent shifts as events unfold. To support evaluation, we introduce ORVOSB, a benchmark with frame-level causal annotations and referent-shift labels, comprising 210 videos, 12,907 annotated frames, and 512 queries across five reasoning categories. We further propose a baseline with continually-updated segmentation prompts and a structured temporal token reservoir for long-horizon reasoning under bounded computation. Experiments show that existing methods struggle under strict causality and referent shifts, while our baseline establishes a strong foundation for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper carves out a new online causal task for language-guided video segmentation and ships a small benchmark, but the evaluation leaves open whether the performance gaps are real or benchmark-specific.

read the letter

The core contribution is a clear definition of Online Reasoning Video Object Segmentation, where models must process natural-language queries frame by frame using only past and current information, plus handling referent shifts as the video unfolds. They back this with ORVOSB, a benchmark of 210 videos, 12,907 frames, and 512 queries that includes explicit causal annotations and shift labels. That fills a practical gap between offline video object segmentation work and settings like robotics or live video that cannot revisit future frames. The baseline they describe, with continually updated prompts and a temporal token reservoir, is a straightforward way to keep computation bounded while maintaining some long-horizon memory. Those pieces are useful and worth having in the literature. The soft spot is the benchmark scale and validation. Two hundred ten videos and five hundred twelve queries is modest for claims about how existing methods struggle under strict causality or how well the baseline generalizes. The stress-test note is fair here: without side-by-side statistics on query complexity, shift frequency, or video length distributions against larger real-world corpora, it is hard to know whether the observed gaps are inherent or tied to this particular test distribution. The abstract asserts that offline methods struggle and the baseline sets a foundation, but the numbers and ablations would need to be strong to carry that weight. This is aimed at researchers building video-language models or real-time segmentation systems. It is the kind of task-and-benchmark paper that can move the field forward if the data holds up, so it deserves a serious referee who can press on the representativeness and quantitative details.

Referee Report

2 major / 2 minor

Summary. The paper introduces Online Reasoning Video Object Segmentation (ORVOS), a task requiring strictly causal, frame-by-frame pixel-level mask prediction from natural-language queries that may contain implicit temporal references and referent shifts. It presents the ORVOSB benchmark (210 videos, 12,907 frames, 512 queries with frame-level causal annotations and referent-shift labels across five reasoning categories) and proposes a baseline using continually-updated segmentation prompts plus a structured temporal token reservoir for bounded long-horizon reasoning. Experiments claim that existing offline methods struggle under these constraints while the baseline establishes a strong foundation for future work.

Significance. If the empirical claims hold, the work is significant for exposing the gap between offline video object segmentation methods and real-world causal deployments, while providing a new benchmark and baseline to standardize evaluation of referent-shift handling. The emphasis on bounded computation in the baseline is a practical strength. Impact depends on rigorous validation that performance gaps are not artifacts of the benchmark distribution.

major comments (2)

[Benchmark section] Benchmark section (ORVOSB construction): The central claim that existing methods 'struggle under strict causality and referent shifts' rests on ORVOSB being representative, yet no quantitative comparisons are provided for query linguistic complexity, referent-shift frequency, video duration statistics, or causal annotation consistency against larger real-world online video-query corpora. This is load-bearing, as over-representation of short clips or simple shifts could artifactually inflate observed gaps.
[Experiments section] Experiments section: The abstract asserts that experiments demonstrate struggles for prior methods and a 'strong foundation' for the baseline, but the manuscript provides no error bars, statistical tests, or ablations on the temporal token reservoir's contribution to long-horizon performance. Without these, the strength of the empirical support for the task definition cannot be verified.

minor comments (2)

[Abstract] Abstract: The five reasoning categories are mentioned but not enumerated; listing them would improve immediate clarity for readers.
[Method] Notation: The term 'structured temporal token reservoir' is introduced without a precise definition or pseudocode in the early sections; a small diagram or equation would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, indicating the revisions we will incorporate.

read point-by-point responses

Referee: [Benchmark section] Benchmark section (ORVOSB construction): The central claim that existing methods 'struggle under strict causality and referent shifts' rests on ORVOSB being representative, yet no quantitative comparisons are provided for query linguistic complexity, referent-shift frequency, video duration statistics, or causal annotation consistency against larger real-world online video-query corpora. This is load-bearing, as over-representation of short clips or simple shifts could artifactually inflate observed gaps.

Authors: We agree that additional context on ORVOSB's characteristics is important to support the representativeness of the claims. As the first benchmark providing frame-level causal annotations and referent-shift labels for this task, direct equivalents do not exist. In the revision we will add quantitative statistics on query linguistic complexity, referent-shift frequency, and video duration distributions, along with comparisons to published figures from existing video-query benchmarks such as Refer-YouTube-VOS. We will also report inter-annotator agreement for the causal annotations. Direct comparison of causal annotation consistency is not possible with prior datasets, which we will explicitly note as a limitation. revision: partial
Referee: [Experiments section] Experiments section: The abstract asserts that experiments demonstrate struggles for prior methods and a 'strong foundation' for the baseline, but the manuscript provides no error bars, statistical tests, or ablations on the temporal token reservoir's contribution to long-horizon performance. Without these, the strength of the empirical support for the task definition cannot be verified.

Authors: We appreciate the emphasis on empirical rigor. The revised manuscript will include error bars for all reported metrics, statistical significance tests comparing method performances, and dedicated ablations isolating the temporal token reservoir's role in long-horizon reasoning. These additions will provide clearer validation of the performance gaps and the baseline's contributions. revision: yes

Circularity Check

0 steps flagged

No circularity: new task definition and benchmark with independent baseline

full rationale

The paper defines a new task (ORVOS) requiring strictly causal, frame-by-frame processing with referent shifts, introduces the ORVOSB benchmark (210 videos, 12,907 frames, 512 queries with causal annotations and shift labels), and proposes a baseline using continually-updated prompts and temporal token reservoir. No equations, fitted parameters, or derivations are present that reduce to self-definition or self-citation. Central claims rest on experimental comparisons showing existing offline methods struggle on the new benchmark, which is externally falsifiable via the released annotations and does not rely on prior author results for uniqueness or ansatz. This is a standard task/benchmark contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's claims rest on standard computer-vision assumptions about video frame causality and language grounding rather than new mathematical axioms or invented physical entities.

axioms (1)

domain assumption Real-world video deployments require strictly causal, frame-by-frame decisions without access to future frames.
Stated directly in the abstract as the motivation for moving from offline to online regime.

pith-pipeline@v0.9.0 · 5481 in / 1138 out tokens · 27348 ms · 2026-05-10T15:24:06.045514+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 17 canonical work pages · 8 internal anchors

[1]

NeurIPS35, 23716–23736 (2022)

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. NeurIPS35, 23716–23736 (2022)

2022
[2]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

NeurIPS37, 6833–6859 (2024)

Bai, Z., He, T., Mei, H., Wang, P., Gao, Z., Chen, J., Zhang, Z., Shou, M.Z.: One token to seg them all: Language instructed reasoning segmentation in videos. NeurIPS37, 6833–6859 (2024)

2024
[5]

In: CVPR

Botach, A., Zheltonozhskii, E., Baskin, C.: End-to-end referring video object seg- mentation with multimodal transformers. In: CVPR. pp. 4985–4995 (2022)

2022
[6]

In: SIGCOMM

Bothra, C., Gao, J., Rao, S., Ribeiro, B.: Veritas: Answering causal queries from video streaming traces. In: SIGCOMM. pp. 738–753 (2023)

2023
[7]

In: CVPR

Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In: CVPR. pp. 1209–1218 (2018)

2018
[8]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., et al.: Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

In: CVPR

Chen, J., Lv, Z., Wu, S., Lin, K.Q., Song, C., Gao, D., Liu, J.W., Gao, Z., Mao, D., Shou, M.Z.: Videollm-online: Online video large language model for streaming video. In: CVPR. pp. 18407–18418 (2024)

2024
[10]

In: CVPR

Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.: Detect what you can: Detecting and representing objects using holistic models and body parts. In: CVPR. pp. 1971–1978 (2014)

1971
[11]

In: CVPR

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: CVPR. pp. 24185–24198 (2024)

2024
[12]

IEEE Trans- actions on Mobile Computing23(12), 12761–12777 (2024)

Dai, P., Chao, Y., Wu, X., Liu, K., Guo, S.: Context-aware offloading for edge- assisted on-device video analytics through online learning approach. IEEE Trans- actions on Mobile Computing23(12), 12761–12777 (2024)

2024
[13]

In: CVPR

Ding, H., Liu, C., He, S., Jiang, X., Loy, C.C.: Mevis: A large-scale benchmark for video segmentation with motion expressions. In: CVPR. pp. 2694–2703 (2023)

2023
[14]

In: CVPR

Ding, H., Liu, C., He, S., Jiang, X., Torr, P.H., Bai, S.: Mose: A new dataset for video object segmentation in complex scenes. In: CVPR. pp. 20224–20234 (2023)

2023
[15]

In: CVPR

Gong, S., Zhuge, Y., Zhang, L., Yang, Z., Zhang, P., Lu, H.: The devil is in temporal token: High quality video reasoning segmentation. In: CVPR. pp. 29183–29192 (2025)

2025
[16]

arXiv preprint arXiv:1308.0850 (2013) 4, 5

Graves, A.: Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013)

work page arXiv 2013
[17]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

2022
[18]

In: CVPR

Huang, Z., Li, X., Li, J., Wang, J., Zeng, X., Liang, C., Wu, T., Chen, X., Li, L., Wang, L.: Online video understanding: Ovbench and videochat-online. In: CVPR. pp. 3328–3338 (2025)

2025
[19]

In: CVPR

Hui, T., Huang, S., Liu, S., Ding, Z., Li, G., Wang, W., Han, J., Wang, F.: Collab- orative spatial-temporal modeling for language-queried video actor segmentation. In: CVPR. pp. 4187–4196 (2021) Online Reasoning Video Object Segmentation 17

2021
[20]

In: CVPR

Jin, P., Takanobu, R., Zhang, W., Cao, X., Yuan, L.: Chat-univi: Unified visual rep- resentation empowers large language models with image and video understanding. In: CVPR. pp. 13700–13710 (2024)

2024
[21]

In: EMNLP

Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: Referring to objects in photographs of natural scenes. In: EMNLP. pp. 787–798 (2014)

2014
[22]

In: CVPR

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: CVPR. pp. 4015–4026 (2023)

2023
[23]

In: CVPR

Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. In: CVPR. pp. 9579–9589 (2024)

2024
[24]

Stream- ingbench: Assessing the gap for mllms to achieve streaming video understanding.CoRR, abs/2411.03628, 2024

Lin, J., Fang, Z., Chen, C., Wan, Z., Luo, F., Li, P., Liu, Y., Sun, M.: Stream- ingbench: Assessing the gap for mllms to achieve streaming video understanding. arXiv preprint arXiv:2411.03628 (2024)

work page arXiv 2024
[25]

In: CVPR

Lin, L., Yu, X., Pang, Z., Wang, Y.X.: Glus: Global-local reasoning unified into a single large language model for video segmentation. In: CVPR. pp. 8658–8667 (2025)

2025
[26]

NeurIPS36, 34892– 34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. NeurIPS36, 34892– 34916 (2023)

2023
[27]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Maaz, M., Rasheed, H., Khan, S., Khan, F.: Video-chatgpt: Towards detailed video understanding via large vision and language models. In: ACL. pp. 12585–12602 (2024)

2024
[29]

In: CVPR

Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Genera- tion and comprehension of unambiguous object descriptions. In: CVPR. pp. 11–20 (2016)

2016
[30]

Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 3DV. pp. 565–571. Ieee (2016)

2016
[31]

In: CVPR

Munasinghe, S., Gani, H., Zhu, W., Cao, J., Xing, E., Khan, F.S., Khan, S.: Videoglamm: A large multimodal model for pixel-level visual grounding in videos. In: CVPR. pp. 19036–19046 (2025)

2025
[32]

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

Ning, Z., Liu, G., Jin, Q., Ding, W., Guo, M., Zhao, J.: Livevlm: Efficient online video understanding via streaming-oriented kv cache and retrieval. arXiv preprint arXiv:2505.15269 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Niu, J., Li, Y., Miao, Z., Ge, C., Zhou, Y., He, Q., Dong, X., Duan, H., Ding, S., Qian, R., et al.: Ovo-bench: How far is your video-llms from real-world online video understanding? In: CVPR. pp. 18902–18913 (2025)

2025
[34]

AI & society40(2), 677–690 (2025)

Obrenovic, B., Gu, X., Wang, G., Godinic, D., Jakhongirov, I.: Generative ai and human–robot interaction: implications and future agenda for business, society and ethics. AI & society40(2), 677–690 (2025)

2025
[35]

The 2017 DAVIS Challenge on Video Object Segmentation

Pont-Tuset, J., Perazzi, F., Caelles, S., Arbel´ aez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)

work page internal anchor Pith review arXiv 2017
[36]

IJCV130(8), 2022–2039 (2022)

Qi, J., Gao, Y., Hu, Y., Wang, X., Liu, X., Bai, X., Belongie, S., Yuille, A., Torr, P.H., Bai, S.: Occluded video instance segmentation: A benchmark. IJCV130(8), 2022–2039 (2022)

2022
[37]

NeurIPS37, 119336–119360 (2024) 18 Liu et al

Qian, R., Dong, X., Zhang, P., Zang, Y., Ding, S., Lin, D., Wang, J.: Streaming long video understanding with large language models. NeurIPS37, 119336–119360 (2024) 18 Liu et al

2024
[38]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Qian, R., Yin, X., Dou, D.: Reasoning to attend: Try to understand how¡ seg¿ token works. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24722–24731 (2025)

2025
[39]

In: CVPR

Ramanathan, V., Kalia, A., Petrovic, V., Wen, Y., Zheng, B., Guo, B., Wang, R., Marquez, A., Kovvuri, R., Kadian, A., et al.: Paco: Parts and attributes of common objects. In: CVPR. pp. 7141–7151 (2023)

2023
[40]

In: SIGKDD

Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: Deepspeed: System optimiza- tions enable training deep learning models with over 100 billion parameters. In: SIGKDD. pp. 3505–3506 (2020)

2020
[41]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., R¨ adle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

In: CVPR

Ren, Z., Huang, Z., Wei, Y., Zhao, Y., Fu, D., Feng, J., Jin, X.: Pixellm: Pixel reasoning with large multimodal model. In: CVPR. pp. 26374–26383 (2024)

2024
[43]

In: ECCV

Seo, S., Lee, J.Y., Han, B.: Urvos: Unified referring video object segmentation network with a large-scale benchmark. In: ECCV. pp. 208–223. Springer (2020)

2020
[44]

arXiv preprint arXiv:2505.05467 , year=

Wang, H., Feng, B., Lai, Z., Xu, M., Li, S., Ge, W., Dehghan, A., Cao, M., Huang, P.: Streambridge: Turning your offline video large language model into a proactive streaming assistant. arXiv preprint arXiv:2505.05467 (2025)

work page arXiv 2025
[45]

In: CVPR

Wang, W., Zhou, T., Yu, F., Dai, J., Konukoglu, E., Van Gool, L.: Exploring cross-image pixel contrast for semantic segmentation. In: CVPR. pp. 7303–7313 (2021)

2021
[46]

In: CVPR

Wu, J., Jiang, Y., Sun, P., Yuan, Z., Luo, P.: Language as queries for referring video object segmentation. In: CVPR. pp. 4974–4984 (2022)

2022
[47]

arXiv preprint arXiv:2501.13468 , year=

Xiong, H., Yang, Z., Yu, J., Zhuge, Y., Zhang, L., Zhu, J., Lu, H.: Streaming video understanding and multi-round interaction with memory-enhanced knowl- edge. arXiv preprint arXiv:2501.13468 (2025)

work page arXiv 2025
[48]

arXiv preprint arXiv:2510.09608 , year=

Xu, R., Xiao, G., Chen, Y., He, L., Peng, K., Lu, Y., Han, S.: Streamingvlm: Real- time understanding for infinite video streams. arXiv preprint arXiv:2510.09608 (2025)

work page arXiv 2025
[49]

In: ECCV

Yan, C., Wang, H., Yan, S., Jiang, X., Hu, Y., Kang, G., Xie, W., Gavves, E.: Visa: Reasoning video object segmentation via large language models. In: ECCV. pp. 98–115. Springer (2024)

2024
[50]

In: AAAI

Yan, S., Zhang, R., Guo, Z., Chen, W., Zhang, W., Li, H., Qiao, Y., Dong, H., He, Z., Gao, P.: Referred by multi-modality: A unified temporal transformer for video object segmentation. In: AAAI. vol. 38, pp. 6449–6457 (2024)

2024
[51]

In: ICCV

Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV. pp. 5188–5197 (2019)

2019
[52]

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Yang, R., Chen, H., Zhang, J., Zhao, M., Qian, C., Wang, K., Wang, Q., Koripella, T.V., Movahedi, M., Li, M., et al.: Embodiedbench: Comprehensive benchmark- ing multi-modal large language models for vision-driven embodied agents. arXiv preprint arXiv:2502.09560 (2025)

work page arXiv 2025
[53]

Lisa++: An improved baseline for reasoning segmentation with large language model,

Yang, S., Qu, T., Lai, X., Tian, Z., Peng, B., Liu, S., Jia, J.: Lisa++: An improved baseline for reasoning segmentation with large language model. arXiv preprint arXiv:2312.17240 (2023)

work page arXiv 2023
[54]

arXiv preprint arXiv:2502.10810 (2025)

Yang, Z., Hu, Y., Du, Z., Xue, D., Qian, S., Wu, J., Yang, F., Dong, W., Xu, C.: Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding. arXiv preprint arXiv:2502.10810 (2025)

work page arXiv 2025
[55]

In: ICCV

Zheng, R., Qi, L., Chen, X., Wang, Y., Wang, K., Qiao, Y., Zhao, H.: Villa: Video reasoning segmentation with large language model. In: ICCV. pp. 23667–23677 (2025) Online Reasoning Video Object Segmentation 19

2025
[56]

In: CVPR

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: CVPR. pp. 633–641 (2017)

2017
[57]

arXiv preprint arXiv:2312.17448 (2023)

Zhu, J., Cheng, Z.Q., He, J.Y., Li, C., Luo, B., Lu, H., Geng, Y., Xie, X.: Tracking with human-intent reasoning. arXiv preprint arXiv:2312.17448 (2023)

work page arXiv 2023
[58]

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)

work page internal anchor Pith review arXiv 2010