Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation

Danial Hamdi; Fardin Ayar; Mahdi Javanmardi

arxiv: 2606.07394 · v1 · pith:AEIZA667new · submitted 2026-06-05 · 💻 cs.CV

Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation

Danial Hamdi , Fardin Ayar , Mahdi Javanmardi This is my paper

Pith reviewed 2026-06-27 22:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords video instance segmentationtracking instabilityperformance diagnosisinteger linear programmingonline methodsocclusiontemporal association

0 comments

The pith

Tracking instability creates gaps exceeding 20 AP for online video instance segmentation under occlusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a diagnostic framework that uses an integer linear program to isolate the separate effects of classification, segmentation, and tracking errors on overall performance in video instance segmentation. When applied to seven methods across standard benchmarks and occlusion-heavy subsets, the analysis identifies tracking instability as the dominant source of loss for online approaches, with the gaps widening in longer videos and denser scenes. Stronger backbones raise baseline scores but leave the tracking gaps largely unchanged, showing the limitation lies in temporal association rather than feature representation. This separation matters because it shows where algorithmic effort should focus to close the remaining performance shortfalls.

Core claim

Formulating identity and class assignment as an integer linear program produces a model-agnostic oracle that decomposes performance loss hierarchically by error source. The resulting measurements on online and offline VIS methods demonstrate that tracking instability is the primary bottleneck, producing gaps larger than 20 AP under heavy occlusion that increase sharply with video length and instance density, while classification contributes less once tracking has already failed.

What carries the argument

The integer linear program oracle that isolates each error source in identity and class assignment without reference to any particular model.

If this is right

Fixing temporal association would produce the largest AP gains for online VIS methods.
Semantic classification improvements yield diminishing returns on benchmarks where tracking already fails.
Replacing the backbone leaves the size of the tracking gaps essentially unchanged.
The magnitude of tracking gaps scales directly with sequence length and instance count.
Offline methods exhibit smaller tracking gaps than online ones on the same data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same ILP decomposition could be applied to other video tasks that combine detection and association to locate their dominant failure modes.
Models that already achieve high per-frame accuracy may still need explicit long-horizon association modules to realize those gains in full video metrics.
TrackLens visualizations could be used during training to surface specific failure queries for targeted data collection.

Load-bearing premise

The integer linear program produces an oracle that separates the error sources without introducing its own assignment biases or artifacts that change the measured gaps.

What would settle it

Re-running the oracle after manually correcting all tracking assignments on a held-out set of videos and observing whether the reported tracking gaps drop to near zero while other gaps remain.

Figures

Figures reproduced from arXiv: 2606.07394 by Danial Hamdi, Fardin Ayar, Mahdi Javanmardi.

**Figure 2.** Figure 2: Overview of TrackLens. Rows represent query tracks across time, columns show video frames. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Tracking and classification gaps across backbone scales on YouTube-VIS 2021. For each method, [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: AP tracking gap by instance density and video duration on the OVIS diagnostic split (online [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

In Video Instance Segmentation (VIS), classification, segmentation, and tracking objectives are jointly evaluated, but their individual contributions to performance loss remain opaque. We introduce a diagnostic framework that formulates identity and class assignment as an Integer Linear Program (ILP), yielding a model-agnostic oracle that hierarchically isolates each error source. Applied to seven VIS methods spanning online and offline paradigms across YouTube-VIS 2019/2021 and a diagnostic subset of OVIS, our analysis reveals a consistent picture. Tracking instability is a critical bottleneck for online methods, with gaps exceeding 20 AP under heavy occlusion, and grows sharply with video length and instance density. While semantic classification contributes meaningfully on standard benchmarks, its impact becomes negligible where tracking fails most. Although stronger backbones substantially lift default scores, they leave AP tracking gaps largely intact, confirming that temporal fragility is algorithmic rather than purely representational. To complement the oracle, we introduce TrackLens, a visual tool that translates gap magnitude into observable, query-level failure modes. Together, these tools provide a systematic foundation for targeting VIS's core challenge: robust long-term temporal association.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The ILP oracle and TrackLens are the actual new pieces here, and the tracking-bottleneck finding follows if the oracle holds up.

read the letter

The paper's main contribution is an ILP formulation that turns error attribution in VIS into a hierarchical decomposition of AP loss. They run it on seven methods across YouTube-VIS 2019/2021 and an OVIS subset, and the numbers point to tracking as the dominant gap for online approaches, especially under occlusion, longer videos, and higher instance counts. Stronger backbones raise the baseline but do not close those tracking gaps, which is a useful distinction.

What works is the consistent pattern across methods and the addition of TrackLens for turning the numbers into query-level visuals. That combination gives practitioners a clearer way to decide whether to invest in temporal association rather than just scaling the backbone.

The load-bearing part is the ILP oracle itself. The claim that it isolates sources without its own assignment biases rests on the exact objective, constraints, and occlusion handling. If those choices implicitly favor tracking attributions in the regimes the paper highlights, the reported gaps could be inflated. The abstract-only view leaves that uncheckable, and the stress-test concern lands directly on the formulation details.

No sign of circularity or self-referential fitting in the reported results. The work is for VIS researchers who already run the standard benchmarks and want a diagnostic layer on top. A reader focused on evaluation protocols or long-term tracking would find it worth their time.

It deserves peer review. The diagnostic idea is worth referee scrutiny on the ILP mechanics even if the current numbers need tighter validation.

Referee Report

1 major / 0 minor

Summary. The paper introduces an ILP-based diagnostic oracle to hierarchically decompose performance loss in video instance segmentation into classification, segmentation, and tracking components. Applied across seven methods (online and offline) on YouTube-VIS 2019/2021 and a diagnostic OVIS subset, the analysis concludes that tracking instability is the dominant bottleneck for online methods (gaps >20 AP under heavy occlusion, increasing with video length and instance density), that semantic classification impact is secondary where tracking fails, and that stronger backbones do not close the tracking gap, indicating an algorithmic rather than representational issue. TrackLens is presented as a complementary visualization tool.

Significance. If the ILP oracle is shown to isolate error sources without attribution bias, the framework would offer a useful model-agnostic diagnostic for VIS research, clarifying why temporal association remains the core challenge and providing concrete targets for improvement. The multi-method, multi-dataset consistency and the addition of TrackLens strengthen its potential utility as a tool for the community.

major comments (1)

[ILP formulation and oracle validation (likely §3)] The central claim that tracking gaps exceed 20 AP and grow with length/density/occlusion rests on the ILP oracle correctly attributing errors without its own biases (e.g., in identity-switch costs or occlusion handling). The manuscript must include the full ILP objective, all constraints, and validation experiments (such as sensitivity to cost parameters or comparison against alternative oracles) demonstrating that measured gaps are not inflated by the diagnostic formulation itself, particularly in the high-occlusion regimes highlighted in the results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the ILP oracle's formulation and validation. We agree that full transparency on the diagnostic is necessary to substantiate the tracking gap claims, particularly under occlusion. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: The central claim that tracking gaps exceed 20 AP and grow with length/density/occlusion rests on the ILP oracle correctly attributing errors without its own biases (e.g., in identity-switch costs or occlusion handling). The manuscript must include the full ILP objective, all constraints, and validation experiments (such as sensitivity to cost parameters or comparison against alternative oracles) demonstrating that measured gaps are not inflated by the diagnostic formulation itself, particularly in the high-occlusion regimes highlighted in the results.

Authors: We agree that the full ILP formulation and validation are required for the claims to be credible. In the revised version we will add the complete objective function (including all terms for classification, segmentation, and tracking costs) together with the full set of constraints to Section 3. We have performed sensitivity analyses on the identity-switch and occlusion-handling cost parameters; the resulting tracking gaps vary by less than 2 AP across the tested range and remain above 18 AP in the high-occlusion OVIS subset. We will also include a comparison against a greedy bipartite-matching oracle, which produces qualitatively identical gap rankings and magnitudes. These additions will be placed in a new subsection of §3 and an appendix, directly addressing potential attribution bias in the regimes highlighted in the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; ILP oracle is independent diagnostic

full rationale

The paper formulates an ILP as a model-agnostic oracle to hierarchically isolate classification, segmentation, and tracking errors, then applies it to measure gaps on YouTube-VIS and OVIS for seven existing methods. No equations or steps reduce the reported tracking gaps (>20 AP under occlusion, scaling with length/density) to quantities defined by the paper's own fitted parameters or self-citations. The oracle is presented as an external decomposition tool rather than a self-referential fit, and no load-bearing uniqueness theorems or ansatzes from prior author work are invoked. The derivation chain remains self-contained against the external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that an ILP can serve as an unbiased oracle for error decomposition; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Integer Linear Programming can be solved to optimality to produce a model-agnostic assignment oracle for identity and class labels
The diagnostic framework is built directly on this premise.

pith-pipeline@v0.9.1-grok · 5733 in / 1149 out tokens · 21966 ms · 2026-06-27T22:08:55.032510+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 13 canonical work pages

[1]

L. Yang, Y . Fan, N. Xu, Video instance segmentation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5188–5197

2019
[2]

Cheng, A

B. Cheng, A. Choudhuri, I. Misra, A. Kirillov, R. Girdhar, A. G. Schwing, Mask2former for video instance segmentation, ArXiv abs/2112.10764 (2021). URLhttps://api.semanticscholar.org/CorpusID:245335013

arXiv 2021
[3]

Y . Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, H. Xia, End-to-end video instance segmentation with transformers, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8737–8746. doi:10.1109/CVPR46437.2021.00863

work page doi:10.1109/cvpr46437.2021.00863 2021
[4]

J. Wu, Y . Jiang, S. Bai, W. Zhang, X. Bai, Seqformer: Sequential transformer for video instance segmentation, in: S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, T. Hassner (Eds.), Computer Vision – ECCV 2022, Springer Nature Switzerland, Cham, 2022, pp. 553–569

2022
[5]

Hwang, M

S. Hwang, M. Heo, S. W. Oh, S. J. Kim, Video instance segmentation using inter- frame communication transformers, Advances in Neural Information Processing Systems 34 (2021) 13352–13363

2021
[6]

H. Lin, R. Wu, S. Liu, J. Lu, J. Jia, Video instance segmentation with a propose- reduce paradigm, in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1719–1728. doi:10.1109/ICCV48922.2021.00176

work page doi:10.1109/iccv48922.2021.00176 2021
[7]

M. Heo, S. Hwang, S. W. Oh, J.-Y . Lee, S. J. Kim, Vita: Video instance segmen- tation via object token association, Advances in Neural Information Processing Systems 35 (2022) 23109–23120

2022
[8]

M. Heo, S. Hwang, J. Hyun, H. Kim, S. W. Oh, J.-Y . Lee, S. J. Kim, A gener- alized framework for video instance segmentation, in: 2023 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 14623– 14632. doi:10.1109/CVPR52729.2023.01405

work page doi:10.1109/cvpr52729.2023.01405 2023
[9]

Huang, Z

D.-A. Huang, Z. Yu, A. Anandkumar, Minvis: A minimal video instance seg- mentation framework without video-based training, Advances in Neural Infor- mation Processing Systems 35 (2022) 31265–31277. 17

2022
[10]

J. Cao, R. M. Anwer, H. Cholakkal, F. S. Khan, Y . Pang, L. Shao, Sipmask: Spatial information preservation for fast image and video instance segmentation, Proc. European Conference on Computer Vision (2020)

2020
[11]

H. Kim, J. Kang, M. Heo, S. Hwang, S. W. Oh, S. J. Kim, Visage: Video instance segmentation with appearance-guided enhancement (2024). arXiv:2312.04885

arXiv 2024
[12]

K. Ying, Q. Zhong, W. Mao, Z. Wang, H. Chen, L. Y . Wu, Y . Liu, C. Fan, Y . Zhuge, C. Shen, Ctvis: Consistent training for online video instance segmen- tation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 899–908

2023
[13]

S. Yang, Y . Fang, X. Wang, Y . Li, C. Fang, Y . Shan, B. Feng, W. Liu, Crossover learning for fast online video instance segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 8043–8052

2021
[14]

Cheng, I

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, R. Girdhar, Masked-attention mask transformer for universal image segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1290–1299

2022
[15]

J. Wu, Q. Liu, Y . Jiang, S. Bai, A. Yuille, X. Bai, In defense of online models for video instance segmentation, in: European Conference on Computer Vision, Springer, 2022, pp. 588–605

2022
[16]

S. Lee, J. Seo, K. Han, M. Choi, S. Im, Cavis: Context-aware video instance segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 4507–4517

2025
[17]

Bokhovkin, S

T. Zhang, X. Tian, Y . Wu, S. Ji, X. Wang, Y . Zhang, P. Wan, Dvis: De- coupled video instance segmentation framework, in: 2023 IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2023, pp. 1282–1291. doi:10.1109/ICCV51070.2023.00124

work page doi:10.1109/iccv51070.2023.00124 2023
[18]

Zhang, X

T. Zhang, X. Tian, Y . Zhou, S. Ji, X. Wang, X. Tao, Y . Zhang, P. Wan, Z. Wang, Y . Wu, Dvis++: Improved decoupled framework for universal video segmen- tation, IEEE Transactions on Pattern Analysis and Machine Intelligence 47 (7) (2025) 5918–5929. doi:10.1109/TPAMI.2025.3552694

work page doi:10.1109/tpami.2025.3552694 2025
[19]

J. Qi, Y . Gao, Y . Hu, X. Wang, X. Liu, X. Bai, S. Belongie, A. Yuille, P. H. S. Torr, S. Bai, Occluded video instance segmentation: A benchmark, International Journal of Computer Vision 130 (8) (2022) 2022–2039. doi:10.1007/s11263- 022-01629-1. URLhttps://doi.org/10.1007/s11263-022-01629-1

work page doi:10.1007/s11263- 2022
[20]

L. Yang, Y . Fan, Y . Fu, N. Xu, The 3rd large-scale video object segmentation challenge - video instance segmentation track (Jun. 2021). 18

2021
[21]

K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: 2017 IEEE In- ternational Conference on Computer Vision (ICCV), 2017, pp. 2980–2988. doi:10.1109/ICCV .2017.322

work page doi:10.1109/iccv 2017
[22]

Wojke, A

N. Wojke, A. Bewley, D. Paulus, Simple online and realtime tracking with a deep association metric, in: 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 3645–3649. doi:10.1109/ICIP.2017.8296962

work page doi:10.1109/icip.2017.8296962 2017
[23]

Carion, F

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End- to-end object detection with transformers, in: A. Vedaldi, H. Bischof, T. Brox, J.-M. Frahm (Eds.), Computer Vision – ECCV 2020, Springer International Pub- lishing, Cham, 2020, pp. 213–229

2020
[24]

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable {detr}: Deformable transformers for end-to-end object detection, international Conference on Learn- ing Representations (2021). URLhttps://openreview.net/forum?id=gZ9hCDWe6ke

2021
[25]

M. Li, S. Li, W. Xiang, L. Zhang, Mdqe: Mining discriminative query embed- dings to segment occluded instances on challenging videos, in: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 10524–10533. doi:10.1109/CVPR52729.2023.01014

work page doi:10.1109/cvpr52729.2023.01014 2023
[26]

H. K. Cheng, Y .-W. Tai, C.-K. Tang, Rethinking space-time networks with im- proved memory coverage for efficient video object segmentation, Advances in Neural Information Processing Systems 34 (2021) 11781–11794

2021
[27]

H. K. Cheng, A. G. Schwing, Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model, in: S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, T. Hassner (Eds.), Computer Vision – ECCV 2022, Springer Nature Switzerland, Cham, 2022, pp. 640–658

2022
[28]

Y . Zhou, T. Zhang, S. Ji, S. Yan, X. Li, Improving video segmentation via dy- namic anchor queries, in: A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, G. Varol (Eds.), Computer Vision – ECCV 2024, Springer Nature Switzerland, Cham, 2025, pp. 446–463

2024
[29]

Athar, A

A. Athar, A. Hermans, J. Luiten, D. Ramanan, B. Leibe, Tarvis: A unified ap- proach for target-based video segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 18738–18748

2023
[30]

M. Li, S. Li, X. Zhang, L. Zhang, UniVS: Unified and Universal Video Segmen- tation with Prompts as Queries , in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, Los Alamitos, CA, USA, 2024, pp. 3227–3238. doi:10.1109/CVPR52733.2024.00311. URLhttps://doi.ieeecomputersociety.org/10.1109/CVPR52733.20 24.00311 19

work page doi:10.1109/cvpr52733.2024.00311 2024
[31]

J. Wu, Y . Jiang, Q. Liu, Z. Yuan, X. Bai, S. Bai, General object founda- tion model for images and videos at scale, in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 3783–3795. doi:10.1109/CVPR52733.2024.00363

work page doi:10.1109/cvpr52733.2024.00363 2024
[32]

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking,

J. Luiten, A. O ˘sep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taixé, B. Leibe, Hota: A higher order metric for evaluating multi-object tracking, Int. J. Comput. Vision 129 (2) (2021) 548–578. doi:10.1007/s11263-020-01375-2. URLhttps://doi.org/10.1007/s11263-020-01375-2

work page doi:10.1007/s11263-020-01375-2 2021
[33]

Bolya, S

D. Bolya, S. Foley, J. Hays, J. Hoffman, Tide: A general toolbox for identifying object detection errors, in: European Conference on Computer Vision, Springer, 2020, pp. 558–573

2020
[34]

W. Jia, L. Yang, Z. Jia, W. Zhao, Y . Zhou, Q. Song, Tive: A toolbox for identi- fying video instance segmentation errors, Neurocomputing 545 (2023) 126321

2023
[35]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recogni- tion, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. doi:10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[36]

Perron, V

L. Perron, V . Furnon, Or-tools (2025). URLhttps://developers.google.com/optimization/ 20

2025

[1] [1]

L. Yang, Y . Fan, N. Xu, Video instance segmentation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5188–5197

2019

[2] [2]

Cheng, A

B. Cheng, A. Choudhuri, I. Misra, A. Kirillov, R. Girdhar, A. G. Schwing, Mask2former for video instance segmentation, ArXiv abs/2112.10764 (2021). URLhttps://api.semanticscholar.org/CorpusID:245335013

arXiv 2021

[3] [3]

Y . Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, H. Xia, End-to-end video instance segmentation with transformers, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8737–8746. doi:10.1109/CVPR46437.2021.00863

work page doi:10.1109/cvpr46437.2021.00863 2021

[4] [4]

J. Wu, Y . Jiang, S. Bai, W. Zhang, X. Bai, Seqformer: Sequential transformer for video instance segmentation, in: S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, T. Hassner (Eds.), Computer Vision – ECCV 2022, Springer Nature Switzerland, Cham, 2022, pp. 553–569

2022

[5] [5]

Hwang, M

S. Hwang, M. Heo, S. W. Oh, S. J. Kim, Video instance segmentation using inter- frame communication transformers, Advances in Neural Information Processing Systems 34 (2021) 13352–13363

2021

[6] [6]

H. Lin, R. Wu, S. Liu, J. Lu, J. Jia, Video instance segmentation with a propose- reduce paradigm, in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1719–1728. doi:10.1109/ICCV48922.2021.00176

work page doi:10.1109/iccv48922.2021.00176 2021

[7] [7]

M. Heo, S. Hwang, S. W. Oh, J.-Y . Lee, S. J. Kim, Vita: Video instance segmen- tation via object token association, Advances in Neural Information Processing Systems 35 (2022) 23109–23120

2022

[8] [8]

M. Heo, S. Hwang, J. Hyun, H. Kim, S. W. Oh, J.-Y . Lee, S. J. Kim, A gener- alized framework for video instance segmentation, in: 2023 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 14623– 14632. doi:10.1109/CVPR52729.2023.01405

work page doi:10.1109/cvpr52729.2023.01405 2023

[9] [9]

Huang, Z

D.-A. Huang, Z. Yu, A. Anandkumar, Minvis: A minimal video instance seg- mentation framework without video-based training, Advances in Neural Infor- mation Processing Systems 35 (2022) 31265–31277. 17

2022

[10] [10]

J. Cao, R. M. Anwer, H. Cholakkal, F. S. Khan, Y . Pang, L. Shao, Sipmask: Spatial information preservation for fast image and video instance segmentation, Proc. European Conference on Computer Vision (2020)

2020

[11] [11]

H. Kim, J. Kang, M. Heo, S. Hwang, S. W. Oh, S. J. Kim, Visage: Video instance segmentation with appearance-guided enhancement (2024). arXiv:2312.04885

arXiv 2024

[12] [12]

K. Ying, Q. Zhong, W. Mao, Z. Wang, H. Chen, L. Y . Wu, Y . Liu, C. Fan, Y . Zhuge, C. Shen, Ctvis: Consistent training for online video instance segmen- tation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 899–908

2023

[13] [13]

S. Yang, Y . Fang, X. Wang, Y . Li, C. Fang, Y . Shan, B. Feng, W. Liu, Crossover learning for fast online video instance segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 8043–8052

2021

[14] [14]

Cheng, I

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, R. Girdhar, Masked-attention mask transformer for universal image segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1290–1299

2022

[15] [15]

J. Wu, Q. Liu, Y . Jiang, S. Bai, A. Yuille, X. Bai, In defense of online models for video instance segmentation, in: European Conference on Computer Vision, Springer, 2022, pp. 588–605

2022

[16] [16]

S. Lee, J. Seo, K. Han, M. Choi, S. Im, Cavis: Context-aware video instance segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 4507–4517

2025

[17] [17]

Bokhovkin, S

T. Zhang, X. Tian, Y . Wu, S. Ji, X. Wang, Y . Zhang, P. Wan, Dvis: De- coupled video instance segmentation framework, in: 2023 IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2023, pp. 1282–1291. doi:10.1109/ICCV51070.2023.00124

work page doi:10.1109/iccv51070.2023.00124 2023

[18] [18]

Zhang, X

T. Zhang, X. Tian, Y . Zhou, S. Ji, X. Wang, X. Tao, Y . Zhang, P. Wan, Z. Wang, Y . Wu, Dvis++: Improved decoupled framework for universal video segmen- tation, IEEE Transactions on Pattern Analysis and Machine Intelligence 47 (7) (2025) 5918–5929. doi:10.1109/TPAMI.2025.3552694

work page doi:10.1109/tpami.2025.3552694 2025

[19] [19]

J. Qi, Y . Gao, Y . Hu, X. Wang, X. Liu, X. Bai, S. Belongie, A. Yuille, P. H. S. Torr, S. Bai, Occluded video instance segmentation: A benchmark, International Journal of Computer Vision 130 (8) (2022) 2022–2039. doi:10.1007/s11263- 022-01629-1. URLhttps://doi.org/10.1007/s11263-022-01629-1

work page doi:10.1007/s11263- 2022

[20] [20]

L. Yang, Y . Fan, Y . Fu, N. Xu, The 3rd large-scale video object segmentation challenge - video instance segmentation track (Jun. 2021). 18

2021

[21] [21]

K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: 2017 IEEE In- ternational Conference on Computer Vision (ICCV), 2017, pp. 2980–2988. doi:10.1109/ICCV .2017.322

work page doi:10.1109/iccv 2017

[22] [22]

Wojke, A

N. Wojke, A. Bewley, D. Paulus, Simple online and realtime tracking with a deep association metric, in: 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 3645–3649. doi:10.1109/ICIP.2017.8296962

work page doi:10.1109/icip.2017.8296962 2017

[23] [23]

Carion, F

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End- to-end object detection with transformers, in: A. Vedaldi, H. Bischof, T. Brox, J.-M. Frahm (Eds.), Computer Vision – ECCV 2020, Springer International Pub- lishing, Cham, 2020, pp. 213–229

2020

[24] [24]

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable {detr}: Deformable transformers for end-to-end object detection, international Conference on Learn- ing Representations (2021). URLhttps://openreview.net/forum?id=gZ9hCDWe6ke

2021

[25] [25]

M. Li, S. Li, W. Xiang, L. Zhang, Mdqe: Mining discriminative query embed- dings to segment occluded instances on challenging videos, in: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 10524–10533. doi:10.1109/CVPR52729.2023.01014

work page doi:10.1109/cvpr52729.2023.01014 2023

[26] [26]

H. K. Cheng, Y .-W. Tai, C.-K. Tang, Rethinking space-time networks with im- proved memory coverage for efficient video object segmentation, Advances in Neural Information Processing Systems 34 (2021) 11781–11794

2021

[27] [27]

H. K. Cheng, A. G. Schwing, Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model, in: S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, T. Hassner (Eds.), Computer Vision – ECCV 2022, Springer Nature Switzerland, Cham, 2022, pp. 640–658

2022

[28] [28]

Y . Zhou, T. Zhang, S. Ji, S. Yan, X. Li, Improving video segmentation via dy- namic anchor queries, in: A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, G. Varol (Eds.), Computer Vision – ECCV 2024, Springer Nature Switzerland, Cham, 2025, pp. 446–463

2024

[29] [29]

Athar, A

A. Athar, A. Hermans, J. Luiten, D. Ramanan, B. Leibe, Tarvis: A unified ap- proach for target-based video segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 18738–18748

2023

[30] [30]

M. Li, S. Li, X. Zhang, L. Zhang, UniVS: Unified and Universal Video Segmen- tation with Prompts as Queries , in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, Los Alamitos, CA, USA, 2024, pp. 3227–3238. doi:10.1109/CVPR52733.2024.00311. URLhttps://doi.ieeecomputersociety.org/10.1109/CVPR52733.20 24.00311 19

work page doi:10.1109/cvpr52733.2024.00311 2024

[31] [31]

J. Wu, Y . Jiang, Q. Liu, Z. Yuan, X. Bai, S. Bai, General object founda- tion model for images and videos at scale, in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 3783–3795. doi:10.1109/CVPR52733.2024.00363

work page doi:10.1109/cvpr52733.2024.00363 2024

[32] [32]

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking,

J. Luiten, A. O ˘sep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taixé, B. Leibe, Hota: A higher order metric for evaluating multi-object tracking, Int. J. Comput. Vision 129 (2) (2021) 548–578. doi:10.1007/s11263-020-01375-2. URLhttps://doi.org/10.1007/s11263-020-01375-2

work page doi:10.1007/s11263-020-01375-2 2021

[33] [33]

Bolya, S

D. Bolya, S. Foley, J. Hays, J. Hoffman, Tide: A general toolbox for identifying object detection errors, in: European Conference on Computer Vision, Springer, 2020, pp. 558–573

2020

[34] [34]

W. Jia, L. Yang, Z. Jia, W. Zhao, Y . Zhou, Q. Song, Tive: A toolbox for identi- fying video instance segmentation errors, Neurocomputing 545 (2023) 126321

2023

[35] [35]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recogni- tion, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. doi:10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016

[36] [36]

Perron, V

L. Perron, V . Furnon, Or-tools (2025). URLhttps://developers.google.com/optimization/ 20

2025