Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation

Chang Huang; Han Shen; Lichao Huang; Qiang Zhou; Wenyu Liu; Xinggang Wang; Yongchao Gong; Zilong Huang

arxiv: 1907.01203 · v2 · pith:HQU3TQVCnew · submitted 2019-07-02 · 💻 cs.CV

Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation

Qiang Zhou , Zilong Huang , Lichao Huang , Yongchao Gong , Han Shen , Chang Huang , Wenyu Liu , Xinggang Wang This is my paper

Pith reviewed 2026-05-25 11:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords video object segmentationcascaded networkobject proposalobject trackingmodel adaptationDAVIS datasetYouTube-VOSpixel-level segmentation

0 comments

The pith

A cascaded proposal-tracking-segmentation network transfers generic object knowledge and uses dynamic reference adaptation to reach state-of-the-art video object segmentation results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a three-part network that first generates object proposals using general objectness knowledge, then tracks the specific target among those proposals, and finally segments the object with a segmentation stage that adapts its model on the fly using dynamic references. This structure addresses the shortage of video-specific training data and large appearance changes by reusing detection knowledge and updating references during the video. Experiments on DAVIS'17 and YouTube-VOS show the approach outperforming prior methods, indicating that separating proposal, selection, and adaptive segmentation can make pixel-level tracking more robust without additional labeled videos.

Core claim

The PTS framework consists of an object proposal network that transfers objectness information as generic knowledge into VOS, a tracking network that identifies the target object from the proposals, and a segmentation network performed based on the tracking results with a novel dynamic-reference based model adaptation scheme. This unified framework achieves the state-of-the-art performance on the DAVIS'17 dataset and the YouTube-VOS dataset.

What carries the argument

The dynamic-reference based model adaptation scheme, which updates the segmentation model during inference using tracking outputs to handle visual variations.

If this is right

Generic object detection knowledge can be reused to compensate for limited VOS training samples.
Separating proposal generation from target selection reduces errors from appearance changes.
Dynamic reference updates allow the segmentation stage to maintain accuracy across video frames.
The cascaded design supports end-to-end training while keeping each stage focused on its subtask.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If proposal quality drops on unusual object categories, the entire cascade would lose accuracy even if later stages function well.
The same proposal-plus-tracking pattern might apply to related tasks such as instance segmentation in video or multi-object tracking without major redesign.
Speed improvements could come from sharing early features between the proposal and tracking stages rather than running them sequentially.

Load-bearing premise

The object proposal network successfully transfers generic objectness knowledge into the VOS task and the tracking network reliably selects the correct target from proposals.

What would settle it

A direct comparison showing that the method fails to match or exceed the top reported scores on the DAVIS'17 and YouTube-VOS validation sets would falsify the performance claim.

Figures

Figures reproduced from arXiv: 1907.01203 by Chang Huang, Han Shen, Lichao Huang, Qiang Zhou, Wenyu Liu, Xinggang Wang, Yongchao Gong, Zilong Huang.

**Figure 1.** Figure 1: Example proposals of OPN on unseen categories. We randomly pick a portion of all proposals near the objects of interest. The [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the proposed PTSNet, which consists of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Video object segmentation (VOS) aims at pixel-level object tracking given only the annotations in the first frame. Due to the large visual variations of objects in video and the lack of training samples, it remains a difficult task despite the upsurging development of deep learning. Toward solving the VOS problem, we bring in several new insights by the proposed unified framework consisting of object proposal, tracking and segmentation components. The object proposal network transfers objectness information as generic knowledge into VOS; the tracking network identifies the target object from the proposals; and the segmentation network is performed based on the tracking results with a novel dynamic-reference based model adaptation scheme. Extensive experiments have been conducted on the DAVIS'17 dataset and the YouTube-VOS dataset, our method achieves the state-of-the-art performance on several video object segmentation benchmarks. We make the code publicly available at https://github.com/sydney0zq/PTSNet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The PTS cascade chains proposal, tracking, and segmentation for VOS with a dynamic reference trick, but the tracking stage's contribution stays unproven without isolating ablations.

read the letter

The paper's main move is a three-part cascade: an object proposal network to bring in generic objectness, a tracking network that picks the right proposal, and a segmentation network that adapts via dynamic reference frames. The dynamic-reference adaptation is the clearest concrete addition, giving the segmentation stage a way to update its reference without full retraining. Public code is also a plus for anyone who wants to test the pipeline directly. That structure is new enough compared to prior VOS work that treats proposal and tracking more implicitly. The claim of SOTA on DAVIS'17 and YouTube-VOS is the headline result, but the abstract supplies none of the actual numbers, baselines, or error bars, so the strength of that claim depends entirely on what the full experiments show. The soft spot is the missing component tests. The architecture treats the tracking handoff as load-bearing, yet there is no ablation that keeps the proposal and segmentation networks fixed while replacing the learned tracker with random selection or an oracle. If performance holds up under random selection, the cascade's necessity weakens and the gains cannot be credited to the proposed tracking step. The same goes for the assumption that the proposal network transfers useful generic knowledge; that also needs a controlled check rather than being taken as given. This is the kind of modular engineering paper that VOS practitioners might want to read for the pipeline idea and the code, even if the gains turn out incremental. It has enough structure and a public implementation to deserve referee time so the experiments can be examined and the ablations requested if they are absent. I would send it out rather than desk reject.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a cascaded PTS network for video object segmentation that consists of an object proposal network transferring generic objectness knowledge, a tracking network selecting the target object from proposals, and a segmentation network employing a novel dynamic-reference based model adaptation scheme. The work asserts that this unified framework solves the VOS problem under large visual variations and limited training samples, and reports achieving state-of-the-art performance on the DAVIS'17 and YouTube-VOS benchmarks while releasing code publicly.

Significance. If the quantitative results, baselines, and component ablations ultimately support the claims, the explicit decomposition into proposal, tracking, and adaptive segmentation stages could constitute a useful architectural contribution to video object segmentation by demonstrating the value of injecting generic objectness priors. The public code release would further aid reproducibility.

major comments (2)

[Abstract] Abstract: the central claim that the method 'achieves the state-of-the-art performance on several video object segmentation benchmarks' is asserted without any quantitative metrics, baseline comparisons, error bars, ablation tables, or experimental protocol details, rendering the empirical contribution unverifiable from the manuscript text.
[Experiments] Experiments (implied by the abstract's reference to 'extensive experiments'): no ablation is reported that isolates the tracking network's contribution, e.g., by substituting random proposal selection or an oracle tracker while keeping the proposal and segmentation networks fixed. Without such a controlled test the necessity of the three-stage cascade remains unproven, directly affecting the load-bearing assumption that the tracking stage reliably selects the correct target.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the two major comments point-by-point below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the method 'achieves the state-of-the-art performance on several video object segmentation benchmarks' is asserted without any quantitative metrics, baseline comparisons, error bars, ablation tables, or experimental protocol details, rendering the empirical contribution unverifiable from the manuscript text.

Authors: We agree that the abstract, as a concise summary, does not contain the supporting quantitative details. The Experiments section provides the full results, including J&F scores on DAVIS'17 and YouTube-VOS with baseline comparisons and protocol information. To improve verifiability directly from the abstract, we will revise it to include the primary performance numbers and a brief reference to the evaluation setting. revision: yes
Referee: [Experiments] Experiments (implied by the abstract's reference to 'extensive experiments'): no ablation is reported that isolates the tracking network's contribution, e.g., by substituting random proposal selection or an oracle tracker while keeping the proposal and segmentation networks fixed. Without such a controlled test the necessity of the three-stage cascade remains unproven, directly affecting the load-bearing assumption that the tracking stage reliably selects the correct target.

Authors: The manuscript reports ablations on the proposal network and the dynamic-reference adaptation scheme, together with end-to-end comparisons. We acknowledge, however, that an explicit controlled ablation replacing the tracking stage with random selection or an oracle tracker is not present and would more directly substantiate the value of the cascade. We will add this experiment (and the corresponding analysis) to the revised Experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical cascade evaluated on external benchmarks

full rationale

The paper presents a three-stage neural architecture (proposal network, tracking network, dynamic-reference segmentation) trained and tested on public VOS benchmarks (DAVIS'17, YouTube-VOS). No equations, fitted parameters, or first-principles derivations are shown that reduce to their own inputs by construction. Performance claims rest on external empirical results rather than self-referential definitions or self-citation chains. The reader's assessment of score 1.0 is consistent with the absence of any load-bearing circular step.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical performance of a deep network trained on standard VOS benchmarks; no additional free parameters, axioms, or invented entities beyond conventional neural-network training are described in the abstract.

free parameters (1)

network weights and hyperparameters
Standard deep-learning parameters fitted during training on benchmark data; not enumerated in the abstract.

pith-pipeline@v0.9.0 · 5712 in / 1175 out tokens · 28894 ms · 2026-05-25T11:20:40.027099+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

[1]

Avinash Ramakanth and R

S. Avinash Ramakanth and R. Venkatesh Babu. Seamseg: Video object segmentation using patch seams. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 376–383, 2014

work page 2014
[2]

L. Bao, B. Wu, and W. Liu. Cnn in mrf: Video object seg- mentation via inference in a cnn-based higher-order spatio- temporal mrf. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5977–5986, 2018

work page 2018
[3]

Bertinetto, J

L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr. Fully-convolutional siamese networks for object tracking. In European conference on computer vision, pages 850–865, 2016

work page 2016
[4]

Caelles, K.-K

S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taix ´e, D. Cremers, and L. Van Gool. One-shot video object seg- mentation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5320–5329, 2017

work page 2017
[5]

Chang, D

J. Chang, D. Wei, and J. W. Fisher. A video representation using temporal superpixels. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 2051–2058, 2013

work page 2051
[6]

Chatﬁeld, K

K. Chatﬁeld, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convo- lutional nets. In British Machine Vision Conference, 2014

work page 2014
[7]

Y . Chen, J. Pont-Tuset, A. Montes, and L. Van Gool. Blaz- ingly fast video object segmentation with pixel-wise metric learning. In IEEE Conference on Computer Vision and Pat- tern Recognition, pages 1189–1198, 2018

work page 2018
[8]

Cheng, Y .-H

J. Cheng, Y .-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang. Fast and accurate online video object segmentation via track- ing parts. In IEEE Conference on Computer Vision and Pat- tern Recognition, pages 7415–7424, 2018

work page 2018
[9]

H. Ci, C. Wang, and Y . Wang. Video object segmentation by learning location-sensitive embeddings. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 501–516, 2018

work page 2018
[10]

Danelljan, G

M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg. Eco: Efﬁcient convolution operators for tracking. In CVPR, 2017

work page 2017
[11]

Danelljan, A

M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. Beyond correlation ﬁlters: Learning continuous convolution operators for visual tracking. In European Conference on Computer Vision, pages 472–488. Springer, 2016

work page 2016
[12]

Girshick

R. Girshick. Fast r-cnn. In International Conference on Com- puter Vision, pages 1440–1448, 2015

work page 2015
[13]

Grundmann, V

M. Grundmann, V . Kwatra, M. Han, and I. Essa. Efﬁcient hierarchical graph-based video segmentation. In 2010 ieee computer society conference on computer vision and pattern recognition, pages 2141–2148. IEEE, 2010

work page 2010
[14]

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick. Mask r- cnn. In International Conference on Computer Vision, pages 2980–2988, 2017

work page 2017
[15]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

work page 2016
[16]

D. Held, S. Thrun, and S. Savarese. Learning to track at 100 fps with deep regression networks. In European Conference on Computer Vision, pages 749–765. Springer, 2016

work page 2016
[17]

J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High- speed tracking with kernelized correlation ﬁlters. IEEE transactions on pattern analysis and machine intelligence , 37(3):583–596, 2015

work page 2015
[18]

P. Hu, G. Wang, X. Kong, J. Kuen, and Y .-P. Tan. Motion- guided cascaded reﬁnement network for video object seg- mentation. In International Conference on Computer Vision and Pattern Recognition, pages 1400–1409, 2018

work page 2018
[19]

Hu, J.-B

Y .-T. Hu, J.-B. Huang, and A. G. Schwing. Videomatch: Matching based video object segmentation. In European Conference on Computer Vision, pages 56–73, 2018

work page 2018
[20]

A Generative Appearance Model for End-to-end Video Object Segmentation

J. Johnander, M. Danelljan, E. Brissman, F. S. Khan, and M. Felsberg. A generative appearance model for end-to-end video object segmentation. arXiv preprint arXiv:1811.11611, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

I. Jung, J. Son, M. Baek, and B. Han. Real-time mdnet. In European Conference on Computer Vision , pages 89–104, 2018

work page 2018
[22]

P. D. Kaiming He, Georgia Gkioxari and R. Girshick. Mask r-cnn: A perspective on equivariance. http: //kaiminghe.com/iccv17tutorial/maskrcnn_ iccv2017_tutorial_kaiminghe.pdf. Accessed 3, 2017

work page 2017
[23]

K. Kang, H. Li, T. Xiao, W. Ouyang, J. Yan, X. Liu, and X. Wang. Object detection in videos with tubelet proposal networks. In CVPR, 2017

work page 2017
[24]

Khoreva, R

A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele. Lucid data dreaming for video object segmentation. 2018

work page 2018
[25]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[26]

Kristan, J

M. Kristan, J. Matas, A. Leonardis, T. V ojir, R. Pﬂugfelder, G. Fernandez, G. Nebehay, F. Porikli, and L. ˇCehovin. A novel performance evaluation methodology for single-target trackers. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 38(11):2137–2155, Nov 2016. 9

work page 2016
[27]

B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High perfor- mance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8971–8980, 2018

work page 2018
[28]

Li and C

X. Li and C. Change Loy. Video object segmentation with joint re-identiﬁcation and attention-aware mask propagation. In European Conference on Computer Vision, pages 93–110, 2018

work page 2018
[29]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com- mon objects in context. In European Cconference on Com- puter Vision, pages 740–755, 2014

work page 2014
[30]

Luiten, P

J. Luiten, P. V oigtlaender, and B. Leibe. Premvos: Proposal- generation, reﬁnement and merging for video object segmen- tation. In Asian Conference on Computer Vision, 2018

work page 2018
[31]

Maninis, S

K.-K. Maninis, S. Caelles, Y . Chen, J. Pont-Tuset, L. Leal- Taix´e, D. Cremers, and L. Van Gool. Video object segmen- tation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018

work page 2018
[32]

M ¨arki, F

N. M ¨arki, F. Perazzi, O. Wang, and A. Sorkine-Hornung. Bi- lateral space video segmentation. In IEEE Conference on Computer Vision and Pattern Recognition , pages 743–751, 2016

work page 2016
[33]

Nam and B

H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition , pages 4293– 4302, 2016

work page 2016
[34]

C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large ker- nel mattersimprove semantic segmentation by global convo- lutional network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1743–1751, 2017

work page 2017
[35]

Perazzi, A

F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung. Learning video object segmentation from static images. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3491–3500, 2017

work page 2017
[36]

Perazzi, J

F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In IEEE Conference on Computer Vision and Pattern Recog- nition, pages 724–732, 2016

work page 2016
[37]

Perazzi, O

F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung. Fully connected object proposals for video segmentation. In Proceedings of the IEEE international conference on com- puter vision, pages 3227–3234, 2015

work page 2015
[38]

P. O. Pinheiro, T.-Y . Lin, R. Collobert, and P. Doll´ar. Learn- ing to reﬁne object segments. In European Conference on Computer Vision, pages 75–91. Springer, 2016

work page 2016
[40]

The 2017 DAVIS Challenge on Video Object Segmentation

J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbel´aez, A. Sorkine- Hornung, and L. Van Gool. The 2017 davis chal- lenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

V oigtlaender and B

P. V oigtlaender and B. Leibe. Online adaptation of convo- lutional neural networks for video object segmentation. In British Machine Vision Conference, 2017

work page 2017
[42]

Y . Wu, J. Lim, and M.-H. Yang. Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2411–2418, 2013

work page 2013
[43]

Wug Oh, J.-Y

S. Wug Oh, J.-Y . Lee, K. Sunkavalli, and S. Joo Kim. Fast video object segmentation by reference-guided mask propa- gation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7376–7385, 2018

work page 2018
[44]

H. Xiao, J. Feng, G. Lin, Y . Liu, and M. Zhang. Monet: Deep motion exploitation for video object segmentation. In International Conference on Computer Vision and Pattern Recognition, pages 1140–1148, 2018

work page 2018
[45]

S. Xie, R. Girshick, P. Doll ´ar, Z. Tu, and K. He. Aggre- gated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 5987–5995, 2017

work page 2017
[46]

N. Xu, L. Yang, Y . Fan, J. Yang, D. Yue, Y . Liang, B. Price, S. Cohen, and T. Huang. Youtube-vos: Sequence-to- sequence video object segmentation. In European Confer- ence on Computer Vision, pages 603–619, 2018

work page 2018
[47]

L. Yang, Y . Wang, X. Xiong, J. Yang, and A. K. Katsaggelos. Efﬁcient video object segmentation via network modulation. In IEEE Conference on Computer Vision and Pattern Recog- nition, pages 6499–6507, 2018. 10

work page 2018

[1] [1]

Avinash Ramakanth and R

S. Avinash Ramakanth and R. Venkatesh Babu. Seamseg: Video object segmentation using patch seams. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 376–383, 2014

work page 2014

[2] [2]

L. Bao, B. Wu, and W. Liu. Cnn in mrf: Video object seg- mentation via inference in a cnn-based higher-order spatio- temporal mrf. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5977–5986, 2018

work page 2018

[3] [3]

Bertinetto, J

L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr. Fully-convolutional siamese networks for object tracking. In European conference on computer vision, pages 850–865, 2016

work page 2016

[4] [4]

Caelles, K.-K

S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taix ´e, D. Cremers, and L. Van Gool. One-shot video object seg- mentation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5320–5329, 2017

work page 2017

[5] [5]

Chang, D

J. Chang, D. Wei, and J. W. Fisher. A video representation using temporal superpixels. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 2051–2058, 2013

work page 2051

[6] [6]

Chatﬁeld, K

K. Chatﬁeld, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convo- lutional nets. In British Machine Vision Conference, 2014

work page 2014

[7] [7]

Y . Chen, J. Pont-Tuset, A. Montes, and L. Van Gool. Blaz- ingly fast video object segmentation with pixel-wise metric learning. In IEEE Conference on Computer Vision and Pat- tern Recognition, pages 1189–1198, 2018

work page 2018

[8] [8]

Cheng, Y .-H

J. Cheng, Y .-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang. Fast and accurate online video object segmentation via track- ing parts. In IEEE Conference on Computer Vision and Pat- tern Recognition, pages 7415–7424, 2018

work page 2018

[9] [9]

H. Ci, C. Wang, and Y . Wang. Video object segmentation by learning location-sensitive embeddings. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 501–516, 2018

work page 2018

[10] [10]

Danelljan, G

M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg. Eco: Efﬁcient convolution operators for tracking. In CVPR, 2017

work page 2017

[11] [11]

Danelljan, A

M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. Beyond correlation ﬁlters: Learning continuous convolution operators for visual tracking. In European Conference on Computer Vision, pages 472–488. Springer, 2016

work page 2016

[12] [12]

Girshick

R. Girshick. Fast r-cnn. In International Conference on Com- puter Vision, pages 1440–1448, 2015

work page 2015

[13] [13]

Grundmann, V

M. Grundmann, V . Kwatra, M. Han, and I. Essa. Efﬁcient hierarchical graph-based video segmentation. In 2010 ieee computer society conference on computer vision and pattern recognition, pages 2141–2148. IEEE, 2010

work page 2010

[14] [14]

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick. Mask r- cnn. In International Conference on Computer Vision, pages 2980–2988, 2017

work page 2017

[15] [15]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

work page 2016

[16] [16]

D. Held, S. Thrun, and S. Savarese. Learning to track at 100 fps with deep regression networks. In European Conference on Computer Vision, pages 749–765. Springer, 2016

work page 2016

[17] [17]

J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High- speed tracking with kernelized correlation ﬁlters. IEEE transactions on pattern analysis and machine intelligence , 37(3):583–596, 2015

work page 2015

[18] [18]

P. Hu, G. Wang, X. Kong, J. Kuen, and Y .-P. Tan. Motion- guided cascaded reﬁnement network for video object seg- mentation. In International Conference on Computer Vision and Pattern Recognition, pages 1400–1409, 2018

work page 2018

[19] [19]

Hu, J.-B

Y .-T. Hu, J.-B. Huang, and A. G. Schwing. Videomatch: Matching based video object segmentation. In European Conference on Computer Vision, pages 56–73, 2018

work page 2018

[20] [20]

A Generative Appearance Model for End-to-end Video Object Segmentation

J. Johnander, M. Danelljan, E. Brissman, F. S. Khan, and M. Felsberg. A generative appearance model for end-to-end video object segmentation. arXiv preprint arXiv:1811.11611, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

I. Jung, J. Son, M. Baek, and B. Han. Real-time mdnet. In European Conference on Computer Vision , pages 89–104, 2018

work page 2018

[22] [22]

P. D. Kaiming He, Georgia Gkioxari and R. Girshick. Mask r-cnn: A perspective on equivariance. http: //kaiminghe.com/iccv17tutorial/maskrcnn_ iccv2017_tutorial_kaiminghe.pdf. Accessed 3, 2017

work page 2017

[23] [23]

K. Kang, H. Li, T. Xiao, W. Ouyang, J. Yan, X. Liu, and X. Wang. Object detection in videos with tubelet proposal networks. In CVPR, 2017

work page 2017

[24] [24]

Khoreva, R

A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele. Lucid data dreaming for video object segmentation. 2018

work page 2018

[25] [25]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[26] [26]

Kristan, J

M. Kristan, J. Matas, A. Leonardis, T. V ojir, R. Pﬂugfelder, G. Fernandez, G. Nebehay, F. Porikli, and L. ˇCehovin. A novel performance evaluation methodology for single-target trackers. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 38(11):2137–2155, Nov 2016. 9

work page 2016

[27] [27]

B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High perfor- mance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8971–8980, 2018

work page 2018

[28] [28]

Li and C

X. Li and C. Change Loy. Video object segmentation with joint re-identiﬁcation and attention-aware mask propagation. In European Conference on Computer Vision, pages 93–110, 2018

work page 2018

[29] [29]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com- mon objects in context. In European Cconference on Com- puter Vision, pages 740–755, 2014

work page 2014

[30] [30]

Luiten, P

J. Luiten, P. V oigtlaender, and B. Leibe. Premvos: Proposal- generation, reﬁnement and merging for video object segmen- tation. In Asian Conference on Computer Vision, 2018

work page 2018

[31] [31]

Maninis, S

K.-K. Maninis, S. Caelles, Y . Chen, J. Pont-Tuset, L. Leal- Taix´e, D. Cremers, and L. Van Gool. Video object segmen- tation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018

work page 2018

[32] [32]

M ¨arki, F

N. M ¨arki, F. Perazzi, O. Wang, and A. Sorkine-Hornung. Bi- lateral space video segmentation. In IEEE Conference on Computer Vision and Pattern Recognition , pages 743–751, 2016

work page 2016

[33] [33]

Nam and B

H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition , pages 4293– 4302, 2016

work page 2016

[34] [34]

C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large ker- nel mattersimprove semantic segmentation by global convo- lutional network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1743–1751, 2017

work page 2017

[35] [35]

Perazzi, A

F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung. Learning video object segmentation from static images. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3491–3500, 2017

work page 2017

[36] [36]

Perazzi, J

F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In IEEE Conference on Computer Vision and Pattern Recog- nition, pages 724–732, 2016

work page 2016

[37] [37]

Perazzi, O

F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung. Fully connected object proposals for video segmentation. In Proceedings of the IEEE international conference on com- puter vision, pages 3227–3234, 2015

work page 2015

[38] [38]

P. O. Pinheiro, T.-Y . Lin, R. Collobert, and P. Doll´ar. Learn- ing to reﬁne object segments. In European Conference on Computer Vision, pages 75–91. Springer, 2016

work page 2016

[39] [40]

The 2017 DAVIS Challenge on Video Object Segmentation

J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbel´aez, A. Sorkine- Hornung, and L. Van Gool. The 2017 davis chal- lenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[40] [41]

V oigtlaender and B

P. V oigtlaender and B. Leibe. Online adaptation of convo- lutional neural networks for video object segmentation. In British Machine Vision Conference, 2017

work page 2017

[41] [42]

Y . Wu, J. Lim, and M.-H. Yang. Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2411–2418, 2013

work page 2013

[42] [43]

Wug Oh, J.-Y

S. Wug Oh, J.-Y . Lee, K. Sunkavalli, and S. Joo Kim. Fast video object segmentation by reference-guided mask propa- gation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7376–7385, 2018

work page 2018

[43] [44]

H. Xiao, J. Feng, G. Lin, Y . Liu, and M. Zhang. Monet: Deep motion exploitation for video object segmentation. In International Conference on Computer Vision and Pattern Recognition, pages 1140–1148, 2018

work page 2018

[44] [45]

S. Xie, R. Girshick, P. Doll ´ar, Z. Tu, and K. He. Aggre- gated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 5987–5995, 2017

work page 2017

[45] [46]

N. Xu, L. Yang, Y . Fan, J. Yang, D. Yue, Y . Liang, B. Price, S. Cohen, and T. Huang. Youtube-vos: Sequence-to- sequence video object segmentation. In European Confer- ence on Computer Vision, pages 603–619, 2018

work page 2018

[46] [47]

L. Yang, Y . Wang, X. Xiong, J. Yang, and A. K. Katsaggelos. Efﬁcient video object segmentation via network modulation. In IEEE Conference on Computer Vision and Pattern Recog- nition, pages 6499–6507, 2018. 10

work page 2018