pith. sign in

arxiv: 1907.01203 · v2 · pith:HQU3TQVCnew · submitted 2019-07-02 · 💻 cs.CV

Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation

Pith reviewed 2026-05-25 11:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords video object segmentationcascaded networkobject proposalobject trackingmodel adaptationDAVIS datasetYouTube-VOSpixel-level segmentation
0
0 comments X

The pith

A cascaded proposal-tracking-segmentation network transfers generic object knowledge and uses dynamic reference adaptation to reach state-of-the-art video object segmentation results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a three-part network that first generates object proposals using general objectness knowledge, then tracks the specific target among those proposals, and finally segments the object with a segmentation stage that adapts its model on the fly using dynamic references. This structure addresses the shortage of video-specific training data and large appearance changes by reusing detection knowledge and updating references during the video. Experiments on DAVIS'17 and YouTube-VOS show the approach outperforming prior methods, indicating that separating proposal, selection, and adaptive segmentation can make pixel-level tracking more robust without additional labeled videos.

Core claim

The PTS framework consists of an object proposal network that transfers objectness information as generic knowledge into VOS, a tracking network that identifies the target object from the proposals, and a segmentation network performed based on the tracking results with a novel dynamic-reference based model adaptation scheme. This unified framework achieves the state-of-the-art performance on the DAVIS'17 dataset and the YouTube-VOS dataset.

What carries the argument

The dynamic-reference based model adaptation scheme, which updates the segmentation model during inference using tracking outputs to handle visual variations.

If this is right

  • Generic object detection knowledge can be reused to compensate for limited VOS training samples.
  • Separating proposal generation from target selection reduces errors from appearance changes.
  • Dynamic reference updates allow the segmentation stage to maintain accuracy across video frames.
  • The cascaded design supports end-to-end training while keeping each stage focused on its subtask.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If proposal quality drops on unusual object categories, the entire cascade would lose accuracy even if later stages function well.
  • The same proposal-plus-tracking pattern might apply to related tasks such as instance segmentation in video or multi-object tracking without major redesign.
  • Speed improvements could come from sharing early features between the proposal and tracking stages rather than running them sequentially.

Load-bearing premise

The object proposal network successfully transfers generic objectness knowledge into the VOS task and the tracking network reliably selects the correct target from proposals.

What would settle it

A direct comparison showing that the method fails to match or exceed the top reported scores on the DAVIS'17 and YouTube-VOS validation sets would falsify the performance claim.

Figures

Figures reproduced from arXiv: 1907.01203 by Chang Huang, Han Shen, Lichao Huang, Qiang Zhou, Wenyu Liu, Xinggang Wang, Yongchao Gong, Zilong Huang.

Figure 1
Figure 1. Figure 1: Example proposals of OPN on unseen categories. We randomly pick a portion of all proposals near the objects of interest. The [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed PTSNet, which consists of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Video object segmentation (VOS) aims at pixel-level object tracking given only the annotations in the first frame. Due to the large visual variations of objects in video and the lack of training samples, it remains a difficult task despite the upsurging development of deep learning. Toward solving the VOS problem, we bring in several new insights by the proposed unified framework consisting of object proposal, tracking and segmentation components. The object proposal network transfers objectness information as generic knowledge into VOS; the tracking network identifies the target object from the proposals; and the segmentation network is performed based on the tracking results with a novel dynamic-reference based model adaptation scheme. Extensive experiments have been conducted on the DAVIS'17 dataset and the YouTube-VOS dataset, our method achieves the state-of-the-art performance on several video object segmentation benchmarks. We make the code publicly available at https://github.com/sydney0zq/PTSNet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a cascaded PTS network for video object segmentation that consists of an object proposal network transferring generic objectness knowledge, a tracking network selecting the target object from proposals, and a segmentation network employing a novel dynamic-reference based model adaptation scheme. The work asserts that this unified framework solves the VOS problem under large visual variations and limited training samples, and reports achieving state-of-the-art performance on the DAVIS'17 and YouTube-VOS benchmarks while releasing code publicly.

Significance. If the quantitative results, baselines, and component ablations ultimately support the claims, the explicit decomposition into proposal, tracking, and adaptive segmentation stages could constitute a useful architectural contribution to video object segmentation by demonstrating the value of injecting generic objectness priors. The public code release would further aid reproducibility.

major comments (2)
  1. [Abstract] Abstract: the central claim that the method 'achieves the state-of-the-art performance on several video object segmentation benchmarks' is asserted without any quantitative metrics, baseline comparisons, error bars, ablation tables, or experimental protocol details, rendering the empirical contribution unverifiable from the manuscript text.
  2. [Experiments] Experiments (implied by the abstract's reference to 'extensive experiments'): no ablation is reported that isolates the tracking network's contribution, e.g., by substituting random proposal selection or an oracle tracker while keeping the proposal and segmentation networks fixed. Without such a controlled test the necessity of the three-stage cascade remains unproven, directly affecting the load-bearing assumption that the tracking stage reliably selects the correct target.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the two major comments point-by-point below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the method 'achieves the state-of-the-art performance on several video object segmentation benchmarks' is asserted without any quantitative metrics, baseline comparisons, error bars, ablation tables, or experimental protocol details, rendering the empirical contribution unverifiable from the manuscript text.

    Authors: We agree that the abstract, as a concise summary, does not contain the supporting quantitative details. The Experiments section provides the full results, including J&F scores on DAVIS'17 and YouTube-VOS with baseline comparisons and protocol information. To improve verifiability directly from the abstract, we will revise it to include the primary performance numbers and a brief reference to the evaluation setting. revision: yes

  2. Referee: [Experiments] Experiments (implied by the abstract's reference to 'extensive experiments'): no ablation is reported that isolates the tracking network's contribution, e.g., by substituting random proposal selection or an oracle tracker while keeping the proposal and segmentation networks fixed. Without such a controlled test the necessity of the three-stage cascade remains unproven, directly affecting the load-bearing assumption that the tracking stage reliably selects the correct target.

    Authors: The manuscript reports ablations on the proposal network and the dynamic-reference adaptation scheme, together with end-to-end comparisons. We acknowledge, however, that an explicit controlled ablation replacing the tracking stage with random selection or an oracle tracker is not present and would more directly substantiate the value of the cascade. We will add this experiment (and the corresponding analysis) to the revised Experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical cascade evaluated on external benchmarks

full rationale

The paper presents a three-stage neural architecture (proposal network, tracking network, dynamic-reference segmentation) trained and tested on public VOS benchmarks (DAVIS'17, YouTube-VOS). No equations, fitted parameters, or first-principles derivations are shown that reduce to their own inputs by construction. Performance claims rest on external empirical results rather than self-referential definitions or self-citation chains. The reader's assessment of score 1.0 is consistent with the absence of any load-bearing circular step.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical performance of a deep network trained on standard VOS benchmarks; no additional free parameters, axioms, or invented entities beyond conventional neural-network training are described in the abstract.

free parameters (1)
  • network weights and hyperparameters
    Standard deep-learning parameters fitted during training on benchmark data; not enumerated in the abstract.

pith-pipeline@v0.9.0 · 5712 in / 1175 out tokens · 28894 ms · 2026-05-25T11:20:40.027099+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

  1. [1]

    Avinash Ramakanth and R

    S. Avinash Ramakanth and R. Venkatesh Babu. Seamseg: Video object segmentation using patch seams. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 376–383, 2014

  2. [2]

    L. Bao, B. Wu, and W. Liu. Cnn in mrf: Video object seg- mentation via inference in a cnn-based higher-order spatio- temporal mrf. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5977–5986, 2018

  3. [3]

    Bertinetto, J

    L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr. Fully-convolutional siamese networks for object tracking. In European conference on computer vision, pages 850–865, 2016

  4. [4]

    Caelles, K.-K

    S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taix ´e, D. Cremers, and L. Van Gool. One-shot video object seg- mentation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5320–5329, 2017

  5. [5]

    Chang, D

    J. Chang, D. Wei, and J. W. Fisher. A video representation using temporal superpixels. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 2051–2058, 2013

  6. [6]

    Chatfield, K

    K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convo- lutional nets. In British Machine Vision Conference, 2014

  7. [7]

    Y . Chen, J. Pont-Tuset, A. Montes, and L. Van Gool. Blaz- ingly fast video object segmentation with pixel-wise metric learning. In IEEE Conference on Computer Vision and Pat- tern Recognition, pages 1189–1198, 2018

  8. [8]

    Cheng, Y .-H

    J. Cheng, Y .-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang. Fast and accurate online video object segmentation via track- ing parts. In IEEE Conference on Computer Vision and Pat- tern Recognition, pages 7415–7424, 2018

  9. [9]

    H. Ci, C. Wang, and Y . Wang. Video object segmentation by learning location-sensitive embeddings. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 501–516, 2018

  10. [10]

    Danelljan, G

    M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg. Eco: Efficient convolution operators for tracking. In CVPR, 2017

  11. [11]

    Danelljan, A

    M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In European Conference on Computer Vision, pages 472–488. Springer, 2016

  12. [12]

    Girshick

    R. Girshick. Fast r-cnn. In International Conference on Com- puter Vision, pages 1440–1448, 2015

  13. [13]

    Grundmann, V

    M. Grundmann, V . Kwatra, M. Han, and I. Essa. Efficient hierarchical graph-based video segmentation. In 2010 ieee computer society conference on computer vision and pattern recognition, pages 2141–2148. IEEE, 2010

  14. [14]

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick. Mask r- cnn. In International Conference on Computer Vision, pages 2980–2988, 2017

  15. [15]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

  16. [16]

    D. Held, S. Thrun, and S. Savarese. Learning to track at 100 fps with deep regression networks. In European Conference on Computer Vision, pages 749–765. Springer, 2016

  17. [17]

    J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High- speed tracking with kernelized correlation filters. IEEE transactions on pattern analysis and machine intelligence , 37(3):583–596, 2015

  18. [18]

    P. Hu, G. Wang, X. Kong, J. Kuen, and Y .-P. Tan. Motion- guided cascaded refinement network for video object seg- mentation. In International Conference on Computer Vision and Pattern Recognition, pages 1400–1409, 2018

  19. [19]

    Hu, J.-B

    Y .-T. Hu, J.-B. Huang, and A. G. Schwing. Videomatch: Matching based video object segmentation. In European Conference on Computer Vision, pages 56–73, 2018

  20. [20]

    A Generative Appearance Model for End-to-end Video Object Segmentation

    J. Johnander, M. Danelljan, E. Brissman, F. S. Khan, and M. Felsberg. A generative appearance model for end-to-end video object segmentation. arXiv preprint arXiv:1811.11611, 2018

  21. [21]

    I. Jung, J. Son, M. Baek, and B. Han. Real-time mdnet. In European Conference on Computer Vision , pages 89–104, 2018

  22. [22]

    P. D. Kaiming He, Georgia Gkioxari and R. Girshick. Mask r-cnn: A perspective on equivariance. http: //kaiminghe.com/iccv17tutorial/maskrcnn_ iccv2017_tutorial_kaiminghe.pdf. Accessed 3, 2017

  23. [23]

    K. Kang, H. Li, T. Xiao, W. Ouyang, J. Yan, X. Liu, and X. Wang. Object detection in videos with tubelet proposal networks. In CVPR, 2017

  24. [24]

    Khoreva, R

    A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele. Lucid data dreaming for video object segmentation. 2018

  25. [25]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  26. [26]

    Kristan, J

    M. Kristan, J. Matas, A. Leonardis, T. V ojir, R. Pflugfelder, G. Fernandez, G. Nebehay, F. Porikli, and L. ˇCehovin. A novel performance evaluation methodology for single-target trackers. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 38(11):2137–2155, Nov 2016. 9

  27. [27]

    B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High perfor- mance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8971–8980, 2018

  28. [28]

    Li and C

    X. Li and C. Change Loy. Video object segmentation with joint re-identification and attention-aware mask propagation. In European Conference on Computer Vision, pages 93–110, 2018

  29. [29]

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com- mon objects in context. In European Cconference on Com- puter Vision, pages 740–755, 2014

  30. [30]

    Luiten, P

    J. Luiten, P. V oigtlaender, and B. Leibe. Premvos: Proposal- generation, refinement and merging for video object segmen- tation. In Asian Conference on Computer Vision, 2018

  31. [31]

    Maninis, S

    K.-K. Maninis, S. Caelles, Y . Chen, J. Pont-Tuset, L. Leal- Taix´e, D. Cremers, and L. Van Gool. Video object segmen- tation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018

  32. [32]

    M ¨arki, F

    N. M ¨arki, F. Perazzi, O. Wang, and A. Sorkine-Hornung. Bi- lateral space video segmentation. In IEEE Conference on Computer Vision and Pattern Recognition , pages 743–751, 2016

  33. [33]

    Nam and B

    H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition , pages 4293– 4302, 2016

  34. [34]

    C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large ker- nel mattersimprove semantic segmentation by global convo- lutional network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1743–1751, 2017

  35. [35]

    Perazzi, A

    F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung. Learning video object segmentation from static images. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3491–3500, 2017

  36. [36]

    Perazzi, J

    F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In IEEE Conference on Computer Vision and Pattern Recog- nition, pages 724–732, 2016

  37. [37]

    Perazzi, O

    F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung. Fully connected object proposals for video segmentation. In Proceedings of the IEEE international conference on com- puter vision, pages 3227–3234, 2015

  38. [38]

    P. O. Pinheiro, T.-Y . Lin, R. Collobert, and P. Doll´ar. Learn- ing to refine object segments. In European Conference on Computer Vision, pages 75–91. Springer, 2016

  39. [40]

    The 2017 DAVIS Challenge on Video Object Segmentation

    J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbel´aez, A. Sorkine- Hornung, and L. Van Gool. The 2017 davis chal- lenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017

  40. [41]

    V oigtlaender and B

    P. V oigtlaender and B. Leibe. Online adaptation of convo- lutional neural networks for video object segmentation. In British Machine Vision Conference, 2017

  41. [42]

    Y . Wu, J. Lim, and M.-H. Yang. Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2411–2418, 2013

  42. [43]

    Wug Oh, J.-Y

    S. Wug Oh, J.-Y . Lee, K. Sunkavalli, and S. Joo Kim. Fast video object segmentation by reference-guided mask propa- gation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7376–7385, 2018

  43. [44]

    H. Xiao, J. Feng, G. Lin, Y . Liu, and M. Zhang. Monet: Deep motion exploitation for video object segmentation. In International Conference on Computer Vision and Pattern Recognition, pages 1140–1148, 2018

  44. [45]

    S. Xie, R. Girshick, P. Doll ´ar, Z. Tu, and K. He. Aggre- gated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 5987–5995, 2017

  45. [46]

    N. Xu, L. Yang, Y . Fan, J. Yang, D. Yue, Y . Liang, B. Price, S. Cohen, and T. Huang. Youtube-vos: Sequence-to- sequence video object segmentation. In European Confer- ence on Computer Vision, pages 603–619, 2018

  46. [47]

    L. Yang, Y . Wang, X. Xiong, J. Yang, and A. K. Katsaggelos. Efficient video object segmentation via network modulation. In IEEE Conference on Computer Vision and Pattern Recog- nition, pages 6499–6507, 2018. 10