A unified neural network for object detection, multiple object tracking and vehicle re-identification

Jiakui Wang; Yuhao Xu

arxiv: 1907.03465 · v1 · pith:4BIMAETZnew · submitted 2019-07-08 · 💻 cs.CV

A unified neural network for object detection, multiple object tracking and vehicle re-identification

Yuhao Xu , Jiakui Wang This is my paper

Pith reviewed 2026-05-25 01:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords object detectionmultiple object trackingvehicle re-identificationend-to-end networkFaster RCNNtriplet losstrack branchunified model

0 comments

The pith

Adding a track branch to Faster RCNN unifies detection, tracking and re-identification into one end-to-end network.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that object detection, multiple object tracking, and vehicle re-identification can be handled by a single network rather than separate models. It extends the Faster RCNN detector by inserting a track branch that reuses region-of-interest feature vectors and trains the whole system end-to-end with a combination of losses including triplet loss. Training data for the triplet loss is built by concatenating neighboring video frames so that matching object instances appear together. This matters because running a detector and a separate re-identification model at inference time is described as time-consuming. The resulting model reaches 57.79 percent mean average precision on the vehicle tracking dataset while also performing tracking.

Core claim

We unify the detector and RE-ID model into an end-to-end network, by adding an additional track branch for tracking in Faster RCNN architecture. With a unified network, we are able to train the whole model end-to-end with multi loss. The track branch straight makes use of the RoI feature vector in Faster RCNN baseline, which reduced the amount of calculation. Since the single image lacks the same object which is necessary when we use the triplet loss to optimizer the track branch, we concatenate the neighbouring frames in a video to construct our training dataset. Experiment shows that our model with resnet101 backbone can achieve 57.79 % mAP and track vehicle well.

What carries the argument

The track branch inserted into Faster RCNN that operates directly on RoI feature vectors and is optimized with triplet loss after neighboring frames are concatenated.

If this is right

The detector, tracker and re-identification head can be optimized jointly rather than in separate stages.
Inference requires only one forward pass through the shared backbone instead of separate CNN evaluations for re-identification.
Multi-loss training becomes feasible for the combined detection and tracking objectives.
The approach achieves 57.79 percent mAP on the vehicle tracking dataset with a ResNet-101 backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Joint optimization might improve feature quality for all three tasks compared with independently trained models.
The reuse of RoI features could be applied to other tracking-by-detection pipelines that currently run separate re-identification networks.
If frame concatenation works for vehicles it may generalize to other moving objects where temporal proximity supplies natural positive pairs.
Reducing the number of separate models could lower overall system latency in real-time video applications.

Load-bearing premise

Concatenating neighboring frames supplies independent same-object instances suitable for triplet-loss training without introducing strong temporal bias or label noise.

What would settle it

Training the same architecture on randomly paired frames instead of concatenated neighboring frames and observing whether mAP or tracking consistency drops would test whether the frame-concatenation step is necessary.

Figures

Figures reproduced from arXiv: 1907.03465 by Jiakui Wang, Yuhao Xu.

**Figure 1.** Figure 1: network architecture pooling layer and two sibling fully connected layers on corresponding feature map to get a RoI feature vector, which followed by classification branch and regression branch to get the more accurate classification score and coordinate. Finally, we apply NMS on these predict bounding boxes to get final predictions. 3.2. Track Branch Based on Faster RCNN architecture, we propose an addi… view at source ↗

**Figure 2.** Figure 2: concatenated image for training.(a) is used to train single camera scenario, (b) is used to train multi-camera scenario [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: tracking result. (a) single camera scenario result, (b) multi-camera scenario result [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: histogram of distance distribution between same objects and different objects. (a) is distance distribution in single camera [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: tracking results performance of vehicle detection and tracking. we leave as our future work. 6. Reference References [1] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In 2017 IEEE International Conference on Image Processing (ICIP), pages 3645–3649. IEEE, 2017. [2] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Si… view at source ↗

read the original abstract

Deep SORT\cite{wojke2017simple} is a tracking-by-detetion approach to multiple object tracking with a detector and a RE-ID model. Both separately training and inference with the two model is time-comsuming. In this paper, we unify the detector and RE-ID model into an end-to-end network, by adding an additional track branch for tracking in Faster RCNN architecture. With a unified network, we are able to train the whole model end-to-end with multi loss, which has shown much benefit in other recent works. The RE-ID model in Deep SORT needs to use deep CNNs to extract feature map from detected object images, However, track branch in our proposed network straight make use of the RoI feature vector in Faster RCNN baseline, which reduced the amount of calculation. Since the single image lacks the same object which is necessary when we use the triplet loss to optimizer the track branch, we concatenate the neighbouring frames in a video to construct our training dataset. We have trained and evaluated our model on AIC19 vehicle tracking dataset, experiment shows that our model with resnet101 backbone can achieve 57.79 \% mAP and track vehicle well.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This adds a track branch to Faster RCNN and reuses RoI features for re-ID, but supplies no baselines or ablations to show the unification actually improves anything.

read the letter

The main point is that the authors bolt a re-ID branch onto Faster RCNN, train it end-to-end with a multi-task loss that includes triplet loss, and reuse the existing RoI features instead of running a separate CNN on each crop. They build positives for the triplet loss by concatenating neighboring video frames. That reuse step is the only clear engineering win, since it avoids redundant feature extraction at inference time. The rest of the paper is a direct follow-on to the Deep SORT pipeline they cite, with no new framework or derivation.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a unified end-to-end network extending Faster R-CNN with an added track branch that re-uses RoI features for vehicle re-identification and tracking. The detector, RE-ID, and tracking components are trained jointly via multi-task loss (including triplet loss) on a dataset formed by concatenating neighboring frames from AIC19 videos; the ResNet-101 variant is reported to reach 57.79% mAP while reducing computation relative to separate Deep SORT training.

Significance. If the unification demonstrably improves over separate models without artifacts from the data-construction shortcut, the approach would offer a concrete efficiency gain by eliminating a second CNN forward pass for RE-ID features. The reuse of RoI vectors is a clear architectural strength that could be valuable in resource-constrained tracking pipelines.

major comments (2)

[Abstract] Abstract (training dataset construction paragraph): the claim that concatenating neighboring frames supplies independent same-object instances for triplet-loss training is load-bearing for the end-to-end unification benefit. No ablation on inter-frame separation, no verification that positive-pair distances remain non-trivial, and no comparison against training on non-adjacent frames are supplied; temporal correlation or label noise in this construction could make the track branch learn motion continuity rather than identity discrimination.
[Abstract] Abstract (results paragraph): the headline 57.79% mAP is presented without any baseline (separate Faster R-CNN + Deep SORT, or Deep SORT alone), without the multi-task loss-weighting scheme, and without explicit statement that the number is measured on a held-out test split rather than training data. These omissions prevent verification that the reported performance stems from architectural unification rather than fitting or data leakage.

minor comments (2)

[Abstract] Typos and grammar: 'detetion' (detection), 'time-comsuming' (time-consuming), 'straight make use of' (directly uses), 'optimizer the track branch' (optimize).
[Abstract] The abstract states that the model 'track[s] vehicle well' but supplies no quantitative tracking metrics (MOTA, IDF1, etc.) or qualitative examples, leaving the tracking claim unsupported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract (training dataset construction paragraph): the claim that concatenating neighboring frames supplies independent same-object instances for triplet-loss training is load-bearing for the end-to-end unification benefit. No ablation on inter-frame separation, no verification that positive-pair distances remain non-trivial, and no comparison against training on non-adjacent frames are supplied; temporal correlation or label noise in this construction could make the track branch learn motion continuity rather than identity discrimination.

Authors: We agree that the training data construction is central to enabling triplet loss in our unified model. Neighboring frames are concatenated specifically to supply multiple views of the same vehicle within one training example, since individual frames rarely repeat the same object. We did not provide the requested ablations in the original submission. In revision we will add an ablation on frame separation distance, report positive-pair distance statistics to confirm identity discrimination, and include a comparison against non-adjacent frame training. revision: yes
Referee: [Abstract] Abstract (results paragraph): the headline 57.79% mAP is presented without any baseline (separate Faster R-CNN + Deep SORT, or Deep SORT alone), without the multi-task loss-weighting scheme, and without explicit statement that the number is measured on a held-out test split rather than training data. These omissions prevent verification that the reported performance stems from architectural unification rather than fitting or data leakage.

Authors: We apologize for the lack of clarity. The 57.79% mAP figure is measured on the held-out AIC19 test split; we will state this explicitly. The multi-task loss weights will be reported in the methods section of the revision. While the manuscript already notes the computational benefit of RoI-feature reuse versus a separate RE-ID network, we will add a direct quantitative baseline comparison against separately trained Faster R-CNN + Deep SORT in the experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation consists of an architectural modification (adding a track branch to Faster R-CNN that reuses RoI features for triplet loss) plus a data-construction step (concatenating neighboring video frames to obtain same-object positives). Performance numbers are obtained by standard end-to-end training and evaluation on the AIC19 dataset; no equation or claim reduces by construction to its own inputs, no fitted parameter is relabeled as a prediction, and the sole external citation (Deep SORT) is not load-bearing for the unification claim. The central result therefore remains an independent architectural and empirical statement rather than a self-referential loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the sufficiency of RoI features for re-identification and on the validity of the concatenated-frame training construction; both are domain assumptions without external verification in the abstract.

free parameters (1)

multi-task loss weights
The abstract mentions training with multi loss but does not specify how detection, classification, and tracking losses are balanced.

axioms (1)

domain assumption RoI feature vectors extracted by Faster RCNN are adequate for vehicle re-identification
Invoked when the authors replace a separate deep CNN with the track branch operating on RoI features.

pith-pipeline@v0.9.0 · 5746 in / 1263 out tokens · 28182 ms · 2026-05-25T01:16:06.055356+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 3 internal anchors

[1]

Simple online and realtime tracking with a deep as- sociation metric

Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep as- sociation metric. In 2017 IEEE International Confer- ence on Image Processing (ICIP) , pages 3645–3649. IEEE, 2017

work page 2017
[2]

Simple online and realtime tracking

Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In 2016 IEEE International Conference on Image Pro- cessing (ICIP), pages 3464–3468. IEEE, 2016

work page 2016
[3]

Gated siamese convolutional neural network architec- ture for human re-identiﬁcation

Rahul Rama Varior, Mrinal Haloi, and Gang Wang. Gated siamese convolutional neural network architec- ture for human re-identiﬁcation. In European con- ference on computer vision, pages 791–808. Springer, 2016

work page 2016
[4]

Facenet: A uniﬁed embedding for face recog- nition and clustering

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uniﬁed embedding for face recog- nition and clustering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 815–823, 2015. 6

work page 2015
[5]

Beyond triplet loss: a deep quadruplet network for person re-identiﬁcation

Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. Beyond triplet loss: a deep quadruplet network for person re-identiﬁcation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 403–412, 2017

work page 2017
[6]

Margin Sample Mining Loss: A Deep Learning Based Method for Person Re-identification

Qiqi Xiao, Hao Luo, and Chi Zhang. Margin sample mining loss: A deep learning based method for person re-identiﬁcation. arXiv preprint arXiv:1710.00478 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

Unsupervised vehicle re-identiﬁcation using triplet networks

Pedro Antonio Mar ´ın-Reyes, Luca Bergamini, Javier Lorenzo-Navarro, Andrea Palazzi, Simone Calder- ara, and Rita Cucchiara. Unsupervised vehicle re-identiﬁcation using triplet networks. In 2018 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition Workshops (CVPRW) , pages 166–

work page 2018
[8]

Vehicle re-identiﬁcation: an efﬁ- cient baseline using triplet embedding

Ratnesh Kumar, Edwin Weill, Farzin Aghdasi, and Parthsarathy Sriram. Vehicle re-identiﬁcation: an efﬁ- cient baseline using triplet embedding. arXiv preprint arXiv:1901.01015, 2019

work page arXiv 1901
[9]

Rich feature hierarchies for accurate object detection and semantic segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell, and Ji- tendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Pro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014

work page 2014
[10]

Spatial pyramid pooling in deep convolu- tional networks for visual recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolu- tional networks for visual recognition. IEEE trans- actions on pattern analysis and machine intelligence , 37(9):1904–1916, 2015

work page 1904
[11]

Fast r-cnn

Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision , pages 1440–1448, 2015

work page 2015
[12]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015

work page 2015
[13]

You only look once: Uniﬁed, real-time object detection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uniﬁed, real-time object detection. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 779–788, 2016

work page 2016
[14]

Ssd: Single shot multibox detec- tor

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Chris- tian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detec- tor. In European conference on computer vision, pages 21–37. Springer, 2016

work page 2016
[15]

Focal loss for dense object de- tection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object de- tection. In Proceedings of the IEEE international con- ference on computer vision, pages 2980–2988, 2017

work page 2017
[16]

DSSD : Deconvolutional Single Shot Detector

Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg. Dssd: Deconvolutional single shot detector.arXiv preprint arXiv:1701.06659, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Bag of Freebies for Training Object Detection Neural Networks

Zhi Zhang, Tong He, Hang Zhang, Zhongyuan Zhang, Junyuan Xie, and Mu Li. Bag of freebies for train- ing object detection neural networks. arXiv preprint arXiv:1902.04103, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[18]

Cascade r-cnn: Delving into high quality object detection

Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6154–6162, 2018

work page 2018
[19]

Fcos: Fully convolutional one-stage object detection

Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355, 2019. 7

work page arXiv 1904

[1] [1]

Simple online and realtime tracking with a deep as- sociation metric

Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep as- sociation metric. In 2017 IEEE International Confer- ence on Image Processing (ICIP) , pages 3645–3649. IEEE, 2017

work page 2017

[2] [2]

Simple online and realtime tracking

Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In 2016 IEEE International Conference on Image Pro- cessing (ICIP), pages 3464–3468. IEEE, 2016

work page 2016

[3] [3]

Gated siamese convolutional neural network architec- ture for human re-identiﬁcation

Rahul Rama Varior, Mrinal Haloi, and Gang Wang. Gated siamese convolutional neural network architec- ture for human re-identiﬁcation. In European con- ference on computer vision, pages 791–808. Springer, 2016

work page 2016

[4] [4]

Facenet: A uniﬁed embedding for face recog- nition and clustering

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uniﬁed embedding for face recog- nition and clustering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 815–823, 2015. 6

work page 2015

[5] [5]

Beyond triplet loss: a deep quadruplet network for person re-identiﬁcation

Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. Beyond triplet loss: a deep quadruplet network for person re-identiﬁcation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 403–412, 2017

work page 2017

[6] [6]

Margin Sample Mining Loss: A Deep Learning Based Method for Person Re-identification

Qiqi Xiao, Hao Luo, and Chi Zhang. Margin sample mining loss: A deep learning based method for person re-identiﬁcation. arXiv preprint arXiv:1710.00478 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

Unsupervised vehicle re-identiﬁcation using triplet networks

Pedro Antonio Mar ´ın-Reyes, Luca Bergamini, Javier Lorenzo-Navarro, Andrea Palazzi, Simone Calder- ara, and Rita Cucchiara. Unsupervised vehicle re-identiﬁcation using triplet networks. In 2018 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition Workshops (CVPRW) , pages 166–

work page 2018

[8] [8]

Vehicle re-identiﬁcation: an efﬁ- cient baseline using triplet embedding

Ratnesh Kumar, Edwin Weill, Farzin Aghdasi, and Parthsarathy Sriram. Vehicle re-identiﬁcation: an efﬁ- cient baseline using triplet embedding. arXiv preprint arXiv:1901.01015, 2019

work page arXiv 1901

[9] [9]

Rich feature hierarchies for accurate object detection and semantic segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell, and Ji- tendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Pro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014

work page 2014

[10] [10]

Spatial pyramid pooling in deep convolu- tional networks for visual recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolu- tional networks for visual recognition. IEEE trans- actions on pattern analysis and machine intelligence , 37(9):1904–1916, 2015

work page 1904

[11] [11]

Fast r-cnn

Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision , pages 1440–1448, 2015

work page 2015

[12] [12]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015

work page 2015

[13] [13]

You only look once: Uniﬁed, real-time object detection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uniﬁed, real-time object detection. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 779–788, 2016

work page 2016

[14] [14]

Ssd: Single shot multibox detec- tor

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Chris- tian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detec- tor. In European conference on computer vision, pages 21–37. Springer, 2016

work page 2016

[15] [15]

Focal loss for dense object de- tection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Focal loss for dense object de- tection. In Proceedings of the IEEE international con- ference on computer vision, pages 2980–2988, 2017

work page 2017

[16] [16]

DSSD : Deconvolutional Single Shot Detector

Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg. Dssd: Deconvolutional single shot detector.arXiv preprint arXiv:1701.06659, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Bag of Freebies for Training Object Detection Neural Networks

Zhi Zhang, Tong He, Hang Zhang, Zhongyuan Zhang, Junyuan Xie, and Mu Li. Bag of freebies for train- ing object detection neural networks. arXiv preprint arXiv:1902.04103, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[18] [18]

Cascade r-cnn: Delving into high quality object detection

Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6154–6162, 2018

work page 2018

[19] [19]

Fcos: Fully convolutional one-stage object detection

Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355, 2019. 7

work page arXiv 1904