Graph Neural Based End-to-end Data Association Framework for Online Multiple-Object Tracking
Pith reviewed 2026-05-24 23:05 UTC · model grok-4.3
The pith
A graph neural network can solve maximum weighted bipartite matching for data association in online multiple object tracking directly from detections.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an end-to-end network with an affinity learning module and a graph neural network optimization module can resolve the data association problem in online MOT by learning to solve the maximum weighted bipartite matching task, allowing the entire system to co-adapt during training and handle varying object cardinalities with good scalability.
What carries the argument
The graph neural network optimization module that takes computed affinities as edge weights and solves the maximum weighted bipartite matching problem while adapting to varying numbers of detections.
If this is right
- All modules in the tracker co-adapt during joint training, improving overall model adaptiveness.
- The system handles association problems with changing numbers of detections without fixed-size assumptions.
- Parameter tuning effort decreases because the network learns the matching process directly.
- The approach integrates appearance and motion cues into a single trainable pipeline for online tracking.
Where Pith is reading between the lines
- The same graph neural network approach to matching could apply to other vision tasks that reduce to bipartite assignment.
- Replacing traditional solvers with learned optimization might lower computational overhead in real-time systems.
- End-to-end training of association could allow trackers to adjust automatically to new camera setups or object types.
Load-bearing premise
The graph neural network can reliably approximate optimal solutions to the maximum weighted bipartite matching problem for different numbers of objects without post-processing or separate solvers.
What would settle it
Compare the assignments produced by the trained graph neural network against exact solutions from a standard bipartite matching solver on sequences with known ground-truth associations and varying object counts; systematic mismatches would falsify the claim.
Figures
read the original abstract
In this work, we present an end-to-end framework to settle data association in online Multiple-Object Tracking (MOT). Given detection responses, we formulate the frame-by-frame data association as Maximum Weighted Bipartite Matching problem, whose solution is learned using a neural network. The network incorporates an affinity learning module, wherein both appearance and motion cues are investigated to encode object feature representation and compute pairwise affinities. Employing the computed affinities as edge weights, the following matching problem on a bipartite graph is resolved by the optimization module, which leverages a graph neural network to adapt with the varying cardinalities of the association problem and solve the combinatorial hardness with favorable scalability and compatibility. To facilitate effective training of the proposed tracking network, we design a multi-level matrix loss in conjunction with the assembled supervision methodology. Being trained end-to-end, all modules in the tracker can co-adapt and co-operate collaboratively, resulting in improved model adaptiveness and less parameter-tuning efforts. Experiment results on the MOT benchmarks demonstrate the efficacy of the proposed approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an end-to-end neural framework for online multiple-object tracking that formulates frame-by-frame data association as a maximum weighted bipartite matching problem. An affinity learning module encodes appearance and motion cues to produce edge weights; these are fed to a graph neural network optimization module that is claimed to adapt to varying detection cardinalities and solve the combinatorial problem directly. Training uses a multi-level matrix loss with assembled supervision, allowing all modules to co-adapt.
Significance. If the GNN optimization module produces valid, high-quality matchings for arbitrary cardinalities without external solvers or post-processing, the work would advance fully differentiable MOT pipelines and reduce reliance on hand-tuned components. The multi-level loss and end-to-end training are presented as enabling better adaptability on MOT benchmarks.
major comments (1)
- [Abstract / Optimization Module] Abstract (and optimization module description): the central claim that the GNN 'leverages a graph neural network to adapt with the varying cardinalities of the association problem and solve the combinatorial hardness' without post-hoc adjustments is load-bearing for the 'end-to-end' and 'no separate solvers' assertions. Standard message-passing GNNs on bipartite graphs output soft scores; converting them to feasible permutation matrices for unseen cardinalities typically requires argmax, Sinkhorn normalization, or an external solver such as Hungarian. The multi-level matrix loss supervises toward ground-truth assignments only during training and does not guarantee feasible or optimal outputs at inference. Concrete evidence (architecture diagram, inference procedure, or ablation removing any post-processing) is required to substantiate the claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. Below we respond point-by-point to the major comment, offering clarification on the optimization module while committing to revisions that strengthen the presentation of our claims.
read point-by-point responses
-
Referee: [Abstract / Optimization Module] Abstract (and optimization module description): the central claim that the GNN 'leverages a graph neural network to adapt with the varying cardinalities of the association problem and solve the combinatorial hardness' without post-hoc adjustments is load-bearing for the 'end-to-end' and 'no separate solvers' assertions. Standard message-passing GNNs on bipartite graphs output soft scores; converting them to feasible permutation matrices for unseen cardinalities typically requires argmax, Sinkhorn normalization, or an external solver such as Hungarian. The multi-level matrix loss supervises toward ground-truth assignments only during training and does not guarantee feasible or optimal outputs at inference. Concrete evidence (architecture diagram, inference procedure, or ablation removing any post-processing) is required to substantiate the claim.
Authors: We appreciate the referee highlighting the need for precision on this central aspect of the framework. The optimization module constructs a bipartite graph whose nodes correspond to detections in the current and previous frames (thus naturally accommodating arbitrary cardinalities) and whose edges are initialized with affinities from the appearance-motion module. Successive GNN layers perform message passing that refines these affinities into an output matrix whose entries directly encode assignment decisions. The multi-level matrix loss, applied with assembled supervision, explicitly penalizes deviations from the ground-truth assignment matrix at multiple resolutions, encouraging the network to produce outputs that are already close to valid permutation matrices. At inference the GNN output is used to recover the matching by selecting the highest-scoring entries while enforcing the one-to-one constraint implicit in the learned representation; no external combinatorial solver is invoked. This design keeps the entire pipeline differentiable. We nevertheless recognize that the manuscript would benefit from greater transparency. In the revision we will add an architecture diagram of the optimization module, a step-by-step description of the inference procedure that converts the GNN output into a feasible matching for unseen cardinalities, and an ablation that isolates the contribution of any minimal post-processing steps. revision: yes
Circularity Check
No circularity: framework trained end-to-end on external data with no self-definitional reductions
full rationale
The paper formulates data association as a maximum weighted bipartite matching problem and learns its solution via a neural network with affinity and optimization modules. All components are trained on labeled tracking data using a multi-level matrix loss; outputs are not equivalent to inputs by construction, nor are any predictions statistically forced from fitted subsets. No self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The derivation chain is therefore self-contained against external benchmarks and does not reduce to renaming or tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A graph neural network can adapt to varying cardinalities and solve the maximum weighted bipartite matching problem with favorable scalability.
Reference graph
Works this paper leans on
- [1]
-
[2]
S. Avidan. Ensemble tracking. IEEE transactions on pattern analysis and machine intelligence, 29(2), 2007
work page 2007
-
[3]
S.-H. Bae and K.-J. Yoon. Confidence-based data associa- tion and discriminative deep appearance learning for robust online multi-object tracking. IEEE transactions on pattern analysis and machine intelligence, 40(3):595–610, 2018
work page 2018
-
[4]
E. Balas and M. W. Padberg. Set partitioning: A survey. SIAM review, 18(4):710–760, 1976
work page 1976
-
[5]
P. W. Battaglia, J. B. Hamrick, V . Bapst, A. Sanchez- Gonzalez, V . Zambaldi, M. Malinowski, A. Tacchetti, D. Ra- poso, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [6]
-
[7]
E. Bochinski, V . Eiselein, and T. Sikora. High-speed tracking-by-detection without using image information. In Advanced Video and Signal Based Surveillance (AVSS), 2017 14th IEEE International Conference on , pages 1–6. IEEE, 2017
work page 2017
-
[8]
W. Brendel, M. Amer, and S. Todorovic. Multiobject track- ing as maximum weight independent set. InComputer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1273–1280. IEEE, 2011
work page 2011
-
[9]
W. Brendel and S. Todorovic. Learning spatiotemporal graphs of human activities. InComputer vision (ICCV), 2011 IEEE international conference on , pages 778–785. IEEE, 2011
work page 2011
-
[10]
M. M. Bronstein, J. Bruna, Y . LeCun, A. Szlam, and P. Van- dergheynst. Geometric deep learning: going beyond eu- clidean data. IEEE Signal Processing Magazine, 34(4):18– 42, 2017
work page 2017
- [11]
-
[12]
X. Cao, X. Jiang, X. Li, and P. Yan. Correlation-based track- ing of multiple targets with hierarchical layered structure. IEEE transactions on cybernetics, 48(1):90–102, 2018
work page 2018
-
[13]
J. Chen, H. Sheng, Y . Zhang, and Z. Xiong. Enhancing de- tection model for multiple hypothesis tracking. In Conf. on Computer Vision and Pattern Recognition Workshops, pages 2143–2152, 2017
work page 2017
-
[14]
W. Choi. Near-online multi-target tracking with aggregated local flow descriptor. In Proceedings of the IEEE inter- national conference on computer vision , pages 3029–3037, 2015
work page 2015
-
[15]
W. Choi and S. Savarese. A unified framework for multi- target tracking and collective activity recognition. In Eu- ropean Conference on Computer Vision , pages 215–230. Springer, 2012
work page 2012
- [16]
-
[17]
Q. Chu, W. Ouyang, H. Li, X. Wang, B. Liu, and N. Yu. Online multi-object tracking using cnn-based single ob- ject tracker with spatial-temporal attention mechanism. In 2017 IEEE International Conference on Computer Vision (ICCV).(Oct 2017), pages 4846–4855, 2017
work page 2017
-
[19]
R. T. Collins. Multitarget data association with higher-order motion models. In Computer Vision and Pattern Recogni- tion (CVPR), 2012 IEEE Conference on , pages 1744–1751. IEEE, 2012
work page 2012
-
[20]
H. Dai, E. B. Khalil, Y . Zhang, B. Dilkina, and L. Song. Learning combinatorial optimization algorithms over graphs. arXiv preprint arXiv:1704.01665, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
N. De Cao and T. Kipf. Molgan: An implicit gener- ative model for small molecular graphs. arXiv preprint arXiv:1805.11973, 2018
-
[22]
A. Dehghan, S. Modiri Assari, and M. Shah. Gmmcp tracker: Globally optimal generalized maximum multi clique prob- lem for multiple object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 4091–4099, 2015
work page 2015
-
[23]
A. Dehghan, Y . Tian, P. H. Torr, and M. Shah. Target identity-aware network flow for online multiple target track- ing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1146–1154, 2015
work page 2015
-
[24]
M. Ding, J. Tang, and J. Zhang. Semi-supervised learning on graphs with generative adversarial nets. arXiv preprint arXiv:1809.00130, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[25]
X. Dong and J. Shen. Triplet loss in siamese network for object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pages 459–474, 2018
work page 2018
-
[26]
V . Eiselein, D. Arp, M. P ¨atzold, and T. Sikora. Real-time multi-human tracking using a probability hypothesis density filter and multiple detectors. In Advanced Video and Signal- Based Surveillance (AVSS), 2012 IEEE Ninth International Conference on, pages 325–330. IEEE, 2012
work page 2012
-
[27]
Few-Shot Learning with Graph Neural Networks
V . Garcia and J. Bruna. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
Neural Message Passing for Quantum Chemistry
J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
I. Goodfellow, Y . Bengio, A. Courville, and Y . Bengio.Deep learning, volume 1. MIT press Cambridge, 2016
work page 2016
-
[31]
M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Con- ference on, volume 2, pages 729–734. IEEE, 2005
work page 2005
-
[32]
A. He, C. Luo, X. Tian, and W. Zeng. A twofold siamese network for real-time object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 4834–4843, 2018
work page 2018
-
[33]
Q. He, J. Wu, G. Yu, and C. Zhang. Sot for mot. arXiv preprint arXiv:1712.01059, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[34]
W. Hu, X. Li, W. Luo, X. Zhang, S. Maybank, and Z. Zhang. Single and multiple object tracking using log-euclidean rie- mannian subspace and block-division appearance model. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 34(12):2420–2440, 2012
work page 2012
- [35]
- [36]
- [37]
-
[38]
C. Kim, F. Li, and J. M. Rehg. Multi-object tracking with neural gating using bilinear lstm. In Proceedings of the Eu- ropean Conference on Computer Vision (ECCV), pages 200– 215, 2018
work page 2018
-
[39]
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[40]
T. N. Kipf and M. Welling. Semi-supervised classifica- tion with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[41]
H. W. Kuhn. The hungarian method for the assignment prob- lem. Naval research logistics quarterly, 2(1-2):83–97, 1955
work page 1955
-
[42]
C.-H. Kuo, C. Huang, and R. Nevatia. Multi-target track- ing by on-line learned discriminative appearance models. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 685–692. IEEE, 2010
work page 2010
-
[43]
T. Kutschbach, E. Bochinski, V . Eiselein, and T. Sikora. Sequential sensor fusion combining probability hypothesis density and kernelized correlation filters for multi-object tracking in video data. In2017 14th IEEE International Con- ference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–5. IEEE, 2017
work page 2017
-
[44]
L. Leal-Taix ´e, C. Canton-Ferrer, and K. Schindler. Learn- ing by tracking: Siamese cnn for robust target association. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 33–40, 2016
work page 2016
-
[45]
MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking
L. Leal-Taix ´e, A. Milan, I. Reid, S. Roth, and K. Schindler. Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[46]
L. Leal-Taix ´e, G. Pons-Moll, and B. Rosenhahn. Everybody needs somebody: Modeling social and grouping behavior on a linear programming multiple people tracker. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE Interna- tional Conference on, pages 120–127. IEEE, 2011
work page 2011
-
[47]
H. Li, Y . Li, and F. Porikli. Deeptrack: Learning discrimina- tive feature representations online for robust visual tracking. IEEE Transactions on Image Processing, 25(4):1834–1848, 2016
work page 2016
-
[48]
Y . Li, C. Huang, and R. Nevatia. Learning to associate: Hy- bridboosted multi-target tracker for crowded scene. 2009
work page 2009
-
[49]
MOT16: A Benchmark for Multi-Object Tracking
A. Milan, L. Leal-Taix ´e, I. Reid, S. Roth, and K. Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [50]
- [51]
- [52]
- [53]
- [54]
-
[55]
End-to-End Tracking and Semantic Segmentation Using Recurrent Neural Networks
P. Ondruska, J. Dequaire, D. Z. Wang, and I. Posner. End- to-end tracking and semantic segmentation using recurrent neural networks. arXiv preprint arXiv:1604.05091, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[56]
Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural Networks
P. Ondruska and I. Posner. Deep tracking: Seeing be- yond seeing using recurrent neural networks. arXiv preprint arXiv:1602.00991, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[57]
S. Pellegrini, A. Ess, and L. Van Gool. Improving data as- sociation by joint modeling of pedestrian trajectories and groupings. In European conference on computer vision , pages 452–465. Springer, 2010
work page 2010
-
[58]
H. Pirsiavash, D. Ramanan, and C. C. Fowlkes. Globally- optimal greedy algorithms for tracking a variable num- ber of objects. In Computer Vision and Pattern Recogni- tion (CVPR), 2011 IEEE Conference on , pages 1201–1208. IEEE, 2011
work page 2011
-
[59]
H. Possegger, T. Mauthner, P. M. Roth, and H. Bischof. Oc- clusion geodesics for online multi-object tracking. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1306–1313, 2014
work page 2014
- [60]
-
[61]
Discovering objects and their relations from entangled scene representations
D. Raposo, A. Santoro, D. Barrett, R. Pascanu, T. Lilli- crap, and P. Battaglia. Discovering objects and their rela- tions from entangled scene representations. arXiv preprint arXiv:1702.05068, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[62]
Features for Multi-Target Multi-Camera Tracking and Re-Identification
E. Ristani and C. Tomasi. Features for multi-target multi-camera tracking and re-identification. arXiv preprint arXiv:1803.10859, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[63]
A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese. Learning social etiquette: Human trajectory understanding in crowded scenes. In European conference on computer vi- sion, pages 549–565. Springer, 2016
work page 2016
-
[64]
Tracking The Untrackable: Learning To Track Multiple Cues with Long-Term Dependencies
A. Sadeghian, A. Alahi, and S. Savarese. Tracking the un- trackable: Learning to track multiple cues with long-term de- pendencies. arXiv preprint arXiv:1701.01909, 4(5):6, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[65]
R. Sanchez-Matilla, F. Poiesi, and A. Cavallaro. Online multi-target tracking with strong and weak detections. In European Conference on Computer Vision , pages 84–99. Springer, 2016
work page 2016
-
[66]
F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. Computational capabilities of graph neu- ral networks. IEEE Transactions on Neural Networks , 20(1):81–102, 2009
work page 2009
-
[68]
F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009
work page 2009
-
[69]
S. Schulter, P. Vernaza, W. Choi, and M. Chandraker. Deep network flow for multi-object tracking. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 2730–2739. IEEE, 2017
work page 2017
-
[70]
P. Scovanner and M. F. Tappen. Learning pedestrian dynam- ics from the real world. In Computer Vision, 2009 IEEE 12th International Conference on, pages 381–388. IEEE, 2009
work page 2009
-
[71]
G. Shu, A. Dehghan, O. Oreifej, E. Hand, and M. Shah. Part- based multiple-person tracking with partial occlusion han- dling. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1815–1821. IEEE, 2012
work page 2012
-
[72]
J. Son, M. Baek, M. Cho, and B. Han. Multi-object tracking with quadruplet convolutional neural networks. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5620–5629, 2017
work page 2017
-
[73]
B. Song, T.-Y . Jeng, E. Staudt, and A. K. Roy-Chowdhury. A stochastic graph evolution framework for robust multi- target tracking. In European Conference on Computer Vi- sion, pages 605–619. Springer, 2010
work page 2010
-
[74]
PeerNets: Exploiting Peer Wisdom Against Adversarial Attacks
J. Svoboda, J. Masci, F. Monti, M. M. Bronstein, and L. Guibas. Peernets: Exploiting peer wisdom against ad- versarial attacks. arXiv preprint arXiv:1806.00088, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[75]
S. Tang, B. Andres, M. Andriluka, and B. Schiele. Sub- graph decomposition for multi-target tracking. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5033–5041, 2015
work page 2015
-
[76]
S. Tang, B. Andres, M. Andriluka, and B. Schiele. Multi- person tracking by multicut and deep matching. InEuropean Conference on Computer Vision , pages 100–111. Springer, 2016
work page 2016
-
[77]
S. Tang, M. Andriluka, B. Andres, and B. Schiele. Multiple people tracking by lifted multicut and person reidentifica- tion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3539–3548, 2017
work page 2017
-
[78]
P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y . Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[79]
X. Wan, J. Wang, Z. Kong, Q. Zhao, and S. Deng. Multi- object tracking using online metric learning with long short- term memory. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 788–792. IEEE, 2018
work page 2018
-
[80]
N. Wang and D.-Y . Yeung. Learning a deep compact im- age representation for visual tracking. In Advances in neural information processing systems, pages 809–817, 2013
work page 2013
-
[81]
Q. Wang, Z. Teng, J. Xing, J. Gao, W. Hu, and S. May- bank. Learning attentions: residual attentional siamese net- work for high performance online visual tracking. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4854–4863, 2018
work page 2018
-
[82]
X. Wang, E. T ¨uretken, F. Fleuret, and P. Fua. Tracking inter- acting objects using intertwined flows. IEEE transactions on pattern analysis and machine intelligence , 38(EPFL- ARTICLE-210040):2312–2326, 2016
work page 2016
-
[83]
B. Yang and R. Nevatia. Multi-target tracking by online learning of non-linear motion patterns and robust appear- ance models. In Computer Vision and Pattern Recogni- tion (CVPR), 2012 IEEE Conference on , pages 1918–1925. IEEE, 2012
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.