pith. machine review for the scientific record. sign in

arxiv: 2605.09245 · v1 · submitted 2026-05-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CalibFree: Self-Supervised View Feature Separation for Calibration-Free Multi-Camera Multi-Object Tracking

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-camera multi-object trackingself-supervised learningfeature separationcalibration-freeview-agnostic featurescross-view reconstructionsingle-view distillation
0
0 comments X

The pith

Self-supervised separation of view-agnostic and view-specific features enables multi-camera tracking without calibration or labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CalibFree, a self-supervised learning method for multi-camera multi-object tracking that learns to isolate features consistent across camera views from those unique to each view. It does this through single-view distillation and cross-view reconstruction without using any camera calibration parameters or manual annotations. The resulting view-agnostic features support consistent object identity maintenance even in changing or complex scenes. On the MMP-MvMHAT dataset the approach raises overall accuracy by 3 percent and average F1 score by 7.5 percent relative to prior methods. Similar gains appear on the more varied MvMHAT dataset for long-term and cross-view tracking.

Core claim

By promoting separation between view-agnostic and view-specific representations via single-view distillation and cross-view reconstruction, CalibFree performs multi-camera multi-object tracking without calibration information or labels, yielding higher accuracy and F1 scores on standard benchmarks while adapting to dynamic camera configurations.

What carries the argument

The view-agnostic feature representation produced by single-view distillation together with cross-view reconstruction, which isolates identity-preserving information independent of camera perspective.

If this is right

  • Tracking systems can operate in settings where installing calibrated cameras is impractical or expensive.
  • No manual labeling is required to train or adapt the tracker to new camera networks.
  • Performance remains stable under viewpoint changes and scene dynamics.
  • Cross-view association improves without explicit geometric alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation idea could replace hand-crafted calibration in other multi-view tasks such as 3D pose estimation.
  • Temporary camera arrays for events or robotics could adopt this approach with little setup effort.
  • Adding temporal consistency signals might further strengthen identity preservation over long sequences.
  • The learned invariance might transfer to non-visual sensors if similar reconstruction objectives are defined.

Load-bearing premise

That single-view distillation combined with cross-view reconstruction produces features that keep object identities consistent across uncalibrated views without any external supervision.

What would settle it

Running the method on a new multi-camera dataset with no calibration data and no labels, then checking whether cross-view identity consistency collapses when the separation losses are removed.

Figures

Figures reproduced from arXiv: 2605.09245 by Deep Patel, Dinesh Manocha, Iain Melvin, Martin Renqiang Min, Ruiqi Xian, Sanjoy Kundu.

Figure 1
Figure 1. Figure 1: Multi-Camera Multi-Object Tracking (MCMOT) setup. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CalibFree. The method includes single-view distillation, feature separation, and cross-view reconstruction. In single-view distillation (red box), masked detections are encoded, and feature quality is supervised by a teacher model using a distillation loss. A separation regularizer encourages the learned features to specialize into view-agnostic and view-specific components. For cross-view reco… view at source ↗
Figure 3
Figure 3. Figure 3: Cross-view reconstruction results. The input images show the original crops, while the masked images indicate regions removed for reconstruction. CalibFree reconstructs the masked regions by leveraging complementary observations across views, recovering consistent appearance and identity-relevant cues even when large portions are obscured. Calibration independence. CalibFree never accesses camera intrinsic… view at source ↗
read the original abstract

Multi-camera multi-object tracking (MCMOT) faces significant challenges in maintaining consistent object identities across varying camera perspectives, particularly when precise calibration and extensive annotations are required. In this paper, we present CalibFree, a self-supervised representation learning framework that does not need any calibration or manual labeling for the MCMOT task. By promoting feature separation between view-agnostic and view-specific representations through single-view distillation and cross-view reconstruction, our method adapts to complex, dynamic scenarios with minimal overhead. Experiments on the MMP-MvMHAT dataset show a 3% improvement in overall accuracy and a 7.5% increase in the average F1 score over state-of-the-art approaches, confirming the effectiveness of our calibration-free design. Moreover, on the more diverse MvMHAT dataset, our approach demonstrates superior over-time tracking and strong cross-view performance, highlighting its adaptability to a wide range of camera configurations. Code will be publicly available upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes CalibFree, a self-supervised framework for multi-camera multi-object tracking (MCMOT) that separates view-agnostic and view-specific features via single-view distillation and cross-view reconstruction, eliminating the need for camera calibration or manual labels. It reports a 3% gain in overall accuracy and 7.5% gain in average F1 score on the MMP-MvMHAT dataset, plus strong cross-view and over-time tracking results on the more diverse MvMHAT dataset.

Significance. If the central claim is substantiated, the work would be significant for practical MCMOT deployment in uncalibrated, dynamic settings by removing geometric priors and annotation burdens. The self-supervised design and public code commitment are positive attributes that could enable wider adoption if the identity-consistency mechanism is shown to hold.

major comments (3)
  1. [Method and Abstract] The central claim that single-view distillation plus cross-view reconstruction yields view-agnostic features with consistent object identities across cameras rests on an unverified assumption; the reconstruction objective (described at high level in the method) appears to operate at feature or pixel level without an explicit identity-aware or association term, which risks permitting identity permutations that still satisfy the loss (see skeptic note on data-association).
  2. [Experiments] Only aggregate performance numbers are reported; no ablation studies, error analysis, or derivation details are provided to isolate the contribution of the view-feature separation or to confirm that gains arise from multi-view consistency rather than improved single-view tracking.
  3. [Experiments] The 3% accuracy / 7.5% F1 improvements on MMP-MvMHAT are presented without statistical significance tests, variance across runs, or comparisons against strong calibration-free baselines, weakening attribution to the proposed design.
minor comments (1)
  1. [Abstract] The abstract's claim of 'strong cross-view performance' on MvMHAT would benefit from explicit per-camera or cross-view ID consistency metrics rather than qualitative description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below, providing clarifications from the manuscript and indicating planned revisions to strengthen the presentation.

read point-by-point responses
  1. Referee: [Method and Abstract] The central claim that single-view distillation plus cross-view reconstruction yields view-agnostic features with consistent object identities across cameras rests on an unverified assumption; the reconstruction objective (described at high level in the method) appears to operate at feature or pixel level without an explicit identity-aware or association term, which risks permitting identity permutations that still satisfy the loss (see skeptic note on data-association).

    Authors: We appreciate this observation on the implicit nature of identity consistency. In Section 3, the single-view distillation loss enforces intra-view feature consistency for the same object, while the cross-view reconstruction operates exclusively on the separated view-agnostic features; the separation itself is intended to isolate identity information from view-specific cues, reducing the likelihood of permutations that would violate reconstruction across views. The training data implicitly provides the association through simultaneous multi-view captures of the same scenes. To address the concern directly, we will expand the method section with a formal derivation of the combined objective and a discussion of the data-association assumption in the revision. revision: partial

  2. Referee: [Experiments] Only aggregate performance numbers are reported; no ablation studies, error analysis, or derivation details are provided to isolate the contribution of the view-feature separation or to confirm that gains arise from multi-view consistency rather than improved single-view tracking.

    Authors: We agree that isolating the contributions is important. Although the original submission focused on overall results due to space limits, we have performed component ablations (distillation alone vs. full model) and error analysis on identity switches and cross-view consistency. These will be added to the experiments section in the revised manuscript, along with expanded loss derivations, to demonstrate that the gains stem from the multi-view feature separation rather than single-view improvements alone. revision: yes

  3. Referee: [Experiments] The 3% accuracy / 7.5% F1 improvements on MMP-MvMHAT are presented without statistical significance tests, variance across runs, or comparisons against strong calibration-free baselines, weakening attribution to the proposed design.

    Authors: We acknowledge the value of statistical rigor and clearer baseline attribution. In the revision we will report standard deviations over multiple runs, include paired significance tests, and explicitly categorize the baselines to highlight calibration-free methods. This will better substantiate that the reported gains are attributable to the proposed self-supervised separation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical self-supervised framework with independent evaluation

full rationale

The paper introduces a self-supervised method that separates view-agnostic and view-specific features via single-view distillation and cross-view reconstruction losses, then reports tracking accuracy gains on external datasets (MMP-MvMHAT, MvMHAT). No derivation chain, equation, or fitted quantity is shown to reduce to its own inputs by construction. The central result is an empirical performance claim, not a mathematical prediction forced by self-definition or self-citation. Self-citations, if present, are not load-bearing for the uniqueness or correctness of the reported metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven premise that view-agnostic and view-specific features can be reliably disentangled without calibration parameters or labels.

axioms (1)
  • domain assumption View-agnostic and view-specific representations can be separated effectively through single-view distillation and cross-view reconstruction.
    This assumption is invoked to justify the calibration-free design.

pith-pipeline@v0.9.0 · 5483 in / 1158 out tokens · 35152 ms · 2026-05-12T04:44:39.237345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 3 internal anchors

  1. [1]

    https://iccv2021-mmp.github.io/

  2. [2]

    In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W

    Akbari, H., Yuan, L., Qian, R., Chuang, W.H., Chang, S.F., Cui, Y., Gong, B.: Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems. vol. 34, pp. 24206– 24221. Curran Associates, Inc. (2021),...

  3. [3]

    In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W

    Bastani, F., He, S., Madden, S.: Self-supervised multi-object tracking with cross- input consistency. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems. vol. 34, pp. 13695–13706. Curran Associates, Inc. (2021),https://proceedings.neurips. cc / paper _ files / paper / 2021 / fil...

  4. [4]

    2019 IEEE/CVF International Conference on Computer Vision (ICCV) pp

    Bergmann, P., Meinhardt, T., Leal-Taixé, L.: Tracking without bells and whistles. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 941– 951 (2019),https://api.semanticscholar.org/CorpusID:76665153

  5. [5]

    EURASIP Journal on Image and Video Processing2008, 1–10 (2008)

    Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing2008, 1–10 (2008)

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Cai, J., Xu, M., Li, W., Xiong, Y., Xia, W., Tu, Z., Soatto, S.: Memot: Multi-object tracking with memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8090–8100 (2022)

  7. [7]

    In: IEEE Winter Conference on Applications of Computer Vision

    Cai, Y., Medioni, G.: Exploring context information for inter-camera multiple tar- get tracking. In: IEEE Winter Conference on Applications of Computer Vision. pp. 761–768. IEEE (2014)

  8. [8]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K.: Observation-centric sort: Rethinking sort for robust multi-object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9686–9696 (2023)

  9. [9]

    In: European conference on computer vision

    Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020)

  10. [10]

    A Simple Framework for Contrastive Learning of Visual Representations

    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. ArXivabs/2002.05709(2020), https://api.semanticscholar.org/CorpusID:211096730

  11. [11]

    Pattern Recognition47(3), 1126–1137 (2014)

    Chen, X., Huang, K., Tan, T.: Object tracking across non-overlapping views by learning inter-camera transfer models. Pattern Recognition47(3), 1126–1137 (2014)

  12. [12]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Cheng, C.C., Qiu, M.X., Chiang, C.K., Lai, S.H.: Rest: A reconfigurable spatial- temporal graph model for multi-camera multi-object tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10051–10060 (2023)

  13. [13]

    In: BMVC

    Chilgunde,A.,Kumar,P.,Ranganath,S.,Huang,W.:Multi-cameratargettracking in blind regions of cameras with non-overlapping fields of view. In: BMVC. pp. 1–

  14. [14]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Chu, P., Ling, H.: Famnet: Joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6172–6181 (2019) CalibFree 5

  15. [15]

    In: Proceedings of the IEEE international conference on computer vision

    Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE international conference on computer vision. pp. 1422–1430 (2015)

  16. [16]

    IEEE Transactions on Pattern Analysis and Machine Intelligence44, 6981–6992 (2021),https://api.semanticscholar.org/CorpusID:236159249

    Dong, J., Fang, Q., Jiang, W.B., Yang, Y., Huang, Q.X., Bao, H., Zhou, X.: Fast and robust multi-person 3d pose estimation and tracking from multiple views. IEEE Transactions on Pattern Analysis and Machine Intelligence44, 6981–6992 (2021),https://api.semanticscholar.org/CorpusID:236159249

  17. [17]

    ICLR (2021)

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. ICLR (2021)

  18. [18]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: Centernet: Keypoint triplets for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 6569–6578 (2019)

  19. [19]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

    Feng, W., Wang, F., Han, R., Gan, Y., Qian, Z., Hou, J., Wang, S.: Unveiling the power of self-supervision for multi-view multi-human association and tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

  20. [20]

    IEEE Transactions on Pattern Analysis and Machine Intelligence30(2), 267–282 (2008).https://doi.org/10.1109/TPAMI.2007.1174

    Fleuret, F., Berclaz, J., Lengagne, R., Fua, P.: Multicamera people tracking with a probabilistic occupancy map. IEEE Transactions on Pattern Analysis and Machine Intelligence30(2), 267–282 (2008).https://doi.org/10.1109/TPAMI.2007.1174

  21. [21]

    Proceedings of the 29th ACM International Con- ference on Multimedia (2021),https://api.semanticscholar.org/CorpusID: 239011901

    Gan, Y., Han, R., Yin, L., Feng, W., Wang, S.: Self-supervised multi-view multi- human association and tracking. Proceedings of the 29th ACM International Con- ference on Multimedia (2021),https://api.semanticscholar.org/CorpusID: 239011901

  22. [22]

    YOLOX: Exceeding YOLO Series in 2021

    Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021)

  23. [23]

    In: Computer Vision– ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006

    Gilbert, A., Bowden, R.: Tracking objects across cameras by incrementally learn- ing inter-camera colour calibration and patterns of activity. In: Computer Vision– ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part II 9. pp. 125–136. Springer (2006)

  24. [24]

    Fast r-cnn

    Girshick, R.: Fast r-cnn. arXiv preprint arXiv:1504.08083 (2015)

  25. [25]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Gu, J., Hu, C., Zhang, T., Chen, X., Wang, Y., Wang, Y., Zhao, H.: Vip3d: End- to-end visual trajectory prediction via 3d agent queries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5496– 5506 (2023)

  26. [26]

    IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 5225–5242 (2021)

    Han, R., Feng, W., Zhang, Y., Zhao, J., Wang, S.: Multiple human association and tracking from egocentric and complementary top views. IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 5225–5242 (2021)

  27. [27]

    In: Proceedings of the AAAI Con- ference on Artificial Intelligence

    Han, R., Feng, W., Zhao, J., Niu, Z., Zhang, Y., Wan, L., Wang, S.: Complementary-view multiple human tracking. In: Proceedings of the AAAI Con- ference on Artificial Intelligence. vol. 34, pp. 10917–10924 (2020)

  28. [28]

    2022 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) pp

    He, K., Chen, X., Xie, S., Li, Y., Doll’ar, P., Girshick, R.B.: Masked autoen- coders are scalable vision learners. 2022 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) pp. 15979–15988 (2021),https://api. semanticscholar.org/CorpusID:243985980

  29. [29]

    2020 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) pp

    He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for un- supervised visual representation learning. 2020 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) pp. 9726–9735 (2019),https: //api.semanticscholar.org/CorpusID:207930212 6 R. Xian et al

  30. [30]

    IEEE Transactions on Image Processing29, 5191– 5205 (2020)

    He,Y.,Wei,X.,Hong,X.,Shi,W.,Gong,Y.:Multi-targetmulti-cameratrackingby tracklet-to-target assignment. IEEE Transactions on Image Processing29, 5191– 5205 (2020)

  31. [31]

    IEEE Transactions on Pattern Analysis and Machine Intelligence46, 2506–2517 (2022),https://api

    Huang, Z., Jin, X., Lu, C., Hou, Q., Cheng, M.M., Fu, D., Shen, X., Feng, J.: Contrastive masked autoencoders are stronger vision learners. IEEE Transactions on Pattern Analysis and Machine Intelligence46, 2506–2517 (2022),https://api. semanticscholar.org/CorpusID:251105242

  32. [32]

    In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05)

    Javed, O., Shafique, K., Shah, M.: Appearance modeling for tracking in multiple non-overlapping cameras. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05). vol. 2, pp. 26–33. IEEE (2005)

  33. [33]

    ArXivabs/1811.09795(2018),https://api

    Kim, D., Cho, D., Kweon, I.S.: Self-supervised video representation learning with space-time cubic puzzles. ArXivabs/1811.09795(2018),https://api. semanticscholar.org/CorpusID:53762354

  34. [34]

    In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)

    Kim, D., Cho, D., Yoo, D., Kweon, I.S.: Learning image representations by com- pleting damaged jigsaw puzzles. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 793–802. IEEE (2018)

  35. [35]

    Naval research logistics quarterly2(1-2), 83–97 (1955)

    Kuhn, H.W.: The hungarian method for the assignment problem. Naval research logistics quarterly2(1-2), 83–97 (1955)

  36. [36]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6874–6883 (2017)

  37. [37]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops

    Leal-Taixé, L., Canton-Ferrer, C., Schindler, K.: Learning by tracking: Siamese cnn for robust target association. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 33–40 (2016)

  38. [38]

    2024 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) pp

    Lu, Z., Shuai, B., Chen, Y., Xu, Z., Modolo, D.: Self-supervised multi-object tracking with path consistency. 2024 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) pp. 19016–19026 (2024),https://api. semanticscholar.org/CorpusID:269005039

  39. [39]

    International journal of computer vision129, 548–578 (2021)

    Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., Leibe, B.: Hota: A higher order metric for evaluating multi-object tracking. International journal of computer vision129, 548–578 (2021)

  40. [40]

    In: Proceedings of the IEEE international conference on computer vision

    Maksai, A., Wang, X., Fleuret, F., Fua, P.: Non-markovian globally consistent multi-object tracking. In: Proceedings of the IEEE international conference on computer vision. pp. 2544–2554 (2017)

  41. [41]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C.: Trackformer: Multi- object tracking with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8844–8854 (2022)

  42. [42]

    Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery2(1), 86– 97 (2012)

    Murtagh, F., Contreras, P.: Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery2(1), 86– 97 (2012)

  43. [43]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Nguyen, D.M., Henschel, R., Rosenhahn, B., Sonntag, D., Swoboda, P.: Lmgp: Lifted multicut meets geometry projections for multi-camera multi-object track- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8866–8875 (2022)

  44. [44]

    arXiv preprint arXiv:2408.13243 (2024)

    Niculescu-Mizil, A., Patel, D., Melvin, I.: Mctr: Multi camera tracking transformer. arXiv preprint arXiv:2408.13243 (2024)

  45. [45]

    ArXivabs/1603.09246(2016),https://api.semanticscholar

    Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. ArXivabs/1603.09246(2016),https://api.semanticscholar. org/CorpusID:187547 CalibFree 7

  46. [46]

    Representation Learning with Contrastive Predictive Coding

    van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive pre- dictive coding. ArXivabs/1807.03748(2018),https://api.semanticscholar. org/CorpusID:49670925

  47. [47]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Pang, Z., Li, J., Tokmakov, P., Chen, D., Zagoruyko, S., Wang, Y.X.: Standing between past and future: Spatio-temporal modeling for multi-camera 3d multi- object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17928–17938 (2023)

  48. [48]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context en- coders: Feature learning by inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2536–2544 (2016)

  49. [49]

    In: BMVC

    Prosser, B.J., Gong, S., Xiang, T.: Multi-camera matching using bi-directional cumulative brightness transfer functions. In: BMVC. vol. 8, pp. 164–1. Leeds, UK (2008)

  50. [50]

    Quach, K.G., Nguyen, P., Le, H., Truong, T.D., Duong, C.N., Tran, M.T., Luu, K.: Dyglip: A dynamic graph model with link prediction for accurate multi-camera multipleobjecttracking.In:ProceedingsoftheIEEE/CVFconferenceoncomputer vision and pattern recognition. pp. 13784–13793 (2021)

  51. [51]

    In: European conference on computer vision

    Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: European conference on computer vision. pp. 17–35. Springer (2016)

  52. [52]

    2018 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition pp

    Ristani, E., Tomasi, C.: Features for multi-target multi-camera tracking and re-identification. 2018 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition pp. 6036–6046 (2018),https://api.semanticscholar.org/ CorpusID:4462331

  53. [53]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Schulter, S., Vernaza, P., Choi, W., Chandraker, M.: Deep network flow for multi- object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6951–6960 (2017)

  54. [54]

    arXiv preprint arXiv:2012.15460 (2020) 19

    Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., Wang, C., Luo, P.: Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460 (2020)

  55. [55]

    arXiv preprint arXiv:1706.06196 (2017)

    Tesfaye, Y.T., Zemene, E., Prati, A., Pelillo, M., Shah, M.: Multi-target tracking in multiple non-overlapping cameras using constrained dominant sets. arXiv preprint arXiv:1706.06196 (2017)

  56. [56]

    Inter- national Journal of Computer Vision127, 1303–1320 (2019)

    Tesfaye, Y.T., Zemene, E., Prati, A., Pelillo, M., Shah, M.: Multi-target tracking in multiple non-overlapping cameras using fast-constrained dominant sets. Inter- national Journal of Computer Vision127, 1303–1320 (2019)

  57. [57]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

  58. [58]

    In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M

    Wang, J., Jiao, J., Liu, Y.H.: Self-supervised video representation learning by pace prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 504–521. Springer International Publishing, Cham (2020)

  59. [59]

    ArXivabs/2104.12807 (2021),https://api.semanticscholar.org/CorpusID:233407605

    Wang, L., Luc, P., Recasens, A., Alayrac, J.B., van den Oord, A.: Multimodal self-supervised learning of general audio representations. ArXivabs/2104.12807 (2021),https://api.semanticscholar.org/CorpusID:233407605

  60. [60]

    IEEE Transactions on knowledge and data engineering25(6), 1336–1353 (2012)

    Wang, Y.X., Zhang, Y.J.: Nonnegative matrix factorization: A comprehensive re- view. IEEE Transactions on knowledge and data engineering25(6), 1336–1353 (2012)

  61. [61]

    Xian et al

    Welch, G.: An introduction to the kalman filter (1995) 8 R. Xian et al

  62. [62]

    In: 2017 IEEE international conference on image processing (ICIP)

    Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE international conference on image processing (ICIP). pp. 3645–3649. IEEE (2017)

  63. [63]

    12347–12356 (2021),https://api

    Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., Yuan, J.: Track to detect and segment:Anonlinemulti-objecttracker.2021IEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR) pp. 12347–12356 (2021),https://api. semanticscholar.org/CorpusID:232240682

  64. [64]

    Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2.https:// github.com/facebookresearch/detectron2(2019)

  65. [65]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Xu, J., Cao, Y., Zhang, Z., Hu, H.: Spatial-temporal relation networks for multi- object tracking. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3988–3998 (2019)

  66. [66]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Xu, Y., Osep, A., Ban, Y., Horaud, R., Leal-Taixé, L., Alameda-Pineda, X.: How to train your deep multi-object tracker. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6787–6796 (2020)

  67. [67]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)

    Xu, Y., Liu, X., Liu, Y., Zhu, S.C.: Multi-view people tracking via hierarchical tra- jectory composition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)

  68. [68]

    In: Conference on Multimedia Modeling (2023),https: //api.semanticscholar.org/CorpusID:257986445

    Yin, Y., Hua, Y., Song, T., Ma, R., Guan, H.: Self-supervised multi-object tracking with cycle-consistency. In: Conference on Multimedia Modeling (2023),https: //api.semanticscholar.org/CorpusID:257986445

  69. [69]

    arXiv preprint arXiv:2003.11753 (2020)

    You, Q., Jiang, H.: Real-time 3d deep multi-camera tracking. arXiv preprint arXiv:2003.11753 (2020)

  70. [70]

    In: European Conference on Computer Vision

    Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., Wei, Y.: Motr: End-to-end multiple-object tracking with transformer. In: European Conference on Computer Vision. pp. 659–675. Springer (2022)

  71. [71]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhang,T.,Chen,X.,Wang,Y.,Wang,Y.,Zhao,H.:Mutr3d:Amulti-cameratrack- ing framework via 3d-to-2d queries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4537–4546 (2022)

  72. [72]

    Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.: Bytetrack: Multi-object tracking by associating every detection box (2022)

  73. [73]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

    Zhang, Y., Wang, T., Zhang, X.: Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 22056–22065 (2023)

  74. [74]

    In: European Conference on Computer Vision

    Zhao, Z., Wu, Z., Zhuang, Y., Li, B., Jia, J.: Tracking objects as pixel-wise distri- butions. In: European Conference on Computer Vision. pp. 76–94. Springer (2022)

  75. [75]

    CoRRabs/1711.10295(2017),http://arxiv.org/abs/ 1711.10295

    Zhong, Z., Zheng, L., Zheng, Z., Li, S., Yang, Y.: Camera style adaptation for person re-identification. CoRRabs/1711.10295(2017),http://arxiv.org/abs/ 1711.10295

  76. [76]

    ArXiv abs/2004.01177(2020),https : / / api

    Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. ArXiv abs/2004.01177(2020),https : / / api . semanticscholar . org / CorpusID : 214775104