arxiv: 2605.09245 · v1 · submitted 2026-05-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CalibFree: Self-Supervised View Feature Separation for Calibration-Free Multi-Camera Multi-Object Tracking

Ruiqi Xian , Deep Patel , Iain Melvin , Sanjoy Kundu , Martin Renqiang Min , Dinesh Manocha

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-camera multi-object trackingself-supervised learningfeature separationcalibration-freeview-agnostic featurescross-view reconstructionsingle-view distillation

0 comments

The pith

Self-supervised separation of view-agnostic and view-specific features enables multi-camera tracking without calibration or labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CalibFree, a self-supervised learning method for multi-camera multi-object tracking that learns to isolate features consistent across camera views from those unique to each view. It does this through single-view distillation and cross-view reconstruction without using any camera calibration parameters or manual annotations. The resulting view-agnostic features support consistent object identity maintenance even in changing or complex scenes. On the MMP-MvMHAT dataset the approach raises overall accuracy by 3 percent and average F1 score by 7.5 percent relative to prior methods. Similar gains appear on the more varied MvMHAT dataset for long-term and cross-view tracking.

Core claim

By promoting separation between view-agnostic and view-specific representations via single-view distillation and cross-view reconstruction, CalibFree performs multi-camera multi-object tracking without calibration information or labels, yielding higher accuracy and F1 scores on standard benchmarks while adapting to dynamic camera configurations.

What carries the argument

The view-agnostic feature representation produced by single-view distillation together with cross-view reconstruction, which isolates identity-preserving information independent of camera perspective.

If this is right

Tracking systems can operate in settings where installing calibrated cameras is impractical or expensive.
No manual labeling is required to train or adapt the tracker to new camera networks.
Performance remains stable under viewpoint changes and scene dynamics.
Cross-view association improves without explicit geometric alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation idea could replace hand-crafted calibration in other multi-view tasks such as 3D pose estimation.
Temporary camera arrays for events or robotics could adopt this approach with little setup effort.
Adding temporal consistency signals might further strengthen identity preservation over long sequences.
The learned invariance might transfer to non-visual sensors if similar reconstruction objectives are defined.

Load-bearing premise

That single-view distillation combined with cross-view reconstruction produces features that keep object identities consistent across uncalibrated views without any external supervision.

What would settle it

Running the method on a new multi-camera dataset with no calibration data and no labels, then checking whether cross-view identity consistency collapses when the separation losses are removed.

Figures

Figures reproduced from arXiv: 2605.09245 by Deep Patel, Dinesh Manocha, Iain Melvin, Martin Renqiang Min, Ruiqi Xian, Sanjoy Kundu.

**Figure 2.** Figure 2: Overview of CalibFree. The method includes single-view distillation, feature separation, and cross-view reconstruction. In single-view distillation (red box), masked detections are encoded, and feature quality is supervised by a teacher model using a distillation loss. A separation regularizer encourages the learned features to specialize into view-agnostic and view-specific components. For cross-view reco… view at source ↗

**Figure 3.** Figure 3: Cross-view reconstruction results. The input images show the original crops, while the masked images indicate regions removed for reconstruction. CalibFree reconstructs the masked regions by leveraging complementary observations across views, recovering consistent appearance and identity-relevant cues even when large portions are obscured. Calibration independence. CalibFree never accesses camera intrinsic… view at source ↗

read the original abstract

Multi-camera multi-object tracking (MCMOT) faces significant challenges in maintaining consistent object identities across varying camera perspectives, particularly when precise calibration and extensive annotations are required. In this paper, we present CalibFree, a self-supervised representation learning framework that does not need any calibration or manual labeling for the MCMOT task. By promoting feature separation between view-agnostic and view-specific representations through single-view distillation and cross-view reconstruction, our method adapts to complex, dynamic scenarios with minimal overhead. Experiments on the MMP-MvMHAT dataset show a 3% improvement in overall accuracy and a 7.5% increase in the average F1 score over state-of-the-art approaches, confirming the effectiveness of our calibration-free design. Moreover, on the more diverse MvMHAT dataset, our approach demonstrates superior over-time tracking and strong cross-view performance, highlighting its adaptability to a wide range of camera configurations. Code will be publicly available upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CalibFree offers a self-supervised split of view-agnostic and view-specific features for calibration-free multi-camera tracking, but the cross-view reconstruction may not guarantee identity consistency without an explicit association term.

read the letter

The paper's core idea is to learn features that stay consistent across cameras without calibration or labels by using single-view distillation plus cross-view reconstruction. This framing for the MCMOT task looks new compared to the usual calibrated or supervised baselines referenced in the abstract. They report a 3% lift in overall accuracy and 7.5% in average F1 on MMP-MvMHAT, plus better over-time tracking on the more varied MvMHAT set, and they plan to release code. Those are concrete points worth noting for anyone trying to cut down on deployment friction in multi-camera setups.

Referee Report

3 major / 1 minor

Summary. The paper proposes CalibFree, a self-supervised framework for multi-camera multi-object tracking (MCMOT) that separates view-agnostic and view-specific features via single-view distillation and cross-view reconstruction, eliminating the need for camera calibration or manual labels. It reports a 3% gain in overall accuracy and 7.5% gain in average F1 score on the MMP-MvMHAT dataset, plus strong cross-view and over-time tracking results on the more diverse MvMHAT dataset.

Significance. If the central claim is substantiated, the work would be significant for practical MCMOT deployment in uncalibrated, dynamic settings by removing geometric priors and annotation burdens. The self-supervised design and public code commitment are positive attributes that could enable wider adoption if the identity-consistency mechanism is shown to hold.

major comments (3)

[Method and Abstract] The central claim that single-view distillation plus cross-view reconstruction yields view-agnostic features with consistent object identities across cameras rests on an unverified assumption; the reconstruction objective (described at high level in the method) appears to operate at feature or pixel level without an explicit identity-aware or association term, which risks permitting identity permutations that still satisfy the loss (see skeptic note on data-association).
[Experiments] Only aggregate performance numbers are reported; no ablation studies, error analysis, or derivation details are provided to isolate the contribution of the view-feature separation or to confirm that gains arise from multi-view consistency rather than improved single-view tracking.
[Experiments] The 3% accuracy / 7.5% F1 improvements on MMP-MvMHAT are presented without statistical significance tests, variance across runs, or comparisons against strong calibration-free baselines, weakening attribution to the proposed design.

minor comments (1)

[Abstract] The abstract's claim of 'strong cross-view performance' on MvMHAT would benefit from explicit per-camera or cross-view ID consistency metrics rather than qualitative description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below, providing clarifications from the manuscript and indicating planned revisions to strengthen the presentation.

read point-by-point responses

Referee: [Method and Abstract] The central claim that single-view distillation plus cross-view reconstruction yields view-agnostic features with consistent object identities across cameras rests on an unverified assumption; the reconstruction objective (described at high level in the method) appears to operate at feature or pixel level without an explicit identity-aware or association term, which risks permitting identity permutations that still satisfy the loss (see skeptic note on data-association).

Authors: We appreciate this observation on the implicit nature of identity consistency. In Section 3, the single-view distillation loss enforces intra-view feature consistency for the same object, while the cross-view reconstruction operates exclusively on the separated view-agnostic features; the separation itself is intended to isolate identity information from view-specific cues, reducing the likelihood of permutations that would violate reconstruction across views. The training data implicitly provides the association through simultaneous multi-view captures of the same scenes. To address the concern directly, we will expand the method section with a formal derivation of the combined objective and a discussion of the data-association assumption in the revision. revision: partial
Referee: [Experiments] Only aggregate performance numbers are reported; no ablation studies, error analysis, or derivation details are provided to isolate the contribution of the view-feature separation or to confirm that gains arise from multi-view consistency rather than improved single-view tracking.

Authors: We agree that isolating the contributions is important. Although the original submission focused on overall results due to space limits, we have performed component ablations (distillation alone vs. full model) and error analysis on identity switches and cross-view consistency. These will be added to the experiments section in the revised manuscript, along with expanded loss derivations, to demonstrate that the gains stem from the multi-view feature separation rather than single-view improvements alone. revision: yes
Referee: [Experiments] The 3% accuracy / 7.5% F1 improvements on MMP-MvMHAT are presented without statistical significance tests, variance across runs, or comparisons against strong calibration-free baselines, weakening attribution to the proposed design.

Authors: We acknowledge the value of statistical rigor and clearer baseline attribution. In the revision we will report standard deviations over multiple runs, include paired significance tests, and explicitly categorize the baselines to highlight calibration-free methods. This will better substantiate that the reported gains are attributable to the proposed self-supervised separation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical self-supervised framework with independent evaluation

full rationale

The paper introduces a self-supervised method that separates view-agnostic and view-specific features via single-view distillation and cross-view reconstruction losses, then reports tracking accuracy gains on external datasets (MMP-MvMHAT, MvMHAT). No derivation chain, equation, or fitted quantity is shown to reduce to its own inputs by construction. The central result is an empirical performance claim, not a mathematical prediction forced by self-definition or self-citation. Self-citations, if present, are not load-bearing for the uniqueness or correctness of the reported metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven premise that view-agnostic and view-specific features can be reliably disentangled without calibration parameters or labels.

axioms (1)

domain assumption View-agnostic and view-specific representations can be separated effectively through single-view distillation and cross-view reconstruction.
This assumption is invoked to justify the calibration-free design.

pith-pipeline@v0.9.0 · 5483 in / 1158 out tokens · 35152 ms · 2026-05-12T04:44:39.237345+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By promoting feature separation between view-agnostic and view-specific representations through single-view distillation and cross-view reconstruction
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

L = L_sep + L_distill + L_recon

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 3 internal anchors

[1]

https://iccv2021-mmp.github.io/

work page
[2]

In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W

Akbari, H., Yuan, L., Qian, R., Chuang, W.H., Chang, S.F., Cui, Y., Gong, B.: Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems. vol. 34, pp. 24206– 24221. Curran Associates, Inc. (2021),...

work page 2021
[3]

In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W

Bastani, F., He, S., Madden, S.: Self-supervised multi-object tracking with cross- input consistency. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems. vol. 34, pp. 13695–13706. Curran Associates, Inc. (2021),https://proceedings.neurips. cc / paper _ files / paper / 2021 / fil...

work page 2021
[4]

2019 IEEE/CVF International Conference on Computer Vision (ICCV) pp

Bergmann, P., Meinhardt, T., Leal-Taixé, L.: Tracking without bells and whistles. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 941– 951 (2019),https://api.semanticscholar.org/CorpusID:76665153

work page 2019
[5]

EURASIP Journal on Image and Video Processing2008, 1–10 (2008)

Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing2008, 1–10 (2008)

work page 2008
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cai, J., Xu, M., Li, W., Xiong, Y., Xia, W., Tu, Z., Soatto, S.: Memot: Multi-object tracking with memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8090–8100 (2022)

work page 2022
[7]

In: IEEE Winter Conference on Applications of Computer Vision

Cai, Y., Medioni, G.: Exploring context information for inter-camera multiple tar- get tracking. In: IEEE Winter Conference on Applications of Computer Vision. pp. 761–768. IEEE (2014)

work page 2014
[8]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K.: Observation-centric sort: Rethinking sort for robust multi-object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9686–9696 (2023)

work page 2023
[9]

In: European conference on computer vision

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020)

work page 2020
[10]

A Simple Framework for Contrastive Learning of Visual Representations

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. ArXivabs/2002.05709(2020), https://api.semanticscholar.org/CorpusID:211096730

work page internal anchor Pith review arXiv 2002
[11]

Pattern Recognition47(3), 1126–1137 (2014)

Chen, X., Huang, K., Tan, T.: Object tracking across non-overlapping views by learning inter-camera transfer models. Pattern Recognition47(3), 1126–1137 (2014)

work page 2014
[12]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Cheng, C.C., Qiu, M.X., Chiang, C.K., Lai, S.H.: Rest: A reconfigurable spatial- temporal graph model for multi-camera multi-object tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10051–10060 (2023)

work page 2023
[13]

In: BMVC

Chilgunde,A.,Kumar,P.,Ranganath,S.,Huang,W.:Multi-cameratargettracking in blind regions of cameras with non-overlapping fields of view. In: BMVC. pp. 1–

work page
[14]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Chu, P., Ling, H.: Famnet: Joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6172–6181 (2019) CalibFree 5

work page 2019
[15]

In: Proceedings of the IEEE international conference on computer vision

Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE international conference on computer vision. pp. 1422–1430 (2015)

work page 2015
[16]

IEEE Transactions on Pattern Analysis and Machine Intelligence44, 6981–6992 (2021),https://api.semanticscholar.org/CorpusID:236159249

Dong, J., Fang, Q., Jiang, W.B., Yang, Y., Huang, Q.X., Bao, H., Zhou, X.: Fast and robust multi-person 3d pose estimation and tracking from multiple views. IEEE Transactions on Pattern Analysis and Machine Intelligence44, 6981–6992 (2021),https://api.semanticscholar.org/CorpusID:236159249

work page 2021
[17]

ICLR (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. ICLR (2021)

work page 2021
[18]

In: Proceedings of the IEEE/CVF international conference on computer vision

Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: Centernet: Keypoint triplets for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 6569–6578 (2019)

work page 2019
[19]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

Feng, W., Wang, F., Han, R., Gan, Y., Qian, Z., Hou, J., Wang, S.: Unveiling the power of self-supervision for multi-view multi-human association and tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

work page 2024
[20]

IEEE Transactions on Pattern Analysis and Machine Intelligence30(2), 267–282 (2008).https://doi.org/10.1109/TPAMI.2007.1174

Fleuret, F., Berclaz, J., Lengagne, R., Fua, P.: Multicamera people tracking with a probabilistic occupancy map. IEEE Transactions on Pattern Analysis and Machine Intelligence30(2), 267–282 (2008).https://doi.org/10.1109/TPAMI.2007.1174

work page doi:10.1109/tpami.2007.1174 2008
[21]

Proceedings of the 29th ACM International Con- ference on Multimedia (2021),https://api.semanticscholar.org/CorpusID: 239011901

Gan, Y., Han, R., Yin, L., Feng, W., Wang, S.: Self-supervised multi-view multi- human association and tracking. Proceedings of the 29th ACM International Con- ference on Multimedia (2021),https://api.semanticscholar.org/CorpusID: 239011901

work page 2021
[22]

YOLOX: Exceeding YOLO Series in 2021

Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021)

work page internal anchor Pith review arXiv 2021
[23]

In: Computer Vision– ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006

Gilbert, A., Bowden, R.: Tracking objects across cameras by incrementally learn- ing inter-camera colour calibration and patterns of activity. In: Computer Vision– ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part II 9. pp. 125–136. Springer (2006)

work page 2006
[24]

Fast r-cnn

Girshick, R.: Fast r-cnn. arXiv preprint arXiv:1504.08083 (2015)

work page arXiv 2015
[25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Gu, J., Hu, C., Zhang, T., Chen, X., Wang, Y., Wang, Y., Zhao, H.: Vip3d: End- to-end visual trajectory prediction via 3d agent queries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5496– 5506 (2023)

work page 2023
[26]

IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 5225–5242 (2021)

Han, R., Feng, W., Zhang, Y., Zhao, J., Wang, S.: Multiple human association and tracking from egocentric and complementary top views. IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 5225–5242 (2021)

work page 2021
[27]

In: Proceedings of the AAAI Con- ference on Artificial Intelligence

Han, R., Feng, W., Zhao, J., Niu, Z., Zhang, Y., Wan, L., Wang, S.: Complementary-view multiple human tracking. In: Proceedings of the AAAI Con- ference on Artificial Intelligence. vol. 34, pp. 10917–10924 (2020)

work page 2020
[28]

2022 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) pp

He, K., Chen, X., Xie, S., Li, Y., Doll’ar, P., Girshick, R.B.: Masked autoen- coders are scalable vision learners. 2022 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) pp. 15979–15988 (2021),https://api. semanticscholar.org/CorpusID:243985980

work page 2022
[29]

2020 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) pp

He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for un- supervised visual representation learning. 2020 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) pp. 9726–9735 (2019),https: //api.semanticscholar.org/CorpusID:207930212 6 R. Xian et al

work page 2020
[30]

IEEE Transactions on Image Processing29, 5191– 5205 (2020)

He,Y.,Wei,X.,Hong,X.,Shi,W.,Gong,Y.:Multi-targetmulti-cameratrackingby tracklet-to-target assignment. IEEE Transactions on Image Processing29, 5191– 5205 (2020)

work page 2020
[31]

IEEE Transactions on Pattern Analysis and Machine Intelligence46, 2506–2517 (2022),https://api

Huang, Z., Jin, X., Lu, C., Hou, Q., Cheng, M.M., Fu, D., Shen, X., Feng, J.: Contrastive masked autoencoders are stronger vision learners. IEEE Transactions on Pattern Analysis and Machine Intelligence46, 2506–2517 (2022),https://api. semanticscholar.org/CorpusID:251105242

work page 2022
[32]

In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05)

Javed, O., Shafique, K., Shah, M.: Appearance modeling for tracking in multiple non-overlapping cameras. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05). vol. 2, pp. 26–33. IEEE (2005)

work page 2005
[33]

ArXivabs/1811.09795(2018),https://api

Kim, D., Cho, D., Kweon, I.S.: Self-supervised video representation learning with space-time cubic puzzles. ArXivabs/1811.09795(2018),https://api. semanticscholar.org/CorpusID:53762354

work page arXiv 2018
[34]

In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)

Kim, D., Cho, D., Yoo, D., Kweon, I.S.: Learning image representations by com- pleting damaged jigsaw puzzles. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 793–802. IEEE (2018)

work page 2018
[35]

Naval research logistics quarterly2(1-2), 83–97 (1955)

Kuhn, H.W.: The hungarian method for the assignment problem. Naval research logistics quarterly2(1-2), 83–97 (1955)

work page 1955
[36]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6874–6883 (2017)

work page 2017
[37]

In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops

Leal-Taixé, L., Canton-Ferrer, C., Schindler, K.: Learning by tracking: Siamese cnn for robust target association. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 33–40 (2016)

work page 2016
[38]

2024 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) pp

Lu, Z., Shuai, B., Chen, Y., Xu, Z., Modolo, D.: Self-supervised multi-object tracking with path consistency. 2024 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) pp. 19016–19026 (2024),https://api. semanticscholar.org/CorpusID:269005039

work page 2024
[39]

International journal of computer vision129, 548–578 (2021)

Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., Leibe, B.: Hota: A higher order metric for evaluating multi-object tracking. International journal of computer vision129, 548–578 (2021)

work page 2021
[40]

In: Proceedings of the IEEE international conference on computer vision

Maksai, A., Wang, X., Fleuret, F., Fua, P.: Non-markovian globally consistent multi-object tracking. In: Proceedings of the IEEE international conference on computer vision. pp. 2544–2554 (2017)

work page 2017
[41]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C.: Trackformer: Multi- object tracking with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8844–8854 (2022)

work page 2022
[42]

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery2(1), 86– 97 (2012)

Murtagh, F., Contreras, P.: Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery2(1), 86– 97 (2012)

work page 2012
[43]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Nguyen, D.M., Henschel, R., Rosenhahn, B., Sonntag, D., Swoboda, P.: Lmgp: Lifted multicut meets geometry projections for multi-camera multi-object track- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8866–8875 (2022)

work page 2022
[44]

arXiv preprint arXiv:2408.13243 (2024)

Niculescu-Mizil, A., Patel, D., Melvin, I.: Mctr: Multi camera tracking transformer. arXiv preprint arXiv:2408.13243 (2024)

work page arXiv 2024
[45]

ArXivabs/1603.09246(2016),https://api.semanticscholar

Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. ArXivabs/1603.09246(2016),https://api.semanticscholar. org/CorpusID:187547 CalibFree 7

work page arXiv 2016
[46]

Representation Learning with Contrastive Predictive Coding

van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive pre- dictive coding. ArXivabs/1807.03748(2018),https://api.semanticscholar. org/CorpusID:49670925

work page internal anchor Pith review Pith/arXiv arXiv 2018
[47]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Pang, Z., Li, J., Tokmakov, P., Chen, D., Zagoruyko, S., Wang, Y.X.: Standing between past and future: Spatio-temporal modeling for multi-camera 3d multi- object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17928–17938 (2023)

work page 2023
[48]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context en- coders: Feature learning by inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2536–2544 (2016)

work page 2016
[49]

In: BMVC

Prosser, B.J., Gong, S., Xiang, T.: Multi-camera matching using bi-directional cumulative brightness transfer functions. In: BMVC. vol. 8, pp. 164–1. Leeds, UK (2008)

work page 2008
[50]

Quach, K.G., Nguyen, P., Le, H., Truong, T.D., Duong, C.N., Tran, M.T., Luu, K.: Dyglip: A dynamic graph model with link prediction for accurate multi-camera multipleobjecttracking.In:ProceedingsoftheIEEE/CVFconferenceoncomputer vision and pattern recognition. pp. 13784–13793 (2021)

work page 2021
[51]

In: European conference on computer vision

Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: European conference on computer vision. pp. 17–35. Springer (2016)

work page 2016
[52]

2018 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition pp

Ristani, E., Tomasi, C.: Features for multi-target multi-camera tracking and re-identification. 2018 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition pp. 6036–6046 (2018),https://api.semanticscholar.org/ CorpusID:4462331

work page 2018
[53]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Schulter, S., Vernaza, P., Choi, W., Chandraker, M.: Deep network flow for multi- object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6951–6960 (2017)

work page 2017
[54]

arXiv preprint arXiv:2012.15460 (2020) 19

Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., Wang, C., Luo, P.: Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460 (2020)

work page arXiv 2012
[55]

arXiv preprint arXiv:1706.06196 (2017)

Tesfaye, Y.T., Zemene, E., Prati, A., Pelillo, M., Shah, M.: Multi-target tracking in multiple non-overlapping cameras using constrained dominant sets. arXiv preprint arXiv:1706.06196 (2017)

work page arXiv 2017
[56]

Inter- national Journal of Computer Vision127, 1303–1320 (2019)

Tesfaye, Y.T., Zemene, E., Prati, A., Pelillo, M., Shah, M.: Multi-target tracking in multiple non-overlapping cameras using fast-constrained dominant sets. Inter- national Journal of Computer Vision127, 1303–1320 (2019)

work page 2019
[57]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

work page 2023
[58]

In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M

Wang, J., Jiao, J., Liu, Y.H.: Self-supervised video representation learning by pace prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 504–521. Springer International Publishing, Cham (2020)

work page 2020
[59]

ArXivabs/2104.12807 (2021),https://api.semanticscholar.org/CorpusID:233407605

Wang, L., Luc, P., Recasens, A., Alayrac, J.B., van den Oord, A.: Multimodal self-supervised learning of general audio representations. ArXivabs/2104.12807 (2021),https://api.semanticscholar.org/CorpusID:233407605

work page arXiv 2021
[60]

IEEE Transactions on knowledge and data engineering25(6), 1336–1353 (2012)

Wang, Y.X., Zhang, Y.J.: Nonnegative matrix factorization: A comprehensive re- view. IEEE Transactions on knowledge and data engineering25(6), 1336–1353 (2012)

work page 2012
[61]

Xian et al

Welch, G.: An introduction to the kalman filter (1995) 8 R. Xian et al

work page 1995
[62]

In: 2017 IEEE international conference on image processing (ICIP)

Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE international conference on image processing (ICIP). pp. 3645–3649. IEEE (2017)

work page 2017
[63]

12347–12356 (2021),https://api

Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., Yuan, J.: Track to detect and segment:Anonlinemulti-objecttracker.2021IEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR) pp. 12347–12356 (2021),https://api. semanticscholar.org/CorpusID:232240682

work page 2021
[64]

Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2.https:// github.com/facebookresearch/detectron2(2019)

work page 2019
[65]

In: Proceedings of the IEEE/CVF international conference on computer vision

Xu, J., Cao, Y., Zhang, Z., Hu, H.: Spatial-temporal relation networks for multi- object tracking. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3988–3998 (2019)

work page 2019
[66]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Xu, Y., Osep, A., Ban, Y., Horaud, R., Leal-Taixé, L., Alameda-Pineda, X.: How to train your deep multi-object tracker. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6787–6796 (2020)

work page 2020
[67]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)

Xu, Y., Liu, X., Liu, Y., Zhu, S.C.: Multi-view people tracking via hierarchical tra- jectory composition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)

work page 2016
[68]

In: Conference on Multimedia Modeling (2023),https: //api.semanticscholar.org/CorpusID:257986445

Yin, Y., Hua, Y., Song, T., Ma, R., Guan, H.: Self-supervised multi-object tracking with cycle-consistency. In: Conference on Multimedia Modeling (2023),https: //api.semanticscholar.org/CorpusID:257986445

work page 2023
[69]

arXiv preprint arXiv:2003.11753 (2020)

You, Q., Jiang, H.: Real-time 3d deep multi-camera tracking. arXiv preprint arXiv:2003.11753 (2020)

work page arXiv 2003
[70]

In: European Conference on Computer Vision

Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., Wei, Y.: Motr: End-to-end multiple-object tracking with transformer. In: European Conference on Computer Vision. pp. 659–675. Springer (2022)

work page 2022
[71]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhang,T.,Chen,X.,Wang,Y.,Wang,Y.,Zhao,H.:Mutr3d:Amulti-cameratrack- ing framework via 3d-to-2d queries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4537–4546 (2022)

work page 2022
[72]

Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.: Bytetrack: Multi-object tracking by associating every detection box (2022)

work page 2022
[73]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

Zhang, Y., Wang, T., Zhang, X.: Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 22056–22065 (2023)

work page 2023
[74]

In: European Conference on Computer Vision

Zhao, Z., Wu, Z., Zhuang, Y., Li, B., Jia, J.: Tracking objects as pixel-wise distri- butions. In: European Conference on Computer Vision. pp. 76–94. Springer (2022)

work page 2022
[75]

CoRRabs/1711.10295(2017),http://arxiv.org/abs/ 1711.10295

Zhong, Z., Zheng, L., Zheng, Z., Li, S., Yang, Y.: Camera style adaptation for person re-identification. CoRRabs/1711.10295(2017),http://arxiv.org/abs/ 1711.10295

work page arXiv 2017
[76]

ArXiv abs/2004.01177(2020),https : / / api

Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. ArXiv abs/2004.01177(2020),https : / / api . semanticscholar . org / CorpusID : 214775104

work page arXiv 2004