HieDG: A Hierarchical Discrete Geometry-Guided Framework for Multi-Animal Tracking

Bohao Chen; Chengyang Zhang; Chenxun Deng; Hang Zhou; Hongying Yan; Hua Han; Xi Chen; Ye Yuan; Yifan Zhang; Zhongde Zhang

arxiv: 2607.00494 · v1 · pith:YQJ54HNMnew · submitted 2026-07-01 · 💻 cs.CV

HieDG: A Hierarchical Discrete Geometry-Guided Framework for Multi-Animal Tracking

Chenxun Deng , Zhongde Zhang , Ye Yuan , Chengyang Zhang , Yifan Zhang , Bohao Chen , Hongying Yan , Hang Zhou

show 2 more authors

Hua Han Xi Chen

This is my paper

Pith reviewed 2026-07-02 14:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-animal trackingdiscrete geometryresidual codebookquery-based trackingidentity associationgeometric embeddingsmulti-object tracking

0 comments

The pith

A two-stage residual codebook converts continuous position, scale, and velocity signals into stable discrete tokens that strengthen identity association inside query-based multi-animal trackers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Query-based trackers for multi-animal scenes lose identity links when small shifts in continuous geometric coordinates swing the cross-frame attention weights. HieDG replaces those raw signals with tokens from a two-stage residual codebook that encodes position, scale, and velocity. The tokens are aligned to visual embeddings and inserted into the tracking queries. Experiments on AnimalTrack, BFT, and BuckTales report higher HOTA, AssA, and IDF1, while results on DanceTrack and SportsMOT remain competitive. A reader would care because the same instability appears whenever trackers must rely on geometry amid uniform appearance or dense motion.

Core claim

The paper claims that reformulating geometric dynamics as structured discrete representations inside a query-based tracker, achieved by passing position, scale, and velocity cues through a two-stage residual codebook, produces stable tokens that can be aligned with visual embeddings and integrated into tracking queries, thereby improving identity consistency under uniform appearance, high density, and irregular motion.

What carries the argument

The two-stage residual codebook that discretizes position, scale, and velocity cues into tokens aligned with visual embeddings and fed into tracking queries.

If this is right

Identity association improves in scenes where appearance cues are weak because geometry is now represented by stable tokens rather than fragile continuous values.
The same tokenized geometry can be added to generic multi-object trackers without animal-specific tuning and still yields competitive scores.
End-to-end training benefits because the discrete tokens reduce sensitivity of the attention mechanism to coordinate noise.
Heuristic geometric post-processing steps become less necessary once the codebook supplies structured motion cues inside the queries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The discretization step could be tested on non-visual sensors such as GPS or depth to see whether the same codebook structure stabilizes fusion across modalities.
If the residual codebook generalizes, future trackers might drop separate motion-prediction heads and rely on the tokenized geometry for both association and short-term prediction.
Quantization artifacts may appear first in very slow or very fast motions; measuring error rates on velocity-binned subsets of the test sets would expose any such limits.

Load-bearing premise

Small coordinate perturbations in continuous geometric embeddings disproportionately degrade cross-frame attention weights, and a learned residual codebook can replace those embeddings with tokens that preserve motion information without introducing harmful quantization artifacts.

What would settle it

Replacing the discrete tokens with the original continuous geometric embeddings inside the same query-based tracker and measuring whether HOTA and IDF1 on AnimalTrack or BFT drop, stay flat, or rise.

Figures

Figures reproduced from arXiv: 2607.00494 by Bohao Chen, Chengyang Zhang, Chenxun Deng, Hang Zhou, Hongying Yan, Hua Han, Xi Chen, Ye Yuan, Yifan Zhang, Zhongde Zhang.

**Figure 1.** Figure 1: (a) Typical conditions in multi-animal tracking, including uniform appearance, high density, and irregular motion, which reduce appearance discriminability. (b) Quantization transforms fluctuating continuous signals into stable discrete states, suppressing perturbation-induced variability. Inspired by quantization principles in signal processing—where noisy analog signals are mapped to stable discrete st… view at source ↗

**Figure 2.** Figure 2: Overview of the HieDG framework. Appearance and geometric features are extracted via Deformable DETR [50]. The geometric vectors are first projected through MLPs and independently quantized by two-stage residual codebooks for position, size, and velocity. The resulting discrete geometric embeddings are concatenated and aligned to match the visual space. The combined representation is finally passed to the … view at source ↗

**Figure 3.** Figure 3: t-SNE [23] visualization of track embeddings across all frames in the BFT dataset. Different IDs are marked by distinct colors and shapes [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: t-SNE [23] visualization of track embeddings across all frames in the BFT dataset. A unique combination of color and shape denotes each ID. We compare geometric embeddings, appearance embeddings, and their fused trajectory embeddings, highlighting the discriminative capacity of each representation. D Visualization Results D.1 t-SNE Visualization Due to space limitations in the main text, we could not prov… view at source ↗

**Figure 5.** Figure 5: Tracking visualization of HieDG on the AnimalTrack dataset. D.2 Tracking Visualization In this section, we also present six consecutive-frame tracking visualizations from the AnimalTrack dataset. Representative scenes, including chicken, deer, and dolphins, are shown in [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

read the original abstract

Multi-animal tracking (MAT) is critical for wildlife monitoring and behavioral analysis, yet remains challenging due to uniform appearance, high density, and irregular motion. Existing methods typically follow heuristic- or query-based paradigms: the former relies on handcrafted geometric associations without end-to-end optimization, whereas the latter enables joint optimization but relies heavily on appearance embeddings. In such conditions, continuous geometric embeddings can be unstable, as small coordinate perturbations may disproportionately alter cross-frame attention weights, degrading identity association performance. To address this limitation, we propose HieDG, a Hierarchical Discrete Geometry-guided tracking framework that reformulates geometric dynamics as structured discrete representations within a query-based tracker. Instead of directly using raw geometric signals, HieDG employs a two-stage residual codebook to discretize position, scale, and velocity cues, transforming unstable continuous geometry into structured, stable discrete tokens. These tokens are aligned with visual embeddings and integrated into the tracking queries to enhance identity consistency. Extensive experiments on animal-specific benchmarks (AnimalTrack, BFT, and BuckTales) demonstrate state-of-the-art association performance with significant improvements in HOTA, AssA, and IDF1. Additional evaluations on generic multi-object tracking benchmarks, including DanceTrack and SportsMOT, show competitive performance, indicating the broader applicability of discretized geometric modeling beyond animal-specific scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HieDG adds a two-stage residual codebook to turn continuous position/scale/velocity into discrete tokens inside a query-based tracker, claiming better identity association on animal benchmarks.

read the letter

The main thing here is that the authors discretize geometric cues with a hierarchical residual codebook so the tokens stay stable across frames when animals look alike and move irregularly. They plug those tokens into the tracking queries alongside appearance features, which is a direct response to the known sensitivity of attention weights to small coordinate shifts.

The paper does a reasonable job spelling out the motivation and showing how the discretization slots into an existing query architecture without circular fitting. The experiments target the right datasets for the use case—AnimalTrack, BFT, and BuckTales—and also check general MOT sets like DanceTrack and SportsMOT.

The soft spots are the missing details. The abstract states SOTA association numbers but gives no deltas, no ablations on codebook size or stage count, and no check on whether quantization erodes useful motion signal. Without those, it is hard to tell how much the discretization itself drives the gains versus other implementation choices. The assumption that discrete tokens avoid the instability without new artifacts still needs the full results to hold up.

This is for people who build or adapt trackers for wildlife monitoring or dense scenes where appearance cues are weak. A reader who already works with query-based methods could test the codebook idea on their own data.

I would send it to peer review. The construction is clear and the benchmarks are appropriate, even if the paper will need more controls and numbers before acceptance.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes HieDG, a query-based multi-animal tracking framework that reformulates geometric dynamics (position, scale, velocity) via a two-stage residual codebook to produce discrete tokens. These tokens are aligned with visual embeddings and integrated into tracking queries to stabilize identity association. The paper reports state-of-the-art results on AnimalTrack, BFT, and BuckTales (in HOTA, AssA, IDF1) and competitive performance on DanceTrack and SportsMOT.

Significance. If the empirical gains hold under rigorous evaluation, the discretization strategy offers a concrete mechanism for mitigating attention-weight sensitivity to small geometric perturbations, which could improve robustness in dense, low-appearance-variation tracking scenarios. The extension to generic MOT benchmarks suggests the approach is not narrowly animal-specific.

minor comments (3)

[Abstract] Abstract: the claim of 'significant improvements' and 'state-of-the-art association performance' is stated without any numerical deltas, error bars, or reference to specific table rows, forcing the reader to locate the quantitative evidence later in the manuscript.
[Method] The integration step that aligns discrete geometric tokens with visual embeddings is described at a high level; a concrete description of the alignment loss or projection layer (e.g., in the method section) would improve reproducibility.
[Experiments] No dataset statistics (number of sequences, average density, occlusion rates) are supplied for the animal benchmarks, which makes it harder to contextualize the reported gains relative to prior work.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of HieDG, the recognition of its potential to improve robustness via discretization, and the recommendation for minor revision. We will incorporate any minor clarifications in the revised version.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes HieDG as a modeling choice: a two-stage residual codebook that discretizes position/scale/velocity cues to stabilize attention weights in query-based trackers. No equations, derivations, or self-citation chains are presented that reduce the claimed performance gains to inputs by construction. The discretization step is motivated directly by the stated instability of continuous embeddings and is evaluated via empirical results on external benchmarks (AnimalTrack, BFT, BuckTales, DanceTrack, SportsMOT). This is an independent architectural decision with no load-bearing self-definition, fitted-input-as-prediction, or uniqueness theorem imported from prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that continuous geometry is the dominant source of instability and on the introduction of a learned two-stage residual codebook whose size and training procedure are not specified in the abstract.

free parameters (1)

residual codebook size and number of stages
The discretization mechanism requires choosing or learning the codebook vocabulary size and residual depth, which are not reported in the abstract.

axioms (1)

domain assumption Continuous geometric embeddings are unstable because small coordinate perturbations disproportionately alter cross-frame attention weights.
Explicitly stated in the abstract as the core limitation being addressed.

pith-pipeline@v0.9.1-grok · 5794 in / 1250 out tokens · 37966 ms · 2026-07-02T14:53:37.613041+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 3 canonical work pages

[1]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (October 2019)

Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking without bells and whistles. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (October 2019)

2019
[2]

EURASIP Journal on Image and Video Processing2008(1), 246309 (2008)

Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing2008(1), 246309 (2008)

2008
[3]

Nature Reviews Genetics23(8), 492–503 (2022)

Bertorelle, G., Raffini, F., Bosse, M., Bortoluzzi, C., Iannucci, A., Trucchi, E., Morales, H.E., Van Oosterhout, C.: Genetic load: genomic estimates and applica- tions in non-model animals. Nature Reviews Genetics23(8), 492–503 (2022)

2022
[4]

In: Proceedings of the International Conference on Image Processing

Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: Proceedings of the International Conference on Image Processing. pp. 3464–3468 (2016)

2016
[5]

In: Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance

Bochinski, E., Eiselein, V., Sikora, T.: High-speed tracking-by-detection without using image information. In: Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance. pp. 1–6 (2017).https://doi.org/ 10.1109/AVSS.2017.8078516

work page doi:10.1109/avss.2017.8078516 2017
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K.: Observation-centric sort: Rethinking sort for robust multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9686–9696 (June 2023)

2023
[7]

In: Proceedings of the IEEE/CVF HieDG for Multi-Animal Tracking 7 International Conference on Computer Vision (ICCV)

Cui, Y., Zeng, C., Zhao, X., Yang, Y., Wu, G., Wang, L.: Sportsmot: A large multi- object tracking dataset in multiple sports scenes. In: Proceedings of the IEEE/CVF HieDG for Multi-Animal Tracking 7 International Conference on Computer Vision (ICCV). pp. 9921–9931 (October 2023)

2023
[9]

Sepasdar, M

Deng, C., Li, D., Ji, L., Zhang, C., Li, B., Yan, H., Zheng, J., Wang, L., Zhang, J.: Chatdiff: A chatgpt-based diffusion model for long-tailed classification. Neural Networks181, 106794 (2025).https://doi.org/https://doi.org/10.1016/j. neunet.2024.106794,https://www.sciencedirect.com/science/article/pii/ S0893608024007184

work page doi:10.1016/j 2025
[10]

In: Advances in Neural Information Processing Systems (2023),https://openreview.net/forum?id=hyPUZX03Ks

Fiquet, P.É.H., Simoncelli, E.P.: A polar prediction model for learning to represent visual transformations. In: Advances in Neural Information Processing Systems (2023),https://openreview.net/forum?id=hyPUZX03Ks

2023
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Gao, R., Qi, J., Wang, L.: Multiple object tracking as id prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27883–27893 (June 2025)

2025
[12]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Gao, R., Wang, L.: Memotr: Long-term memory-augmented transformer for multi- object tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9901–9910 (October 2023)

2023
[13]

In: Proceedings of the Conference on Computer Vision and Pattern Recognition

Guo, S., Wang, J., Wang, X., Tao, D.: Online multiple object tracking with cross- task synergy. In: Proceedings of the Conference on Computer Vision and Pattern Recognition. pp. 8132–8141 (2021)

2021
[14]

Han,G.,Lim,S.N.:Few-shotobjectdetectionwithfoundationmodels.In:Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 28608–28618 (June 2024)

2024
[15]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)

2016
[16]

Trends in Ecology & Evolution37(4), 293–298 (2022)

Jetz, W., Tertitski, G., Kays, R., Mueller, U., Wikelski, M., Åkesson, S., Anisimov, Y., Antonov, A., Arnold, W., Bairlein, F., et al.: Biological earth observation with animal sensors. Trends in Ecology & Evolution37(4), 293–298 (2022)

2022
[17]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lee, D., Kim, C., Kim, S., Cho, M., Han, W.S.: Autoregressive image genera- tion using residual quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11523–11532 (June 2022)

2022
[18]

IEEE Transactions on Image Processing31, 3182–3196 (2022)

Liang, C., Zhang, Z., Zhou, X., Li, B., Zhu, S., Hu, W.: Rethinking the competition between detection and reid in multiobject tracking. IEEE Transactions on Image Processing31, 3182–3196 (2022)

2022
[19]

In: Proceedings of the European Conference on Computer Vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision. pp. 740–755. Springer (2014)

2014
[20]

Science367(6476), 440–445 (2020)

Liu, D., Li, W., Ma, C., Zheng, W., Yao, Y., Tso, C.F., Zhong, P., Chen, X., Song, J.H., Choi, W., et al.: A common hub for sleep and motor control in the substantia nigra. Science367(6476), 440–445 (2020)

2020
[21]

Computers and Electronics in Agriculture224, 109161 (2024)

Liu, Y., Li, W., Liu, X., Li, Z., Yue, J.: Deep learning in multiple animal tracking: A survey. Computers and Electronics in Agriculture224, 109161 (2024)

2024
[22]

International Journal of Computer Vision129(2), 548–578 (2021) 8 C

Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., Leibe, B.: Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision129(2), 548–578 (2021) 8 C. Deng et al

2021
[23]

Journal of Machine Learning Research9, 2579–2605 (nov 2008)

van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research9, 2579–2605 (nov 2008)

2008
[24]

In: Proceedings of the International Confer- ence on Image Processing

Maggiolino, G., Ahmad, A., Cao, J., Kitani, K.: Deep oc-sort: Multi-pedestrian tracking by adaptive re-identification. In: Proceedings of the International Confer- ence on Image Processing. pp. 3025–3029 (2023)

2023
[25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Meinhardt, T., Kirillov, A., Leal-Taixé, L., Feichtenhofer, C.: Trackformer: Multi- object tracking with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8844–8854 (June 2022)

2022
[26]

Robotics Institute, Pittsburgh, PA, Tech

Mills-Tettey, G.A., Stentz, A., Dias, M.B.: The dynamic hungarian algorithm for the assignment problem with changing costs. Robotics Institute, Pittsburgh, PA, Tech. Rep. CMU-RI-TR-07-277(2007)

2007
[27]

In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

Naik,H.,Yang,J.,Das,D.,Crofoot,M.C.,Rathore,A.,Sridhar,V.H.:Bucktales:A multi-uav dataset for multi-object tracking and re-identification of wild antelopes. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems. vol. 37, pp. 81992–82009. Curran Associates, Inc. (2024)

2024
[28]

In: Advances in Neural Information Processing Systems

van den Oord, A., Vinyals, O., kavukcuoglu, k.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017)

2017
[29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pang,J., Qiu,L., Li,X., Chen,H., Li,Q., Darrell,T., Yu, F.:Quasi-dense similarity learning for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 164–173 (June 2021)

2021
[30]

Nature Methods19(4), 486–495 (2022)

Pereira, T.D., Tabris, N., Matsliah, A., Turner, D.M., Li, J., Ravindranath, S., Papadoyannis, E.S., Normand, E., Deutsch, D.S., Wang, Z.Y., et al.: Sleap: A deep learning system for multi-animal pose tracking. Nature Methods19(4), 486–495 (2022)

2022
[31]

In: Proceedings of the Eu- ropean Conference on Computer Vision

Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Proceedings of the Eu- ropean Conference on Computer Vision. pp. 17–35. Springer (2016)

2016
[32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Shim, K., Ko, K., Yang, Y., Kim, C.: Focusing on tracks for online multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11687–11696 (June 2025)

2025
[33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Sun,P.,Cao,J.,Jiang,Y., Yuan,Z.,Bai,S., Kitani,K.,Luo,P.:Dancetrack:Multi- object tracking in uniform appearance and diverse motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20993–21002 (June 2022)

2022
[34]

arXiv preprint arXiv:2012.15460 (2020)

Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., Wang, C., Luo, P.: Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460 (2020)

work page arXiv 2012
[35]

(eds.) Advances in Neural Information Processing Systems

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, L.u.,Polosukhin,I.:Attentionisallyouneed.In:Guyon,I.,Luxburg,U.V.,Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017)

2017
[36]

In: Advances in Neural Information Processing Systems

Wang, A., Chen, H., Liu, L., Chen, K., Lin, Z., Han, J., Ding, G.: Yolov10: Real- time end-to-end object detection. In: Advances in Neural Information Processing Systems. vol. 37, pp. 107984–108011. Curran Associates, Inc. (2024)

2024
[37]

In: Advances in Neural Informa- tion Processing Systems (2024) HieDG for Multi-Animal Tracking 9

Wang, J., Jiang, Y., Yuan, Z., PENG, B., Wu, Z., Jiang, Y.G.: Omnitokenizer: A joint image-video tokenizer for visual generation. In: Advances in Neural Informa- tion Processing Systems (2024) HieDG for Multi-Animal Tracking 9

2024
[38]

In: Proceedings of the European Conference on Computer Vision

Wang, Z., Zheng, L., Liu, Y., Li, Y., Wang, S.: Towards real-time multi-object tracking. In: Proceedings of the European Conference on Computer Vision. pp. 107–122. Springer (2020)

2020
[39]

Welch, G., Bishop, G., et al.: An introduction to the kalman filter (1995)

1995
[40]

IEEE Trans- actions on Pattern Analysis and Machine Intelligence45(6), 7820–7835 (2023)

Xu, Y., Ban, Y., Delorme, G., Gan, C., Rus, D., Alameda-Pineda, X.: Transcenter: Transformers with dense representations for multiple-object tracking. IEEE Trans- actions on Pattern Analysis and Machine Intelligence45(6), 7820–7835 (2023)

2023
[41]

In: International Conference on Learning Representations (ICLR) (2025)

Yan, F., Luo, W., Zhong, Y., Gan, Y., Ma, L.: CO-MOT: Boosting end-to- end transformer-based multi-object tracking via coopetition label assignment and shadow sets. In: International Conference on Learning Representations (ICLR) (2025)

2025
[42]

In: Proceedings of the International Conference on Learning Representations (2022)

Yu, J., Li, X., Koh, J.Y., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y., Baldridge, J., Wu, Y.: Vector-quantized image modeling with improved VQGAN. In: Proceedings of the International Conference on Learning Representations (2022)

2022
[43]

In: Proceedings of the European Con- ference on Computer Vision

Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., Wei, Y.: Motr: End-to-end multiple-object tracking with transformer. In: Proceedings of the European Con- ference on Computer Vision. pp. 659–675. Springer (2022)

2022
[44]

International Journal of Computer Vision131(2), 496–513 (2023)

Zhang, L., Gao, J., Xiao, Z., Fan, H.: Animaltrack: A benchmark for multi-animal tracking in the wild. International Journal of Computer Vision131(2), 496–513 (2023)

2023
[45]

In: Proceedings of the European Conference on Computer Vision

Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.: Bytetrack: Multi-object tracking by associating every detection box. In: Proceedings of the European Conference on Computer Vision. pp. 1–21. Springer (2022)

2022
[46]

International Journal of Computer Vision129(11), 3069–3087 (2021)

Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision129(11), 3069–3087 (2021)

2021
[47]

In: Proceedings of the Conference on Com- puter Vision and Pattern Recognition

Zhang, Y., Wang, T., Zhang, X.: Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. In: Proceedings of the Conference on Com- puter Vision and Pattern Recognition. pp. 22056–22065 (2023)

2023
[48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zheng, G., Lin, S., Zuo, H., Fu, C., Pan, J.: Nettrack: Tracking highly dynamic objects with a net. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19145–19155 (June 2024)

2024
[49]

In: Proceedings of the European Conference on Computer Vision

Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. In: Proceedings of the European Conference on Computer Vision. pp. 474–490. Springer (2020)

2020
[50]

In: Proceedings of the International Conference on Learning Representations (2021)

Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable {detr}: Deformable transformers for end-to-end object detection. In: Proceedings of the International Conference on Learning Representations (2021)

2021

[1] [1]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (October 2019)

Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking without bells and whistles. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (October 2019)

2019

[2] [2]

EURASIP Journal on Image and Video Processing2008(1), 246309 (2008)

Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing2008(1), 246309 (2008)

2008

[3] [3]

Nature Reviews Genetics23(8), 492–503 (2022)

Bertorelle, G., Raffini, F., Bosse, M., Bortoluzzi, C., Iannucci, A., Trucchi, E., Morales, H.E., Van Oosterhout, C.: Genetic load: genomic estimates and applica- tions in non-model animals. Nature Reviews Genetics23(8), 492–503 (2022)

2022

[4] [4]

In: Proceedings of the International Conference on Image Processing

Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: Proceedings of the International Conference on Image Processing. pp. 3464–3468 (2016)

2016

[5] [5]

In: Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance

Bochinski, E., Eiselein, V., Sikora, T.: High-speed tracking-by-detection without using image information. In: Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance. pp. 1–6 (2017).https://doi.org/ 10.1109/AVSS.2017.8078516

work page doi:10.1109/avss.2017.8078516 2017

[6] [6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K.: Observation-centric sort: Rethinking sort for robust multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9686–9696 (June 2023)

2023

[7] [7]

In: Proceedings of the IEEE/CVF HieDG for Multi-Animal Tracking 7 International Conference on Computer Vision (ICCV)

Cui, Y., Zeng, C., Zhao, X., Yang, Y., Wu, G., Wang, L.: Sportsmot: A large multi- object tracking dataset in multiple sports scenes. In: Proceedings of the IEEE/CVF HieDG for Multi-Animal Tracking 7 International Conference on Computer Vision (ICCV). pp. 9921–9931 (October 2023)

2023

[8] [9]

Sepasdar, M

Deng, C., Li, D., Ji, L., Zhang, C., Li, B., Yan, H., Zheng, J., Wang, L., Zhang, J.: Chatdiff: A chatgpt-based diffusion model for long-tailed classification. Neural Networks181, 106794 (2025).https://doi.org/https://doi.org/10.1016/j. neunet.2024.106794,https://www.sciencedirect.com/science/article/pii/ S0893608024007184

work page doi:10.1016/j 2025

[9] [10]

In: Advances in Neural Information Processing Systems (2023),https://openreview.net/forum?id=hyPUZX03Ks

Fiquet, P.É.H., Simoncelli, E.P.: A polar prediction model for learning to represent visual transformations. In: Advances in Neural Information Processing Systems (2023),https://openreview.net/forum?id=hyPUZX03Ks

2023

[10] [11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Gao, R., Qi, J., Wang, L.: Multiple object tracking as id prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27883–27893 (June 2025)

2025

[11] [12]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Gao, R., Wang, L.: Memotr: Long-term memory-augmented transformer for multi- object tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9901–9910 (October 2023)

2023

[12] [13]

In: Proceedings of the Conference on Computer Vision and Pattern Recognition

Guo, S., Wang, J., Wang, X., Tao, D.: Online multiple object tracking with cross- task synergy. In: Proceedings of the Conference on Computer Vision and Pattern Recognition. pp. 8132–8141 (2021)

2021

[13] [14]

Han,G.,Lim,S.N.:Few-shotobjectdetectionwithfoundationmodels.In:Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 28608–28618 (June 2024)

2024

[14] [15]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)

2016

[15] [16]

Trends in Ecology & Evolution37(4), 293–298 (2022)

Jetz, W., Tertitski, G., Kays, R., Mueller, U., Wikelski, M., Åkesson, S., Anisimov, Y., Antonov, A., Arnold, W., Bairlein, F., et al.: Biological earth observation with animal sensors. Trends in Ecology & Evolution37(4), 293–298 (2022)

2022

[16] [17]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lee, D., Kim, C., Kim, S., Cho, M., Han, W.S.: Autoregressive image genera- tion using residual quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11523–11532 (June 2022)

2022

[17] [18]

IEEE Transactions on Image Processing31, 3182–3196 (2022)

Liang, C., Zhang, Z., Zhou, X., Li, B., Zhu, S., Hu, W.: Rethinking the competition between detection and reid in multiobject tracking. IEEE Transactions on Image Processing31, 3182–3196 (2022)

2022

[18] [19]

In: Proceedings of the European Conference on Computer Vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision. pp. 740–755. Springer (2014)

2014

[19] [20]

Science367(6476), 440–445 (2020)

Liu, D., Li, W., Ma, C., Zheng, W., Yao, Y., Tso, C.F., Zhong, P., Chen, X., Song, J.H., Choi, W., et al.: A common hub for sleep and motor control in the substantia nigra. Science367(6476), 440–445 (2020)

2020

[20] [21]

Computers and Electronics in Agriculture224, 109161 (2024)

Liu, Y., Li, W., Liu, X., Li, Z., Yue, J.: Deep learning in multiple animal tracking: A survey. Computers and Electronics in Agriculture224, 109161 (2024)

2024

[21] [22]

International Journal of Computer Vision129(2), 548–578 (2021) 8 C

Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., Leibe, B.: Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision129(2), 548–578 (2021) 8 C. Deng et al

2021

[22] [23]

Journal of Machine Learning Research9, 2579–2605 (nov 2008)

van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research9, 2579–2605 (nov 2008)

2008

[23] [24]

In: Proceedings of the International Confer- ence on Image Processing

Maggiolino, G., Ahmad, A., Cao, J., Kitani, K.: Deep oc-sort: Multi-pedestrian tracking by adaptive re-identification. In: Proceedings of the International Confer- ence on Image Processing. pp. 3025–3029 (2023)

2023

[24] [25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Meinhardt, T., Kirillov, A., Leal-Taixé, L., Feichtenhofer, C.: Trackformer: Multi- object tracking with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8844–8854 (June 2022)

2022

[25] [26]

Robotics Institute, Pittsburgh, PA, Tech

Mills-Tettey, G.A., Stentz, A., Dias, M.B.: The dynamic hungarian algorithm for the assignment problem with changing costs. Robotics Institute, Pittsburgh, PA, Tech. Rep. CMU-RI-TR-07-277(2007)

2007

[26] [27]

In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

Naik,H.,Yang,J.,Das,D.,Crofoot,M.C.,Rathore,A.,Sridhar,V.H.:Bucktales:A multi-uav dataset for multi-object tracking and re-identification of wild antelopes. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems. vol. 37, pp. 81992–82009. Curran Associates, Inc. (2024)

2024

[27] [28]

In: Advances in Neural Information Processing Systems

van den Oord, A., Vinyals, O., kavukcuoglu, k.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017)

2017

[28] [29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pang,J., Qiu,L., Li,X., Chen,H., Li,Q., Darrell,T., Yu, F.:Quasi-dense similarity learning for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 164–173 (June 2021)

2021

[29] [30]

Nature Methods19(4), 486–495 (2022)

Pereira, T.D., Tabris, N., Matsliah, A., Turner, D.M., Li, J., Ravindranath, S., Papadoyannis, E.S., Normand, E., Deutsch, D.S., Wang, Z.Y., et al.: Sleap: A deep learning system for multi-animal pose tracking. Nature Methods19(4), 486–495 (2022)

2022

[30] [31]

In: Proceedings of the Eu- ropean Conference on Computer Vision

Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Proceedings of the Eu- ropean Conference on Computer Vision. pp. 17–35. Springer (2016)

2016

[31] [32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Shim, K., Ko, K., Yang, Y., Kim, C.: Focusing on tracks for online multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11687–11696 (June 2025)

2025

[32] [33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Sun,P.,Cao,J.,Jiang,Y., Yuan,Z.,Bai,S., Kitani,K.,Luo,P.:Dancetrack:Multi- object tracking in uniform appearance and diverse motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20993–21002 (June 2022)

2022

[33] [34]

arXiv preprint arXiv:2012.15460 (2020)

Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., Wang, C., Luo, P.: Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460 (2020)

work page arXiv 2012

[34] [35]

(eds.) Advances in Neural Information Processing Systems

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, L.u.,Polosukhin,I.:Attentionisallyouneed.In:Guyon,I.,Luxburg,U.V.,Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017)

2017

[35] [36]

In: Advances in Neural Information Processing Systems

Wang, A., Chen, H., Liu, L., Chen, K., Lin, Z., Han, J., Ding, G.: Yolov10: Real- time end-to-end object detection. In: Advances in Neural Information Processing Systems. vol. 37, pp. 107984–108011. Curran Associates, Inc. (2024)

2024

[36] [37]

In: Advances in Neural Informa- tion Processing Systems (2024) HieDG for Multi-Animal Tracking 9

Wang, J., Jiang, Y., Yuan, Z., PENG, B., Wu, Z., Jiang, Y.G.: Omnitokenizer: A joint image-video tokenizer for visual generation. In: Advances in Neural Informa- tion Processing Systems (2024) HieDG for Multi-Animal Tracking 9

2024

[37] [38]

In: Proceedings of the European Conference on Computer Vision

Wang, Z., Zheng, L., Liu, Y., Li, Y., Wang, S.: Towards real-time multi-object tracking. In: Proceedings of the European Conference on Computer Vision. pp. 107–122. Springer (2020)

2020

[38] [39]

Welch, G., Bishop, G., et al.: An introduction to the kalman filter (1995)

1995

[39] [40]

IEEE Trans- actions on Pattern Analysis and Machine Intelligence45(6), 7820–7835 (2023)

Xu, Y., Ban, Y., Delorme, G., Gan, C., Rus, D., Alameda-Pineda, X.: Transcenter: Transformers with dense representations for multiple-object tracking. IEEE Trans- actions on Pattern Analysis and Machine Intelligence45(6), 7820–7835 (2023)

2023

[40] [41]

In: International Conference on Learning Representations (ICLR) (2025)

Yan, F., Luo, W., Zhong, Y., Gan, Y., Ma, L.: CO-MOT: Boosting end-to- end transformer-based multi-object tracking via coopetition label assignment and shadow sets. In: International Conference on Learning Representations (ICLR) (2025)

2025

[41] [42]

In: Proceedings of the International Conference on Learning Representations (2022)

Yu, J., Li, X., Koh, J.Y., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y., Baldridge, J., Wu, Y.: Vector-quantized image modeling with improved VQGAN. In: Proceedings of the International Conference on Learning Representations (2022)

2022

[42] [43]

In: Proceedings of the European Con- ference on Computer Vision

Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., Wei, Y.: Motr: End-to-end multiple-object tracking with transformer. In: Proceedings of the European Con- ference on Computer Vision. pp. 659–675. Springer (2022)

2022

[43] [44]

International Journal of Computer Vision131(2), 496–513 (2023)

Zhang, L., Gao, J., Xiao, Z., Fan, H.: Animaltrack: A benchmark for multi-animal tracking in the wild. International Journal of Computer Vision131(2), 496–513 (2023)

2023

[44] [45]

In: Proceedings of the European Conference on Computer Vision

Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.: Bytetrack: Multi-object tracking by associating every detection box. In: Proceedings of the European Conference on Computer Vision. pp. 1–21. Springer (2022)

2022

[45] [46]

International Journal of Computer Vision129(11), 3069–3087 (2021)

Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision129(11), 3069–3087 (2021)

2021

[46] [47]

In: Proceedings of the Conference on Com- puter Vision and Pattern Recognition

Zhang, Y., Wang, T., Zhang, X.: Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. In: Proceedings of the Conference on Com- puter Vision and Pattern Recognition. pp. 22056–22065 (2023)

2023

[47] [48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zheng, G., Lin, S., Zuo, H., Fu, C., Pan, J.: Nettrack: Tracking highly dynamic objects with a net. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19145–19155 (June 2024)

2024

[48] [49]

In: Proceedings of the European Conference on Computer Vision

Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. In: Proceedings of the European Conference on Computer Vision. pp. 474–490. Springer (2020)

2020

[49] [50]

In: Proceedings of the International Conference on Learning Representations (2021)

Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable {detr}: Deformable transformers for end-to-end object detection. In: Proceedings of the International Conference on Learning Representations (2021)

2021