HieDG: A Hierarchical Discrete Geometry-Guided Framework for Multi-Animal Tracking
Pith reviewed 2026-07-02 14:53 UTC · model grok-4.3
The pith
A two-stage residual codebook converts continuous position, scale, and velocity signals into stable discrete tokens that strengthen identity association inside query-based multi-animal trackers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that reformulating geometric dynamics as structured discrete representations inside a query-based tracker, achieved by passing position, scale, and velocity cues through a two-stage residual codebook, produces stable tokens that can be aligned with visual embeddings and integrated into tracking queries, thereby improving identity consistency under uniform appearance, high density, and irregular motion.
What carries the argument
The two-stage residual codebook that discretizes position, scale, and velocity cues into tokens aligned with visual embeddings and fed into tracking queries.
If this is right
- Identity association improves in scenes where appearance cues are weak because geometry is now represented by stable tokens rather than fragile continuous values.
- The same tokenized geometry can be added to generic multi-object trackers without animal-specific tuning and still yields competitive scores.
- End-to-end training benefits because the discrete tokens reduce sensitivity of the attention mechanism to coordinate noise.
- Heuristic geometric post-processing steps become less necessary once the codebook supplies structured motion cues inside the queries.
Where Pith is reading between the lines
- The discretization step could be tested on non-visual sensors such as GPS or depth to see whether the same codebook structure stabilizes fusion across modalities.
- If the residual codebook generalizes, future trackers might drop separate motion-prediction heads and rely on the tokenized geometry for both association and short-term prediction.
- Quantization artifacts may appear first in very slow or very fast motions; measuring error rates on velocity-binned subsets of the test sets would expose any such limits.
Load-bearing premise
Small coordinate perturbations in continuous geometric embeddings disproportionately degrade cross-frame attention weights, and a learned residual codebook can replace those embeddings with tokens that preserve motion information without introducing harmful quantization artifacts.
What would settle it
Replacing the discrete tokens with the original continuous geometric embeddings inside the same query-based tracker and measuring whether HOTA and IDF1 on AnimalTrack or BFT drop, stay flat, or rise.
Figures
read the original abstract
Multi-animal tracking (MAT) is critical for wildlife monitoring and behavioral analysis, yet remains challenging due to uniform appearance, high density, and irregular motion. Existing methods typically follow heuristic- or query-based paradigms: the former relies on handcrafted geometric associations without end-to-end optimization, whereas the latter enables joint optimization but relies heavily on appearance embeddings. In such conditions, continuous geometric embeddings can be unstable, as small coordinate perturbations may disproportionately alter cross-frame attention weights, degrading identity association performance. To address this limitation, we propose HieDG, a Hierarchical Discrete Geometry-guided tracking framework that reformulates geometric dynamics as structured discrete representations within a query-based tracker. Instead of directly using raw geometric signals, HieDG employs a two-stage residual codebook to discretize position, scale, and velocity cues, transforming unstable continuous geometry into structured, stable discrete tokens. These tokens are aligned with visual embeddings and integrated into the tracking queries to enhance identity consistency. Extensive experiments on animal-specific benchmarks (AnimalTrack, BFT, and BuckTales) demonstrate state-of-the-art association performance with significant improvements in HOTA, AssA, and IDF1. Additional evaluations on generic multi-object tracking benchmarks, including DanceTrack and SportsMOT, show competitive performance, indicating the broader applicability of discretized geometric modeling beyond animal-specific scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes HieDG, a query-based multi-animal tracking framework that reformulates geometric dynamics (position, scale, velocity) via a two-stage residual codebook to produce discrete tokens. These tokens are aligned with visual embeddings and integrated into tracking queries to stabilize identity association. The paper reports state-of-the-art results on AnimalTrack, BFT, and BuckTales (in HOTA, AssA, IDF1) and competitive performance on DanceTrack and SportsMOT.
Significance. If the empirical gains hold under rigorous evaluation, the discretization strategy offers a concrete mechanism for mitigating attention-weight sensitivity to small geometric perturbations, which could improve robustness in dense, low-appearance-variation tracking scenarios. The extension to generic MOT benchmarks suggests the approach is not narrowly animal-specific.
minor comments (3)
- [Abstract] Abstract: the claim of 'significant improvements' and 'state-of-the-art association performance' is stated without any numerical deltas, error bars, or reference to specific table rows, forcing the reader to locate the quantitative evidence later in the manuscript.
- [Method] The integration step that aligns discrete geometric tokens with visual embeddings is described at a high level; a concrete description of the alignment loss or projection layer (e.g., in the method section) would improve reproducibility.
- [Experiments] No dataset statistics (number of sequences, average density, occlusion rates) are supplied for the animal benchmarks, which makes it harder to contextualize the reported gains relative to prior work.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of HieDG, the recognition of its potential to improve robustness via discretization, and the recommendation for minor revision. We will incorporate any minor clarifications in the revised version.
Circularity Check
No significant circularity
full rationale
The paper proposes HieDG as a modeling choice: a two-stage residual codebook that discretizes position/scale/velocity cues to stabilize attention weights in query-based trackers. No equations, derivations, or self-citation chains are presented that reduce the claimed performance gains to inputs by construction. The discretization step is motivated directly by the stated instability of continuous embeddings and is evaluated via empirical results on external benchmarks (AnimalTrack, BFT, BuckTales, DanceTrack, SportsMOT). This is an independent architectural decision with no load-bearing self-definition, fitted-input-as-prediction, or uniqueness theorem imported from prior author work.
Axiom & Free-Parameter Ledger
free parameters (1)
- residual codebook size and number of stages
axioms (1)
- domain assumption Continuous geometric embeddings are unstable because small coordinate perturbations disproportionately alter cross-frame attention weights.
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (October 2019)
Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking without bells and whistles. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (October 2019)
2019
-
[2]
EURASIP Journal on Image and Video Processing2008(1), 246309 (2008)
Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing2008(1), 246309 (2008)
2008
-
[3]
Nature Reviews Genetics23(8), 492–503 (2022)
Bertorelle, G., Raffini, F., Bosse, M., Bortoluzzi, C., Iannucci, A., Trucchi, E., Morales, H.E., Van Oosterhout, C.: Genetic load: genomic estimates and applica- tions in non-model animals. Nature Reviews Genetics23(8), 492–503 (2022)
2022
-
[4]
In: Proceedings of the International Conference on Image Processing
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: Proceedings of the International Conference on Image Processing. pp. 3464–3468 (2016)
2016
-
[5]
In: Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance
Bochinski, E., Eiselein, V., Sikora, T.: High-speed tracking-by-detection without using image information. In: Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance. pp. 1–6 (2017).https://doi.org/ 10.1109/AVSS.2017.8078516
-
[6]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K.: Observation-centric sort: Rethinking sort for robust multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9686–9696 (June 2023)
2023
-
[7]
In: Proceedings of the IEEE/CVF HieDG for Multi-Animal Tracking 7 International Conference on Computer Vision (ICCV)
Cui, Y., Zeng, C., Zhao, X., Yang, Y., Wu, G., Wang, L.: Sportsmot: A large multi- object tracking dataset in multiple sports scenes. In: Proceedings of the IEEE/CVF HieDG for Multi-Animal Tracking 7 International Conference on Computer Vision (ICCV). pp. 9921–9931 (October 2023)
2023
-
[9]
Deng, C., Li, D., Ji, L., Zhang, C., Li, B., Yan, H., Zheng, J., Wang, L., Zhang, J.: Chatdiff: A chatgpt-based diffusion model for long-tailed classification. Neural Networks181, 106794 (2025).https://doi.org/https://doi.org/10.1016/j. neunet.2024.106794,https://www.sciencedirect.com/science/article/pii/ S0893608024007184
work page doi:10.1016/j 2025
-
[10]
In: Advances in Neural Information Processing Systems (2023),https://openreview.net/forum?id=hyPUZX03Ks
Fiquet, P.É.H., Simoncelli, E.P.: A polar prediction model for learning to represent visual transformations. In: Advances in Neural Information Processing Systems (2023),https://openreview.net/forum?id=hyPUZX03Ks
2023
-
[11]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Gao, R., Qi, J., Wang, L.: Multiple object tracking as id prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27883–27893 (June 2025)
2025
-
[12]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Gao, R., Wang, L.: Memotr: Long-term memory-augmented transformer for multi- object tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9901–9910 (October 2023)
2023
-
[13]
In: Proceedings of the Conference on Computer Vision and Pattern Recognition
Guo, S., Wang, J., Wang, X., Tao, D.: Online multiple object tracking with cross- task synergy. In: Proceedings of the Conference on Computer Vision and Pattern Recognition. pp. 8132–8141 (2021)
2021
-
[14]
Han,G.,Lim,S.N.:Few-shotobjectdetectionwithfoundationmodels.In:Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 28608–28618 (June 2024)
2024
-
[15]
He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)
2016
-
[16]
Trends in Ecology & Evolution37(4), 293–298 (2022)
Jetz, W., Tertitski, G., Kays, R., Mueller, U., Wikelski, M., Åkesson, S., Anisimov, Y., Antonov, A., Arnold, W., Bairlein, F., et al.: Biological earth observation with animal sensors. Trends in Ecology & Evolution37(4), 293–298 (2022)
2022
-
[17]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Lee, D., Kim, C., Kim, S., Cho, M., Han, W.S.: Autoregressive image genera- tion using residual quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11523–11532 (June 2022)
2022
-
[18]
IEEE Transactions on Image Processing31, 3182–3196 (2022)
Liang, C., Zhang, Z., Zhou, X., Li, B., Zhu, S., Hu, W.: Rethinking the competition between detection and reid in multiobject tracking. IEEE Transactions on Image Processing31, 3182–3196 (2022)
2022
-
[19]
In: Proceedings of the European Conference on Computer Vision
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision. pp. 740–755. Springer (2014)
2014
-
[20]
Science367(6476), 440–445 (2020)
Liu, D., Li, W., Ma, C., Zheng, W., Yao, Y., Tso, C.F., Zhong, P., Chen, X., Song, J.H., Choi, W., et al.: A common hub for sleep and motor control in the substantia nigra. Science367(6476), 440–445 (2020)
2020
-
[21]
Computers and Electronics in Agriculture224, 109161 (2024)
Liu, Y., Li, W., Liu, X., Li, Z., Yue, J.: Deep learning in multiple animal tracking: A survey. Computers and Electronics in Agriculture224, 109161 (2024)
2024
-
[22]
International Journal of Computer Vision129(2), 548–578 (2021) 8 C
Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., Leibe, B.: Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision129(2), 548–578 (2021) 8 C. Deng et al
2021
-
[23]
Journal of Machine Learning Research9, 2579–2605 (nov 2008)
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research9, 2579–2605 (nov 2008)
2008
-
[24]
In: Proceedings of the International Confer- ence on Image Processing
Maggiolino, G., Ahmad, A., Cao, J., Kitani, K.: Deep oc-sort: Multi-pedestrian tracking by adaptive re-identification. In: Proceedings of the International Confer- ence on Image Processing. pp. 3025–3029 (2023)
2023
-
[25]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Meinhardt, T., Kirillov, A., Leal-Taixé, L., Feichtenhofer, C.: Trackformer: Multi- object tracking with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8844–8854 (June 2022)
2022
-
[26]
Robotics Institute, Pittsburgh, PA, Tech
Mills-Tettey, G.A., Stentz, A., Dias, M.B.: The dynamic hungarian algorithm for the assignment problem with changing costs. Robotics Institute, Pittsburgh, PA, Tech. Rep. CMU-RI-TR-07-277(2007)
2007
-
[27]
In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C
Naik,H.,Yang,J.,Das,D.,Crofoot,M.C.,Rathore,A.,Sridhar,V.H.:Bucktales:A multi-uav dataset for multi-object tracking and re-identification of wild antelopes. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems. vol. 37, pp. 81992–82009. Curran Associates, Inc. (2024)
2024
-
[28]
In: Advances in Neural Information Processing Systems
van den Oord, A., Vinyals, O., kavukcuoglu, k.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017)
2017
-
[29]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Pang,J., Qiu,L., Li,X., Chen,H., Li,Q., Darrell,T., Yu, F.:Quasi-dense similarity learning for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 164–173 (June 2021)
2021
-
[30]
Nature Methods19(4), 486–495 (2022)
Pereira, T.D., Tabris, N., Matsliah, A., Turner, D.M., Li, J., Ravindranath, S., Papadoyannis, E.S., Normand, E., Deutsch, D.S., Wang, Z.Y., et al.: Sleap: A deep learning system for multi-animal pose tracking. Nature Methods19(4), 486–495 (2022)
2022
-
[31]
In: Proceedings of the Eu- ropean Conference on Computer Vision
Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Proceedings of the Eu- ropean Conference on Computer Vision. pp. 17–35. Springer (2016)
2016
-
[32]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Shim, K., Ko, K., Yang, Y., Kim, C.: Focusing on tracks for online multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11687–11696 (June 2025)
2025
-
[33]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Sun,P.,Cao,J.,Jiang,Y., Yuan,Z.,Bai,S., Kitani,K.,Luo,P.:Dancetrack:Multi- object tracking in uniform appearance and diverse motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20993–21002 (June 2022)
2022
-
[34]
arXiv preprint arXiv:2012.15460 (2020)
Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., Wang, C., Luo, P.: Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460 (2020)
-
[35]
(eds.) Advances in Neural Information Processing Systems
Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, L.u.,Polosukhin,I.:Attentionisallyouneed.In:Guyon,I.,Luxburg,U.V.,Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017)
2017
-
[36]
In: Advances in Neural Information Processing Systems
Wang, A., Chen, H., Liu, L., Chen, K., Lin, Z., Han, J., Ding, G.: Yolov10: Real- time end-to-end object detection. In: Advances in Neural Information Processing Systems. vol. 37, pp. 107984–108011. Curran Associates, Inc. (2024)
2024
-
[37]
In: Advances in Neural Informa- tion Processing Systems (2024) HieDG for Multi-Animal Tracking 9
Wang, J., Jiang, Y., Yuan, Z., PENG, B., Wu, Z., Jiang, Y.G.: Omnitokenizer: A joint image-video tokenizer for visual generation. In: Advances in Neural Informa- tion Processing Systems (2024) HieDG for Multi-Animal Tracking 9
2024
-
[38]
In: Proceedings of the European Conference on Computer Vision
Wang, Z., Zheng, L., Liu, Y., Li, Y., Wang, S.: Towards real-time multi-object tracking. In: Proceedings of the European Conference on Computer Vision. pp. 107–122. Springer (2020)
2020
-
[39]
Welch, G., Bishop, G., et al.: An introduction to the kalman filter (1995)
1995
-
[40]
IEEE Trans- actions on Pattern Analysis and Machine Intelligence45(6), 7820–7835 (2023)
Xu, Y., Ban, Y., Delorme, G., Gan, C., Rus, D., Alameda-Pineda, X.: Transcenter: Transformers with dense representations for multiple-object tracking. IEEE Trans- actions on Pattern Analysis and Machine Intelligence45(6), 7820–7835 (2023)
2023
-
[41]
In: International Conference on Learning Representations (ICLR) (2025)
Yan, F., Luo, W., Zhong, Y., Gan, Y., Ma, L.: CO-MOT: Boosting end-to- end transformer-based multi-object tracking via coopetition label assignment and shadow sets. In: International Conference on Learning Representations (ICLR) (2025)
2025
-
[42]
In: Proceedings of the International Conference on Learning Representations (2022)
Yu, J., Li, X., Koh, J.Y., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y., Baldridge, J., Wu, Y.: Vector-quantized image modeling with improved VQGAN. In: Proceedings of the International Conference on Learning Representations (2022)
2022
-
[43]
In: Proceedings of the European Con- ference on Computer Vision
Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., Wei, Y.: Motr: End-to-end multiple-object tracking with transformer. In: Proceedings of the European Con- ference on Computer Vision. pp. 659–675. Springer (2022)
2022
-
[44]
International Journal of Computer Vision131(2), 496–513 (2023)
Zhang, L., Gao, J., Xiao, Z., Fan, H.: Animaltrack: A benchmark for multi-animal tracking in the wild. International Journal of Computer Vision131(2), 496–513 (2023)
2023
-
[45]
In: Proceedings of the European Conference on Computer Vision
Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.: Bytetrack: Multi-object tracking by associating every detection box. In: Proceedings of the European Conference on Computer Vision. pp. 1–21. Springer (2022)
2022
-
[46]
International Journal of Computer Vision129(11), 3069–3087 (2021)
Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision129(11), 3069–3087 (2021)
2021
-
[47]
In: Proceedings of the Conference on Com- puter Vision and Pattern Recognition
Zhang, Y., Wang, T., Zhang, X.: Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. In: Proceedings of the Conference on Com- puter Vision and Pattern Recognition. pp. 22056–22065 (2023)
2023
-
[48]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Zheng, G., Lin, S., Zuo, H., Fu, C., Pan, J.: Nettrack: Tracking highly dynamic objects with a net. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19145–19155 (June 2024)
2024
-
[49]
In: Proceedings of the European Conference on Computer Vision
Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. In: Proceedings of the European Conference on Computer Vision. pp. 474–490. Springer (2020)
2020
-
[50]
In: Proceedings of the International Conference on Learning Representations (2021)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable {detr}: Deformable transformers for end-to-end object detection. In: Proceedings of the International Conference on Learning Representations (2021)
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.