arxiv: 2604.12665 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

Hypergraph-State Collaborative Reasoning for Multi-Object Tracking

Zikai Song , Junqing Yu , Yi-Ping Phoebe Chen , Wei Yang , Xinchao Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-object trackinghypergraphstate space modelmotion reasoningocclusion handlingcollaborative inference

0 comments

The pith

Objects with similar motion states mutually refine trajectories to stabilize multi-object tracking under occlusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes letting objects with similar motion states mutually constrain and refine each other's motion estimates in multi-object tracking. This approach aims to reduce instability from noisy predictions and prevent trajectory fragmentation when objects are occluded. It realizes this through HyperSSM, an architecture combining hypergraph computation for spatial correlations via dynamic hyperedges with a state space model for temporal smoothness. The design optimizes spatial consensus and temporal coherence together. Experiments on MOT17, MOT20, DanceTrack, and SportsMOT demonstrate state-of-the-art tracking performance.

Core claim

By allowing objects with similar motion states to mutually constrain and refine each other, the collaborative reasoning framework stabilizes noisy trajectories and infers plausible motion continuity even when targets are occluded, using HyperSSM to integrate hypergraph spatial reasoning and state space temporal modeling.

What carries the argument

HyperSSM architecture that integrates hypergraph computation capturing spatial motion correlations through dynamic hyperedges and a state space model enforcing temporal smoothness via structured state transitions.

If this is right

Mutual constraints from similar objects stabilize noisy trajectories.
Plausible motion continuity is inferred during occlusions.
Spatial consensus and temporal coherence are optimized simultaneously.
State-of-the-art performance is achieved on four diverse MOT benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending this collaborative approach to incorporate appearance features could further improve accuracy in ambiguous motion cases.
The framework might apply to related tasks like multi-person pose tracking where group dynamics matter.
Future work could test the method's robustness by simulating erroneous hyperedge connections.

Load-bearing premise

Dynamic hyperedges reliably identify objects with truly similar motion states without creating spurious constraints that propagate errors.

What would settle it

Showing that the performance gains disappear when hyperedges are constructed from random motion similarities instead of learned dynamic ones would falsify the benefit of collaborative reasoning.

Figures

Figures reproduced from arXiv: 2604.12665 by Junqing Yu, Wei Yang, Xinchao Wang, Yi-Ping Phoebe Chen, Zikai Song.

**Figure 1.** Figure 1: Hypergraph-based collaborative motion estimation. We first construct a motion-aware hypergraph, represented by an incidence matrix, the hyperedge (e) selectively associating strongly correlated targets based on motion states (gathering), the gathered group is then dispersed back to individual nodes for fine-grained motion refinement (scattering), achieving both collaborative inference and precise object-s… view at source ↗

**Figure 2.** Figure 2: HyperSSM block and Hyper Convolution. The HyperSSM block integrates hypergraph computation into the SSM. Given the input motion feature X of all objects N within L frames. First, collaborative reasoning is performed via hyper convolution (HConv), involving a vertex-to-edge (V2E) gathering from vertices to hyperedges and followed by a edge-to-vertex (E2V) scattering back to vertices. The block then computes… view at source ↗

**Figure 3.** Figure 3: The architecture of the motion estimation model. The motion estimation module comprises multiple cascaded HyperSSM layers. Given the multi-frame multi-object position information PLt , we encode it into trajectory embeddings to guide each layer. The resulting motion features are processed through a feed-forward network (FFN), generating the position information PLt+1 for the next sliding window at one-fram… view at source ↗

**Figure 4.** Figure 4: Ablation study on the number of HyperSSM layers(left) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visualizations under the two key challenge scenarios: a [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Motion reasoning serves as the cornerstone of multi-object tracking (MOT), as it enables consistent association of targets across frames. However, existing motion estimation approaches face two major limitations: (1) instability caused by noisy or probabilistic predictions, and (2) vulnerability under occlusion, where trajectories often fragment once visual cues disappear. To overcome these issues, we propose a collaborative reasoning framework that enhances motion estimation through joint inference among multiple correlated objects. By allowing objects with similar motion states to mutually constrain and refine each other, our framework stabilizes noisy trajectories and infers plausible motion continuity even when target is occluded. To realize this concept, we design HyperSSM, an architecture that integrates Hypergraph computation and a State Space Model (SSM) for unified spatial-temporal reasoning. The Hypergraph module captures spatial motion correlations through dynamic hyperedges, while the SSM enforces temporal smoothness via structured state transitions. This synergistic design enables simultaneous optimization of spatial consensus and temporal coherence, resulting in robust and stable motion estimation. Extensive experiments on four mainstream and diverse benchmarks(MOT17, MOT20, DanceTrack, and SportsMOT) covering various motion patterns and scene complexities, demonstrate that our approach achieves state-of-the-art performance across a wide range of tracking scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HyperSSM pairs hypergraphs with SSMs for group motion reasoning in MOT, but the abstract gives no mechanism to keep noisy initial estimates from seeding bad hyperedges.

read the letter

The core idea is a collaborative setup where objects with similar motion states refine each other through dynamic hyperedges in a hypergraph, then get smoothed over time with a state space model. This is positioned as a fix for the usual problems of noisy predictions and broken tracks under occlusion. The architecture itself is the new piece: a single module that does spatial consensus and temporal coherence together instead of bolting separate graph and sequence components on top of an existing tracker. That integration is straightforward and worth looking at if you work on graph-based tracking models. The paper also picks the right benchmarks—MOT17, MOT20, DanceTrack, SportsMOT—to test across different motion patterns and densities, which is better than the usual single-dataset claims. The stress-test concern lands. Dynamic hyperedges are formed from spatial motion correlations, yet the description does not say how the system avoids pulling in wrong neighbors when the initial motion estimates are already noisy or when visual features disappear. If the first grouping step is unreliable, the mutual refinement can just average errors instead of correcting them. No ablations or concrete construction rules are mentioned to show this does not happen. The SOTA claim is stated without numbers, variance, or comparison tables in the abstract, so it is impossible to judge whether the gains come from the collaborative part or from other implementation choices. This paper is aimed at MOT people who already use graphs or SSMs and want to try a joint spatial-temporal version. A reader could pull the high-level design for their own work, but would need the full implementation details and controlled experiments before treating the results as settled. I would send it to peer review. The problem is real, the architecture is coherent on paper, and the benchmarks are appropriate; the missing pieces on error control and evidence are exactly what referees can ask for.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes HyperSSM, a collaborative reasoning architecture for multi-object tracking that combines a hypergraph module for capturing spatial motion correlations via dynamic hyperedges with a state space model (SSM) for temporal smoothness. It claims this mutual constraint among objects with similar motion states stabilizes noisy trajectories and infers plausible continuity under occlusion, achieving state-of-the-art results on MOT17, MOT20, DanceTrack, and SportsMOT.

Significance. If the empirical claims hold, the work would be significant for MOT by offering a unified spatial-temporal framework that leverages inter-object collaboration to address longstanding issues with noisy motion estimates and occlusions. The synergistic design of hypergraphs for consensus and SSM for coherence represents a novel direction, and the evaluation across four diverse benchmarks covering varied motion patterns provides a reasonable testbed for generality.

major comments (3)

[Abstract] Abstract: The central claim that dynamic hyperedges enable reliable mutual refinement by connecting objects with truly similar motion states is load-bearing, yet the description provides no mechanism or safeguard against seeding hyperedges from noisy initial motion predictions, which could propagate errors rather than stabilize trajectories (directly relevant to the occlusion handling premise).
[Abstract] Abstract and Methods description: No ablation studies or quantitative metrics are reported to isolate the contribution of the hypergraph-based spatial consensus versus the SSM temporal component, or to demonstrate robustness when initial motion estimates are deliberately noisy; without these, the SOTA claims cannot be attributed to the collaborative reasoning.
[Abstract] Abstract: The assertion of 'extensive experiments' demonstrating stabilization and inference is made without any reported implementation details, hyperedge construction algorithm, or controlled tests under occlusion, leaving the weakest assumption (reliable identification of similar motion states) unverified.

minor comments (2)

Notation for hyperedge weights and SSM state transitions could be formalized with equations for reproducibility.
The manuscript would benefit from additional citations to prior hypergraph applications in vision tasks.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to provide greater clarity, supporting experiments, and implementation details.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that dynamic hyperedges enable reliable mutual refinement by connecting objects with truly similar motion states is load-bearing, yet the description provides no mechanism or safeguard against seeding hyperedges from noisy initial motion predictions, which could propagate errors rather than stabilize trajectories (directly relevant to the occlusion handling premise).

Authors: We agree that the abstract is too concise to convey the hyperedge construction details and any safeguards. We will revise the abstract to briefly note that dynamic hyperedges are formed via a motion-state similarity function incorporating consistency filtering to reduce sensitivity to initial noise. In the methods section of the revision, we will add an explicit discussion of the safeguards, including how initial predictions are vetted before hyperedge seeding to limit error propagation. revision: yes
Referee: [Abstract] Abstract and Methods description: No ablation studies or quantitative metrics are reported to isolate the contribution of the hypergraph-based spatial consensus versus the SSM temporal component, or to demonstrate robustness when initial motion estimates are deliberately noisy; without these, the SOTA claims cannot be attributed to the collaborative reasoning.

Authors: We acknowledge that the current manuscript does not include dedicated ablations isolating the hypergraph and SSM contributions or testing robustness to noisy initials. We will add these in the revised experiments section, including quantitative comparisons of the full model against hypergraph-only and SSM-only variants, as well as controlled tests where Gaussian noise is injected into initial motion predictions to measure impact on tracking stability. revision: yes
Referee: [Abstract] Abstract: The assertion of 'extensive experiments' demonstrating stabilization and inference is made without any reported implementation details, hyperedge construction algorithm, or controlled tests under occlusion, leaving the weakest assumption (reliable identification of similar motion states) unverified.

Authors: We recognize the need for more explicit details on implementation and targeted tests. We will expand Section 4 with the full hyperedge construction algorithm (including pseudocode) and additional hyperparameters. We will also add controlled occlusion experiments, such as performance breakdowns on high-occlusion subsequences and qualitative trajectory inference examples, to directly verify the motion-state identification assumption. revision: yes

Circularity Check

0 steps flagged

No significant circularity; novel architecture validated externally

full rationale

The paper proposes HyperSSM as a new design combining hypergraph spatial correlations with SSM temporal modeling to enable collaborative motion reasoning. All load-bearing claims (stabilization under noise/occlusion, SOTA results) rest on experimental validation against independent benchmarks (MOT17, MOT20, DanceTrack, SportsMOT) rather than any self-referential fitting, self-citation chain, or redefinition of inputs as outputs. No equations or steps in the provided text reduce the claimed predictions to the framework's own fitted quantities by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only the abstract is available, preventing full extraction of parameters or assumptions; the central contribution is the new HyperSSM architecture itself.

invented entities (1)

HyperSSM architecture no independent evidence
purpose: Integrates hypergraph computation and state space model for unified spatial-temporal reasoning in MOT
Presented as the core novel design enabling collaborative inference

pith-pipeline@v0.9.0 · 5522 in / 1054 out tokens · 46614 ms · 2026-05-10T15:38:42.820353+00:00 · methodology

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OmniTrend: Content-Context Modeling for Scalable Social Popularity Prediction
cs.CV 2026-04 unverdicted novelty 6.0

OmniTrend predicts popularity by combining separate content attractiveness and contextual exposure predictors using cross-modal and exogenous signals.
HotComment: A Benchmark for Evaluating Popularity of Online Comments
cs.AI 2026-04 unverdicted novelty 6.0

HotComment is a new multimodal benchmark that quantifies online comment popularity via content quality assessment, interaction-based prediction, and agent-simulated user engagement, accompanied by the StyleCmt stylist...
INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

INTENT mitigates cross-modal correspondence noise and modality-inherent noise in composed image retrieval via FFT-based visual invariant composition and bi-objective discriminative learning.
HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

HABIT improves robustness in composed image retrieval under noisy triplets by quantifying sample cleanliness via mutual information transition rates and applying dual-consistency progressive learning to retain good pa...
ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

ReTrack calibrates directional bias in composed video features using semantic disentanglement and bidirectional evidence alignment to improve retrieval performance on CVR and CIR tasks.
Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction
cs.MM 2026-04 unverdicted novelty 5.0

A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.

Reference graph

Works this paper leans on

74 extracted references · 20 canonical work pages · cited by 7 Pith papers · 4 internal anchors

[1]

Bot-sort: R obust associations multi-pedestrian tracking

Nir Aharon, Roy Orfaig, and Ben-Zion Bobrovsky. Bot-sort: Robust associations multi-pedestrian tracking.arXiv preprint arXiv:2206.14651, 2022. 3, 6, 7

work page arXiv 2022
[2]

Star: Spatial-temporal tracklet matching for multi- object tracking

Xuewei Bai, Yongcai Wang, Deying Li, Haodi Ping, and LI Chunxu. Star: Spatial-temporal tracklet matching for multi- object tracking. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. 3
[3]

Simple online and realtime tracking

Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In2016 IEEE international conference on image processing (ICIP), pages 3464–3468. IEEE, 2016. 3

2016
[4]

Memot: multi-object track- ing with memory

Jiarui Cai, Mingze Xu, Wei Li, Yuanjun Xiong, Wei Xia, Zhuowen Tu, and Stefano Soatto. Memot: multi-object track- ing with memory. InProceedings of the CVPR, pages 8090– 8100, 2022. 6

2022
[5]

Observation-centric sort: Rethink- ing sort for robust multi-object tracking.arXiv preprint arXiv:2203.14360, 2022

Jinkun Cao, Xinshuo Weng, Rawal Khirodkar, Jiangmiao Pang, and Kris Kitani. Observation-centric sort: Rethink- ing sort for robust multi-object tracking.arXiv preprint arXiv:2203.14360, 2022. 6, 7

work page arXiv 2022
[6]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InProceedings of the ECCV, pages 213–229. Springer, 2020. 2

2020
[7]

Uni- fying short and long-term tracking with graph hierarchies

Orcun Cetintas, Guillem Brasó, and Laura Leal-Taixé. Uni- fying short and long-term tracking with graph hierarchies. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22877–22887, 2023. 3

2023
[8]

Sportsmot: A large multi-object tracking dataset in multiple sports scenes

Yutao Cui, Chenkai Zeng, Xiaoyu Zhao, Yichun Yang, Gang- shan Wu, and Limin Wang. Sportsmot: A large multi-object tracking dataset in multiple sports scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9921–9931, 2023. 2, 6, 7

2023
[9]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024. 3

work page internal anchor Pith review arXiv 2024
[10]

arXiv preprint arXiv:2003.09003 (2020)

Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, and Laura Leal-Taixé. Mot20: A benchmark for multi object tracking in crowded scenes.arXiv preprint arXiv:2003.09003,

work page arXiv 2003
[11]

Strongsort: Make deepsort great again.arXiv preprint arXiv:2202.13514, 2022

Yunhao Du, Yang Song, Bo Yang, and Yanyun Zhao. Strongsort: Make deepsort great again.arXiv preprint arXiv:2202.13514, 2022. 6, 7

work page arXiv 2022
[12]

Sset: a dataset for shot segmentation, event detection, player tracking in soccer videos.Multimedia Tools and Applications, 79(1):28971–28992, 2020

Na Feng, Zikai Song, Junqing Yu, Yi Ping Phoebe Chen, and Tao Guan. Sset: a dataset for shot segmentation, event detection, player tracking in soccer videos.Multimedia Tools and Applications, 79(1):28971–28992, 2020. 2

2020
[13]

Ma-vlad: a fine-grained local feature ag- gregation scheme for action recognition.Multimedia Systems, 30(3):139, 2024

Na Feng, Ying Tang, Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Ma-vlad: a fine-grained local feature ag- gregation scheme for action recognition.Multimedia Systems, 30(3):139, 2024. 2

2024
[14]

Hypergraph neural networks

Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. Hypergraph neural networks. InProceedings of the AAAI conference on artificial intelligence, pages 3558–3565,
[15]

Hyper-yolo: When visual object detection meets hyper- graph computation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Yifan Feng, Jiangang Huang, Shaoyi Du, Shihui Ying, Jun- Hai Yong, Yipeng Li, Guiguang Ding, Rongrong Ji, and Yue Gao. Hyper-yolo: When visual object detection meets hyper- graph computation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 3

2024
[16]

Memotr: Long-term memory- augmented transformer for multi-object tracking

Ruopeng Gao and Limin Wang. Memotr: Long-term memory- augmented transformer for multi-object tracking. InProceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9901–9910, 2023. 6, 7

2023
[17]

Multiple object track- ing as id prediction.arXiv preprint arXiv:2403.16848, 2024

Ruopeng Gao, Ji Qi, and Limin Wang. Multiple object track- ing as id prediction.arXiv preprint arXiv:2403.16848, 2024. 1, 2, 6, 7

work page arXiv 2024
[18]

Hgnn+: General hypergraph neural networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3181–3199,

Yue Gao, Yifan Feng, Shuyi Ji, and Rongrong Ji. Hgnn+: General hypergraph neural networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3181–3199,
[19]

YOLOX: Exceeding YOLO Series in 2021

Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021.arXiv preprint arXiv:2107.08430, 2021. 3, 5

work page internal anchor Pith review arXiv 2021
[20]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Efficiently mod- eling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Re. Efficiently mod- eling long sequences with structured state spaces. 3
[22]

On the parameterization and initialization of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022

Albert Gu, Karan Goel, Ankit Gupta, and Christopher Re. On the parameterization and initialization of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022. 3

2022
[23]

Ettrack: enhanced temporal motion predictor for multi-object tracking

Xudong Han, Nobuyuki Oishi, Yueying Tian, Elif Ucurum, Rupert Young, Chris Chatwin, and Philip Birch. Ettrack: enhanced temporal motion predictor for multi-object tracking. Applied Intelligence, 55(1):1–17, 2025. 6, 7

2025
[24]

Learnable graph matching: Incorporating graph parti- tioning with deep feature learning for multiple object tracking

Jiawei He, Zehao Huang, Naiyan Wang, and Zhaoxiang Zhang. Learnable graph matching: Incorporating graph parti- tioning with deep feature learning for multiple object tracking. InProceedings of the CVPR, pages 5299–5309, 2021. 3

2021
[25]

arXiv preprint arXiv:2409.00487 (2024)

Bin Hu, Run Luo, Zelin Liu, Cheng Wang, and Wenyu Liu. Trackssm: A general motion predictor by state-space model. arXiv preprint arXiv:2409.00487, 2024. 1, 3, 6, 7

work page arXiv 2024
[26]

Exploiting multimodal spatial-temporal patterns for video object tracking

Xiantao Hu, Ying Tai, Xu Zhao, Chen Zhao, Zhenyu Zhang, Jun Li, Bineng Zhong, and Jian Yang. Exploiting multimodal spatial-temporal patterns for video object tracking. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 3581–3589, 2025. 1

2025
[27]

Adaptive perception for unified visual multi-modal object tracking.IEEE Transactions on Artificial Intelligence, 2025

Xiantao Hu, Bineng Zhong, Qihua Liang, Liangtao Shi, Zhiyi Mo, Ying Tai, and Jian Yang. Adaptive perception for unified visual multi-modal object tracking.IEEE Transactions on Artificial Intelligence, 2025. 1

2025
[28]

Exploring learning- based motion models in multi-object tracking.arXiv preprint arXiv:2403.10826, 2024

Hsiang-Wei Huang, Cheng-Yen Yang, Wenhao Chai, Zhongyu Jiang, and Jenq-Neng Hwang. Exploring learning- based motion models in multi-object tracking.arXiv preprint arXiv:2403.10826, 2024. 3, 6, 7

work page arXiv 2024
[29]

Sam2mot: A novel paradigm of multi-object tracking by segmentation,

Junjie Jiang, Zelin Wang, Manqi Zhao, Yin Li, and Dong- Sheng Jiang. Sam2mot: A novel paradigm of multi-object tracking by segmentation.arXiv preprint arXiv:2504.04519,

work page arXiv
[30]

Dual-path temporal decoder for end-to-end multi-object tracking

Hyunseop Kim, Juheon Jeong, Hanul Kim, and Yeong Jun Koh. Dual-path temporal decoder for end-to-end multi-object tracking. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. 6, 7
[31]

Cadtrack: Learning contextual aggregation with deformable alignment for robust rgbt tracking.arXiv preprint arXiv:2511.17967, 2025

Hao Li, Yuhao Wang, Xiantao Hu, Wenning Hao, Pingping Zhang, Dong Wang, and Huchuan Lu. Cadtrack: Learning contextual aggregation with deformable alignment for robust rgbt tracking.arXiv preprint arXiv:2511.17967, 2025. 1

work page arXiv 2025
[32]

Coupled mamba: Enhanced multimodal fusion with coupled state space model.Advances in Neural Information Processing Systems, 37:59808–59832, 2024

Wenbing Li, Hang Zhou, Junqing Yu, Zikai Song, and Wei Yang. Coupled mamba: Enhanced multimodal fusion with coupled state space model.Advances in Neural Information Processing Systems, 37:59808–59832, 2024. 2

2024
[33]

Sparsetrack: Multi-object tracking by performing scene decomposition based on pseudo-depth.arXiv preprint arXiv:2306.05238, 2023

Zelin Liu, Xinggang Wang, Cheng Wang, Wenyu Liu, and Xiang Bai. Sparsetrack: Multi-object tracking by performing scene decomposition based on pseudo-depth.arXiv preprint arXiv:2306.05238, 2023. 1, 6, 7

work page arXiv 2023
[34]

Diffusiontrack: Diffusion model for multi-object tracking.arXiv preprint arXiv:2308.09905, 2023

Run Luo, Zikai Song, Lintao Ma, Jinlin Wei, Wei Yang, and Min Yang. Diffusiontrack: Diffusion model for multi-object tracking.arXiv preprint arXiv:2308.09905, 2023. 6, 7

work page arXiv 2023
[35]

Diffmot: A real-time diffusion-based multiple object tracker with non-linear prediction

Weiyi Lv, Yuhang Huang, Ning Zhang, Ruei-Sung Lin, Mei Han, and Dan Zeng. Diffmot: A real-time diffusion-based multiple object tracker with non-linear prediction. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19321–19330, 2024. 6, 7

2024
[36]

Trackformer: Multi-object tracking with transformers

Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. Trackformer: Multi-object tracking with transformers. InProceedings of the CVPR, pages 8844– 8854, 2022. 2

2022
[37]

MOT16: A Benchmark for Multi-Object Tracking

Anton Milan, Laura Leal-Taixé, Ian Reid, Stefan Roth, and Konrad Schindler. Mot16: A benchmark for multi-object tracking.arXiv preprint arXiv:1603.00831, 2016. 2, 6, 8

work page Pith review arXiv 2016
[38]

Trackmpnn: A message passing graph neural architecture for multi-object tracking.arXiv preprint arXiv:2101.04206, 2021

Akshay Rangesh, Pranav Maheshwari, Mez Gebre, Siddhesh Mhatre, Vahid Ramezani, and Mohan M Trivedi. Trackmpnn: A message passing graph neural architecture for multi-object tracking.arXiv preprint arXiv:2101.04206, 2021. 3

work page arXiv 2021
[39]

Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information process- ing systems, 28, 2015

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information process- ing systems, 28, 2015. 3

2015
[40]

Weihong Ren, Xinchao Wang, Jiandong Tian, Yandong Tang, and Antoni B. Chan. Tracking-by-counting: Using net- work flows on crowd density maps for tracking multiple tar- gets.IEEE Transactions on Image Processing, 30:1439–1452,
[41]

Generalized in- tersection over union: A metric and a loss for bounding box regression

Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. InProceedings of the CVPR, pages 658–666,
[42]

arXiv preprint arXiv:2410.01806 (2024)

Mattia Segu, Luigi Piccinelli, Siyuan Li, Yung-Hsu Yang, Bernt Schiele, and Luc Van Gool. Samba: Synchronized set-of-sequences modeling for multiple object tracking.arXiv preprint arXiv:2410.01806, 2024. 2, 6, 7

work page arXiv 2024
[43]

Improving weakly supervised object localization via causal intervention

Feifei Shao, Yawei Luo, Li Zhang, Lu Ye, Siliang Tang, Yi Yang, and Jun Xiao. Improving weakly supervised object localization via causal intervention. InProceedings of the 29th ACM International Conference on Multimedia, pages 3321–3329, 2021. 1

2021
[44]

Deep learning for weakly-supervised object detection and localization: A survey

Feifei Shao, Long Chen, Jian Shao, Wei Ji, Shaoning Xiao, Lu Ye, Yueting Zhuang, and Jun Xiao. Deep learning for weakly-supervised object detection and localization: A survey. Neurocomputing, 496:192–207, 2022

2022
[45]

Knowledge-guided causal intervention for weakly-supervised object localization.IEEE Transactions on Knowledge and Data Engineering, 36(11):6477–6489, 2024

Feifei Shao, Yawei Luo, Fei Gao, Yi Yang, and Jun Xiao. Knowledge-guided causal intervention for weakly-supervised object localization.IEEE Transactions on Knowledge and Data Engineering, 36(11):6477–6489, 2024

2024
[46]

Counterfactual co-occurring learning for bias mitigation in weakly-supervised object localization

Feifei Shao, Yawei Luo, Lei Chen, Ping Liu, Wei Yang, Yi Yang, and Jun Xiao. Counterfactual co-occurring learning for bias mitigation in weakly-supervised object localization. IEEE Transactions on Multimedia, 2026. 1

2026
[47]

Focusing on tracks for online multi-object tracking

Kyujin Shim, Kangwook Ko, Yujin Yang, and Changick Kim. Focusing on tracks for online multi-object tracking. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 11687–11696, 2025. 5

2025
[48]

Fine-grain level sports video search en- gine

Zikai Song, Junqing Yu, Hengyou Cai, Yangliu Hu, and Yi- Ping Phoebe Chen. Fine-grain level sports video search en- gine. InInternational Conference on Multimedia Modeling, pages 519–531. Springer, 2019. 2

2019
[49]

Distractor-aware tracker with a domain-special optimized benchmark for soccer player track- ing

Zikai Song, Zhiwen Wan, Wei Yuan, Ying Tang, Junqing Yu, and Yi-Ping Phoebe Chen. Distractor-aware tracker with a domain-special optimized benchmark for soccer player track- ing. InProceedings of the 2021 International Conference on Multimedia Retrieval, pages 276–284, 2021. 2

2021
[50]

Transformer tracking with cyclic shifting window attention

Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Transformer tracking with cyclic shifting window attention. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8791–8800,
[51]

Compact transformer tracker with correlative masked modeling

Zikai Song, Run Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Compact transformer tracker with correlative masked modeling. InProceedings of the AAAI conference on artificial intelligence, pages 2321–2329, 2023. 1

2023
[52]

Autogenic language embedding for coherent point tracking

Zikai Song, Ying Tang, Run Luo, Lintao Ma, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Autogenic language embedding for coherent point tracking. InProceedings of the 32nd ACM International Conference on Multimedia, pages 2021–2030, 2024. 1

2021
[53]

Temporal coherent object flow for multi-object tracking.Proceedings of the AAAI Con- ference on Artificial Intelligence, 39(7):6978–6986, 2025

Zikai Song, Run Luo, Lintao Ma, Ying Tang, Yi-Ping Phoebe Chen, Junqing Yu, and Wei Yang. Temporal coherent object flow for multi-object tracking.Proceedings of the AAAI Con- ference on Artificial Intelligence, 39(7):6978–6986, 2025. 1, 3, 6

2025
[54]

arXiv preprint arXiv:2012.15460 (2020) 19

Peize Sun, Jinkun Cao, Yi Jiang, Rufeng Zhang, Enze Xie, Zehuan Yuan, Changhu Wang, and Ping Luo. Transtrack: Multiple object tracking with transformer.arXiv preprint arXiv:2012.15460, 2020. 6, 7

work page arXiv 2012
[55]

Dancetrack: Multi-object tracking in uniform appearance and diverse motion

Peize Sun, Jinkun Cao, Yi Jiang, Zehuan Yuan, Song Bai, Kris Kitani, and Ping Luo. Dancetrack: Multi-object tracking in uniform appearance and diverse motion. InProceedings of the CVPR, pages 20993–21002, 2022. 2, 6

2022
[56]

Tracking interacting objects using intertwined flows

Xinchao Wang, Engin Türetken, Francois Fleuret, and Pascal Fua. Tracking interacting objects using intertwined flows. IEEE transactions on pattern analysis and machine intelli- gence, 38(11):2312–2326, 2015. 1

2015
[57]

Simple online and realtime tracking with a deep association metric

Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017. 3

2017
[58]

A comprehensive survey on graph neural networks.IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020

Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks.IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020. 3

2020
[59]

Mambatrack: a simple baseline for multiple object tracking with state space model

Changcheng Xiao, Qiong Cao, Zhigang Luo, and Long Lan. Mambatrack: a simple baseline for multiple object tracking with state space model. InProceedings of the 32nd ACM International Conference on Multimedia, pages 4082–4091,
[60]

Transcenter: Transformers with dense representations for multiple-object tracking.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

Yihong Xu, Yutong Ban, Guillaume Delorme, Chuang Gan, Daniela Rus, and Xavier Alameda-Pineda. Transcenter: Transformers with dense representations for multiple-object tracking.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. 6

2022
[61]

Hybrid-sort: Weak cues matter for online multi-object tracking

Mingzhan Yang, Guangxin Han, Bin Yan, Wenhua Zhang, Jinqing Qi, Huchuan Lu, and Dong Wang. Hybrid-sort: Weak cues matter for online multi-object tracking. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6504– 6512, 2024. 3, 6, 7

2024
[62]

Mvp: Winning solution to smp challenge 2025 video track

Liliang Ye, Yunyao Zhang, Yafeng Wu, Yi-Ping Phoebe Chen, Junqing Yu, Wei Yang, and Zikai Song. Mvp: Winning solution to smp challenge 2025 video track. InProceedings of the 33rd ACM International Conference on Multimedia, pages 14079–14085, 2025. 2

2025
[63]

Motrv3: Release-fetch supervi- sion for end-to-end multi-object tracking.arXiv preprint arXiv:2305.14298, 2023

En Yu, Tiancai Wang, Zhuoling Li, Yuang Zhang, Xiangyu Zhang, and Wenbing Tao. Motrv3: Release-fetch supervi- sion for end-to-end multi-object tracking.arXiv preprint arXiv:2305.14298, 2023. 2, 6, 7

work page arXiv 2023
[64]

Comprehensive dataset of broadcast soccer videos

Junqing Yu, Aiping Lei, Zikai Song, Tingting Wang, Hengyou Cai, and Na Feng. Comprehensive dataset of broadcast soccer videos. In2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pages 418–423, 2018. 2

2018
[65]

Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph

AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246, 2018. 3

2018
[66]

Motr: End-to-end multiple- object tracking with transformer

Fangao Zeng, Bin Dong, Yuang Zhang, Tiancai Wang, Xi- angyu Zhang, and Yichen Wei. Motr: End-to-end multiple- object tracking with transformer. InProceedings of the ECCV, pages 659–675, 2022. 1, 2, 6, 7

2022
[67]

Trackmamba: Mamba- transformer tracking

Jiaming Zhang, Cheng Liang, Yutao Cui, Xiangbo Shu, Gangshan Wu, and Limin Wang. Trackmamba: Mamba- transformer tracking. 3
[68]

Bytetrack: Multi-object tracking by associating every detection box

Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. Bytetrack: Multi-object tracking by associating every detection box. InProceedings of the ECCV, pages 1–21. Springer, 2022. 1, 3, 6, 7

2022
[69]

Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors

Yuang Zhang, Tiancai Wang, and Xiangyu Zhang. Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22056–22065, 2023. 1, 2, 6, 7

2023
[70]

Toward unified token learning for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(4):2125–2135, 2023

Yaozong Zheng, Bineng Zhong, Qihua Liang, Guorong Li, Rongrong Ji, and Xianxian Li. Toward unified token learning for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(4):2125–2135, 2023. 1

2023
[71]

Odtrack: Online dense temporal token learning for visual tracking

Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang, and Xianxian Li. Odtrack: Online dense temporal token learning for visual tracking. InProceedings of the AAAI conference on artificial intelligence, pages 7588– 7596, 2024. 1

2024
[72]

Decoupled spatio-temporal consistency learn- ing for self-supervised tracking

Yaozong Zheng, Bineng Zhong, Qihua Liang, Ning Li, and Shuxiang Song. Decoupled spatio-temporal consistency learn- ing for self-supervised tracking. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10635–10643,
[73]

Towards universal modal tracking with online dense temporal token learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Yaozong Zheng, Bineng Zhong, Qihua Liang, Shengping Zhang, Guorong Li, Xianxian Li, and Rongrong Ji. Towards universal modal tracking with online dense temporal token learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1

2025
[74]

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable trans- formers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020. 2

work page internal anchor Pith review arXiv 2010