PS-MOT: Cultivating Instance Awareness from Point Seeds for Multi-Object Tracking

Fei Teng; Hao Shi; Kailun Yang; Kai Luo; Kunyu Peng; Mengfei Duan; Wanjun Jia; Xu Wang; Zhiyong Li

arxiv: 2606.30476 · v1 · pith:XJBKZLNCnew · submitted 2026-06-29 · 💻 cs.CV · cs.RO· eess.IV

PS-MOT: Cultivating Instance Awareness from Point Seeds for Multi-Object Tracking

Kai Luo , Fei Teng , Mengfei Duan , Wanjun Jia , Xu Wang , Hao Shi , Kunyu Peng , Zhiyong Li

show 1 more author

Kailun Yang

This is my paper

Pith reviewed 2026-06-30 06:43 UTC · model grok-4.3

classification 💻 cs.CV cs.ROeess.IV

keywords point-supervised multi-object trackingpseudo-label generationtemporal promptingwavelet attentionuncertainty-guided learninginstance awareness

0 comments

The pith

Point seeds can be evolved into instance-aware multi-object trackers via temporal prompting, wavelet attention, and uncertainty-guided loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that multi-object tracking can be done effectively with only point annotations rather than full bounding boxes. It introduces a pipeline that turns sparse point seeds into consistent pseudo-labels, activates object boundaries in the model, and treats labels as probabilistic distributions during training. If the approach holds, tracking systems could operate with far less annotation effort while maintaining competitive accuracy across varied video domains.

Core claim

PS-Track forms a hierarchical pipeline that cultivates instance awareness from point seeds at data, model, and loss levels. Temporal-Feedback Prompting generates temporally consistent pseudo-labels from points using motion priors and negative cues. Point-Excited Wavelet Attention activates high-frequency components to hallucinate boundaries from semantic correlations. Uncertainty-Guided Gaussian Learning models the pseudo-labels as distributions that dynamically adjust supervision strength, yielding state-of-the-art results for point-supervised tracking on DanceTrack, EmboTrack, SportsMOT, and JRDB.

What carries the argument

The PS-Track hierarchical pipeline that transitions from points to instances using Temporal-Feedback Prompting at the data level, Point-Excited Wavelet Attention at the model level, and Uncertainty-Guided Gaussian Learning at the loss level.

If this is right

Point supervision becomes a practical substitute for bounding-box supervision in multi-object tracking.
The same pipeline delivers leading results on dance, sports, and pedestrian-robot datasets under point-only labels.
Annotation effort for training trackers can shift from drawing boxes to marking centers without loss of capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same point-to-instance cultivation steps could be tested on single-object tracking or video instance segmentation to check transfer.
If the pseudo-label evolution proves stable, the method may reduce the data cost of building trackers for new environments by an order of magnitude.
Extending the uncertainty model to handle long-term occlusions would be a direct next measurement on the same benchmarks.

Load-bearing premise

The three components together can resolve spatial ambiguity and identity drift when the only input is point seeds that carry no explicit geometric or scale information.

What would settle it

On a new video dataset or ablation setting, removing any one of the three components causes the tracker to fall below prior point-supervised methods or to exhibit clear rises in identity switches and localization errors.

Figures

Figures reproduced from arXiv: 2606.30476 by Fei Teng, Hao Shi, Kailun Yang, Kai Luo, Kunyu Peng, Mengfei Duan, Wanjun Jia, Xu Wang, Zhiyong Li.

**Figure 2.** Figure 2: Versatility of PSTrack across mainstream tracking paradigms. labels, we transition from deterministic regression to Uncertainty-Guided Gaussian Learning (UGL) at the loss level. Since point annotations lack physical extents, deterministic box regression causes noise memorization. UGL circumvents this by modeling pseudolabels as Gaussian distributions, naturally accommodating spatial ambiguity. To miti… view at source ↗

**Figure 3.** Figure 3: The overview of our proposed framework. It operates on a coarse-to-fine paradigm across three levels: (a) Data Level: The Temporal-Feedback Prompting mechanism evolves sparse points into consistent pseudo-labels; (b) Model Level: The Point-Excited Wavelet Attention (PEWA) module leverages frequency decomposition to Hallucinate boundaries from point cues; (c) Loss Level: The Uncertainty-Guided Gaussian Loss… view at source ↗

**Figure 4.** Figure 4: Visual comparison of pseudo-labels generated by the original SAM [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on the DanceTrack dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

We introduce Point-supervised Multi-Object Tracking (PS-MOT) as a cost-effective alternative to traditional bounding box supervision, shifting the focus from spatial fitting to topological center-driven representation. However, PS-MOT faces challenges, e.g., spatial ambiguity and identity drift due to the lack of explicit geometric structure and scale constraints. To address these, we propose PS-Track, a hierarchical pipeline transitioning from points to instances across data, model, and loss levels. At the data level, we introduce Temporal-Feedback Prompting (TFP) to evolve points into temporally consistent pseudo-labels using negative spatial cues and motion priors. At the model level, we design the Point-Excited Wavelet Attention (PEWA) module, which leverages semantic correlations to activate high-frequency components, ``hallucinating'' object boundaries. At the loss level, Uncertainty-Guided Gaussian Learning (UGL) models pseudo-labels as probabilistic distributions, dynamically calibrating supervision intensity. Experiments on DanceTrack, EmboTrack, SportsMOT, and JRDB demonstrate that PS-Track provides a feasible and effective point-supervised alternative across diverse tracking scenarios, establishing a new state-of-the-art for point-supervised tracking. The source code is available at https://github.com/xifen523/PS-MOT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PS-Track gives a practical point-supervised MOT pipeline with code released and tests on four datasets, but the actual performance numbers and module contributions need checking.

read the letter

The main thing here is a point-supervised MOT method that turns point seeds into tracked instances using three modules and ships the code. They test it on DanceTrack, EmboTrack, SportsMOT, and JRDB.

What stands out is the hierarchical setup: Temporal-Feedback Prompting builds temporally consistent pseudo-labels from points using motion and negative cues, Point-Excited Wavelet Attention tries to recover boundaries by exciting high-frequency parts from semantic correlations, and Uncertainty-Guided Gaussian Learning treats the pseudo-labels as distributions to soften the supervision. This directly targets the lack of geometry and scale in point annotations.

The paper does well by releasing the implementation and spreading the experiments across dance, embryo, sports, and robotics scenes. That breadth makes the feasibility claim easier to take seriously than a single-dataset result.

The soft spots are straightforward. The abstract supplies no numbers, ablations, or error bars, so the SOTA claim for point-supervised tracking cannot be weighed yet. It is also unclear how much the three modules depart from prior point-based or pseudo-label techniques without the full comparisons. The wavelet "hallucination" step could introduce artifacts in cluttered scenes, though the multi-dataset tests suggest they checked for that.

This is for tracking researchers who want cheaper supervision options. Someone already running MOT experiments would get value from the code and the specific design choices for handling identity drift.

Send it to peer review. The core idea is coherent, the code is out, and the experiments cover enough ground that a referee can check the details.

Referee Report

0 major / 2 minor

Summary. The paper introduces Point-supervised Multi-Object Tracking (PS-MOT) as a cost-effective alternative to bounding box supervision, focusing on topological center-driven representation from point seeds. It proposes the PS-Track hierarchical pipeline with three components: Temporal-Feedback Prompting (TFP) at the data level to evolve points into temporally consistent pseudo-labels via negative spatial cues and motion priors; Point-Excited Wavelet Attention (PEWA) at the model level to leverage semantic correlations for activating high-frequency components and hallucinating object boundaries; and Uncertainty-Guided Gaussian Learning (UGL) at the loss level to model pseudo-labels as probabilistic distributions for dynamic supervision calibration. Experiments across DanceTrack, EmboTrack, SportsMOT, and JRDB are reported to demonstrate feasibility and establish a new state-of-the-art for point-supervised tracking, with source code released.

Significance. If the empirical claims hold, the work provides a practical reduction in annotation cost for MOT by shifting to point supervision while addressing the specific challenges of spatial ambiguity and identity drift through a structured data-model-loss pipeline. The open release of code supports reproducibility. This could meaningfully expand research directions in weakly-supervised tracking by showing that point seeds can suffice when augmented with the proposed mechanisms for pseudo-label evolution and boundary inference.

minor comments (2)

[Abstract] Abstract: the SOTA claim is stated without any quantitative metrics, baseline comparisons, or dataset-specific scores; adding one or two key numbers (e.g., HOTA or MOTA deltas) would make the summary self-contained.
[Abstract] Abstract and §3 (model level): the phrase 'hallucinating object boundaries' is used without a precise technical definition or reference to how PEWA's wavelet activation produces explicit boundary outputs versus implicit feature enhancement; a short clarification or diagram reference would improve precision.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the thorough summary of our work, the positive assessment of its significance, and the recommendation for minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces PS-Track as a hierarchical pipeline with three explicitly proposed modules (Temporal-Feedback Prompting at data level, Point-Excited Wavelet Attention at model level, and Uncertainty-Guided Gaussian Learning at loss level) to address stated challenges of point supervision. These components are presented as novel designs whose effectiveness is validated through experiments on four external datasets, with source code released. No equations, self-citations, or claims reduce any prediction or result to a fitted parameter or definition internal to the same work by construction; the derivation chain remains self-contained and externally falsifiable via the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, training details, or modeling choices, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5787 in / 1097 out tokens · 31291 ms · 2026-06-30T06:43:53.338041+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 12 canonical work pages · 3 internal anchors

[1]

BoT-SORT: Robust associations multi-pedestrian tracking,

Aharon, N., Orfaig, R., Bobrovsky, B.Z.: BoT-SORT: Robust associations multi- pedestrian tracking. arXiv preprint arXiv:2206.14651 (2022)

work page arXiv 2022
[2]

Journal of Cognitive Neuroscience (2003)

Bar, M.: A cortical mechanism for triggering top-down facilitation in visual object recognition. Journal of Cognitive Neuroscience (2003)

2003
[3]

In: ECCV (2016)

Bearman, A., Russakovsky, O., Ferrari, V., Fei-Fei, L.: What’s the point: Semantic segmentation with point supervision. In: ECCV (2016)

2016
[4]

EURASIP Journal on Image and Video Processing (2008)

Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP Journal on Image and Video Processing (2008)

2008
[5]

In: ICIP (2016)

Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP (2016)

2016
[6]

In: CVPR (2016)

Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: CVPR (2016)

2016
[7]

In: CVPR (2023)

Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K.: Observation-centric SORT: Rethinking SORT for robust multi-object tracking. In: CVPR (2023)

2023
[8]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

In: CVPR (2023)

Cetintas, O., Brasó, G., Leal-Taixé, L.: Unifying short and long-term tracking with graph hierarchies. In: CVPR (2023)

2023
[10]

Scientific Reports (2025)

Chan, S., Zhou, W., Lei, Y., Li, C., Hu, J., Hong, F.: Sparse point annotations for remote sensing image segmentation. Scientific Reports (2025)

2025
[11]

In: ECCV (2022)

Chen, P., Yu, X., Han, X., Hassan, N., Wang, K., Li, J., Zhao, J., Shi, H., Han, Z., Ye, Q.: Point-to-box network for accurate object detection via single point supervision. In: ECCV (2022)

2022
[12]

In: CVPR (2022)

Cheng, B., Parkhi, O., Kirillov, A.: Pointly-supervised instance segmentation. In: CVPR (2022)

2022
[13]

In: ICCV (2023)

Cui, Y., Zeng, C., Zhao, X., Yang, Y., Wu, G., Wang, L.: SportsMOT: A large multi-object tracking dataset in multiple sports scenes. In: ICCV (2023)

2023
[14]

International Journal of Computer Vision (2021)

Dendorfer, P., Osep, A., Milan, A., Schindler, K., Cremers, D., Reid, I., Roth, S., Leal-Taixé, L.: MOTChallenge: A benchmark for single-camera multiple target tracking. International Journal of Computer Vision (2021)

2021
[15]

arXiv preprint arXiv:2003.09003 (2020)

Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taixé, L.: MOT20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003 (2020)

work page arXiv 2003
[16]

In: ICML (2016)

Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In: ICML (2016)

2016
[17]

In: CVPR (2025)

Gao, R., Qi, J., Wang, L.: Multiple object tracking as ID prediction. In: CVPR (2025)

2025
[18]

In: ICCV (2023)

Gao, R., Wang, L.: MeMOTR: Long-term memory-augmented transformer for multi-object tracking. In: ICCV (2023)

2023
[19]

IEEE Transactions on Circuits and Systems for Video Technology (2025) PS-MOT 17

Gao, X., Li, Z., Shi, H., Chen, Z., Zhao, P.: Scribble-supervised video object seg- mentation via scribble enhancement. IEEE Transactions on Circuits and Systems for Video Technology (2025) PS-MOT 17

2025
[20]

YOLOX: Exceeding YOLO Series in 2021

Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: YOLOX: Exceeding YOLO series in 2021. arXiv preprint arXiv:2107.08430 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

IEEE Access (2025)

Hayat, M., Aramvith, S.: Superpixel-guided graph-attention boundary GAN for adaptive feature refinement in scribble-supervised medical image segmentation. IEEE Access (2025)

2025
[22]

In: CVPR (2021)

He, J., Huang, Z., Wang, N., Zhang, Z.: Learnable graph matching: Incorporat- ing graph partitioning with deep feature learning for multiple object tracking. In: CVPR (2021)

2021
[23]

In: ICCV (2017)

He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)

2017
[24]

In: CVPR (2016)

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

2016
[25]

In: CVPR (2023)

Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., Lu, L., Jia, X., Liu, Q., Dai, J., Qiao, Y., Li, H.: Planning-oriented autonomous driving. In: CVPR (2023)

2023
[26]

arXiv preprint arXiv:2601.01925 (2026)

Jia, L., Wu, Y., Ran, B., Wang, Y., Wang, L., Lu, H.: AR-MOT: Autoregressive multi-object tracking. arXiv preprint arXiv:2601.01925 (2026)

work page arXiv 2026
[27]

Kendall, A., Gal, Y.: What uncertainties do we need in Bayesian deep learning for computer vision? In: NeurIPS (2017)

2017
[28]

Knowledge and Information Systems (2025)

Li, S., Yang, L., Tan, H., Wang, B., Huang, W., Liu, H., Yang, W., Lan, L.: Self- supervised re-identification for online joint multi-object tracking. Knowledge and Information Systems (2025)

2025
[29]

arXiv preprint arXiv:2411.06702 (2024)

Lim, J.S., Luo, Y., Chen, Z., Wei, T., Chapman, S., Huang, Z.: Track any peppers: Weakly supervised sweet pepper tracking using VLMs. arXiv preprint arXiv:2411.06702 (2024)

work page arXiv 2024
[30]

In: CVPR (2024)

Lu, Z., Shuai, B., Chen, Y., Xu, Z., Modolo, D.: Self-supervised multi-object track- ing with path consistency. In: CVPR (2024)

2024
[31]

Interna- tional Journal of Computer Vision (2021)

Luiten, J., Osep, A., Dendorfer, P., Torr, P.H.S., Geiger, A., Leal-Taixé, L., Leibe, B.: HOTA: A higher order metric for evaluating multi-object tracking. Interna- tional Journal of Computer Vision (2021)

2021
[32]

OmniTrack++: Omnidirectional Multi-Object Tracking by Learning Large-FoV Trajectory Feedback

Luo, K., Shi, H., Peng, K., Teng, F., Wu, S., Wang, K., Yang, K.: OmniTrack++: Omnidirectional multi-object tracking by learning large-FoV trajectory feedback. arXiv preprint arXiv:2511.00510 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

In: CVPR (2025)

Luo, K., Shi, H., Wu, S., Teng, F., Duan, M., Huang, C., Wang, Y., Wang, K., Yang, K.: Omnidirectional multi-object tracking. In: CVPR (2025)

2025
[34]

In: CVPR (2024)

Lv, W., Huang, Y., Zhang, N., Lin, R.S., Han, M., Zeng, D.: DiffMOT: A real- time diffusion-based multiple object tracker with non-linear prediction. In: CVPR (2024)

2024
[35]

In: NeurIPS (2021)

Mao, J., Niu, M., Jiang, C., Liang, H., Chen, J., Liang, X., Li, Y., Ye, C., Zhang, W., Li, Z., Yu, J., Xu, H., Xu, C.X.: One million scenes for autonomous driving: Once dataset. In: NeurIPS (2021)

2021
[36]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)

Martin-Martin, R., Patel, M., Rezatofighi, H., Shenoi, A., Gwak, J., Frankel, E., Sadeghian, A., Savarese, S.: JRDB: A dataset and benchmark of egocentric robot visual perception of humans in built environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)

2023
[37]

In: CVPR (2022)

Meinhardt, T., Kirillov, A., Leal-Taixé, L., Feichtenhofer, C.: TrackFormer: Multi- object tracking with transformers. In: CVPR (2022)

2022
[38]

In: CVPR (2021)

Pang,J., Qiu,L., Li,X., Chen,H., Li,Q., Darrell,T., Yu, F.:Quasi-dense similarity learning for multiple object tracking. In: CVPR (2021)

2021
[39]

In: ICCV (2017) 18 K

Papadopoulos, D.P., Uijlings, J.R.R., Keller, F., Ferrari, V.: Extreme clicking for efficient object annotation. In: ICCV (2017) 18 K. Luoet al

2017
[40]

arXiv preprint arXiv:2410.13842 (2024)

Peng, Y., Li, H., Wu, P., Zhang, Y., Sun, X., Wu, F.: D-FINE: Redefine re- gression task in detrs as fine-grained distribution refinement. arXiv preprint arXiv:2410.13842 (2024)

work page arXiv 2024
[41]

In: CVPR (2023)

Ren, H., Han, S., Ding, H., Zhang, Z., Wang, H., Wang, F.: Focus on details: Online multi-object tracking with diverse fine-grained representation. In: CVPR (2023)

2023
[42]

In: ECCVW (2016)

Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: ECCVW (2016)

2016
[43]

IEEE Transactions on Signal Processing (2002)

Shensa, M.J.: The discrete wavelet transform: wedding the a trous and Mallat algorithms. IEEE Transactions on Signal Processing (2002)

2002
[44]

In: ICIP (2024)

Shim, K., Hwang, J., Ko, K., Kim, C.: A confidence-aware matching strategy for generalized multi-object tracking. In: ICIP (2024)

2024
[45]

In: CVPR (2025)

Shim, K., Ko, K., Yang, Y., Kim, C.: Focusing on tracks for online multi-object tracking. In: CVPR (2025)

2025
[46]

In: AAAI (2026)

Shu, Z., Wu, J., Yan, W., Liu, X., Zhang, H., Liu, C., Mao, Y., Chen, J.: Wave- Former: Frequency-time decoupled vision modeling with wave equation. In: AAAI (2026)

2026
[47]

In: CVPR (2022)

Sun, P., Cao, J., Jiang, Y., Yuan, Z., Bai, S., Kitani, K., Luo, P.: DanceTrack: Multi-object tracking in uniform appearance and diverse motion. In: CVPR (2022)

2022
[48]

arXiv preprint arXiv:2012.15460 (2020)

Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., Wang, C., Luo, P.: TransTrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460 (2020)

work page arXiv 2012
[49]

In: AAAI (2024)

Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y.: Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. In: AAAI (2024)

2024
[50]

In: CVPR (2017)

Tang, P., Wang, X., Bai, X., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: CVPR (2017)

2017
[51]

Cerebral Cortex (1995)

Ullman, S.: Sequence seeking and counter streams: a computational model for bidirectional information flow in the visual cortex. Cerebral Cortex (1995)

1995
[52]

arXiv preprint arXiv:2411.08433 (2024)

Wang, X., Liu, J., Feng, M., Zhang, Z., Yang, X.: 3D multi-object tracking with semi-supervised GRU-Kalman filter. arXiv preprint arXiv:2411.08433 (2024)

work page arXiv 2024
[53]

In: ICIP (2017)

Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: ICIP (2017)

2017
[54]

In: CVPR (2021)

Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., Yuan, J.: Track to detect and segment: An online multi-object tracker. In: CVPR (2021)

2021
[55]

In: AAAI (2024)

Yang, M., Han, G., Yan, B., Zhang, W., Qi, J., Lu, H., Wang, D.: Hybrid-SORT: Weak cues matter for online multi-object tracking. In: AAAI (2024)

2024
[56]

In: AAAI (2024)

Yi, K., Luo, K., Luo, X., Huang, J., Wu, H., Hu, R., Hao, W.: UCMCTrack: Multi- object tracking with uniform camera motion compensation. In: AAAI (2024)

2024
[57]

In: CVPR (2022)

Yu, X., Chen, P., Wu, D., Hassan, N., Li, G., Yan, J., Shi, H., Ye, Q., Han, Z.: Object localization under single coarse point supervision. In: CVPR (2022)

2022
[58]

In: CVPR (2025)

Yu, Y., Ren, B., Zhang, P., Liu, M., Luo, J., Zhang, S., Da, F., Yan, J., Yang, X.: Point2RBox-v2: Rethinking point-supervised oriented object detection with spatial layout among instances. In: CVPR (2025)

2025
[59]

In: CVPR (2024)

Yu, Y., Yang, X., Li, Q., Da, F., Dai, J., Qiao, Y., Yan, J.: Point2RBox: Combine knowledge from synthetic visual patterns for end-to-end oriented object detection with single point supervision. In: CVPR (2024)

2024
[60]

In: ECCV (2022) PS-MOT 19

Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., Wei, Y.: MOTR: End-to-end multiple-object tracking with transformer. In: ECCV (2022) PS-MOT 19

2022
[61]

arXiv preprint arXiv:2509.26281 (2025)

Zhang,T.,Fan,Z.,Liu,M.,Zhang,X.,Lu,X.,Li,W.,Zhou,Y.,Yu,Y.,Li,X.,Yan, J., Yang, X.: Point2RBox-v3: Self-bootstrapping from point annotations via inte- grated pseudo-label refinement and utilization. arXiv preprint arXiv:2509.26281 (2025)

work page arXiv 2025
[62]

In: ECCV (2022)

Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.: ByteTrack: Multi-object tracking by associating every detection box. In: ECCV (2022)

2022
[63]

International Journal of Computer Vision (2021)

Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: FairMOT: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision (2021)

2021
[64]

In: CVPR (2023)

Zhang, Y., Wang, T., Zhang, X.: MOTRv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. In: CVPR (2023)

2023
[65]

arXiv preprint arXiv:2502.16809 (2025)

Zhao, Z., Yu, J., Zhang, L., Zhang, S.: CRTrack: Low-light semi-supervised multi-object tracking based on consistency regularization. arXiv preprint arXiv:2502.16809 (2025)

work page arXiv 2025
[66]

In: AAAI (2025)

Zheng, M., Xu, Z., Xia, Q., Wu, H., Wen, C., Wang, C.: Seg2Box: 3D object detection by point-wise semantics supervision. In: AAAI (2025)

2025
[67]

IEEE Transactions on Geoscience and Remote Sensing (2022)

Zheng, S., Wu, Z., Xu, Y., Wei, Z., Plaza, A.: Learning orientation information from frequency-domain for oriented object detection in remote sensing images. IEEE Transactions on Geoscience and Remote Sensing (2022)

2022
[68]

In: ECCV (2020)

Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. In: ECCV (2020)

2020
[69]

In: CVPR (2022)

Zhou, X., Yin, T., Koltun, V., Krähenbühl, P.: Global tracking transformers. In: CVPR (2022)

2022
[70]

IEEE Transactions on Intelligent Vehicles (2024)

Zhou, Y., Cai, L., Cheng, X., Zhang, Q., Xue, X., Ding, W., Pu, J.: OpenAnno- tate2: Multi-modal auto-annotating for autonomous driving. IEEE Transactions on Intelligent Vehicles (2024)

2024
[71]

IEEE Transactions on Artificial Intelligence (2025)

Zhu, R., Zhao, J., Zhang, D., Wang, G., Chen, X., Zhang, S., Gong, J., Zhou, Q., Zhang, W., Wang, N., Tan, F., Xu, Z., Zhou, H., Yao, H., Zhang, C., Liu, L., Liu, X., Di, X., Li, B.: SparseAD: Sparse query-centric paradigm for efficient end-to-end autonomous driving. IEEE Transactions on Artificial Intelligence (2025)

2025
[72]

In: ICLR (2021) 20 K

Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: Deformable transformers for end-to-end object detection. In: ICLR (2021) 20 K. Luoet al. A Implementation Details Detailed Network Architecture.We implement PS-Track built upon the MOTIP [17] framework, adopting Deformable DETR [72] as our core detec- tor. The visual features are extract...

2021

[1] [1]

BoT-SORT: Robust associations multi-pedestrian tracking,

Aharon, N., Orfaig, R., Bobrovsky, B.Z.: BoT-SORT: Robust associations multi- pedestrian tracking. arXiv preprint arXiv:2206.14651 (2022)

work page arXiv 2022

[2] [2]

Journal of Cognitive Neuroscience (2003)

Bar, M.: A cortical mechanism for triggering top-down facilitation in visual object recognition. Journal of Cognitive Neuroscience (2003)

2003

[3] [3]

In: ECCV (2016)

Bearman, A., Russakovsky, O., Ferrari, V., Fei-Fei, L.: What’s the point: Semantic segmentation with point supervision. In: ECCV (2016)

2016

[4] [4]

EURASIP Journal on Image and Video Processing (2008)

Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP Journal on Image and Video Processing (2008)

2008

[5] [5]

In: ICIP (2016)

Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP (2016)

2016

[6] [6]

In: CVPR (2016)

Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: CVPR (2016)

2016

[7] [7]

In: CVPR (2023)

Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K.: Observation-centric SORT: Rethinking SORT for robust multi-object tracking. In: CVPR (2023)

2023

[8] [8]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

In: CVPR (2023)

Cetintas, O., Brasó, G., Leal-Taixé, L.: Unifying short and long-term tracking with graph hierarchies. In: CVPR (2023)

2023

[10] [10]

Scientific Reports (2025)

Chan, S., Zhou, W., Lei, Y., Li, C., Hu, J., Hong, F.: Sparse point annotations for remote sensing image segmentation. Scientific Reports (2025)

2025

[11] [11]

In: ECCV (2022)

Chen, P., Yu, X., Han, X., Hassan, N., Wang, K., Li, J., Zhao, J., Shi, H., Han, Z., Ye, Q.: Point-to-box network for accurate object detection via single point supervision. In: ECCV (2022)

2022

[12] [12]

In: CVPR (2022)

Cheng, B., Parkhi, O., Kirillov, A.: Pointly-supervised instance segmentation. In: CVPR (2022)

2022

[13] [13]

In: ICCV (2023)

Cui, Y., Zeng, C., Zhao, X., Yang, Y., Wu, G., Wang, L.: SportsMOT: A large multi-object tracking dataset in multiple sports scenes. In: ICCV (2023)

2023

[14] [14]

International Journal of Computer Vision (2021)

Dendorfer, P., Osep, A., Milan, A., Schindler, K., Cremers, D., Reid, I., Roth, S., Leal-Taixé, L.: MOTChallenge: A benchmark for single-camera multiple target tracking. International Journal of Computer Vision (2021)

2021

[15] [15]

arXiv preprint arXiv:2003.09003 (2020)

Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taixé, L.: MOT20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003 (2020)

work page arXiv 2003

[16] [16]

In: ICML (2016)

Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In: ICML (2016)

2016

[17] [17]

In: CVPR (2025)

Gao, R., Qi, J., Wang, L.: Multiple object tracking as ID prediction. In: CVPR (2025)

2025

[18] [18]

In: ICCV (2023)

Gao, R., Wang, L.: MeMOTR: Long-term memory-augmented transformer for multi-object tracking. In: ICCV (2023)

2023

[19] [19]

IEEE Transactions on Circuits and Systems for Video Technology (2025) PS-MOT 17

Gao, X., Li, Z., Shi, H., Chen, Z., Zhao, P.: Scribble-supervised video object seg- mentation via scribble enhancement. IEEE Transactions on Circuits and Systems for Video Technology (2025) PS-MOT 17

2025

[20] [20]

YOLOX: Exceeding YOLO Series in 2021

Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: YOLOX: Exceeding YOLO series in 2021. arXiv preprint arXiv:2107.08430 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[21] [21]

IEEE Access (2025)

Hayat, M., Aramvith, S.: Superpixel-guided graph-attention boundary GAN for adaptive feature refinement in scribble-supervised medical image segmentation. IEEE Access (2025)

2025

[22] [22]

In: CVPR (2021)

He, J., Huang, Z., Wang, N., Zhang, Z.: Learnable graph matching: Incorporat- ing graph partitioning with deep feature learning for multiple object tracking. In: CVPR (2021)

2021

[23] [23]

In: ICCV (2017)

He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)

2017

[24] [24]

In: CVPR (2016)

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

2016

[25] [25]

In: CVPR (2023)

Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., Lu, L., Jia, X., Liu, Q., Dai, J., Qiao, Y., Li, H.: Planning-oriented autonomous driving. In: CVPR (2023)

2023

[26] [26]

arXiv preprint arXiv:2601.01925 (2026)

Jia, L., Wu, Y., Ran, B., Wang, Y., Wang, L., Lu, H.: AR-MOT: Autoregressive multi-object tracking. arXiv preprint arXiv:2601.01925 (2026)

work page arXiv 2026

[27] [27]

Kendall, A., Gal, Y.: What uncertainties do we need in Bayesian deep learning for computer vision? In: NeurIPS (2017)

2017

[28] [28]

Knowledge and Information Systems (2025)

Li, S., Yang, L., Tan, H., Wang, B., Huang, W., Liu, H., Yang, W., Lan, L.: Self- supervised re-identification for online joint multi-object tracking. Knowledge and Information Systems (2025)

2025

[29] [29]

arXiv preprint arXiv:2411.06702 (2024)

Lim, J.S., Luo, Y., Chen, Z., Wei, T., Chapman, S., Huang, Z.: Track any peppers: Weakly supervised sweet pepper tracking using VLMs. arXiv preprint arXiv:2411.06702 (2024)

work page arXiv 2024

[30] [30]

In: CVPR (2024)

Lu, Z., Shuai, B., Chen, Y., Xu, Z., Modolo, D.: Self-supervised multi-object track- ing with path consistency. In: CVPR (2024)

2024

[31] [31]

Interna- tional Journal of Computer Vision (2021)

Luiten, J., Osep, A., Dendorfer, P., Torr, P.H.S., Geiger, A., Leal-Taixé, L., Leibe, B.: HOTA: A higher order metric for evaluating multi-object tracking. Interna- tional Journal of Computer Vision (2021)

2021

[32] [32]

OmniTrack++: Omnidirectional Multi-Object Tracking by Learning Large-FoV Trajectory Feedback

Luo, K., Shi, H., Peng, K., Teng, F., Wu, S., Wang, K., Yang, K.: OmniTrack++: Omnidirectional multi-object tracking by learning large-FoV trajectory feedback. arXiv preprint arXiv:2511.00510 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

In: CVPR (2025)

Luo, K., Shi, H., Wu, S., Teng, F., Duan, M., Huang, C., Wang, Y., Wang, K., Yang, K.: Omnidirectional multi-object tracking. In: CVPR (2025)

2025

[34] [34]

In: CVPR (2024)

Lv, W., Huang, Y., Zhang, N., Lin, R.S., Han, M., Zeng, D.: DiffMOT: A real- time diffusion-based multiple object tracker with non-linear prediction. In: CVPR (2024)

2024

[35] [35]

In: NeurIPS (2021)

Mao, J., Niu, M., Jiang, C., Liang, H., Chen, J., Liang, X., Li, Y., Ye, C., Zhang, W., Li, Z., Yu, J., Xu, H., Xu, C.X.: One million scenes for autonomous driving: Once dataset. In: NeurIPS (2021)

2021

[36] [36]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)

Martin-Martin, R., Patel, M., Rezatofighi, H., Shenoi, A., Gwak, J., Frankel, E., Sadeghian, A., Savarese, S.: JRDB: A dataset and benchmark of egocentric robot visual perception of humans in built environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)

2023

[37] [37]

In: CVPR (2022)

Meinhardt, T., Kirillov, A., Leal-Taixé, L., Feichtenhofer, C.: TrackFormer: Multi- object tracking with transformers. In: CVPR (2022)

2022

[38] [38]

In: CVPR (2021)

Pang,J., Qiu,L., Li,X., Chen,H., Li,Q., Darrell,T., Yu, F.:Quasi-dense similarity learning for multiple object tracking. In: CVPR (2021)

2021

[39] [39]

In: ICCV (2017) 18 K

Papadopoulos, D.P., Uijlings, J.R.R., Keller, F., Ferrari, V.: Extreme clicking for efficient object annotation. In: ICCV (2017) 18 K. Luoet al

2017

[40] [40]

arXiv preprint arXiv:2410.13842 (2024)

Peng, Y., Li, H., Wu, P., Zhang, Y., Sun, X., Wu, F.: D-FINE: Redefine re- gression task in detrs as fine-grained distribution refinement. arXiv preprint arXiv:2410.13842 (2024)

work page arXiv 2024

[41] [41]

In: CVPR (2023)

Ren, H., Han, S., Ding, H., Zhang, Z., Wang, H., Wang, F.: Focus on details: Online multi-object tracking with diverse fine-grained representation. In: CVPR (2023)

2023

[42] [42]

In: ECCVW (2016)

Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: ECCVW (2016)

2016

[43] [43]

IEEE Transactions on Signal Processing (2002)

Shensa, M.J.: The discrete wavelet transform: wedding the a trous and Mallat algorithms. IEEE Transactions on Signal Processing (2002)

2002

[44] [44]

In: ICIP (2024)

Shim, K., Hwang, J., Ko, K., Kim, C.: A confidence-aware matching strategy for generalized multi-object tracking. In: ICIP (2024)

2024

[45] [45]

In: CVPR (2025)

Shim, K., Ko, K., Yang, Y., Kim, C.: Focusing on tracks for online multi-object tracking. In: CVPR (2025)

2025

[46] [46]

In: AAAI (2026)

Shu, Z., Wu, J., Yan, W., Liu, X., Zhang, H., Liu, C., Mao, Y., Chen, J.: Wave- Former: Frequency-time decoupled vision modeling with wave equation. In: AAAI (2026)

2026

[47] [47]

In: CVPR (2022)

Sun, P., Cao, J., Jiang, Y., Yuan, Z., Bai, S., Kitani, K., Luo, P.: DanceTrack: Multi-object tracking in uniform appearance and diverse motion. In: CVPR (2022)

2022

[48] [48]

arXiv preprint arXiv:2012.15460 (2020)

Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., Wang, C., Luo, P.: TransTrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460 (2020)

work page arXiv 2012

[49] [49]

In: AAAI (2024)

Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y.: Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. In: AAAI (2024)

2024

[50] [50]

In: CVPR (2017)

Tang, P., Wang, X., Bai, X., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: CVPR (2017)

2017

[51] [51]

Cerebral Cortex (1995)

Ullman, S.: Sequence seeking and counter streams: a computational model for bidirectional information flow in the visual cortex. Cerebral Cortex (1995)

1995

[52] [52]

arXiv preprint arXiv:2411.08433 (2024)

Wang, X., Liu, J., Feng, M., Zhang, Z., Yang, X.: 3D multi-object tracking with semi-supervised GRU-Kalman filter. arXiv preprint arXiv:2411.08433 (2024)

work page arXiv 2024

[53] [53]

In: ICIP (2017)

Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: ICIP (2017)

2017

[54] [54]

In: CVPR (2021)

Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., Yuan, J.: Track to detect and segment: An online multi-object tracker. In: CVPR (2021)

2021

[55] [55]

In: AAAI (2024)

Yang, M., Han, G., Yan, B., Zhang, W., Qi, J., Lu, H., Wang, D.: Hybrid-SORT: Weak cues matter for online multi-object tracking. In: AAAI (2024)

2024

[56] [56]

In: AAAI (2024)

Yi, K., Luo, K., Luo, X., Huang, J., Wu, H., Hu, R., Hao, W.: UCMCTrack: Multi- object tracking with uniform camera motion compensation. In: AAAI (2024)

2024

[57] [57]

In: CVPR (2022)

Yu, X., Chen, P., Wu, D., Hassan, N., Li, G., Yan, J., Shi, H., Ye, Q., Han, Z.: Object localization under single coarse point supervision. In: CVPR (2022)

2022

[58] [58]

In: CVPR (2025)

Yu, Y., Ren, B., Zhang, P., Liu, M., Luo, J., Zhang, S., Da, F., Yan, J., Yang, X.: Point2RBox-v2: Rethinking point-supervised oriented object detection with spatial layout among instances. In: CVPR (2025)

2025

[59] [59]

In: CVPR (2024)

Yu, Y., Yang, X., Li, Q., Da, F., Dai, J., Qiao, Y., Yan, J.: Point2RBox: Combine knowledge from synthetic visual patterns for end-to-end oriented object detection with single point supervision. In: CVPR (2024)

2024

[60] [60]

In: ECCV (2022) PS-MOT 19

Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., Wei, Y.: MOTR: End-to-end multiple-object tracking with transformer. In: ECCV (2022) PS-MOT 19

2022

[61] [61]

arXiv preprint arXiv:2509.26281 (2025)

Zhang,T.,Fan,Z.,Liu,M.,Zhang,X.,Lu,X.,Li,W.,Zhou,Y.,Yu,Y.,Li,X.,Yan, J., Yang, X.: Point2RBox-v3: Self-bootstrapping from point annotations via inte- grated pseudo-label refinement and utilization. arXiv preprint arXiv:2509.26281 (2025)

work page arXiv 2025

[62] [62]

In: ECCV (2022)

Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.: ByteTrack: Multi-object tracking by associating every detection box. In: ECCV (2022)

2022

[63] [63]

International Journal of Computer Vision (2021)

Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: FairMOT: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision (2021)

2021

[64] [64]

In: CVPR (2023)

Zhang, Y., Wang, T., Zhang, X.: MOTRv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. In: CVPR (2023)

2023

[65] [65]

arXiv preprint arXiv:2502.16809 (2025)

Zhao, Z., Yu, J., Zhang, L., Zhang, S.: CRTrack: Low-light semi-supervised multi-object tracking based on consistency regularization. arXiv preprint arXiv:2502.16809 (2025)

work page arXiv 2025

[66] [66]

In: AAAI (2025)

Zheng, M., Xu, Z., Xia, Q., Wu, H., Wen, C., Wang, C.: Seg2Box: 3D object detection by point-wise semantics supervision. In: AAAI (2025)

2025

[67] [67]

IEEE Transactions on Geoscience and Remote Sensing (2022)

Zheng, S., Wu, Z., Xu, Y., Wei, Z., Plaza, A.: Learning orientation information from frequency-domain for oriented object detection in remote sensing images. IEEE Transactions on Geoscience and Remote Sensing (2022)

2022

[68] [68]

In: ECCV (2020)

Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. In: ECCV (2020)

2020

[69] [69]

In: CVPR (2022)

Zhou, X., Yin, T., Koltun, V., Krähenbühl, P.: Global tracking transformers. In: CVPR (2022)

2022

[70] [70]

IEEE Transactions on Intelligent Vehicles (2024)

Zhou, Y., Cai, L., Cheng, X., Zhang, Q., Xue, X., Ding, W., Pu, J.: OpenAnno- tate2: Multi-modal auto-annotating for autonomous driving. IEEE Transactions on Intelligent Vehicles (2024)

2024

[71] [71]

IEEE Transactions on Artificial Intelligence (2025)

Zhu, R., Zhao, J., Zhang, D., Wang, G., Chen, X., Zhang, S., Gong, J., Zhou, Q., Zhang, W., Wang, N., Tan, F., Xu, Z., Zhou, H., Yao, H., Zhang, C., Liu, L., Liu, X., Di, X., Li, B.: SparseAD: Sparse query-centric paradigm for efficient end-to-end autonomous driving. IEEE Transactions on Artificial Intelligence (2025)

2025

[72] [72]

In: ICLR (2021) 20 K

Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: Deformable transformers for end-to-end object detection. In: ICLR (2021) 20 K. Luoet al. A Implementation Details Detailed Network Architecture.We implement PS-Track built upon the MOTIP [17] framework, adopting Deformable DETR [72] as our core detec- tor. The visual features are extract...

2021