pith. sign in

arxiv: 2606.29861 · v1 · pith:G5H3BMWMnew · submitted 2026-06-29 · 💻 cs.CV · cs.AI

SUMO: Segment and Track Any Motion with Nonlinear State Space Models

Pith reviewed 2026-06-30 06:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords visual object trackingmoving object segmentationstate space modelsnonlinear dynamicsunscented filterzero-shot tracking
0
0 comments X

The pith

A nonlinear state space model from robotics enables zero-shot tracking and segmentation of objects with complex motions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SUMO as a unified, training-free framework that combines nonlinear dynamics modeling with vision segmentation to solve visual object tracking and moving object segmentation. It claims that prior methods struggle because they depend only on visual cues and ignore the underlying nonlinear motion patterns. SUMO builds a robotics-inspired nonlinear state space model, then uses a selective unscented filter to fuse predictions from multiple sources and select reliable memory frames. This produces more consistent object states across time. If correct, the method would deliver stronger performance on real videos without needing task-specific training data.

Core claim

SUMO develops a nonlinear State Space Model to represent object dynamics and introduces a Selective Unscented Filter that applies joint scoring and dynamic fusion of multi-source predictions, together with a memory selection mechanism, to achieve state-of-the-art results on VOT and MOS benchmarks in a zero-shot setting.

What carries the argument

The nonlinear State Space Model (SSM) that encodes object motion dynamics, paired with the Selective Unscented Filter (SUF) that performs state estimation through scoring and fusion of predictions.

Load-bearing premise

The nonlinear state space model accurately captures the motion patterns of objects in real videos.

What would settle it

Videos containing object motions that deviate strongly from the assumed nonlinear dynamics where the selective unscented filter still produces incorrect state estimates despite clear visual evidence.

Figures

Figures reproduced from arXiv: 2606.29861 by Keshu Wu, Kexin Tian, Sixu Li, Yang Zhou, Zhengzhong Tu.

Figure 1
Figure 1. Figure 1: Comprehensive experiments demonstrate that SUMO [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework Architecture. 3.1.2. Memory Attention This module integrates selected memory frames into the current frame via a memory selection mechanism, which will be detailed in Eq.(29), and a transformer-based atten￾tion architecture that follows the SAM2[50] design, which applies self-attention to the current frame features followed by cross-attention to the memory features. 3.1.3. Prompt Encoder In the i… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative visualization of VOT performance. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative visualization of MOS performance. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Visual Object Tracking (VOT) and Moving Object Segmentation (MOS) are two fundamental tasks in computer vision that involve both spatial and temporal object dynamics. Existing methods rely predominantly on visual cues and thus often falter in real-world scenarios where object motions are inherently complex and nonlinear. To address this limitation, we propose SUMO, a zero-shot, training-free, unified framework integrating nonlinear dynamics with vision-based segmentation for accurate and consistent VOT and MOS. Specifically, we develop a nonlinear State Space Model (SSM) inspired by robotics principles to capture the complex object dynamics. Building on this model, we propose a Selective Unscented Filter (SUF) for accurate state estimation, which features a joint scoring mechanism and dynamically fuses multi-source predictions to identify the most plausible object state over time. Furthermore, we apply a memory selection mechanism to evaluate the reliability of memory frames. Our extensive experimental results show that SUMO achieves state-of-the-art performance on both VOT and MOS tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes SUMO, a zero-shot training-free unified framework for Visual Object Tracking (VOT) and Moving Object Segmentation (MOS). It combines a robotics-inspired nonlinear state space model (SSM) to capture complex object dynamics, a Selective Unscented Filter (SUF) with joint scoring and dynamic multi-source fusion for state estimation, and a memory selection mechanism. The central claim is that this yields state-of-the-art performance on both VOT and MOS tasks.

Significance. If the results hold, the work could be significant as a training-free alternative that incorporates nonlinear dynamics from robotics to improve robustness on complex motions where visual-cue-only methods fail. The zero-shot nature and the SUF's joint scoring mechanism are strengths that could generalize across tasks. However, the significance hinges on whether the nonlinear SSM component is demonstrably necessary, which is not yet established.

major comments (2)
  1. [Method description of nonlinear SSM and SUF] The SOTA claim on VOT and MOS rests on the premise that the nonlinear SSM plus SUF delivers superior state estimation, yet the manuscript provides no ablation that replaces the nonlinear dynamics with a linear SSM (or EKF/UKF) while keeping the SUF, joint scoring, and segmentation pipeline fixed. This is load-bearing for the central claim that nonlinear modeling is required to address the limitations of existing methods.
  2. [Experimental results] No quantitative tables, baseline comparisons, or statistical details are referenced to support the SOTA performance assertions, and the experimental evaluation lacks controls that would allow attribution of gains specifically to the robotics-inspired nonlinear component versus the vision backbone.
minor comments (1)
  1. The abstract is dense and would benefit from explicit separation of the three main contributions (nonlinear SSM, SUF, memory selection) for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight the need for stronger empirical isolation of the nonlinear SSM's contribution, which we address by committing to additional experiments in revision. We respond point-by-point below.

read point-by-point responses
  1. Referee: [Method description of nonlinear SSM and SUF] The SOTA claim on VOT and MOS rests on the premise that the nonlinear SSM plus SUF delivers superior state estimation, yet the manuscript provides no ablation that replaces the nonlinear dynamics with a linear SSM (or EKF/UKF) while keeping the SUF, joint scoring, and segmentation pipeline fixed. This is load-bearing for the central claim that nonlinear modeling is required to address the limitations of existing methods.

    Authors: We agree that an explicit ablation isolating the nonlinear dynamics (replacing the nonlinear SSM with a linear SSM or EKF/UKF while freezing SUF, joint scoring, and the segmentation pipeline) is necessary to substantiate the central claim. The current manuscript motivates the nonlinear SSM from robotics principles for complex motions but does not include this controlled comparison. In the revised version we will add the requested ablation on standard VOT and MOS benchmarks and report the resulting performance deltas. revision: yes

  2. Referee: [Experimental results] No quantitative tables, baseline comparisons, or statistical details are referenced to support the SOTA performance assertions, and the experimental evaluation lacks controls that would allow attribution of gains specifically to the robotics-inspired nonlinear component versus the vision backbone.

    Authors: The manuscript contains quantitative results and baseline comparisons in the experimental section; however, we acknowledge that these do not yet include the specific controls needed to attribute gains to the nonlinear SSM versus the vision backbone. The additional ablation described in the response to the first comment will directly address this attribution gap. We will also ensure all tables, statistical details, and controls are clearly referenced in the revised text. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical SOTA claim rests on independent model proposal and experiments

full rationale

The paper proposes a new zero-shot framework (nonlinear SSM + SUF + memory selection) for VOT/MOS, with performance claims grounded in experimental results on standard benchmarks rather than any derivation that reduces to its own inputs. No equations, fitted parameters, or self-citations appear in the abstract or described structure that would trigger self-definitional, fitted-input, or load-bearing self-citation patterns. The robotics inspiration is presented as motivation for an ansatz, not as a uniqueness theorem imported from prior author work. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5709 in / 973 out tokens · 22311 ms · 2026-06-30T06:35:14.733153+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 7 canonical work pages · 4 internal anchors

  1. [1]

    Progressive-x: Efficient, anytime, multi-model fitting algorithm

    Daniel Barath and Jiri Matas. Progressive-x: Efficient, anytime, multi-model fitting algorithm. InProceedings of the IEEE/CVF international conference on computer vision, pages 3780–3788, 2019. 2

  2. [2]

    Time optimal tra- jectories for a car-like mobile robot.IEEE Transactions on Robotics, 38(1):421–432, 2021

    Joseph Z Ben-Asher and Elon D Rimon. Time optimal tra- jectories for a car-like mobile robot.IEEE Transactions on Robotics, 38(1):421–432, 2021. 3

  3. [3]

    Fully-convolutional siamese networks for object tracking

    Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. InComputer vision–ECCV 2016 workshops: Amsterdam, the Netherlands, October 8- 10 and 15-16, 2016, proceedings, part II 14, pages 850–865. Springer, 2016. 1, 2

  4. [4]

    Learning discriminative model prediction for track- ing

    Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. Learning discriminative model prediction for track- ing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 6182–6191, 2019. 2

  5. [5]

    It’s moving! a prob- abilistic model for causal motion segmentation in moving camera videos

    Pia Bideau and Erik Learned-Miller. It’s moving! a prob- abilistic model for causal motion segmentation in moving camera videos. InEuropean Conference on Computer Vi- sion, pages 433–449. Springer, 2016. 2

  6. [6]

    Deep learning for robust motion seg- mentation with non-static cameras.arXiv preprint arXiv:2102.10929, 2021

    Markus Bosch. Deep learning for robust motion seg- mentation with non-static cameras.arXiv preprint arXiv:2102.10929, 2021. 1, 2

  7. [7]

    Ro- bust object modeling for visual tracking

    Yidong Cai, Jie Liu, Jie Tang, and Gangshan Wu. Ro- bust object modeling for visual tracking. InProceedings of the IEEE/CVF international conference on computer vision, pages 9589–9600, 2023. 6

  8. [8]

    Springer Science & Business Media, 2012

    Frank M Callier and Charles A Desoer.Linear system theory. Springer Science & Business Media, 2012. 1, 3

  9. [9]

    Learning independent object motion from unlabelled stereo- scopic videos

    Zhe Cao, Abhishek Kar, Christian Hane, and Jitendra Malik. Learning independent object motion from unlabelled stereo- scopic videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5594– 5603, 2019. 2

  10. [10]

    Linear rotate subspaee based visual tracking methods with application to uav stand-off target tracking

    Fei Che, Jie Li, Yifeng Niu, Lizhen Wu, Wenchen Yao, and Chao Yan. Linear rotate subspaee based visual tracking methods with application to uav stand-off target tracking. In 2019 IEEE International Conference on Unmanned Systems (ICUS), pages 914–919, 2019. 1

  11. [11]

    Transformer tracking

    Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8126–8135, 2021. 2, 6

  12. [12]

    Seqtrack: Sequence to sequence learning for visual ob- ject tracking

    Xin Chen, Houwen Peng, Dong Wang, Huchuan Lu, and Han Hu. Seqtrack: Sequence to sequence learning for visual ob- ject tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14572– 14581, 2023. 6

  13. [13]

    Siamese box adaptive network for visual tracking

    Zedu Chen, Bineng Zhong, Guorong Li, Shengping Zhang, and Rongrong Ji. Siamese box adaptive network for visual tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6668–6677,

  14. [14]

    Mixformer: End-to-end tracking with iterative mixed atten- tion

    Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed atten- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 13608–13618,

  15. [15]

    Atom: Accurate tracking by overlap max- imization

    Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Atom: Accurate tracking by overlap max- imization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4660–4669,

  16. [16]

    Ada-track: End-to-end multi-camera 3d multi-object tracking with alternating detection and association

    Shuxiao Ding, Lukas Schneider, Marius Cordts, and Juergen Gall. Ada-track: End-to-end multi-camera 3d multi-object tracking with alternating detection and association. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15184–15194, 2024. 1

  17. [17]

    Lester E Dubins. On curves of minimal length with a con- straint on average curvature, and with prescribed initial and terminal positions and tangents.American Journal of math- ematics, 79(3):497–516, 1957. 3

  18. [18]

    Lasot: A high-quality large-scale single object tracking benchmark.International Journal of Computer Vision, 129 (2):439–461, 2021

    Heng Fan, Hexin Bai, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Harshit, Mingzhen Huang, Juehuan Liu, et al. Lasot: A high-quality large-scale single object tracking benchmark.International Journal of Computer Vision, 129 (2):439–461, 2021. 5, 6

  19. [19]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. 1, 2

  20. [20]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 2

  21. [21]

    Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562–1577, 2019

    Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562–1577, 2019. 5, 6

  22. [22]

    Segment any motion in videos

    Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, and Qianqian Wang. Segment any motion in videos. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 3406–3416, 2025. 1, 6, 8

  23. [23]

    Accelerated reeds-shepp and under-specified reeds-shepp algorithms for mobile robot path planning.IEEE Transactions on Robotics,

    Ibrahim Ibrahim, Wilm Decr ´e, and Jan Swevers. Accelerated reeds-shepp and under-specified reeds-shepp algorithms for mobile robot path planning.IEEE Transactions on Robotics,

  24. [24]

    Unscented filtering and nonlinear estimation.Proceedings of the IEEE, 92(3): 401–422, 2004

    Simon J Julier and Jeffrey K Uhlmann. Unscented filtering and nonlinear estimation.Proceedings of the IEEE, 92(3): 401–422, 2004. 4

  25. [25]

    Learning segmentation from point trajecto- ries.Advances in Neural Information Processing Systems, 37:112573–112597, 2024

    Laurynas Karazija, Iro Laina, Christian Rupprecht, and An- drea Vedaldi. Learning segmentation from point trajecto- ries.Advances in Neural Information Processing Systems, 37:112573–112597, 2024. 2

  26. [26]

    Nonlinear systems.3rd edition, 2002

    HK Khalil. Nonlinear systems.3rd edition, 2002. 3

  27. [27]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 1 9

  28. [28]

    The weighted markov-dubins problem.IEEE Robotics and Automation Letters, 8(3):1563–1570, 2023

    Deepak Prakash Kumar, Swaroop Darbha, Satya- narayana Gupta Manyam, and David Casbeer. The weighted markov-dubins problem.IEEE Robotics and Automation Letters, 8(3):1563–1570, 2023. 3

  29. [29]

    Motion segmentation via a sparsity constraint

    Taotao Lai, Hanzi Wang, Yan Yan, Tat-Jun Chin, and Wan- Lei Zhao. Motion segmentation via a sparsity constraint. IEEE Transactions on Intelligent Transportation Systems, 18 (4):973–983, 2016. 2

  30. [30]

    High performance visual tracking with siamese region pro- posal network

    Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking with siamese region pro- posal network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8971–8980,

  31. [31]

    Siamrpn++: Evolution of siamese vi- sual tracking with very deep networks

    Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese vi- sual tracking with very deep networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4282–4291, 2019. 2, 6

  32. [32]

    Video segmentation by tracking many figure- ground segments

    Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, and James M Rehg. Video segmentation by tracking many figure- ground segments. InProceedings of the IEEE international conference on computer vision, pages 2192–2199, 2013. 6, 8

  33. [33]

    Sixu Li and Yang Zhou. Nonlinear oscillatory response of automated vehicle car-following: Theoretical analysis with traffic state and control input limits.Transportation Research Part B: Methodological, 201:103315, 2025. 2

  34. [34]

    Sequencing-enabled hierarchical cooperative cav on- ramp merging control with enhanced stability and feasibility

    Sixu Li, Yang Zhou, Xinyue Ye, Jiwan Jiang, and Meng Wang. Sequencing-enabled hierarchical cooperative cav on- ramp merging control with enhanced stability and feasibility. IEEE Transactions on Intelligent Vehicles, 2024. 3

  35. [35]

    Closed-form generation of paths for motion planning of a convexified reeds-shepp vehicle on a sphere.Available at SSRN 5227769, 2025

    Sixu Li, Deepak Prakash Kumar, Swaroop Darbha, and Yang Zhou. Closed-form generation of paths for motion planning of a convexified reeds-shepp vehicle on a sphere.Available at SSRN 5227769, 2025

  36. [36]

    Time-optimal Convexified Reeds-Shepp Paths on a Sphere

    Sixu Li, Deepak Prakash Kumar, Swaroop Darbha, and Yang Zhou. Time-optimal convexified reeds-shepp paths on a sphere.arXiv preprint arXiv:2504.00966, 2025. 3

  37. [37]

    Bootstrapping objectness from videos by relaxed common fate and visual grouping

    Long Lian, Zhirong Wu, and Stella X Yu. Bootstrapping objectness from videos by relaxed common fate and visual grouping. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14582– 14591, 2023. 8

  38. [38]

    Pointmamba: A simple state space model for point cloud analysis,

    Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, and Xiang Bai. Pointmamba: A simple state space model for point cloud analysis.arXiv preprint arXiv:2402.10739, 2024. 1, 2

  39. [39]

    Swintrack: A simple and strong baseline for trans- former tracking.Advances in Neural Information Processing Systems, 35:16743–16754, 2022

    Liting Lin, Heng Fan, Zhipeng Zhang, Yong Xu, and Haibin Ling. Swintrack: A simple and strong baseline for trans- former tracking.Advances in Neural Information Processing Systems, 35:16743–16754, 2022. 6

  40. [40]

    Tracking meets lora: Faster training, larger model, stronger performance

    Liting Lin, Heng Fan, Zhipeng Zhang, Yaowei Wang, Yong Xu, and Haibin Ling. Tracking meets lora: Faster training, larger model, stronger performance. InEuropean Confer- ence on Computer Vision, pages 300–318. Springer, 2024. 6

  41. [41]

    Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024

    Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024. 1, 2

  42. [42]

    Cam- bridge University Press, 2017

    Kevin M Lynch and Frank C Park.Modern robotics. Cam- bridge University Press, 2017. 3

  43. [43]

    U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation

    Jun Ma, Feifei Li, and Bo Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024. 1, 2

  44. [44]

    Transforming model prediction for tracking

    Christoph Mayer, Martin Danelljan, Goutam Bhat, Matthieu Paul, Danda Pani Paudel, Fisher Yu, and Luc Van Gool. Transforming model prediction for tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8731–8740, 2022. 2

  45. [45]

    Em-driven unsupervised learning for efficient motion seg- mentation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(4):4462–4473, 2022

    Etienne Meunier, Ana ¨ıs Badoual, and Patrick Bouthemy. Em-driven unsupervised learning for efficient motion seg- mentation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(4):4462–4473, 2022. 8

  46. [46]

    Trackingnet: A large-scale dataset and benchmark for object tracking in the wild

    Matthias Muller, Adel Bibi, Silvio Giancola, Salman Al- subaihi, and Bernard Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European conference on computer vision (ECCV), pages 300–317, 2018. 1, 6

  47. [47]

    Segmentation of moving objects by long term video analysis.IEEE trans- actions on pattern analysis and machine intelligence, 36(6): 1187–1200, 2013

    Peter Ochs, Jitendra Malik, and Thomas Brox. Segmentation of moving objects by long term video analysis.IEEE trans- actions on pattern analysis and machine intelligence, 36(6): 1187–1200, 2013. 6, 8

  48. [48]

    A benchmark dataset and evaluation methodology for video object segmentation

    Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732,

  49. [49]

    Tracking 3-d motion of dynamic objects using monocular visual-inertial sensing.IEEE Transactions on Robotics, 35 (4):799–816, 2019

    Kejie Qiu, Tong Qin, Wenliang Gao, and Shaojie Shen. Tracking 3-d motion of dynamic objects using monocular visual-inertial sensing.IEEE Transactions on Robotics, 35 (4):799–816, 2019. 1

  50. [50]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 1, 2, 3, 6

  51. [51]

    Hi- era: A hierarchical vision transformer without the bells-and- whistles

    Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, et al. Hi- era: A hierarchical vision transformer without the bells-and- whistles. InInternational conference on machine learning, pages 29441–29454. PMLR, 2023. 2

  52. [52]

    Explicit visual prompts for visual object tracking

    Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Sheng- ping Zhang, and Xianxian Li. Explicit visual prompts for visual object tracking. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 4838–4846, 2024. 6

  53. [53]

    Shortest paths for the reeds-shepp car: a worked out example of the use of geomet- ric techniques in nonlinear optimal control, 1991

    H ´ector J Sussmann and Guoqing Tang. Shortest paths for the reeds-shepp car: a worked out example of the use of geomet- ric techniques in nonlinear optimal control, 1991. 3, 4

  54. [54]

    Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving

    Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving. InProceedings 10 of the IEEE/CVF International Conference on Computer Vi- sion, pages 4567–4576, 2025. 1

  55. [55]

    Physi- cally analyzable ai-based nonlinear platoon dynamics mod- eling during traffic oscillation: A koopman approach.IEEE Transactions on Intelligent Transportation Systems, 2025

    Kexin Tian, Haotian Shi, Yang Zhou, and Sixu Li. Physi- cally analyzable ai-based nonlinear platoon dynamics mod- eling during traffic oscillation: A koopman approach.IEEE Transactions on Intelligent Transportation Systems, 2025. 1

  56. [56]

    The unscented kalman filter for nonlinear estimation

    Eric A Wan and Rudolph Van Der Merwe. The unscented kalman filter for nonlinear estimation. InProceedings of the IEEE 2000 adaptive systems for signal processing, commu- nications, and control symposium (Cat. No. 00EX373), pages 153–158. Ieee, 2000. 4

  57. [57]

    Segment- ing moving objects via an object-centric layered representa- tion.Advances in neural information processing systems, 35: 28023–28036, 2022

    Junyu Xie, Weidi Xie, and Andrew Zisserman. Segment- ing moving objects via an object-centric layered representa- tion.Advances in neural information processing systems, 35: 28023–28036, 2022. 8

  58. [58]

    Appearance- based refinement for object-centric motion segmentation

    Junyu Xie, Weidi Xie, and Andrew Zisserman. Appearance- based refinement for object-centric motion segmentation. In European Conference on Computer Vision, pages 238–256. Springer, 2024. 6, 8

  59. [59]

    Moving object segmentation: All you need is sam (and flow)

    Junyu Xie, Charig Yang, Weidi Xie, and Andrew Zisserman. Moving object segmentation: All you need is sam (and flow). InProceedings of the Asian conference on computer vision, pages 162–178, 2024. 2

  60. [60]

    Autore- gressive queries for adaptive tracking with spatio-temporal transformers

    Jinxia Xie, Bineng Zhong, Zhiyi Mo, Shengping Zhang, Liangtao Shi, Shuxiang Song, and Rongrong Ji. Autore- gressive queries for adaptive tracking with spatio-temporal transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19300– 19309, 2024. 1, 6

  61. [61]

    Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation

    Zhaohu Xing, Tian Ye, Yijun Yang, Guang Liu, and Lei Zhu. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. InInternational Conference on Medical Image Computing and Computer-Assisted Interven- tion, pages 578–588. Springer, 2024. 2

  62. [62]

    Learning spatio-temporal transformer for vi- sual tracking

    Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatio-temporal transformer for vi- sual tracking. InProceedings of the IEEE/CVF international conference on computer vision, pages 10448–10457, 2021. 6

  63. [63]

    Samurai: Adapting segment anything model for zero-shot visual tracking with motion-aware memory.arXiv preprint arXiv:2411.11922,

    Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, and Jenq-Neng Hwang. Samurai: Adapting segment anything model for zero-shot visual tracking with motion-aware memory.arXiv preprint arXiv:2411.11922,

  64. [64]

    Unsupervised moving object detection via contextual information separation

    Yanchao Yang, Antonio Loquercio, Davide Scaramuzza, and Stefano Soatto. Unsupervised moving object detection via contextual information separation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 879–888, 2019. 8

  65. [65]

    Joint feature learning and relation modeling for tracking: A one-stream framework

    Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. InEuropean conference on computer vision, pages 341–357. Springer, 2022. 6

  66. [66]

    Deeper and wider siamese networks for real-time visual tracking

    Zhipeng Zhang and Houwen Peng. Deeper and wider siamese networks for real-time visual tracking. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4591–4600, 2019. 2

  67. [67]

    Mdnet: A semantically and visually inter- pretable medical image diagnosis network

    Zizhao Zhang, Yuanpu Xie, Fuyong Xing, Mason McGough, and Lin Yang. Mdnet: A semantically and visually inter- pretable medical image diagnosis network. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 6428–6436, 2017. 2

  68. [68]

    Odtrack: Online dense temporal token learning for visual tracking

    Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang, and Xianxian Li. Odtrack: Online dense temporal token learning for visual tracking. InProceed- ings of the AAAI conference on artificial intelligence, pages 7588–7596, 2024. 6 11 A. Additional Implementation Details A.1. Computing Environments SUMO is a training-free model, with all infe...