SUMO: Segment and Track Any Motion with Nonlinear State Space Models

Keshu Wu; Kexin Tian; Sixu Li; Yang Zhou; Zhengzhong Tu

arxiv: 2606.29861 · v1 · pith:G5H3BMWMnew · submitted 2026-06-29 · 💻 cs.CV · cs.AI

SUMO: Segment and Track Any Motion with Nonlinear State Space Models

Kexin Tian , Sixu Li , Keshu Wu , Yang Zhou , Zhengzhong Tu This is my paper

Pith reviewed 2026-06-30 06:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords visual object trackingmoving object segmentationstate space modelsnonlinear dynamicsunscented filterzero-shot tracking

0 comments

The pith

A nonlinear state space model from robotics enables zero-shot tracking and segmentation of objects with complex motions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SUMO as a unified, training-free framework that combines nonlinear dynamics modeling with vision segmentation to solve visual object tracking and moving object segmentation. It claims that prior methods struggle because they depend only on visual cues and ignore the underlying nonlinear motion patterns. SUMO builds a robotics-inspired nonlinear state space model, then uses a selective unscented filter to fuse predictions from multiple sources and select reliable memory frames. This produces more consistent object states across time. If correct, the method would deliver stronger performance on real videos without needing task-specific training data.

Core claim

SUMO develops a nonlinear State Space Model to represent object dynamics and introduces a Selective Unscented Filter that applies joint scoring and dynamic fusion of multi-source predictions, together with a memory selection mechanism, to achieve state-of-the-art results on VOT and MOS benchmarks in a zero-shot setting.

What carries the argument

The nonlinear State Space Model (SSM) that encodes object motion dynamics, paired with the Selective Unscented Filter (SUF) that performs state estimation through scoring and fusion of predictions.

Load-bearing premise

The nonlinear state space model accurately captures the motion patterns of objects in real videos.

What would settle it

Videos containing object motions that deviate strongly from the assumed nonlinear dynamics where the selective unscented filter still produces incorrect state estimates despite clear visual evidence.

Figures

Figures reproduced from arXiv: 2606.29861 by Keshu Wu, Kexin Tian, Sixu Li, Yang Zhou, Zhengzhong Tu.

**Figure 2.** Figure 2: Framework Architecture. 3.1.2. Memory Attention This module integrates selected memory frames into the current frame via a memory selection mechanism, which will be detailed in Eq.(29), and a transformer-based attention architecture that follows the SAM2[50] design, which applies self-attention to the current frame features followed by cross-attention to the memory features. 3.1.3. Prompt Encoder In the i… view at source ↗

**Figure 3.** Figure 3: Qualitative visualization of VOT performance. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative visualization of MOS performance. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Visual Object Tracking (VOT) and Moving Object Segmentation (MOS) are two fundamental tasks in computer vision that involve both spatial and temporal object dynamics. Existing methods rely predominantly on visual cues and thus often falter in real-world scenarios where object motions are inherently complex and nonlinear. To address this limitation, we propose SUMO, a zero-shot, training-free, unified framework integrating nonlinear dynamics with vision-based segmentation for accurate and consistent VOT and MOS. Specifically, we develop a nonlinear State Space Model (SSM) inspired by robotics principles to capture the complex object dynamics. Building on this model, we propose a Selective Unscented Filter (SUF) for accurate state estimation, which features a joint scoring mechanism and dynamically fuses multi-source predictions to identify the most plausible object state over time. Furthermore, we apply a memory selection mechanism to evaluate the reliability of memory frames. Our extensive experimental results show that SUMO achieves state-of-the-art performance on both VOT and MOS tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SUMO brings nonlinear SSMs from robotics into zero-shot VOT and MOS but the SOTA claim hangs on an unablated modeling choice.

read the letter

The punchline is that SUMO offers a new way to fuse nonlinear dynamics modeling with segmentation for tracking and moving object tasks without any training, but the results rest on an untested assumption that the nonlinear state space model is essential.

What the paper does is take ideas from robotics state estimation and apply them to computer vision problems that have mostly ignored explicit motion models. It introduces a Selective Unscented Filter that scores and fuses multi-source predictions, along with a way to pick reliable memory frames. This setup aims to make tracking more robust when objects move in complicated ways that pure visual methods miss. The training-free design is a clear strength for deployment in varied environments.

The soft spot is right where the stress test flags it: no ablation that replaces the nonlinear SSM with a linear version or a standard filter to see if performance drops. If the linear version does just as well, then the robotics-inspired part isn't doing the heavy lifting, and the headline result doesn't follow from the claimed innovation. The abstract also skips over the actual equations, so it's difficult to judge how novel the filter really is versus existing unscented approaches.

This paper is aimed at people building systems for real-world video where motion complexity matters, such as robotics or security applications. A reader who wants to experiment with state space ideas in vision could pick up some useful components, but they'd have to fill in the missing controls themselves.

I would send it for peer review because the core idea has potential and the zero-shot claim is worth checking, even though the current version needs more rigorous validation on what makes it work.

Referee Report

2 major / 1 minor

Summary. The paper proposes SUMO, a zero-shot training-free unified framework for Visual Object Tracking (VOT) and Moving Object Segmentation (MOS). It combines a robotics-inspired nonlinear state space model (SSM) to capture complex object dynamics, a Selective Unscented Filter (SUF) with joint scoring and dynamic multi-source fusion for state estimation, and a memory selection mechanism. The central claim is that this yields state-of-the-art performance on both VOT and MOS tasks.

Significance. If the results hold, the work could be significant as a training-free alternative that incorporates nonlinear dynamics from robotics to improve robustness on complex motions where visual-cue-only methods fail. The zero-shot nature and the SUF's joint scoring mechanism are strengths that could generalize across tasks. However, the significance hinges on whether the nonlinear SSM component is demonstrably necessary, which is not yet established.

major comments (2)

[Method description of nonlinear SSM and SUF] The SOTA claim on VOT and MOS rests on the premise that the nonlinear SSM plus SUF delivers superior state estimation, yet the manuscript provides no ablation that replaces the nonlinear dynamics with a linear SSM (or EKF/UKF) while keeping the SUF, joint scoring, and segmentation pipeline fixed. This is load-bearing for the central claim that nonlinear modeling is required to address the limitations of existing methods.
[Experimental results] No quantitative tables, baseline comparisons, or statistical details are referenced to support the SOTA performance assertions, and the experimental evaluation lacks controls that would allow attribution of gains specifically to the robotics-inspired nonlinear component versus the vision backbone.

minor comments (1)

The abstract is dense and would benefit from explicit separation of the three main contributions (nonlinear SSM, SUF, memory selection) for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight the need for stronger empirical isolation of the nonlinear SSM's contribution, which we address by committing to additional experiments in revision. We respond point-by-point below.

read point-by-point responses

Referee: [Method description of nonlinear SSM and SUF] The SOTA claim on VOT and MOS rests on the premise that the nonlinear SSM plus SUF delivers superior state estimation, yet the manuscript provides no ablation that replaces the nonlinear dynamics with a linear SSM (or EKF/UKF) while keeping the SUF, joint scoring, and segmentation pipeline fixed. This is load-bearing for the central claim that nonlinear modeling is required to address the limitations of existing methods.

Authors: We agree that an explicit ablation isolating the nonlinear dynamics (replacing the nonlinear SSM with a linear SSM or EKF/UKF while freezing SUF, joint scoring, and the segmentation pipeline) is necessary to substantiate the central claim. The current manuscript motivates the nonlinear SSM from robotics principles for complex motions but does not include this controlled comparison. In the revised version we will add the requested ablation on standard VOT and MOS benchmarks and report the resulting performance deltas. revision: yes
Referee: [Experimental results] No quantitative tables, baseline comparisons, or statistical details are referenced to support the SOTA performance assertions, and the experimental evaluation lacks controls that would allow attribution of gains specifically to the robotics-inspired nonlinear component versus the vision backbone.

Authors: The manuscript contains quantitative results and baseline comparisons in the experimental section; however, we acknowledge that these do not yet include the specific controls needed to attribute gains to the nonlinear SSM versus the vision backbone. The additional ablation described in the response to the first comment will directly address this attribution gap. We will also ensure all tables, statistical details, and controls are clearly referenced in the revised text. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical SOTA claim rests on independent model proposal and experiments

full rationale

The paper proposes a new zero-shot framework (nonlinear SSM + SUF + memory selection) for VOT/MOS, with performance claims grounded in experimental results on standard benchmarks rather than any derivation that reduces to its own inputs. No equations, fitted parameters, or self-citations appear in the abstract or described structure that would trigger self-definitional, fitted-input, or load-bearing self-citation patterns. The robotics inspiration is presented as motivation for an ansatz, not as a uniqueness theorem imported from prior author work. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5709 in / 973 out tokens · 22311 ms · 2026-06-30T06:35:14.733153+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 7 canonical work pages · 4 internal anchors

[1]

Progressive-x: Efficient, anytime, multi-model fitting algorithm

Daniel Barath and Jiri Matas. Progressive-x: Efficient, anytime, multi-model fitting algorithm. InProceedings of the IEEE/CVF international conference on computer vision, pages 3780–3788, 2019. 2

2019
[2]

Time optimal tra- jectories for a car-like mobile robot.IEEE Transactions on Robotics, 38(1):421–432, 2021

Joseph Z Ben-Asher and Elon D Rimon. Time optimal tra- jectories for a car-like mobile robot.IEEE Transactions on Robotics, 38(1):421–432, 2021. 3

2021
[3]

Fully-convolutional siamese networks for object tracking

Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. InComputer vision–ECCV 2016 workshops: Amsterdam, the Netherlands, October 8- 10 and 15-16, 2016, proceedings, part II 14, pages 850–865. Springer, 2016. 1, 2

2016
[4]

Learning discriminative model prediction for track- ing

Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. Learning discriminative model prediction for track- ing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 6182–6191, 2019. 2

2019
[5]

It’s moving! a prob- abilistic model for causal motion segmentation in moving camera videos

Pia Bideau and Erik Learned-Miller. It’s moving! a prob- abilistic model for causal motion segmentation in moving camera videos. InEuropean Conference on Computer Vi- sion, pages 433–449. Springer, 2016. 2

2016
[6]

Deep learning for robust motion seg- mentation with non-static cameras.arXiv preprint arXiv:2102.10929, 2021

Markus Bosch. Deep learning for robust motion seg- mentation with non-static cameras.arXiv preprint arXiv:2102.10929, 2021. 1, 2

work page arXiv 2021
[7]

Ro- bust object modeling for visual tracking

Yidong Cai, Jie Liu, Jie Tang, and Gangshan Wu. Ro- bust object modeling for visual tracking. InProceedings of the IEEE/CVF international conference on computer vision, pages 9589–9600, 2023. 6

2023
[8]

Springer Science & Business Media, 2012

Frank M Callier and Charles A Desoer.Linear system theory. Springer Science & Business Media, 2012. 1, 3

2012
[9]

Learning independent object motion from unlabelled stereo- scopic videos

Zhe Cao, Abhishek Kar, Christian Hane, and Jitendra Malik. Learning independent object motion from unlabelled stereo- scopic videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5594– 5603, 2019. 2

2019
[10]

Linear rotate subspaee based visual tracking methods with application to uav stand-off target tracking

Fei Che, Jie Li, Yifeng Niu, Lizhen Wu, Wenchen Yao, and Chao Yan. Linear rotate subspaee based visual tracking methods with application to uav stand-off target tracking. In 2019 IEEE International Conference on Unmanned Systems (ICUS), pages 914–919, 2019. 1

2019
[11]

Transformer tracking

Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8126–8135, 2021. 2, 6

2021
[12]

Seqtrack: Sequence to sequence learning for visual ob- ject tracking

Xin Chen, Houwen Peng, Dong Wang, Huchuan Lu, and Han Hu. Seqtrack: Sequence to sequence learning for visual ob- ject tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14572– 14581, 2023. 6

2023
[13]

Siamese box adaptive network for visual tracking

Zedu Chen, Bineng Zhong, Guorong Li, Shengping Zhang, and Rongrong Ji. Siamese box adaptive network for visual tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6668–6677,
[14]

Mixformer: End-to-end tracking with iterative mixed atten- tion

Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed atten- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 13608–13618,
[15]

Atom: Accurate tracking by overlap max- imization

Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Atom: Accurate tracking by overlap max- imization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4660–4669,
[16]

Ada-track: End-to-end multi-camera 3d multi-object tracking with alternating detection and association

Shuxiao Ding, Lukas Schneider, Marius Cordts, and Juergen Gall. Ada-track: End-to-end multi-camera 3d multi-object tracking with alternating detection and association. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15184–15194, 2024. 1

2024
[17]

Lester E Dubins. On curves of minimal length with a con- straint on average curvature, and with prescribed initial and terminal positions and tangents.American Journal of math- ematics, 79(3):497–516, 1957. 3

1957
[18]

Lasot: A high-quality large-scale single object tracking benchmark.International Journal of Computer Vision, 129 (2):439–461, 2021

Heng Fan, Hexin Bai, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Harshit, Mingzhen Huang, Juehuan Liu, et al. Lasot: A high-quality large-scale single object tracking benchmark.International Journal of Computer Vision, 129 (2):439–461, 2021. 5, 6

2021
[19]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 2

2022
[21]

Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562–1577, 2019

Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562–1577, 2019. 5, 6

2019
[22]

Segment any motion in videos

Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, and Qianqian Wang. Segment any motion in videos. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 3406–3416, 2025. 1, 6, 8

2025
[23]

Accelerated reeds-shepp and under-specified reeds-shepp algorithms for mobile robot path planning.IEEE Transactions on Robotics,

Ibrahim Ibrahim, Wilm Decr ´e, and Jan Swevers. Accelerated reeds-shepp and under-specified reeds-shepp algorithms for mobile robot path planning.IEEE Transactions on Robotics,
[24]

Unscented filtering and nonlinear estimation.Proceedings of the IEEE, 92(3): 401–422, 2004

Simon J Julier and Jeffrey K Uhlmann. Unscented filtering and nonlinear estimation.Proceedings of the IEEE, 92(3): 401–422, 2004. 4

2004
[25]

Learning segmentation from point trajecto- ries.Advances in Neural Information Processing Systems, 37:112573–112597, 2024

Laurynas Karazija, Iro Laina, Christian Rupprecht, and An- drea Vedaldi. Learning segmentation from point trajecto- ries.Advances in Neural Information Processing Systems, 37:112573–112597, 2024. 2

2024
[26]

Nonlinear systems.3rd edition, 2002

HK Khalil. Nonlinear systems.3rd edition, 2002. 3

2002
[27]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 1 9

2023
[28]

The weighted markov-dubins problem.IEEE Robotics and Automation Letters, 8(3):1563–1570, 2023

Deepak Prakash Kumar, Swaroop Darbha, Satya- narayana Gupta Manyam, and David Casbeer. The weighted markov-dubins problem.IEEE Robotics and Automation Letters, 8(3):1563–1570, 2023. 3

2023
[29]

Motion segmentation via a sparsity constraint

Taotao Lai, Hanzi Wang, Yan Yan, Tat-Jun Chin, and Wan- Lei Zhao. Motion segmentation via a sparsity constraint. IEEE Transactions on Intelligent Transportation Systems, 18 (4):973–983, 2016. 2

2016
[30]

High performance visual tracking with siamese region pro- posal network

Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking with siamese region pro- posal network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8971–8980,
[31]

Siamrpn++: Evolution of siamese vi- sual tracking with very deep networks

Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese vi- sual tracking with very deep networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4282–4291, 2019. 2, 6

2019
[32]

Video segmentation by tracking many figure- ground segments

Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, and James M Rehg. Video segmentation by tracking many figure- ground segments. InProceedings of the IEEE international conference on computer vision, pages 2192–2199, 2013. 6, 8

2013
[33]

Sixu Li and Yang Zhou. Nonlinear oscillatory response of automated vehicle car-following: Theoretical analysis with traffic state and control input limits.Transportation Research Part B: Methodological, 201:103315, 2025. 2

2025
[34]

Sequencing-enabled hierarchical cooperative cav on- ramp merging control with enhanced stability and feasibility

Sixu Li, Yang Zhou, Xinyue Ye, Jiwan Jiang, and Meng Wang. Sequencing-enabled hierarchical cooperative cav on- ramp merging control with enhanced stability and feasibility. IEEE Transactions on Intelligent Vehicles, 2024. 3

2024
[35]

Closed-form generation of paths for motion planning of a convexified reeds-shepp vehicle on a sphere.Available at SSRN 5227769, 2025

Sixu Li, Deepak Prakash Kumar, Swaroop Darbha, and Yang Zhou. Closed-form generation of paths for motion planning of a convexified reeds-shepp vehicle on a sphere.Available at SSRN 5227769, 2025

2025
[36]

Time-optimal Convexified Reeds-Shepp Paths on a Sphere

Sixu Li, Deepak Prakash Kumar, Swaroop Darbha, and Yang Zhou. Time-optimal convexified reeds-shepp paths on a sphere.arXiv preprint arXiv:2504.00966, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Bootstrapping objectness from videos by relaxed common fate and visual grouping

Long Lian, Zhirong Wu, and Stella X Yu. Bootstrapping objectness from videos by relaxed common fate and visual grouping. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14582– 14591, 2023. 8

2023
[38]

Pointmamba: A simple state space model for point cloud analysis,

Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, and Xiang Bai. Pointmamba: A simple state space model for point cloud analysis.arXiv preprint arXiv:2402.10739, 2024. 1, 2

work page arXiv 2024
[39]

Swintrack: A simple and strong baseline for trans- former tracking.Advances in Neural Information Processing Systems, 35:16743–16754, 2022

Liting Lin, Heng Fan, Zhipeng Zhang, Yong Xu, and Haibin Ling. Swintrack: A simple and strong baseline for trans- former tracking.Advances in Neural Information Processing Systems, 35:16743–16754, 2022. 6

2022
[40]

Tracking meets lora: Faster training, larger model, stronger performance

Liting Lin, Heng Fan, Zhipeng Zhang, Yaowei Wang, Yong Xu, and Haibin Ling. Tracking meets lora: Faster training, larger model, stronger performance. InEuropean Confer- ence on Computer Vision, pages 300–318. Springer, 2024. 6

2024
[41]

Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024

Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024. 1, 2

2024
[42]

Cam- bridge University Press, 2017

Kevin M Lynch and Frank C Park.Modern robotics. Cam- bridge University Press, 2017. 3

2017
[43]

U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation

Jun Ma, Feifei Li, and Bo Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Transforming model prediction for tracking

Christoph Mayer, Martin Danelljan, Goutam Bhat, Matthieu Paul, Danda Pani Paudel, Fisher Yu, and Luc Van Gool. Transforming model prediction for tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8731–8740, 2022. 2

2022
[45]

Em-driven unsupervised learning for efficient motion seg- mentation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(4):4462–4473, 2022

Etienne Meunier, Ana ¨ıs Badoual, and Patrick Bouthemy. Em-driven unsupervised learning for efficient motion seg- mentation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(4):4462–4473, 2022. 8

2022
[46]

Trackingnet: A large-scale dataset and benchmark for object tracking in the wild

Matthias Muller, Adel Bibi, Silvio Giancola, Salman Al- subaihi, and Bernard Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European conference on computer vision (ECCV), pages 300–317, 2018. 1, 6

2018
[47]

Segmentation of moving objects by long term video analysis.IEEE trans- actions on pattern analysis and machine intelligence, 36(6): 1187–1200, 2013

Peter Ochs, Jitendra Malik, and Thomas Brox. Segmentation of moving objects by long term video analysis.IEEE trans- actions on pattern analysis and machine intelligence, 36(6): 1187–1200, 2013. 6, 8

2013
[48]

A benchmark dataset and evaluation methodology for video object segmentation

Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732,
[49]

Tracking 3-d motion of dynamic objects using monocular visual-inertial sensing.IEEE Transactions on Robotics, 35 (4):799–816, 2019

Kejie Qiu, Tong Qin, Wenliang Gao, and Shaojie Shen. Tracking 3-d motion of dynamic objects using monocular visual-inertial sensing.IEEE Transactions on Robotics, 35 (4):799–816, 2019. 1

2019
[50]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 1, 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Hi- era: A hierarchical vision transformer without the bells-and- whistles

Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, et al. Hi- era: A hierarchical vision transformer without the bells-and- whistles. InInternational conference on machine learning, pages 29441–29454. PMLR, 2023. 2

2023
[52]

Explicit visual prompts for visual object tracking

Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Sheng- ping Zhang, and Xianxian Li. Explicit visual prompts for visual object tracking. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 4838–4846, 2024. 6

2024
[53]

Shortest paths for the reeds-shepp car: a worked out example of the use of geomet- ric techniques in nonlinear optimal control, 1991

H ´ector J Sussmann and Guoqing Tang. Shortest paths for the reeds-shepp car: a worked out example of the use of geomet- ric techniques in nonlinear optimal control, 1991. 3, 4

1991
[54]

Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving

Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving. InProceedings 10 of the IEEE/CVF International Conference on Computer Vi- sion, pages 4567–4576, 2025. 1

2025
[55]

Physi- cally analyzable ai-based nonlinear platoon dynamics mod- eling during traffic oscillation: A koopman approach.IEEE Transactions on Intelligent Transportation Systems, 2025

Kexin Tian, Haotian Shi, Yang Zhou, and Sixu Li. Physi- cally analyzable ai-based nonlinear platoon dynamics mod- eling during traffic oscillation: A koopman approach.IEEE Transactions on Intelligent Transportation Systems, 2025. 1

2025
[56]

The unscented kalman filter for nonlinear estimation

Eric A Wan and Rudolph Van Der Merwe. The unscented kalman filter for nonlinear estimation. InProceedings of the IEEE 2000 adaptive systems for signal processing, commu- nications, and control symposium (Cat. No. 00EX373), pages 153–158. Ieee, 2000. 4

2000
[57]

Segment- ing moving objects via an object-centric layered representa- tion.Advances in neural information processing systems, 35: 28023–28036, 2022

Junyu Xie, Weidi Xie, and Andrew Zisserman. Segment- ing moving objects via an object-centric layered representa- tion.Advances in neural information processing systems, 35: 28023–28036, 2022. 8

2022
[58]

Appearance- based refinement for object-centric motion segmentation

Junyu Xie, Weidi Xie, and Andrew Zisserman. Appearance- based refinement for object-centric motion segmentation. In European Conference on Computer Vision, pages 238–256. Springer, 2024. 6, 8

2024
[59]

Moving object segmentation: All you need is sam (and flow)

Junyu Xie, Charig Yang, Weidi Xie, and Andrew Zisserman. Moving object segmentation: All you need is sam (and flow). InProceedings of the Asian conference on computer vision, pages 162–178, 2024. 2

2024
[60]

Autore- gressive queries for adaptive tracking with spatio-temporal transformers

Jinxia Xie, Bineng Zhong, Zhiyi Mo, Shengping Zhang, Liangtao Shi, Shuxiang Song, and Rongrong Ji. Autore- gressive queries for adaptive tracking with spatio-temporal transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19300– 19309, 2024. 1, 6

2024
[61]

Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation

Zhaohu Xing, Tian Ye, Yijun Yang, Guang Liu, and Lei Zhu. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. InInternational Conference on Medical Image Computing and Computer-Assisted Interven- tion, pages 578–588. Springer, 2024. 2

2024
[62]

Learning spatio-temporal transformer for vi- sual tracking

Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatio-temporal transformer for vi- sual tracking. InProceedings of the IEEE/CVF international conference on computer vision, pages 10448–10457, 2021. 6

2021
[63]

Samurai: Adapting segment anything model for zero-shot visual tracking with motion-aware memory.arXiv preprint arXiv:2411.11922,

Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, and Jenq-Neng Hwang. Samurai: Adapting segment anything model for zero-shot visual tracking with motion-aware memory.arXiv preprint arXiv:2411.11922,

work page arXiv
[64]

Unsupervised moving object detection via contextual information separation

Yanchao Yang, Antonio Loquercio, Davide Scaramuzza, and Stefano Soatto. Unsupervised moving object detection via contextual information separation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 879–888, 2019. 8

2019
[65]

Joint feature learning and relation modeling for tracking: A one-stream framework

Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. InEuropean conference on computer vision, pages 341–357. Springer, 2022. 6

2022
[66]

Deeper and wider siamese networks for real-time visual tracking

Zhipeng Zhang and Houwen Peng. Deeper and wider siamese networks for real-time visual tracking. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4591–4600, 2019. 2

2019
[67]

Mdnet: A semantically and visually inter- pretable medical image diagnosis network

Zizhao Zhang, Yuanpu Xie, Fuyong Xing, Mason McGough, and Lin Yang. Mdnet: A semantically and visually inter- pretable medical image diagnosis network. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 6428–6436, 2017. 2

2017
[68]

Odtrack: Online dense temporal token learning for visual tracking

Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang, and Xianxian Li. Odtrack: Online dense temporal token learning for visual tracking. InProceed- ings of the AAAI conference on artificial intelligence, pages 7588–7596, 2024. 6 11 A. Additional Implementation Details A.1. Computing Environments SUMO is a training-free model, with all infe...

2024

[1] [1]

Progressive-x: Efficient, anytime, multi-model fitting algorithm

Daniel Barath and Jiri Matas. Progressive-x: Efficient, anytime, multi-model fitting algorithm. InProceedings of the IEEE/CVF international conference on computer vision, pages 3780–3788, 2019. 2

2019

[2] [2]

Time optimal tra- jectories for a car-like mobile robot.IEEE Transactions on Robotics, 38(1):421–432, 2021

Joseph Z Ben-Asher and Elon D Rimon. Time optimal tra- jectories for a car-like mobile robot.IEEE Transactions on Robotics, 38(1):421–432, 2021. 3

2021

[3] [3]

Fully-convolutional siamese networks for object tracking

Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. InComputer vision–ECCV 2016 workshops: Amsterdam, the Netherlands, October 8- 10 and 15-16, 2016, proceedings, part II 14, pages 850–865. Springer, 2016. 1, 2

2016

[4] [4]

Learning discriminative model prediction for track- ing

Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. Learning discriminative model prediction for track- ing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 6182–6191, 2019. 2

2019

[5] [5]

It’s moving! a prob- abilistic model for causal motion segmentation in moving camera videos

Pia Bideau and Erik Learned-Miller. It’s moving! a prob- abilistic model for causal motion segmentation in moving camera videos. InEuropean Conference on Computer Vi- sion, pages 433–449. Springer, 2016. 2

2016

[6] [6]

Deep learning for robust motion seg- mentation with non-static cameras.arXiv preprint arXiv:2102.10929, 2021

Markus Bosch. Deep learning for robust motion seg- mentation with non-static cameras.arXiv preprint arXiv:2102.10929, 2021. 1, 2

work page arXiv 2021

[7] [7]

Ro- bust object modeling for visual tracking

Yidong Cai, Jie Liu, Jie Tang, and Gangshan Wu. Ro- bust object modeling for visual tracking. InProceedings of the IEEE/CVF international conference on computer vision, pages 9589–9600, 2023. 6

2023

[8] [8]

Springer Science & Business Media, 2012

Frank M Callier and Charles A Desoer.Linear system theory. Springer Science & Business Media, 2012. 1, 3

2012

[9] [9]

Learning independent object motion from unlabelled stereo- scopic videos

Zhe Cao, Abhishek Kar, Christian Hane, and Jitendra Malik. Learning independent object motion from unlabelled stereo- scopic videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5594– 5603, 2019. 2

2019

[10] [10]

Linear rotate subspaee based visual tracking methods with application to uav stand-off target tracking

Fei Che, Jie Li, Yifeng Niu, Lizhen Wu, Wenchen Yao, and Chao Yan. Linear rotate subspaee based visual tracking methods with application to uav stand-off target tracking. In 2019 IEEE International Conference on Unmanned Systems (ICUS), pages 914–919, 2019. 1

2019

[11] [11]

Transformer tracking

Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8126–8135, 2021. 2, 6

2021

[12] [12]

Seqtrack: Sequence to sequence learning for visual ob- ject tracking

Xin Chen, Houwen Peng, Dong Wang, Huchuan Lu, and Han Hu. Seqtrack: Sequence to sequence learning for visual ob- ject tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14572– 14581, 2023. 6

2023

[13] [13]

Siamese box adaptive network for visual tracking

Zedu Chen, Bineng Zhong, Guorong Li, Shengping Zhang, and Rongrong Ji. Siamese box adaptive network for visual tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6668–6677,

[14] [14]

Mixformer: End-to-end tracking with iterative mixed atten- tion

Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed atten- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 13608–13618,

[15] [15]

Atom: Accurate tracking by overlap max- imization

Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Atom: Accurate tracking by overlap max- imization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4660–4669,

[16] [16]

Ada-track: End-to-end multi-camera 3d multi-object tracking with alternating detection and association

Shuxiao Ding, Lukas Schneider, Marius Cordts, and Juergen Gall. Ada-track: End-to-end multi-camera 3d multi-object tracking with alternating detection and association. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15184–15194, 2024. 1

2024

[17] [17]

Lester E Dubins. On curves of minimal length with a con- straint on average curvature, and with prescribed initial and terminal positions and tangents.American Journal of math- ematics, 79(3):497–516, 1957. 3

1957

[18] [18]

Lasot: A high-quality large-scale single object tracking benchmark.International Journal of Computer Vision, 129 (2):439–461, 2021

Heng Fan, Hexin Bai, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Harshit, Mingzhen Huang, Juehuan Liu, et al. Lasot: A high-quality large-scale single object tracking benchmark.International Journal of Computer Vision, 129 (2):439–461, 2021. 5, 6

2021

[19] [19]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 2

2022

[21] [21]

Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562–1577, 2019

Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562–1577, 2019. 5, 6

2019

[22] [22]

Segment any motion in videos

Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, and Qianqian Wang. Segment any motion in videos. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 3406–3416, 2025. 1, 6, 8

2025

[23] [23]

Accelerated reeds-shepp and under-specified reeds-shepp algorithms for mobile robot path planning.IEEE Transactions on Robotics,

Ibrahim Ibrahim, Wilm Decr ´e, and Jan Swevers. Accelerated reeds-shepp and under-specified reeds-shepp algorithms for mobile robot path planning.IEEE Transactions on Robotics,

[24] [24]

Unscented filtering and nonlinear estimation.Proceedings of the IEEE, 92(3): 401–422, 2004

Simon J Julier and Jeffrey K Uhlmann. Unscented filtering and nonlinear estimation.Proceedings of the IEEE, 92(3): 401–422, 2004. 4

2004

[25] [25]

Learning segmentation from point trajecto- ries.Advances in Neural Information Processing Systems, 37:112573–112597, 2024

Laurynas Karazija, Iro Laina, Christian Rupprecht, and An- drea Vedaldi. Learning segmentation from point trajecto- ries.Advances in Neural Information Processing Systems, 37:112573–112597, 2024. 2

2024

[26] [26]

Nonlinear systems.3rd edition, 2002

HK Khalil. Nonlinear systems.3rd edition, 2002. 3

2002

[27] [27]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 1 9

2023

[28] [28]

The weighted markov-dubins problem.IEEE Robotics and Automation Letters, 8(3):1563–1570, 2023

Deepak Prakash Kumar, Swaroop Darbha, Satya- narayana Gupta Manyam, and David Casbeer. The weighted markov-dubins problem.IEEE Robotics and Automation Letters, 8(3):1563–1570, 2023. 3

2023

[29] [29]

Motion segmentation via a sparsity constraint

Taotao Lai, Hanzi Wang, Yan Yan, Tat-Jun Chin, and Wan- Lei Zhao. Motion segmentation via a sparsity constraint. IEEE Transactions on Intelligent Transportation Systems, 18 (4):973–983, 2016. 2

2016

[30] [30]

High performance visual tracking with siamese region pro- posal network

Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking with siamese region pro- posal network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8971–8980,

[31] [31]

Siamrpn++: Evolution of siamese vi- sual tracking with very deep networks

Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese vi- sual tracking with very deep networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4282–4291, 2019. 2, 6

2019

[32] [32]

Video segmentation by tracking many figure- ground segments

Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, and James M Rehg. Video segmentation by tracking many figure- ground segments. InProceedings of the IEEE international conference on computer vision, pages 2192–2199, 2013. 6, 8

2013

[33] [33]

Sixu Li and Yang Zhou. Nonlinear oscillatory response of automated vehicle car-following: Theoretical analysis with traffic state and control input limits.Transportation Research Part B: Methodological, 201:103315, 2025. 2

2025

[34] [34]

Sequencing-enabled hierarchical cooperative cav on- ramp merging control with enhanced stability and feasibility

Sixu Li, Yang Zhou, Xinyue Ye, Jiwan Jiang, and Meng Wang. Sequencing-enabled hierarchical cooperative cav on- ramp merging control with enhanced stability and feasibility. IEEE Transactions on Intelligent Vehicles, 2024. 3

2024

[35] [35]

Closed-form generation of paths for motion planning of a convexified reeds-shepp vehicle on a sphere.Available at SSRN 5227769, 2025

Sixu Li, Deepak Prakash Kumar, Swaroop Darbha, and Yang Zhou. Closed-form generation of paths for motion planning of a convexified reeds-shepp vehicle on a sphere.Available at SSRN 5227769, 2025

2025

[36] [36]

Time-optimal Convexified Reeds-Shepp Paths on a Sphere

Sixu Li, Deepak Prakash Kumar, Swaroop Darbha, and Yang Zhou. Time-optimal convexified reeds-shepp paths on a sphere.arXiv preprint arXiv:2504.00966, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Bootstrapping objectness from videos by relaxed common fate and visual grouping

Long Lian, Zhirong Wu, and Stella X Yu. Bootstrapping objectness from videos by relaxed common fate and visual grouping. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14582– 14591, 2023. 8

2023

[38] [38]

Pointmamba: A simple state space model for point cloud analysis,

Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, and Xiang Bai. Pointmamba: A simple state space model for point cloud analysis.arXiv preprint arXiv:2402.10739, 2024. 1, 2

work page arXiv 2024

[39] [39]

Swintrack: A simple and strong baseline for trans- former tracking.Advances in Neural Information Processing Systems, 35:16743–16754, 2022

Liting Lin, Heng Fan, Zhipeng Zhang, Yong Xu, and Haibin Ling. Swintrack: A simple and strong baseline for trans- former tracking.Advances in Neural Information Processing Systems, 35:16743–16754, 2022. 6

2022

[40] [40]

Tracking meets lora: Faster training, larger model, stronger performance

Liting Lin, Heng Fan, Zhipeng Zhang, Yaowei Wang, Yong Xu, and Haibin Ling. Tracking meets lora: Faster training, larger model, stronger performance. InEuropean Confer- ence on Computer Vision, pages 300–318. Springer, 2024. 6

2024

[41] [41]

Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024

Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. Vmamba: Visual state space model.Advances in neural information processing systems, 37:103031–103063, 2024. 1, 2

2024

[42] [42]

Cam- bridge University Press, 2017

Kevin M Lynch and Frank C Park.Modern robotics. Cam- bridge University Press, 2017. 3

2017

[43] [43]

U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation

Jun Ma, Feifei Li, and Bo Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Transforming model prediction for tracking

Christoph Mayer, Martin Danelljan, Goutam Bhat, Matthieu Paul, Danda Pani Paudel, Fisher Yu, and Luc Van Gool. Transforming model prediction for tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8731–8740, 2022. 2

2022

[45] [45]

Em-driven unsupervised learning for efficient motion seg- mentation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(4):4462–4473, 2022

Etienne Meunier, Ana ¨ıs Badoual, and Patrick Bouthemy. Em-driven unsupervised learning for efficient motion seg- mentation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(4):4462–4473, 2022. 8

2022

[46] [46]

Trackingnet: A large-scale dataset and benchmark for object tracking in the wild

Matthias Muller, Adel Bibi, Silvio Giancola, Salman Al- subaihi, and Bernard Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European conference on computer vision (ECCV), pages 300–317, 2018. 1, 6

2018

[47] [47]

Segmentation of moving objects by long term video analysis.IEEE trans- actions on pattern analysis and machine intelligence, 36(6): 1187–1200, 2013

Peter Ochs, Jitendra Malik, and Thomas Brox. Segmentation of moving objects by long term video analysis.IEEE trans- actions on pattern analysis and machine intelligence, 36(6): 1187–1200, 2013. 6, 8

2013

[48] [48]

A benchmark dataset and evaluation methodology for video object segmentation

Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732,

[49] [49]

Tracking 3-d motion of dynamic objects using monocular visual-inertial sensing.IEEE Transactions on Robotics, 35 (4):799–816, 2019

Kejie Qiu, Tong Qin, Wenliang Gao, and Shaojie Shen. Tracking 3-d motion of dynamic objects using monocular visual-inertial sensing.IEEE Transactions on Robotics, 35 (4):799–816, 2019. 1

2019

[50] [50]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 1, 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

Hi- era: A hierarchical vision transformer without the bells-and- whistles

Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, et al. Hi- era: A hierarchical vision transformer without the bells-and- whistles. InInternational conference on machine learning, pages 29441–29454. PMLR, 2023. 2

2023

[52] [52]

Explicit visual prompts for visual object tracking

Liangtao Shi, Bineng Zhong, Qihua Liang, Ning Li, Sheng- ping Zhang, and Xianxian Li. Explicit visual prompts for visual object tracking. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 4838–4846, 2024. 6

2024

[53] [53]

Shortest paths for the reeds-shepp car: a worked out example of the use of geomet- ric techniques in nonlinear optimal control, 1991

H ´ector J Sussmann and Guoqing Tang. Shortest paths for the reeds-shepp car: a worked out example of the use of geomet- ric techniques in nonlinear optimal control, 1991. 3, 4

1991

[54] [54]

Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving

Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving. InProceedings 10 of the IEEE/CVF International Conference on Computer Vi- sion, pages 4567–4576, 2025. 1

2025

[55] [55]

Physi- cally analyzable ai-based nonlinear platoon dynamics mod- eling during traffic oscillation: A koopman approach.IEEE Transactions on Intelligent Transportation Systems, 2025

Kexin Tian, Haotian Shi, Yang Zhou, and Sixu Li. Physi- cally analyzable ai-based nonlinear platoon dynamics mod- eling during traffic oscillation: A koopman approach.IEEE Transactions on Intelligent Transportation Systems, 2025. 1

2025

[56] [56]

The unscented kalman filter for nonlinear estimation

Eric A Wan and Rudolph Van Der Merwe. The unscented kalman filter for nonlinear estimation. InProceedings of the IEEE 2000 adaptive systems for signal processing, commu- nications, and control symposium (Cat. No. 00EX373), pages 153–158. Ieee, 2000. 4

2000

[57] [57]

Segment- ing moving objects via an object-centric layered representa- tion.Advances in neural information processing systems, 35: 28023–28036, 2022

Junyu Xie, Weidi Xie, and Andrew Zisserman. Segment- ing moving objects via an object-centric layered representa- tion.Advances in neural information processing systems, 35: 28023–28036, 2022. 8

2022

[58] [58]

Appearance- based refinement for object-centric motion segmentation

Junyu Xie, Weidi Xie, and Andrew Zisserman. Appearance- based refinement for object-centric motion segmentation. In European Conference on Computer Vision, pages 238–256. Springer, 2024. 6, 8

2024

[59] [59]

Moving object segmentation: All you need is sam (and flow)

Junyu Xie, Charig Yang, Weidi Xie, and Andrew Zisserman. Moving object segmentation: All you need is sam (and flow). InProceedings of the Asian conference on computer vision, pages 162–178, 2024. 2

2024

[60] [60]

Autore- gressive queries for adaptive tracking with spatio-temporal transformers

Jinxia Xie, Bineng Zhong, Zhiyi Mo, Shengping Zhang, Liangtao Shi, Shuxiang Song, and Rongrong Ji. Autore- gressive queries for adaptive tracking with spatio-temporal transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19300– 19309, 2024. 1, 6

2024

[61] [61]

Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation

Zhaohu Xing, Tian Ye, Yijun Yang, Guang Liu, and Lei Zhu. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. InInternational Conference on Medical Image Computing and Computer-Assisted Interven- tion, pages 578–588. Springer, 2024. 2

2024

[62] [62]

Learning spatio-temporal transformer for vi- sual tracking

Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatio-temporal transformer for vi- sual tracking. InProceedings of the IEEE/CVF international conference on computer vision, pages 10448–10457, 2021. 6

2021

[63] [63]

Samurai: Adapting segment anything model for zero-shot visual tracking with motion-aware memory.arXiv preprint arXiv:2411.11922,

Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, and Jenq-Neng Hwang. Samurai: Adapting segment anything model for zero-shot visual tracking with motion-aware memory.arXiv preprint arXiv:2411.11922,

work page arXiv

[64] [64]

Unsupervised moving object detection via contextual information separation

Yanchao Yang, Antonio Loquercio, Davide Scaramuzza, and Stefano Soatto. Unsupervised moving object detection via contextual information separation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 879–888, 2019. 8

2019

[65] [65]

Joint feature learning and relation modeling for tracking: A one-stream framework

Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. InEuropean conference on computer vision, pages 341–357. Springer, 2022. 6

2022

[66] [66]

Deeper and wider siamese networks for real-time visual tracking

Zhipeng Zhang and Houwen Peng. Deeper and wider siamese networks for real-time visual tracking. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4591–4600, 2019. 2

2019

[67] [67]

Mdnet: A semantically and visually inter- pretable medical image diagnosis network

Zizhao Zhang, Yuanpu Xie, Fuyong Xing, Mason McGough, and Lin Yang. Mdnet: A semantically and visually inter- pretable medical image diagnosis network. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 6428–6436, 2017. 2

2017

[68] [68]

Odtrack: Online dense temporal token learning for visual tracking

Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang, and Xianxian Li. Odtrack: Online dense temporal token learning for visual tracking. InProceed- ings of the AAAI conference on artificial intelligence, pages 7588–7596, 2024. 6 11 A. Additional Implementation Details A.1. Computing Environments SUMO is a training-free model, with all infe...

2024