pith. sign in

arxiv: 2511.18264 · v3 · submitted 2025-11-23 · 💻 cs.CV

SatSAM2: Motion-Constrained Video Object Tracking in Satellite Imagery using Promptable SAM2 and Kalman Priors

Pith reviewed 2026-05-17 06:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords satellite video trackingzero-shot object trackingSAM2 adaptationKalman motion priorsocclusion handlingremote sensingMVOT benchmarkvideo object tracking
0
0 comments X

The pith

SatSAM2 adds Kalman motion constraints to SAM2 to track objects in satellite videos without any training or fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SatSAM2 as a zero-shot tracker that adapts the promptable SAM2 model to satellite imagery by incorporating explicit motion modeling. It adds a Kalman Filter-based Constrained Motion Module to enforce consistent object trajectories and reduce drift, plus a Motion-Constrained State Machine that switches tracking behavior based on motion reliability and occlusion signals. The authors also release the MVOT synthetic benchmark with over 1500 sequences to enable broader testing. Experiments across existing satellite datasets and MVOT show consistent gains over both classical trackers and unmodified SAM2 variants, with a reported 5.84 percent AUC lift on the OOTB benchmark.

Core claim

SatSAM2 adapts the SAM2 foundation model to satellite video object tracking by inserting two motion-aware modules: a Kalman Filter-based Constrained Motion Module that supplies temporal priors to limit drift, and a Motion-Constrained State Machine that adjusts the tracker state according to motion dynamics and detection reliability. These additions allow the system to maintain tracks through occlusions and viewpoint changes without scenario-specific training. On standard satellite tracking benchmarks and the new MVOT dataset of 1500-plus sequences, the resulting tracker exceeds both traditional methods and other SAM2-based approaches, including a 5.84 percent AUC improvement on the OOTB set.

What carries the argument

The Kalman Filter-based Constrained Motion Module paired with the Motion-Constrained State Machine, which together supply and enforce motion priors inside the SAM2 tracking loop to suppress drift and manage occlusion states.

If this is right

  • Satellite video trackers can now be deployed across new scenes or sensors without collecting labeled training data for each case.
  • Track loss during temporary occlusions or illumination shifts decreases because the state machine explicitly pauses or reinitializes based on motion consistency.
  • Large-scale synthetic benchmarks like MVOT become viable proxies for comparing methods before real-data validation.
  • The same motion-constraint pattern could be inserted into other promptable video segmentation models for remote-sensing tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the motion modules prove robust on real data, they could serve as a lightweight adapter layer for other foundation models in overhead imagery.
  • The approach suggests a general recipe for injecting domain-specific physics priors into large vision models without full retraining.
  • Wider release of the MVOT dataset may accelerate development of trackers that generalize across orbital altitudes and sensor types.

Load-bearing premise

The motion modules will suppress drift and recover from occlusions in varied real satellite footage without creating new failure modes or requiring case-by-case adjustments.

What would settle it

Performance measurements on a held-out collection of real satellite videos containing sudden maneuvers or extended occlusions that show no improvement or a drop in success rate compared with plain SAM2.

Figures

Figures reproduced from arXiv: 2511.18264 by Huan Chen, Junyan Ye, Ruijie Fan, Weijia Li, Xiaolei Wang, Zilong Huang.

Figure 1
Figure 1. Figure 1: Illustration of satellite video object tracking (SVOT). (a) Challenges in satellite-based tracking tasks. (b) Existing promptable methods either lack motion modeling or fail to account for the complete tracking pipeline. (c) Our approach integrates a Kalman-based motion model with a motion-constrained state ma￾chine to enable stable tracking. understanding [17, 19, 37, 38, 41]. However, despite re￾cent pro… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed SatSAM2 framework. (a) SAM2 Observer encodes each frame and retrieves candidate masks via memory matching. (b) Kalman Filter-based Constrained Motion Model (KFCMM) estimates target dynamics and provides predictive guidance under occlusion. (c) Motion-Constrained State Machine (MCSM) adaptively switches between tracking modes based on segmen￾tation confidence and motion consistency.… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the MVOT dataset. In the Stable state, the tracker assumes that the target has been reliably localized and begins making decisions based on both appearance and motion cues. The decision logic is detailed in Algorithm 1. Two scores guide this process: the SAM2 affinity score ssam, which reflects the confidence of the segmentation output, and the Kalman motion score skf, which quantifies the … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on four remote sensing datasets. Our method achieves the most accurate tracking results and demon￾strates strong robustness in recovering targets after occlusion, enabling fast and reliable re-alignment [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation comparison between Ours, GT, and vari￾ants with different modules removed [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example scenes under different viewing angles(0 de￾gree, 10 degree and 20 degree). Red boxes indicate the target positions in different frames. buildings, leading to degraded performance. The 10◦ angle, striking a balance between detail visibility and occlusion, yields the best results. A.3. Occlusion Scenarios MVOT includes 148,700 non-occluded and 9,200 oc￾cluded frames. As shown in [PITH_FULL_IMAGE:fig… view at source ↗
Figure 8
Figure 8. Figure 8: Example scenes under occluded and non-occluded conditions. Red boxes indicate the target positions in different frames. KFCMM module (on the right) for initialization. From the second to the tenth frame, the system enters the Stabilizing phase. During this period, as KFCMM has not yet received sufficient observations, the system relies more on the SAM2 Observer. Masks with confidence scores above a certain… view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of Kalman Filter State Dynamics on Sequence car 30. (a-b) Qualitative Results: Representative frames showing stable tracking (a) and a severe occlusion event (b) where the target is obscured by trees (frames 436–496). (c) Spatial Trajectory: The estimated path (blue) remains smooth and continuous during the occlusion, effectively bridging the gap in visual observations (red crosses). (d) Velo… view at source ↗
Figure 10
Figure 10. Figure 10: Hyperparameter Sensitivity Analysis on the OOTB Dataset. (Left) Impact of the Kalman Motion Weight (αkf ). The tracking performance peaks at αkf = 0.2, confirming that a subtle motion constraint effectively complements visual features. As the weight increases beyond 0.4, the linear motion model begins to override valid visual observations, leading to a significant drop in AUC. (Right) Impact of the Stabil… view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative Analysis of Failure Cases. We identify two primary failure modes in the SatSAM2 framework. (Top Row) Gradual Drift on Elongated Targets: When tracking targets with high aspect ratios (e.g., trains) under low spatial resolution, SAM2 is prone to over-segmentation. Since this mask expansion evolves smoothly over time rather than abruptly, it evades the Kalman filter’s outlier rejection mechanism… view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparisons on three datasets among our method and fully supervised methods (AQAtrack and LoRAT with [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
read the original abstract

Existing satellite video tracking methods often struggle with generalization, requiring scenario-specific training to achieve satisfactory performance, and are prone to track loss in the presence of occlusion. To address these challenges, we propose SatSAM2, a zero-shot satellite video tracker built on SAM2, designed to adapt foundation models to the remote sensing domain. SatSAM2 introduces two core modules: a Kalman Filter-based Constrained Motion Module (KFCMM) to exploit temporal motion cues and suppress drift, and a Motion-Constrained State Machine (MCSM) to regulate tracking states based on motion dynamics and reliability. To support large-scale evaluation, we propose MatrixCity Video Object Tracking (MVOT), a synthetic benchmark containing 1,500+ sequences and 157K annotated frames with diverse viewpoints, illumination, and occlusion conditions. Extensive experiments on two satellite tracking benchmarks and MVOT show that SatSAM2 outperforms both traditional and foundation model-based trackers, including SAM2 and its variants. Notably, on the OOTB dataset, SatSAM2 achieves a 5.84% AUC improvement over state-of-the-art methods. Our code and dataset will be publicly released to encourage further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SatSAM2, a zero-shot satellite video object tracker that adapts the promptable SAM2 foundation model using two new modules: a Kalman Filter-based Constrained Motion Module (KFCMM) to exploit temporal motion cues and suppress drift, and a Motion-Constrained State Machine (MCSM) to regulate tracking states. It introduces the MatrixCity Video Object Tracking (MVOT) synthetic benchmark with 1,500+ sequences and 157K frames, and reports that SatSAM2 outperforms traditional and foundation-model trackers on two satellite benchmarks and MVOT, with a 5.84% AUC gain on the OOTB dataset.

Significance. If the reported gains are substantiated by detailed ablations and robustness tests, the work would be significant for demonstrating practical zero-shot adaptation of large vision foundation models to the remote-sensing domain without scenario-specific training. The public release of code and the MVOT benchmark (with diverse viewpoints, illumination, and occlusions) would be a concrete contribution that enables reproducible research and could accelerate progress on satellite video tracking.

major comments (2)
  1. [§3.2] §3.2 (KFCMM description): The module relies on a standard Kalman filter with constant-velocity assumptions to constrain SAM2 prompts. Satellite imagery frequently exhibits perspective-induced acceleration, parallax, and irregular frame rates; the manuscript provides no quantitative validation (e.g., covariance sensitivity or cross-orbit tests) showing that prediction errors do not lock prompts onto background or trigger erroneous MCSM state switches.
  2. [§4] §4 (Experiments): The central claim of a 5.84% AUC improvement on OOTB and consistent outperformance on MVOT is presented without ablation studies isolating KFCMM and MCSM contributions, without error bars or statistical significance tests, and without failure-case analysis. This absence makes it impossible to verify that the gains are load-bearing on the proposed modules rather than on SAM2 base performance or benchmark specifics.
minor comments (2)
  1. [§3.3] Notation for the state-machine thresholds and Kalman process noise should be defined explicitly in a table or appendix so that readers can reproduce the exact configuration used in the reported results.
  2. Figure captions for qualitative results should include the specific sequence identifiers and frame numbers shown, and should note whether the displayed frames contain the occlusion or viewpoint-change cases highlighted in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the KFCMM assumptions and experimental rigor. We address each major comment below and will revise the manuscript to incorporate additional validation and analysis.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (KFCMM description): The module relies on a standard Kalman filter with constant-velocity assumptions to constrain SAM2 prompts. Satellite imagery frequently exhibits perspective-induced acceleration, parallax, and irregular frame rates; the manuscript provides no quantitative validation (e.g., covariance sensitivity or cross-orbit tests) showing that prediction errors do not lock prompts onto background or trigger erroneous MCSM state switches.

    Authors: We acknowledge that the constant-velocity model in KFCMM is a simplification that may not capture all satellite-specific effects such as parallax or irregular frame rates. The module was introduced to supply a lightweight temporal constraint for SAM2 prompt generation in a zero-shot setting. In the revised manuscript we will add covariance sensitivity analysis, evaluation on sequences with varying frame rates, and cross-orbit tests to quantify prediction error impact on prompt stability and MCSM transitions. revision: yes

  2. Referee: [§4] §4 (Experiments): The central claim of a 5.84% AUC improvement on OOTB and consistent outperformance on MVOT is presented without ablation studies isolating KFCMM and MCSM contributions, without error bars or statistical significance tests, and without failure-case analysis. This absence makes it impossible to verify that the gains are load-bearing on the proposed modules rather than on SAM2 base performance or benchmark specifics.

    Authors: We agree that the experimental section would be strengthened by explicit ablations and statistical reporting. While the manuscript already compares against SAM2 and other baselines, we will add component-wise ablation studies (with and without KFCMM/MCSM), report error bars and significance tests where appropriate, and include a failure-case analysis to demonstrate when the motion constraints are most effective. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on independent benchmarks with standard components

full rationale

The paper presents SatSAM2 as a zero-shot tracker combining publicly available SAM2 with two new modules (KFCMM using Kalman filtering for motion cues and MCSM for state regulation). Performance claims such as the 5.84% AUC gain on OOTB and outperformance on MVOT are reported as outcomes of experiments on external satellite tracking benchmarks plus the authors' synthetic MVOT dataset. No equations, fitted parameters, or self-citations appear in the provided text that would make any prediction or uniqueness claim reduce to the input by construction. The derivation is a standard modular engineering extension evaluated externally and remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on the effectiveness of two newly introduced modules and the assumption that synthetic data captures real satellite challenges; these are not drawn from prior independent evidence.

axioms (1)
  • domain assumption Object motion in satellite videos can be adequately modeled as linear with Gaussian noise for Kalman filtering purposes
    Invoked as the basis for the KFCMM to constrain tracking and suppress drift.
invented entities (2)
  • Kalman Filter-based Constrained Motion Module (KFCMM) no independent evidence
    purpose: Exploit temporal motion cues and suppress drift during tracking
    New module proposed to adapt SAM2 for satellite domain.
  • Motion-Constrained State Machine (MCSM) no independent evidence
    purpose: Regulate tracking states based on motion dynamics and reliability
    New component to handle occlusions and track loss.

pith-pipeline@v0.9.0 · 5531 in / 1457 out tokens · 64059 ms · 2026-05-17T06:06:53.957834+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    Learning discriminative model prediction for track- ing

    Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. Learning discriminative model prediction for track- ing. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), 2019. 2, 1

  2. [2]

    Learning discriminative model prediction for track- ing

    Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. Learning discriminative model prediction for track- ing. In2019 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 6181–6190, 2019. 8

  3. [3]

    Know your surroundings: Exploiting scene infor- mation for object tracking

    Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. Know your surroundings: Exploiting scene infor- mation for object tracking. InComputer Vision – ECCV 2020, pages 205–221, Cham, 2020. Springer International Publishing. 8, 1

  4. [4]

    Yuzeng Chen, Yuqi Tang, Zhiyong Yin, Te Han, Bin Zou, and Huihui Feng. Single object tracking in satellite videos: A correlation filter-based dual-flow tracker.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15:6687–6698, 2022. 2, 8, 6

  5. [5]

    Satel- lite video single object tracking: A systematic review and an oriented object tracking benchmark.ISPRS Journal of Pho- togrammetry and Remote Sensing, 210:212–240, 2024

    Yuzeng Chen, Yuqi Tang, Yi Xiao, Qiangqiang Yuan, Yuwei Zhang, Fengqing Liu, Jiang He, and Liangpei Zhang. Satel- lite video single object tracking: A systematic review and an oriented object tracking benchmark.ISPRS Journal of Pho- togrammetry and Remote Sensing, 210:212–240, 2024. 3, 4, 6

  6. [6]

    Prob- abilistic regression for visual tracking

    Martin Danelljan, Luc Van Gool, and Radu Timofte. Prob- abilistic regression for visual tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2, 8, 1

  7. [7]

    arXiv preprint arXiv:2003.09003

    Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, et al. Mot20: A benchmark for multi object tracking in crowded scenes.arXiv preprint arXiv:2003.09003, 2020. 2

  8. [8]

    Grewal.Kalman Filtering, pages 1285–1289

    Mohinder S. Grewal.Kalman Filtering, pages 1285–1289. Springer Berlin Heidelberg, Berlin, Heidelberg, 2025. 2

  9. [9]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022. 4

  10. [10]

    R. E. Kalman. A new approach to linear filtering and predic- tion problems.Journal of Basic Engineering, 82(1):35–45,

  11. [11]

    A review on kalman filter models.Archives of Computational Methods in Engineering, 30(1):727–747, 2023

    Masoud Khodarahmi and Vafa Maihami. A review on kalman filter models.Archives of Computational Methods in Engineering, 30(1):727–747, 2023

  12. [12]

    Introduction to kalman filter and its applications

    Youngjoo Kim and Hyochoong Bang. Introduction to kalman filter and its applications. InIntroduction and im- plementations of the Kalman filter. IntechOpen, 2018. 2

  13. [13]

    Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023. 2

  14. [14]

    Target-aware transformer for satellite video object tracking.IEEE Transactions on Geo- science and Remote Sensing, 62:1–10, 2024

    Pujian Lai, Meili Zhang, Gong Cheng, Shengyang Li, Xi- ankai Huang, and Junwei Han. Target-aware transformer for satellite video object tracking.IEEE Transactions on Geo- science and Remote Sensing, 62:1–10, 2024. 2, 8, 6

  15. [15]

    Roadcorrector: A structure-aware road extraction method for road connectivity and topology correction.IEEE Transac- tions on Geoscience and Remote Sensing, 62:1–18, 2024

    Jinpeng Li, Jun He, Weijia Li, Jiabin Chen, and Jinhua Yu. Roadcorrector: A structure-aware road extraction method for road connectivity and topology correction.IEEE Transac- tions on Geoscience and Remote Sensing, 62:1–18, 2024. 1

  16. [16]

    A multitask benchmark dataset for satellite video: Object detection, tracking, and segmentation.IEEE Trans- actions on Geoscience and Remote Sensing, 61:1–21, 2023

    Shengyang Li, Zhuang Zhou, Manqi Zhao, Jian Yang, Wei- long Guo, Yixuan Lv, Longxuan Kou, Han Wang, and Yan- feng Gu. A multitask benchmark dataset for satellite video: Object detection, tracking, and segmentation.IEEE Trans- actions on Geoscience and Remote Sensing, 61:1–21, 2023. 3

  17. [17]

    Omnicity: Omnipotent city understanding with multi-level and multi- view images

    Weijia Li, Yawen Lai, Linning Xu, Yuanbo Xiangli, Jinhua Yu, Conghui He, Gui-Song Xia, and Dahua Lin. Omnicity: Omnipotent city understanding with multi-level and multi- view images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17397–17407, 2023. 1

  18. [18]

    Weijia Li, Wenqian Zhao, Jinhua Yu, Juepeng Zheng, Con- ghui He, Haohuan Fu, and Dahua Lin. Joint seman- tic–geometric learning for polygonal building segmentation from high-resolution remote sensing images.ISPRS Journal of Photogrammetry and Remote Sensing, 201:26–37, 2023. 1

  19. [19]

    3d building reconstruction from monocular remote sensing images with multi-level supervi- sions

    Weijia Li, Haote Yang, Zhenghao Hu, Juepeng Zheng, Gui- Song Xia, and Conghui He. 3d building reconstruction from monocular remote sensing images with multi-level supervi- sions. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 27728– 27737, 2024. 1

  20. [20]

    Yangfan Li, Chunjiang Bian, and Hongzhen Chen. Object tracking in satellite videos: Correlation particle filter track- ing method with motion estimation by kalman filter.IEEE Transactions on Geoscience and Remote Sensing, 60:1–12,

  21. [21]

    Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond

    Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhen- zhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215, 2023. 5, 1

  22. [22]

    Tracking meets lora: Faster training, larger model, stronger performance

    Liting Lin, Heng Fan, Zhipeng Zhang, Yaowei Wang, Yong Xu, and Haibin Ling. Tracking meets lora: Faster training, larger model, stronger performance. InECCV, 2024. 2, 8, 1, 6

  23. [23]

    Learning target candidate association to keep track of what not to track

    Christoph Mayer, Martin Danelljan, Danda Pani Paudel, and Luc Van Gool. Learning target candidate association to keep track of what not to track. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13444–13454, 2021. 8 9

  24. [24]

    Transforming model prediction for tracking

    Christoph Mayer, Martin Danelljan, Goutam Bhat, Matthieu Paul, Danda Pani Paudel, Fisher Yu, and Luc Van Gool. Transforming model prediction for tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 8731–8740, 2022. 2, 8, 1

  25. [25]

    Joint tracking and segmentation of multiple targets

    Anton Milan, Laura Leal-Taix ´e, Konrad Schindler, and Ian Reid. Joint tracking and segmentation of multiple targets. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5397–5406, 2015. 2

  26. [26]

    Motchallenge: A benchmark for single- camera multiple target tracking.International Journal of Computer Vision, 2021

    Anton Milan, Laura Leal-Taix ´e, Ian Reid, Stefan Roth, and Konrad Schindler. Motchallenge: A benchmark for single- camera multiple target tracking.International Journal of Computer Vision, 2021. 2

  27. [27]

    Robust visual tracking by segmentation

    Matthieu Paul, Martin Danelljan, Christoph Mayer, and Luc Van Gool. Robust visual tracking by segmentation. InCom- puter Vision – ECCV 2022, pages 571–588, Cham, 2022. Springer Nature Switzerland. 8, 1

  28. [28]

    SAM 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. SAM 2: Segment anything in images and videos. In The Thirteenth Int...

  29. [29]

    Hiera: A hi- erarchical vision transformer without the bells-and-whistles

    Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Ma- lik, Yanghao Li, and Christoph Feichtenhofer. Hiera: A hi- erarchical vision transformer without the bells-and-whistles. ICML, 2023. 4

  30. [30]

    Ros-sam: High-quality interactive segmenta- tion for remote sensing moving object

    Zhe Shan, Yang Liu, Lei Zhou, Cheng Yan, Heng Wang, and Xia Xie. Ros-sam: High-quality interactive segmenta- tion for remote sensing moving object. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3625–3635, 2025. 2

  31. [31]

    Hrsiam: High-resolution siamese network, towards space-borne satellite video tracking.IEEE Transactions on Image Processing, 30:3056–3068, 2021

    Jia Shao, Bo Du, Chen Wu, Mingming Gong, and Tongliang Liu. Hrsiam: High-resolution siamese network, towards space-borne satellite video tracking.IEEE Transactions on Image Processing, 30:3056–3068, 2021. 2

  32. [32]

    Satel- lite video object tracking based on location prompts.IEEE Transactions on Circuits and Systems for Video Technology, 34(7):6253–6264, 2024

    Jiahao Wang, Fang Liu, Licheng Jiao, Yingjia Gao, Hao Wang, Lingling Li, Puhua Chen, Xu Liu, and Shuo Li. Satel- lite video object tracking based on location prompts.IEEE Transactions on Circuits and Systems for Video Technology, 34(7):6253–6264, 2024. 1

  33. [33]

    An introduction to the kalman filter

    Greg Welch and Gary Bishop. An introduction to the kalman filter. Technical Report TR 95-041, University of North Car- olina at Chapel Hill, Department of Computer Science, 1995. 2

  34. [34]

    Autore- gressive queries for adaptive tracking with spatio-temporal transformers

    Jinxia Xie, Bineng Zhong, Zhiyi Mo, Shengping Zhang, Liangtao Shi, Shuxiang Song, and Rongrong Ji. Autore- gressive queries for adaptive tracking with spatio-temporal transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19300– 19309, 2024. 2, 8, 6

  35. [35]

    Samurai: Adapting segment anything model for zero-shot visual tracking with motion-aware memory, 2024

    Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, and Jenq-Neng Hwang. Samurai: Adapting segment anything model for zero-shot visual tracking with motion-aware memory, 2024. 2, 8, 4

  36. [36]

    arXiv preprint arXiv:2304.11968 (2023)

    Jinyu Yang, Mingqi Gao, Zhe Li, Shanghua Gao, Fang Wang, and Fengcai Zheng. Track anything: Segment any- thing meets videos.arXiv preprint arXiv:2304.11968, 2023. 2

  37. [37]

    Sg-bev: Satellite-guided bev fusion for cross-view semantic segmentation

    Junyan Ye, Qiyan Luo, Jinhua Yu, Huaping Zhong, Zhimeng Zheng, Conghui He, and Weijia Li. Sg-bev: Satellite-guided bev fusion for cross-view semantic segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27748–27757, 2024. 1

  38. [38]

    Detecting and tracking small and dense moving objects in satellite videos: A benchmark.IEEE Transactions on Geoscience and Re- mote Sensing, 60:1–18, 2021

    Qian Yin, Qingyong Hu, Hao Liu, Feng Zhang, Yingqian Wang, Zaiping Lin, Wei An, and Yulan Guo. Detecting and tracking small and dense moving objects in satellite videos: A benchmark.IEEE Transactions on Geoscience and Re- mote Sensing, 60:1–18, 2021. 1, 2

  39. [39]

    Detecting and tracking small and dense moving objects in satellite videos: A benchmark.IEEE Transactions on Geoscience and Re- mote Sensing, 60:1–18, 2022

    Qian Yin, Qingyong Hu, Hao Liu, Feng Zhang, Yingqian Wang, Zaiping Lin, Wei An, and Yulan Guo. Detecting and tracking small and dense moving objects in satellite videos: A benchmark.IEEE Transactions on Geoscience and Re- mote Sensing, 60:1–18, 2022. 3

  40. [40]

    Satsot: A benchmark dataset for satellite video single object tracking.IEEE Transactions on Geoscience and Remote Sensing, 60:1–11, 2022

    Manqi Zhao, Shengyang Li, Shiyu Xuan, Longxuan Kou, Shuai Gong, and Zhuang Zhou. Satsot: A benchmark dataset for satellite video single object tracking.IEEE Transactions on Geoscience and Remote Sensing, 60:1–11, 2022. 3, 4, 6

  41. [41]

    Anomalynet: An anomaly detection network for video surveillance.IEEE Transactions on Information Forensics and Security, 14(10):2537–2550,

    Joey Tianyi Zhou, Jiawei Du, Hongyuan Zhu, Xi Peng, Yong Liu, and Rick Siow Mong Goh. Anomalynet: An anomaly detection network for video surveillance.IEEE Transactions on Information Forensics and Security, 14(10):2537–2550,

  42. [42]

    box collapse

    1 10 SatSAM2: Motion-Constrained Video Object Tracking in Satellite Imagery using Promptable SAM2 and Kalman Priors Supplementary Material Supplementary Material This document serves as a comprehensive supplement to the main manuscript, providing detailed implementation spec- ifications, extended experimental analysis, and qualitative visualizations that ...