Recognition: unknown
Dynamic Pondering Sparsity-aware Mixture-of-Experts Transformer for Event Stream based Visual Object Tracking
Pith reviewed 2026-05-08 13:52 UTC · model grok-4.3
The pith
A sparsity-aware mixture-of-experts Vision Transformer processes event streams at multiple densities and adapts inference depth to track objects efficiently.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Progressively injecting sparse, medium-density, and dense event search regions into a three-stage Vision Transformer backbone, augmented by a sparsity-aware Mixture-of-Experts module and a dynamic pondering strategy, produces hierarchical multi-density features and allows inference depth to scale with tracking difficulty, resulting in a favorable accuracy-efficiency trade-off on FE240hz, COESOT, and EventVOT.
What carries the argument
The sparsity-aware Mixture-of-Experts module inside the three-stage Vision Transformer that routes event features according to local density while the dynamic pondering gate controls how many stages run per frame.
If this is right
- Trackers can avoid the suboptimal fixed temporal window by letting density-aware stages supply the right scale of motion information automatically.
- Average compute drops because the pondering gate can exit early on frames where coarse features already suffice for reliable association.
- Expert specialization under different sparsity patterns improves feature quality for both slow-drift and high-speed cases within the same model.
- The same architecture can be deployed on resource-constrained hardware by capping maximum depth while retaining the accuracy of deeper runs only when needed.
- Event-based tracking becomes viable for continuous operation in robotics or surveillance without constant full-model evaluation.
Where Pith is reading between the lines
- The density-progression idea could transfer directly to other sparse asynchronous sensors such as neuromorphic audio or LiDAR event streams.
- Dynamic depth control suggests a route to energy-aware edge tracking where battery or thermal limits dictate the maximum stages allowed per frame.
- If the three-stage granularity proves insufficient for extreme motions, adding a fourth ultra-dense stage would be a natural, testable extension.
Load-bearing premise
That the progressive injection of event regions at three fixed density levels into successive transformer stages will produce features that generalize across motion speeds and scene types without further per-dataset retuning.
What would settle it
Running the tracker on a fourth event dataset whose motion statistics or event density distribution lie well outside the ranges of FE240hz, COESOT, and EventVOT and checking whether accuracy falls below the reported trade-off curve.
Figures
read the original abstract
Despite significant progress, RGB-based trackers remain vulnerable to challenging imaging conditions, such as low illumination and fast motion. Event cameras offer a promising alternative by asynchronously capturing pixel-wise brightness changes, providing high dynamic range and high temporal resolution. However, existing event-based trackers often neglect the intrinsic spatial sparsity and temporal density of event data, while relying on a single fixed temporal-window sampling strategy that is suboptimal under varying motion dynamics. In this paper, we propose an event sparsity-aware tracking framework that explicitly models event-density variations across multiple temporal scales. Specifically, the proposed framework progressively injects sparse, medium-density, and dense event search regions into a three-stage Vision Transformer backbone, enabling hierarchical multi-density feature learning. Furthermore, we introduce a sparsity-aware Mixture-of-Experts module to encourage expert specialization under different sparsity patterns, and design a dynamic pondering strategy to adaptively adjust the inference depth according to tracking difficulty. Extensive experiments on FE240hz, COESOT, and EventVOT demonstrate that the proposed approach achieves a favorable trade-off between tracking accuracy and computational efficiency. The source code will be released on https://github.com/Event-AHU/OpenEvTracking.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a Dynamic Pondering Sparsity-aware Mixture-of-Experts Transformer for event stream based visual object tracking. It addresses limitations of existing event-based trackers by modeling event-density variations via progressive injection of sparse, medium-density, and dense event search regions into a three-stage Vision Transformer backbone for hierarchical multi-density feature learning. A sparsity-aware MoE module encourages expert specialization under different sparsity patterns, and a dynamic pondering strategy adaptively adjusts inference depth according to tracking difficulty. Experiments on FE240hz, COESOT, and EventVOT are reported to demonstrate a favorable accuracy-efficiency trade-off, with code to be released.
Significance. If the empirical results and ablations hold, the work could meaningfully advance event-based tracking by explicitly handling intrinsic spatial sparsity and temporal density variations, which are often neglected. The multi-stage injection, sparsity-aware MoE, and dynamic depth adjustment provide a principled way to adapt to varying motion dynamics, potentially improving robustness in low-illumination and fast-motion scenarios. Code release is a clear strength for reproducibility.
major comments (2)
- Abstract: the central claim of a 'favorable trade-off between tracking accuracy and computational efficiency' is stated without any quantitative metrics, error bars, or baseline comparisons; this makes the empirical contribution difficult to assess from the summary alone and requires the full experimental section to carry the load.
- §3 (Method, dynamic pondering): the strategy for adaptively adjusting inference depth is described at a high level but lacks an explicit formulation, threshold, or loss term; without this, it is unclear whether the adaptivity is learned end-to-end or relies on heuristic rules that could require dataset-specific tuning.
minor comments (3)
- Abstract: consider inserting one or two concrete performance numbers (e.g., success rate or FPS gains) to make the claimed trade-off immediately verifiable.
- Related work: ensure coverage of recent event-based trackers that also exploit sparsity or MoE-style routing; a short comparison table would clarify novelty.
- Figures: captions for the architecture diagram should explicitly label the three event-density injection stages and the MoE routing.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation for minor revision. We address each major comment below and will update the manuscript accordingly to improve clarity.
read point-by-point responses
-
Referee: Abstract: the central claim of a 'favorable trade-off between tracking accuracy and computational efficiency' is stated without any quantitative metrics, error bars, or baseline comparisons; this makes the empirical contribution difficult to assess from the summary alone and requires the full experimental section to carry the load.
Authors: We agree that the abstract would be strengthened by including quantitative support for the central claim. In the revised manuscript, we will update the abstract to report key results, such as the achieved precision and FPS values with comparisons to baselines on FE240hz, COESOT, and EventVOT, while retaining the high-level summary style typical for abstracts. revision: yes
-
Referee: §3 (Method, dynamic pondering): the strategy for adaptively adjusting inference depth is described at a high level but lacks an explicit formulation, threshold, or loss term; without this, it is unclear whether the adaptivity is learned end-to-end or relies on heuristic rules that could require dataset-specific tuning.
Authors: We appreciate this observation. The dynamic pondering mechanism is intended to be fully end-to-end trainable. In the revised version, we will add the explicit mathematical formulation in Section 3, including the threshold computation, the depth adjustment rule, and the auxiliary loss term that enables learning without dataset-specific heuristics. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper is an empirical architecture proposal for event-stream tracking. It describes a three-stage ViT backbone with progressive sparse/medium/dense event injection, a sparsity-aware MoE module, and a dynamic pondering strategy for inference depth. The central claim is an experimental accuracy-efficiency trade-off on FE240hz, COESOT, and EventVOT benchmarks. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or method description. The design choices are presented as novel combinations motivated by event data properties, with no reduction of outputs to inputs by construction. This is a standard self-contained engineering contribution evaluated on external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Transformer tracking,
X. Chen, J. Yan, Bin Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, p. 8126–8135
2021
-
[2]
Towards more flexible and accurate object tracking with natural lan- guage: Algorithms and benchmark,
X. Wang, X. Shu, Z. Zhang, B. Jiang, Y . Wang, Y . Tian, and F. Wu, “Towards more flexible and accurate object tracking with natural lan- guage: Algorithms and benchmark,” inProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2021, pp. 13 763–13 773
2021
-
[3]
Unctrack: Reliable visual object tracking with uncertainty-aware prototype memory network,
S. Yao, Y . Guo, Y . Yan, W. Ren, and X. Cao, “Unctrack: Reliable visual object tracking with uncertainty-aware prototype memory network,” IEEE Transactions on Image Processing, vol. 34, pp. 3533–3546, 2025
2025
-
[4]
Hyperspectral video tracking with spectral–spatial fusion and memory enhancement,
Y . Chen, Q. Yuan, H. Xie, Y . Tang, Y . Xiao, J. He, R. Guan, X. Liu, and L. Zhang, “Hyperspectral video tracking with spectral–spatial fusion and memory enhancement,”IEEE Transactions on Image Processing, vol. 34, pp. 3547–3562, 2025
2025
-
[5]
Ssf-net: Spatial-spectral fusion network with spectral angle awareness for hyperspectral object tracking,
H. Wang, W. Li, X.-G. Xia, Q. Du, and J. Tian, “Ssf-net: Spatial-spectral fusion network with spectral angle awareness for hyperspectral object tracking,”IEEE Transactions on Image Processing, vol. 34, pp. 3518– 3532, 2025
2025
-
[6]
Event- based vision: A survey,
G. Gallego, T. Delbrück, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidiset al., “Event- based vision: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 1, pp. 154–180, 2020
2020
-
[7]
Mambaevt: Event stream based visual object tracking using state space model,
X. Wang, C. Wang, S. Wang, X. Wang, Z. Zhao, L. Zhu, and B. Jiang, “Mambaevt: Event stream based visual object tracking using state space model,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 36, pp. 278–291, 2026
2026
-
[9]
Event stream-based visual object tracking: A high-resolution bench- mark dataset and a novel baseline,
X. Wang, S. Wang, C. Tang, L. Zhu, B. Jiang, Y . Tian, and J. Tang, “Event stream-based visual object tracking: A high-resolution bench- mark dataset and a novel baseline,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 248–19 257
2024
-
[10]
Learning graph-embedded key-event back-tracing for object tracking in event clouds,
Z. Zhu, J. Hou, and X. Lyu, “Learning graph-embedded key-event back-tracing for object tracking in event clouds,”Advances in Neural Information Processing Systems, vol. 35, pp. 7462–7476, 2022
2022
-
[11]
Efficient vision transformer with token sparsification for event-based object tracking,
J. Zhang, X. Yang, H. Tang, Y . Wang, B. Yin, H. Wang, and X. Fu, “Efficient vision transformer with token sparsification for event-based object tracking,”International Journal of Computer Vision, vol. 134, no. 2, p. 75, 2026
2026
-
[12]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review arXiv 2010
-
[13]
Deep learning for visual tracking: A comprehensive survey,
S. M. Marvasti-Zadeh, L. Cheng, H. Ghanei-Yakhdan, and S. Kasaei, “Deep learning for visual tracking: A comprehensive survey,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 5, pp. 3943–3968, 2021
2021
-
[14]
Event-guided structured output tracking of fast-moving objects using a celex sensor,
J. Huang, S. Wang, M. Guo, and S. Chen, “Event-guided structured output tracking of fast-moving objects using a celex sensor,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 9, pp. 2413–2417, 2018
2018
-
[15]
Asynchronous tracking-by-detection on adaptive time surfaces for event-based object tracking,
H. Chen, Q. Wu, Y . Liang, X. Gao, and H. Wang, “Asynchronous tracking-by-detection on adaptive time surfaces for event-based object tracking,” inProceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 473–481
2019
-
[16]
Eklt: Asyn- chronous photometric feature tracking using events and frames,
D. Gehrig, H. Rebecq, G. Gallego, and D. Scaramuzza, “Eklt: Asyn- chronous photometric feature tracking using events and frames,”Inter- national Journal of Computer Vision, vol. 128, no. 3, pp. 601–618, 2020
2020
-
[17]
Spiking transformers for event-based single object tracking,
J. Zhang, B. Dong, H. Zhang, J. Ding, F. Heide, B. Yin, and X. Yang, “Spiking transformers for event-based single object tracking,” inPro- ceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2022, pp. 8801–8810
2022
-
[18]
Towards low-latency event stream-based visual object tracking: A slow-fast approach,
S. Wang, X. Wang, L. Jin, B. Jiang, L. Zhu, L. Chen, Y . Tian, and B. Luo, “Towards low-latency event stream-based visual object tracking: A slow-fast approach,”arXiv preprint arXiv:2505.12903, 2025
-
[19]
Learning spatio-temporal transformer for visual tracking,
B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “Learning spatio-temporal transformer for visual tracking,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, p. 10448–10457
2021
-
[20]
Joint feature learning and relation modeling for tracking: A one-stream framework,
B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” inEuropean conference on computer vision. Springer, 2022, pp. 341– 357
2022
-
[21]
Two- stream beats one-stream: Asymmetric siamese network for efficient visual tracking,
J. Zhu, H. Tang, X. Chen, X. Wang, D. Wang, and H. Lu, “Two- stream beats one-stream: Asymmetric siamese network for efficient visual tracking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 10, 2025, pp. 10 959–10 967
2025
-
[22]
General compression framework for efficient transformer object tracking,
L. Hong, J. Li, X. Zhou, S. Yan, P. Guo, K. Jiang, Z. Chen, S. Gao, R. Li, X. Shenget al., “General compression framework for efficient transformer object tracking,” inProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2025, pp. 13 427–13 437
2025
-
[23]
Less is more: Token context-aware learning for object tracking,
C. Xu, B. Zhong, Q. Liang, Y . Zheng, G. Li, and S. Song, “Less is more: Token context-aware learning for object tracking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 8824–8832
2025
-
[24]
Similarity-guided layer-adaptive vision transformer for uav tracking,
C. Xue, B. Zhong, Q. Liang, Y . Zheng, N. Li, Y . Xue, and S. Song, “Similarity-guided layer-adaptive vision transformer for uav tracking,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 6730–6740
2025
-
[25]
Lighttrack: Finding lightweight neural networks for object tracking via one-shot architecture search,
B. Yan, H. Peng, K. Wu, D. Wang, J. Fu, and H. Lu, “Lighttrack: Finding lightweight neural networks for object tracking via one-shot architecture search,” inProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2021, pp. 15 180–15 189
2021
-
[26]
Mixformerv2: Efficient fully transformer tracking,
Y . Cui, T. Song, G. Wu, and L. Wang, “Mixformerv2: Efficient fully transformer tracking,”Advances in Neural Information Processing Sys- tems, vol. 36, pp. 58 736–58 751, 2023. IEEE TRANSACTIONS ON ***, 2026 13
2023
-
[27]
Exploring dynamic transformer for efficient object tracking,
J. Zhu, X. Chen, H. Diao, S. Li, J.-Y . He, C. Li, B. Luo, D. Wang, and H. Lu, “Exploring dynamic transformer for efficient object tracking,” IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 8, pp. 15 502–15 514, 2025
2025
-
[28]
A-vit: Adaptive tokens for efficient vision transformer,
H. Yin, A. Vahdat, J. M. Alvarez, A. Mallya, J. Kautz, and P. Molchanov, “A-vit: Adaptive tokens for efficient vision transformer,” inProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recogni- tion, 2022, pp. 10 809–10 818
2022
-
[29]
Outrageously large neural networks: The sparsely- gated mixture-of-experts layer,
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V . Le, G. E. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely- gated mixture-of-experts layer,” inInternational Conference on Learning Representations, 2017
2017
-
[30]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,
W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022
2022
-
[31]
Scaling vision with sparse mixture of experts,
C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. Susano Pinto, D. Keysers, and N. Houlsby, “Scaling vision with sparse mixture of experts,”Advances in Neural Information Processing Systems, vol. 34, pp. 8583–8595, 2021
2021
-
[32]
Tutel: Adaptive mixture-of-experts at scale,
C. Hwang, W. Cui, Y . Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang, R. Salas, J. Jose, P. Ramet al., “Tutel: Adaptive mixture-of-experts at scale,” Proceedings of Machine Learning and Systems, vol. 5, pp. 269–287, 2023
2023
-
[33]
Base layers: Simplifying training of large, sparse models,
M. Lewis, S. Bhosale, T. Dettmers, N. Goyal, and L. Zettlemoyer, “Base layers: Simplifying training of large, sparse models,” inInternational Conference on Machine Learning, 2021, pp. 6265–6274
2021
-
[34]
St-moe: Designing stable and transferable sparse expert models,
B. Zoph, I. Bello, S. Kumar, N. Du, Y . Huang, J. Dean, N. Shazeer, and W. Fedus, “St-moe: Designing stable and transferable sparse expert models,” inInternational Conference on Learning Representations, 2022
2022
-
[35]
Hash layers for large sparse models,
S. Roller, S. Sukhbaatar, J. Westonet al., “Hash layers for large sparse models,”Advances in Neural Information Processing Systems, vol. 34, pp. 17 555–17 566, 2021
2021
-
[36]
Dynamic-dino: Fine-grained mixture of experts tuning for real- time open-vocabulary object detection,
Y . Lu, M. Weng, Z. Xiao, R. Jiang, W. Su, G. Zheng, P. Lu, and X. Li, “Dynamic-dino: Fine-grained mixture of experts tuning for real- time open-vocabulary object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 20 847–20 856
2025
-
[37]
Equipping vision foundation model with mixture of experts for out-of-distribution detection,
S. Zhao, J. Liu, X. Wen, H. Tan, and X. Qi, “Equipping vision foundation model with mixture of experts for out-of-distribution detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 1751–1761
2025
-
[38]
Separation for better inte- gration: Disentangling edge and motion in event-based deblurring,
Y . Zhu, H. Chen, Y . Deng, and W. You, “Separation for better inte- gration: Disentangling edge and motion in event-based deblurring,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 14 732–14 742
2025
-
[39]
Adaptive Computation Time for Recurrent Neural Networks
A. Graves, “Adaptive computation time for recurrent neural networks,” arXiv preprint arXiv:1603.08983, 2016
work page internal anchor Pith review arXiv 2016
-
[40]
Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines,
Y . Xu, Z. Wang, Z. Li, Y . Yuan, and G. Yu, “Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 549–12 556
2020
-
[41]
Learning discrim- inative model prediction for tracking,
G. Bhat, M. Danelljan, L. V . Gool, and R. Timofte, “Learning discrim- inative model prediction for tracking,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, p. 6182–6191
2019
-
[42]
Atom: Accurate tracking by overlap maximization,
M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “Atom: Accurate tracking by overlap maximization,” inProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2019, pp. 4660–4669
2019
-
[43]
Mixformer: End-to- end tracking with iterative mixed attention,
Y . Cui, C. Jiang, L. Wang, and W. Gangshan, “Mixformer: End-to- end tracking with iterative mixed attention,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, p. 13608–13618
2022
-
[44]
Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,
J. Xie, B. Zhong, Z. Mo, S. Zhang, L. Shi, S. Song, and R. Ji, “Autoregressive queries for adaptive tracking with spatio-temporal trans- formers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 300–19 309
2024
-
[45]
Object tracking by jointly exploiting frame and event domain,
J. Zhang, X. Yang, Y . Fu, X. Wei, B. Yin, and B. Dong, “Object tracking by jointly exploiting frame and event domain,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 043–13 052
2021
-
[46]
Revisiting color-event based tracking: A unified network, dataset, and metric,
C. Tang, X. Wang, J. Huang, B. Jiang, L. Zhu, S. Chen, J. Zhang, Y . Wang, and Y . Tian, “Revisiting color-event based tracking: A unified network, dataset, and metric,”Pattern Recognition, p. 112718, 2025
2025
-
[47]
Decoupled weight decay regularization,
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2018
2018
-
[48]
Pytorch: An imperative style, high-performance deep learning library,
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antigaet al., “Pytorch: An imperative style, high-performance deep learning library,”Advances in Neural Information Processing Systems, vol. 32, 2019
2019
-
[49]
Transformer meets tracker: Ex- ploiting temporal context for robust visual tracking,
N. Wang, W. Zhou, J. Wang, and H. Li, “Transformer meets tracker: Ex- ploiting temporal context for robust visual tracking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, p. 1571–1580
2021
-
[50]
Transforming model prediction for tracking,
C. Mayer, M. Danelljan, G. Bhat, M. Paul, D. P. Paudel, F. Yu, and L. V . Gool, “Transforming model prediction for tracking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, p. 8731–8740
2022
-
[51]
Aiatrack: Attention in attention for transformer visual tracking,
S. Gao, C. Zhou, C. Ma, X. Wang, and J. Yuan, “Aiatrack: Attention in attention for transformer visual tracking,” inEuropean Conference on Computer Vision, 2022, p. 146–164
2022
-
[52]
Probabilistic regression for visual tracking,
M. Danelljan, L. V . Gool, and R. Timofte, “Probabilistic regression for visual tracking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, p. 7183–7192
2019
-
[53]
Know your surroundings: Exploiting scene information for object tracking,
G. Bhat, M. Danelljan, L. Van Gool, and R. Timofte, “Know your surroundings: Exploiting scene information for object tracking,” in European Conference on Computer Vision, 2020, p. 205–221
2020
-
[54]
Backbone is all your need: A simplified architecture for visual object tracking,
B. Chen, P. Li, L. Bai, L. Qiao, Q. Shen, B. Li, W. Gan, W. Wu, and W. Ouyang, “Backbone is all your need: A simplified architecture for visual object tracking,” inEuropean Conference on Computer Vision, 2021, p. 375–392
2021
-
[55]
Exploring enhanced contextual information for video-level object tracking,
B. Kang, X. Chen, S. Lai, Y . Liu, Y . Liu, and D. Wang, “Exploring enhanced contextual information for video-level object tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 4, 2025, pp. 4194–4202
2025
-
[56]
Utptrack: Towards simple and unified token pruning for visual tracking,
H. Wu, X. Wang, J. Zhang, J. Tong, X. Chen, J. Lin, Y . Ma, and X. Shen, “Utptrack: Towards simple and unified token pruning for visual tracking,”arXiv preprint arXiv:2602.23734, 2026
-
[57]
Spiketrack: A spike-driven framework for efficient visual tracking,
Q. Zhang, J. Cheng, Q. Mao, C. Liu, Y . Fang, Y . Li, M. Ge, and S. Gao, “Spiketrack: A spike-driven framework for efficient visual tracking,” arXiv preprint arXiv:2602.23963, 2026
-
[58]
Atom: Accurate tracking by overlap maximization,
M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg, “Atom: Accurate tracking by overlap maximization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, p. 4660–4669
2019
-
[59]
Robust object modeling for visual tracking,
Y . Cai, J. Liu, J. Tang, and G. Wu, “Robust object modeling for visual tracking,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9589–9600
2023
-
[60]
Autoregressive visual tracking,
X. Wei, Y . Bai, Y . Zheng, D. Shi, and Y . Gong, “Autoregressive visual tracking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9697–9706
2023
-
[61]
Odtrack: Online dense temporal token learning for visual tracking,
Y . Zheng, B. Zhong, Q. Liang, Z. Mo, S. Zhang, and X. Li, “Odtrack: Online dense temporal token learning for visual tracking,” inProceed- ings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7588–7596
2024
-
[62]
Explicit visual prompts for visual object tracking,
L. Shi, B. Zhong, Q. Liang, N. Li, S. Zhang, and X. Li, “Explicit visual prompts for visual object tracking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4838– 4846
2024
-
[63]
Artrackv2: Prompting autore- gressive tracker where to look and how to describe,
Y . Bai, Z. Zhao, Y . Gong, and X. Wei, “Artrackv2: Prompting autore- gressive tracker where to look and how to describe,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 048–19 057
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.