Recognition: unknown
TAPNext++: What's Next for Tracking Any Point (TAP)?
Pith reviewed 2026-05-10 15:24 UTC · model grok-4.3
The pith
Recurrent transformers track any point through videos orders of magnitude longer by training on 1024-frame clips and adding re-detection augmentations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TAPNext++ extends the recurrent transformer of TAPNext by training on 1024-frame sequences enabled by sequence parallelism, introducing periodic-roll augmentations that simulate points re-entering the frame, and adding supervision for occluded points. These modifications allow the model to track points across sequences orders of magnitude longer than before while keeping the same low memory and compute footprint. The paper also defines the Re-Detection Average Jaccard metric to quantify performance on re-appearing points and shows that the resulting model achieves new state-of-the-art results on multiple point-tracking benchmarks.
What carries the argument
Recurrent video transformer trained with sequence-parallel long-sequence training and periodic-roll plus occlusion-supervision augmentations
If this is right
- Points can be tracked reliably across much longer continuous video sequences without increasing memory usage.
- Re-detection of points that leave and re-enter the frame becomes measurably better, as captured by the new AJ_RD metric.
- The same low-latency, frame-by-frame online tracking capability is retained while accuracy improves.
- New state-of-the-art scores are reached on existing TAP benchmarks without changing the core architecture.
- Applications in AR, XR, and robotics gain access to more robust long-term point correspondence.
Where Pith is reading between the lines
- The same sequence-parallel training recipe could be applied to other recurrent video models that currently struggle with long horizons.
- Periodic-roll augmentation may prove useful for any task that requires modeling objects re-entering the field of view.
- If the re-detection improvements hold, downstream systems that rely on stable point tracks over minutes rather than seconds become more feasible.
- Extending the approach to multi-camera or 3D-aware settings would be a direct next test of generality.
Load-bearing premise
The performance gains from long-sequence training and the chosen re-detection augmentations will continue to appear on videos drawn from distributions different from the training data and the evaluated benchmarks.
What would settle it
A controlled test on videos exceeding 1024 frames or containing occlusion patterns absent from the training distribution where TAPNext++ no longer outperforms the original TAPNext or other baselines on the standard TAP metrics.
Figures
read the original abstract
Tracking-Any-Point (TAP) models aim to track any point through a video which is a crucial task in AR/XR and robotics applications. The recently introduced TAPNext approach proposes an end-to-end, recurrent transformer architecture to track points frame-by-frame in a purely online fashion -- demonstrating competitive performance at minimal latency. However, we show that TAPNext struggles with longer video sequences and also frequently fails to re-detect query points that reappear after being occluded or leaving the frame. In this work, we present TAPNext++, a model that tracks points in sequences that are orders of magnitude longer while preserving the low memory and compute footprint of the architecture. We train the recurrent video transformer using several data-driven solutions, including training on long 1024-frame sequences enabled by sequence parallelism techniques. We highlight that re-detection performance is a blind spot in the current literature and introduce a new metric, Re-Detection Average Jaccard ($AJ_{RD}$), to explicitly evaluate tracking on re-appearing points. To improve re-detection of points, we introduce tailored geometric augmentations, such as periodic roll that simulates point re-entries, and supervising occluded points. We demonstrate that recurrent transformers can be substantially improved for point tracking and set a new state-of-the-art on multiple benchmarks. Model and code can be found at https://tap-next-plus-plus.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TAPNext++, an extension of the TAPNext recurrent transformer architecture for Tracking Any Point (TAP) in videos. It addresses limitations in long-sequence tracking and re-detection after occlusions or frame exits by training on 1024-frame clips via sequence parallelism, introducing periodic-roll geometric augmentation to simulate re-entries, adding occlusion supervision, and proposing a new Re-Detection Average Jaccard metric (AJ_RD). The authors claim these changes enable tracking in sequences orders of magnitude longer while preserving low memory/compute footprint, yielding new state-of-the-art results on multiple benchmarks, with code and model released.
Significance. If the performance claims and generalization hold, the work would advance online point tracking for extended videos, a key capability for AR/XR and robotics. The focus on recurrent transformers with explicit handling of re-detection and the new AJ_RD metric fill a noted gap in the literature. Code release supports reproducibility and follow-on work.
major comments (3)
- [§4] §4 (Experiments): The central SOTA claims rest on comparisons to baselines and ablations, but the manuscript provides no full experimental details, ablation tables isolating sequence-parallel training from the periodic-roll/occlusion augmentations, or quantitative results on sequence lengths substantially exceeding 1024 frames. This prevents verification of whether the reported gains are load-bearing or distribution-specific.
- [§5] §5 (Results and AJ_RD): The new AJ_RD metric is introduced to evaluate re-detection, yet its precise formulation, relation to standard AJ, and how occluded/re-entering points are annotated in the benchmarks are not specified. Without this, it is unclear whether AJ_RD provides an independent signal or simply re-weights existing failure modes.
- [§3.2] §3.2 (Augmentations and Training): The periodic-roll augmentation and occlusion supervision are presented as key to re-detection gains, but no cross-dataset or out-of-distribution evaluation (e.g., videos with different camera motion statistics or re-appearance patterns) is reported. This leaves the generalization assumption—that these techniques produce robust improvements beyond the training distribution—unverified.
minor comments (2)
- [Abstract] The abstract states the model tracks 'sequences that are orders of magnitude longer' without citing the exact maximum sequence length evaluated or the memory footprint numbers relative to TAPNext.
- [§3] Notation for AJ_RD is introduced without an equation or pseudocode definition in the main text, which would aid clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper accordingly to improve experimental transparency, metric clarity, and generalization analysis.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The central SOTA claims rest on comparisons to baselines and ablations, but the manuscript provides no full experimental details, ablation tables isolating sequence-parallel training from the periodic-roll/occlusion augmentations, or quantitative results on sequence lengths substantially exceeding 1024 frames. This prevents verification of whether the reported gains are load-bearing or distribution-specific.
Authors: We agree that fuller experimental details and targeted ablations are needed for verification. In the revised manuscript we will add complete experimental protocols, ablation tables that isolate sequence-parallel training from periodic-roll augmentation and occlusion supervision, and quantitative results on sequences longer than 1024 frames (including 2048-frame evaluations) to demonstrate scalability beyond the training regime. revision: yes
-
Referee: [§5] §5 (Results and AJ_RD): The new AJ_RD metric is introduced to evaluate re-detection, yet its precise formulation, relation to standard AJ, and how occluded/re-entering points are annotated in the benchmarks are not specified. Without this, it is unclear whether AJ_RD provides an independent signal or simply re-weights existing failure modes.
Authors: We will expand Section 5 with the exact mathematical definition of AJ_RD, its relation to standard Average Jaccard, and the annotation rules used for occluded and re-entering points in the benchmarks. AJ_RD averages the Jaccard score exclusively over frames in which a point reappears after occlusion or frame exit, thereby isolating re-detection performance rather than re-weighting all failures. revision: yes
-
Referee: [§3.2] §3.2 (Augmentations and Training): The periodic-roll augmentation and occlusion supervision are presented as key to re-detection gains, but no cross-dataset or out-of-distribution evaluation (e.g., videos with different camera motion statistics or re-appearance patterns) is reported. This leaves the generalization assumption—that these techniques produce robust improvements beyond the training distribution—unverified.
Authors: Our current results are reported across multiple benchmarks that already exhibit varied camera motions and re-appearance statistics. We nevertheless acknowledge the value of explicit OOD testing and will add cross-dataset and out-of-distribution experiments in the revision to further substantiate generalization of the augmentations and supervision strategy. revision: partial
Circularity Check
No circularity in the derivation chain
full rationale
The paper presents empirical improvements to the TAPNext architecture through explicit techniques: sequence-parallel training on 1024-frame clips, periodic-roll geometric augmentations, and occlusion supervision. These are independent additions that produce measured gains on standard TAP benchmarks plus a newly defined AJ_RD metric. No equations or claims reduce by construction to the inputs; the base TAPNext is referenced as prior work rather than a self-citation that bears the load of the new results. The derivation chain consists of standard training modifications and evaluation, remaining self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard deep-learning assumptions that gradient-based optimization on augmented video data will yield models that generalize to unseen real-world videos.
Reference graph
Works this paper leans on
-
[1]
Track-on: Transformer-based online point tracking with memory
G ¨orkay Aydemir, Xiongyi Cai, Weidi Xie, and Fatma Guney. Track-on: Transformer-based online point tracking with memory. InThe Thirteenth International Conference on Learning Representations, 2025. 2, 3
2025
-
[2]
Track- on2: Enhancing online point tracking with memory.arXiv preprint arXiv:2509.19115, 2025
G ¨orkay Aydemir, Weidi Xie, and Fatma G ¨uney. Track- on2: Enhancing online point tracking with memory.arXiv preprint arXiv:2509.19115, 2025. 2, 3
-
[3]
Quo vadis, action recognition? a new model and the kinetics dataset
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inpro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 7
2017
-
[4]
Local all-pair correspon- dence for point tracking
Seokju Cho, Jiahui Huang, Jisu Nam, Honggyu An, Seun- gryong Kim, and Joon-Young Lee. Local all-pair correspon- dence for point tracking. InEuropean conference on com- puter vision, pages 306–325. Springer, 2024. 2
2024
-
[5]
Pybullet, a python mod- ule for physics simulation for games, robotics and machine learning
Erwin Coumans and Yunfei Bai. Pybullet, a python mod- ule for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2021. 5
2016
-
[6]
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Soham De, Samuel L Smith, Anushan Fernando, Alek- sandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srini- vasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024. 4
work page internal anchor Pith review arXiv 2024
-
[7]
Tap-vid: A benchmark for track- ing any point in a video.Advances in Neural Information Processing Systems, 35:13610–13626, 2022
Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Re- casens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for track- ing any point in a video.Advances in Neural Information Processing Systems, 35:13610–13626, 2022. 2, 6, 7
2022
-
[8]
Tapir: Tracking any point with per-frame initialization and temporal refinement
Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 10061–10072, 2023. 2
2023
-
[9]
Boot- stap: Bootstrapped training for tracking-any-point
Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, Jo˜ao Carreira, and Andrew Zisserman. Boot- stap: Bootstrapped training for tracking-any-point. In Proceedings of the Asian Conference on Computer Vision (ACCV), pages 3257–3274, 2024. 2
2024
-
[10]
Google scanned objects: A high- quality dataset of 3d scanned household items
Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin- man, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high- quality dataset of 3d scanned household items. In2022 In- ternational Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022. 5
2022
-
[11]
Kubric: A scalable dataset generator
Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapra- gasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3761, 2022. 2, 5
2022
-
[12]
Alltracker: Efficient dense point tracking at high resolution
Adam W Harley, Yang You, Xinglong Sun, Yang Zheng, Nikhil Raghuraman, Yunqi Gu, Sheldon Liang, Wen-Hsuan Chu, Achal Dave, Suya You, et al. Alltracker: Efficient dense point tracking at high resolution. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 5253–5262, 2025. 2
2025
-
[13]
Co- tracker: It is better to track together
Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker: It is better to track together. InEuropean conference on computer vision, pages 18–35. Springer, 2024. 2, 5
2024
-
[14]
Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos
Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 6013–6022, 2025. 2
2025
-
[15]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui- jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.Interna- tional journal of computer vision, 128(7):1956–1981, 2020. 7
1956
-
[16]
Taptrv2: Attention-based position update improves tracking any point
Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Feng Li, Bohan Li, Tianhe Ren, and Lei Zhang. Taptrv2: Attention-based position update improves tracking any point. Advances in Neural Information Processing Systems, 37: 101074–101095, 2024. 2
2024
-
[17]
Taptr: Tracking any point with transformers as detection
Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, and Lei Zhang. Taptr: Tracking any point with transformers as detection. InEuropean Confer- ence on Computer Vision, pages 57–75. Springer, 2024. 2
2024
-
[18]
Megasam: Accurate, fast and ro- bust structure and motion from casual dynamic videos
Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holyn- ski, and Noah Snavely. Megasam: Accurate, fast and ro- bust structure and motion from casual dynamic videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10486–10496, 2025. 2
2025
-
[19]
SGDR: Stochastic Gradient Descent with Warm Restarts
Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016. 7
work page internal anchor Pith review arXiv 2016
-
[20]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 7
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
The 2017 DAVIS Challenge on Video Object Segmentation
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel´aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 7
work page internal anchor Pith review arXiv 2017
-
[23]
Viorica P ˘atr˘aucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, Jo ˜ao Carreira, and Razvan Pascanu. Trecvit: A recurrent video transformer, 2024. 2, 3, 4
2024
-
[24]
Taptrv3: Spatial and tempo- ral context foster robust tracking of any point in long video
Jinyuan Qu, Hongyang Li, Shilong Liu, Tianhe Ren, Zhaoyang Zeng, and Lei Zhang. Taptrv3: Spatial and tempo- ral context foster robust tracking of any point in long video. arXiv preprint arXiv:2411.18671, 2024. 2
-
[25]
Flashattention-3: Fast and accurate attention with asynchrony and low-precision
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems, 37: 68658–68685, 2024. 6
2024
-
[26]
Robotap: Tracking arbitrary points for few-shot visual imitation
Mel Vecerik, Carl Doersch, Yi Yang, Todor Davchev, Yusuf Aytar, Guangyao Zhou, Raia Hadsell, Lourdes Agapito, and Jon Scholz. Robotap: Tracking arbitrary points for few-shot visual imitation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 5397–5403. IEEE,
-
[27]
Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion
Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6726–6737, 2025. 2
2025
-
[28]
Polyhaven: a curated public asset library for visual effects artists and game designers, 2021
Greg Zaal, Rob Tuytel, Rico Cilliers, James Ray Cock, An- dreas Mischok, Sergej Majboroda, Dimitrios Savva, and Ju- rita Burger. Polyhaven: a curated public asset library for visual effects artists and game designers, 2021. 5
2021
-
[29]
Harley, Bokui Shen, Gordon Wet- zstein, and Leonidas J
Yang Zheng, Adam W. Harley, Bokui Shen, Gordon Wet- zstein, and Leonidas J. Guibas. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 19855–19865, 2023. 2, 3, 4, 6, 7
2023
-
[30]
bag of tricks
Artem Zholus, Carl Doersch, Yi Yang, Skanda Koppula, Viorica Patraucean, Xu Owen He, Ignacio Rocco, Mehdi S. M. Sajjadi, Sarath Chandar, and Ross Goroshin. Tap- next: Tracking any point (tap) as next token prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9693–9703, 2025. 2, 3, 4 TAPNext++: What’sNextfor ...
2025
-
[31]
The backward pass follows a similar distributed scan pat- tern to propagate gradients efficiently across GPUs
This phase is also performed in parallel across all GPUs. The backward pass follows a similar distributed scan pat- tern to propagate gradients efficiently across GPUs. In addition to that, we also implement sequence paral- lelism for temporal causal convolution layers. Unlike the linear recurrence, each step of the sequence may only de- pend on a constan...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.