arxiv: 2604.10582 · v1 · submitted 2026-04-12 · 💻 cs.CV

Recognition: unknown

TAPNext++: What's Next for Tracking Any Point (TAP)?

Sebastian Jung , Artem Zholus , Martin Sundermeyer , Carl Doersch , Ross Goroshin , David Joseph Tan , Sarath Chandar , Rudolph Triebel

show 1 more author

Federico Tombari

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords point trackingvideo trackingrecurrent transformersre-detectionsequence parallelismocclusion handlingTAP benchmarks

0 comments

The pith

Recurrent transformers track any point through videos orders of magnitude longer by training on 1024-frame clips and adding re-detection augmentations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a recurrent transformer for point tracking can be made to handle far longer videos and recover points after they disappear and reappear. It does this through sequence-parallel training on 1024-frame sequences plus geometric augmentations that simulate point re-entries and explicit supervision on occluded points. A new metric called Re-Detection Average Jaccard is introduced to measure the previously neglected re-detection ability. These changes preserve the original low-latency, online operation while producing new state-of-the-art results on standard benchmarks. The work matters because reliable long-term point tracking is required for AR, XR, and robotics applications that must operate on real-world video streams containing occlusions and extended durations.

Core claim

TAPNext++ extends the recurrent transformer of TAPNext by training on 1024-frame sequences enabled by sequence parallelism, introducing periodic-roll augmentations that simulate points re-entering the frame, and adding supervision for occluded points. These modifications allow the model to track points across sequences orders of magnitude longer than before while keeping the same low memory and compute footprint. The paper also defines the Re-Detection Average Jaccard metric to quantify performance on re-appearing points and shows that the resulting model achieves new state-of-the-art results on multiple point-tracking benchmarks.

What carries the argument

Recurrent video transformer trained with sequence-parallel long-sequence training and periodic-roll plus occlusion-supervision augmentations

If this is right

Points can be tracked reliably across much longer continuous video sequences without increasing memory usage.
Re-detection of points that leave and re-enter the frame becomes measurably better, as captured by the new AJ_RD metric.
The same low-latency, frame-by-frame online tracking capability is retained while accuracy improves.
New state-of-the-art scores are reached on existing TAP benchmarks without changing the core architecture.
Applications in AR, XR, and robotics gain access to more robust long-term point correspondence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sequence-parallel training recipe could be applied to other recurrent video models that currently struggle with long horizons.
Periodic-roll augmentation may prove useful for any task that requires modeling objects re-entering the field of view.
If the re-detection improvements hold, downstream systems that rely on stable point tracks over minutes rather than seconds become more feasible.
Extending the approach to multi-camera or 3D-aware settings would be a direct next test of generality.

Load-bearing premise

The performance gains from long-sequence training and the chosen re-detection augmentations will continue to appear on videos drawn from distributions different from the training data and the evaluated benchmarks.

What would settle it

A controlled test on videos exceeding 1024 frames or containing occlusion patterns absent from the training distribution where TAPNext++ no longer outperforms the original TAPNext or other baselines on the standard TAP metrics.

Figures

Figures reproduced from arXiv: 2604.10582 by Artem Zholus, Carl Doersch, David Joseph Tan, Federico Tombari, Martin Sundermeyer, Ross Goroshin, Rudolph Triebel, Sarath Chandar, Sebastian Jung.

**Figure 1.** Figure 1: Overview. TAPNext++ sets a new state-of-the-art for online point tracking by simultaneously improving accuracy, speed, and long-term robustness. Left: Our method reaches the best speed-accuracy trade-off, achieving higher mean δ avg than competing online methods while being substantially faster. Middle: Despite running frame-by-frame without explicit memory, TAPNext++ outperforms previous methods on long-… view at source ↗

**Figure 2.** Figure 2: Qualitative Comparison on Challenging Re-entry Scenarios. We compare TAPNext++ with prior methods on two sequences featuring point re-entries at varying locations. Left: An object leaves the frame on the right and re-enters from the left. BootsTAPNext and Track-On fail to re-detect the object upon re-entry. In contrast, CoTracker3 and TAPNext++ successfully relocate the points. However, CoTracker3 fails to… view at source ↗

**Figure 3.** Figure 3: Distributed Parallel Scan during Training. On the left, the TAPNext architecture with its SSM and ViT Blocks is shown. Right, example of information flow during multi-gpu training using a distributed parallel scan. Note that information is only passed between the SSM Blocks of the GPUs and not the ViT Blocks. Only inter-GPU communication is shown, intra-GPU parallel scans are not shown for simplicity. Fo… view at source ↗

**Figure 4.** Figure 4: Roll Augmentation. Videos are rotated and translated smoothly over time, wrapping around the image with a gap. Ground-truth points are visualized. of which is an index of the form (t, x, y). For each frame in a video, we apply a standard ViT-style linear projection followed by learned positional encodings that are added to the linear projections. As a result, we get T × h × w visual tokens, where h and w … view at source ↗

**Figure 6.** Figure 6: AJRD Comparison. Re-Detection Average Jaccard (AJRD) for different dmin values on RoboTAP. TAPNext++ with roll augmentations preserves AJ after long periods of undetectability. The window-based CoTracker3 exhibits a drop in AJRD following a sequence of 256 undetectable frames. LRU), a variant of State Space Models (SSMs). Thanks to their linearity, SSM temporal processing can be parallelized using the as… view at source ↗

**Figure 7.** Figure 7: Long-Term Dynamic Tracking Example. We compare state-of-the-art online trackers on tracking clock hands, visualizing a trace of the last 6 frames. While the original BootsTAPNext, Track-On and CoTracker3 fail at tracking both clock hands for the full duration of the video, TAPNext++ tracks both until the end. CoTracker3 and Track-On are only able to track the hour hand due to its fine structure, additional… view at source ↗

**Figure 8.** Figure 8: LRU State Norm Growth on re-feeding a Single Image. BootsTAPNext shows unstable LRU state after 100 frames. TAPNext++ has a monotonically growing state. y ′j t + c j t ⊙ h j in, where c j t = a j t ⊙ a j t−1 ⊙ · · · ⊙ a j 0 . This phase is also performed in parallel across all GPUs. The backward pass follows a similar distributed scan pattern to propagate gradients efficiently across GPUs. In addition to… view at source ↗

**Figure 9.** Figure 9: Eigenvalue Distribution. Distribution of eigenvalues a. Almost no change can be observed between BootsTAPNext and TAPNext++. D. LRU State analysis We analyze the temporal behavior of the LRU state associated with the point-token stream in the twelfth layer. To isolate the intrinsic LRU dynamics from changes caused by varying inputs, we repeatedly feed the same static image into the network and track the n… view at source ↗

read the original abstract

Tracking-Any-Point (TAP) models aim to track any point through a video which is a crucial task in AR/XR and robotics applications. The recently introduced TAPNext approach proposes an end-to-end, recurrent transformer architecture to track points frame-by-frame in a purely online fashion -- demonstrating competitive performance at minimal latency. However, we show that TAPNext struggles with longer video sequences and also frequently fails to re-detect query points that reappear after being occluded or leaving the frame. In this work, we present TAPNext++, a model that tracks points in sequences that are orders of magnitude longer while preserving the low memory and compute footprint of the architecture. We train the recurrent video transformer using several data-driven solutions, including training on long 1024-frame sequences enabled by sequence parallelism techniques. We highlight that re-detection performance is a blind spot in the current literature and introduce a new metric, Re-Detection Average Jaccard ($AJ_{RD}$), to explicitly evaluate tracking on re-appearing points. To improve re-detection of points, we introduce tailored geometric augmentations, such as periodic roll that simulates point re-entries, and supervising occluded points. We demonstrate that recurrent transformers can be substantially improved for point tracking and set a new state-of-the-art on multiple benchmarks. Model and code can be found at https://tap-next-plus-plus.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TAPNext++ improves long-sequence point tracking and re-detection via 1024-frame training plus periodic-roll and occlusion augmentations, but the gains look tied to the specific training setup and may not generalize.

read the letter

The main advance here is showing how to push a recurrent transformer for point tracking to much longer videos without blowing up memory or latency. They train on 1024-frame clips using sequence parallelism, add periodic-roll augmentations to force the model to handle points re-entering the frame, and supervise through occlusions. They also introduce the AJ_RD metric to measure re-detection explicitly instead of burying it in standard TAP scores. This directly targets the failure modes the original TAPNext had on extended sequences and re-appearances.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces TAPNext++, an extension of the TAPNext recurrent transformer architecture for Tracking Any Point (TAP) in videos. It addresses limitations in long-sequence tracking and re-detection after occlusions or frame exits by training on 1024-frame clips via sequence parallelism, introducing periodic-roll geometric augmentation to simulate re-entries, adding occlusion supervision, and proposing a new Re-Detection Average Jaccard metric (AJ_RD). The authors claim these changes enable tracking in sequences orders of magnitude longer while preserving low memory/compute footprint, yielding new state-of-the-art results on multiple benchmarks, with code and model released.

Significance. If the performance claims and generalization hold, the work would advance online point tracking for extended videos, a key capability for AR/XR and robotics. The focus on recurrent transformers with explicit handling of re-detection and the new AJ_RD metric fill a noted gap in the literature. Code release supports reproducibility and follow-on work.

major comments (3)

[§4] §4 (Experiments): The central SOTA claims rest on comparisons to baselines and ablations, but the manuscript provides no full experimental details, ablation tables isolating sequence-parallel training from the periodic-roll/occlusion augmentations, or quantitative results on sequence lengths substantially exceeding 1024 frames. This prevents verification of whether the reported gains are load-bearing or distribution-specific.
[§5] §5 (Results and AJ_RD): The new AJ_RD metric is introduced to evaluate re-detection, yet its precise formulation, relation to standard AJ, and how occluded/re-entering points are annotated in the benchmarks are not specified. Without this, it is unclear whether AJ_RD provides an independent signal or simply re-weights existing failure modes.
[§3.2] §3.2 (Augmentations and Training): The periodic-roll augmentation and occlusion supervision are presented as key to re-detection gains, but no cross-dataset or out-of-distribution evaluation (e.g., videos with different camera motion statistics or re-appearance patterns) is reported. This leaves the generalization assumption—that these techniques produce robust improvements beyond the training distribution—unverified.

minor comments (2)

[Abstract] The abstract states the model tracks 'sequences that are orders of magnitude longer' without citing the exact maximum sequence length evaluated or the memory footprint numbers relative to TAPNext.
[§3] Notation for AJ_RD is introduced without an equation or pseudocode definition in the main text, which would aid clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper accordingly to improve experimental transparency, metric clarity, and generalization analysis.

read point-by-point responses

Referee: [§4] §4 (Experiments): The central SOTA claims rest on comparisons to baselines and ablations, but the manuscript provides no full experimental details, ablation tables isolating sequence-parallel training from the periodic-roll/occlusion augmentations, or quantitative results on sequence lengths substantially exceeding 1024 frames. This prevents verification of whether the reported gains are load-bearing or distribution-specific.

Authors: We agree that fuller experimental details and targeted ablations are needed for verification. In the revised manuscript we will add complete experimental protocols, ablation tables that isolate sequence-parallel training from periodic-roll augmentation and occlusion supervision, and quantitative results on sequences longer than 1024 frames (including 2048-frame evaluations) to demonstrate scalability beyond the training regime. revision: yes
Referee: [§5] §5 (Results and AJ_RD): The new AJ_RD metric is introduced to evaluate re-detection, yet its precise formulation, relation to standard AJ, and how occluded/re-entering points are annotated in the benchmarks are not specified. Without this, it is unclear whether AJ_RD provides an independent signal or simply re-weights existing failure modes.

Authors: We will expand Section 5 with the exact mathematical definition of AJ_RD, its relation to standard Average Jaccard, and the annotation rules used for occluded and re-entering points in the benchmarks. AJ_RD averages the Jaccard score exclusively over frames in which a point reappears after occlusion or frame exit, thereby isolating re-detection performance rather than re-weighting all failures. revision: yes
Referee: [§3.2] §3.2 (Augmentations and Training): The periodic-roll augmentation and occlusion supervision are presented as key to re-detection gains, but no cross-dataset or out-of-distribution evaluation (e.g., videos with different camera motion statistics or re-appearance patterns) is reported. This leaves the generalization assumption—that these techniques produce robust improvements beyond the training distribution—unverified.

Authors: Our current results are reported across multiple benchmarks that already exhibit varied camera motions and re-appearance statistics. We nevertheless acknowledge the value of explicit OOD testing and will add cross-dataset and out-of-distribution experiments in the revision to further substantiate generalization of the augmentations and supervision strategy. revision: partial

Circularity Check

0 steps flagged

No circularity in the derivation chain

full rationale

The paper presents empirical improvements to the TAPNext architecture through explicit techniques: sequence-parallel training on 1024-frame clips, periodic-roll geometric augmentations, and occlusion supervision. These are independent additions that produce measured gains on standard TAP benchmarks plus a newly defined AJ_RD metric. No equations or claims reduce by construction to the inputs; the base TAPNext is referenced as prior work rather than a self-citation that bears the load of the new results. The derivation chain consists of standard training modifications and evaluation, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical gains from longer training sequences and geometric augmentations; no new physical entities or unstated mathematical axioms are introduced beyond standard supervised learning assumptions.

axioms (1)

domain assumption Standard deep-learning assumptions that gradient-based optimization on augmented video data will yield models that generalize to unseen real-world videos.
Implicit in all reported training and evaluation.

pith-pipeline@v0.9.0 · 5567 in / 1228 out tokens · 58825 ms · 2026-05-10T15:24:40.311646+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 7 canonical work pages · 5 internal anchors

[1]

Track-on: Transformer-based online point tracking with memory

G ¨orkay Aydemir, Xiongyi Cai, Weidi Xie, and Fatma Guney. Track-on: Transformer-based online point tracking with memory. InThe Thirteenth International Conference on Learning Representations, 2025. 2, 3

2025
[2]

Track- on2: Enhancing online point tracking with memory.arXiv preprint arXiv:2509.19115, 2025

G ¨orkay Aydemir, Weidi Xie, and Fatma G ¨uney. Track- on2: Enhancing online point tracking with memory.arXiv preprint arXiv:2509.19115, 2025. 2, 3

work page arXiv 2025
[3]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inpro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 7

2017
[4]

Local all-pair correspon- dence for point tracking

Seokju Cho, Jiahui Huang, Jisu Nam, Honggyu An, Seun- gryong Kim, and Joon-Young Lee. Local all-pair correspon- dence for point tracking. InEuropean conference on com- puter vision, pages 306–325. Springer, 2024. 2

2024
[5]

Pybullet, a python mod- ule for physics simulation for games, robotics and machine learning

Erwin Coumans and Yunfei Bai. Pybullet, a python mod- ule for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2021. 5

2016
[6]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De, Samuel L Smith, Anushan Fernando, Alek- sandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srini- vasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024. 4

work page internal anchor Pith review arXiv 2024
[7]

Tap-vid: A benchmark for track- ing any point in a video.Advances in Neural Information Processing Systems, 35:13610–13626, 2022

Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Re- casens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for track- ing any point in a video.Advances in Neural Information Processing Systems, 35:13610–13626, 2022. 2, 6, 7

2022
[8]

Tapir: Tracking any point with per-frame initialization and temporal refinement

Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 10061–10072, 2023. 2

2023
[9]

Boot- stap: Bootstrapped training for tracking-any-point

Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, Jo˜ao Carreira, and Andrew Zisserman. Boot- stap: Bootstrapped training for tracking-any-point. In Proceedings of the Asian Conference on Computer Vision (ACCV), pages 3257–3274, 2024. 2

2024
[10]

Google scanned objects: A high- quality dataset of 3d scanned household items

Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin- man, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high- quality dataset of 3d scanned household items. In2022 In- ternational Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022. 5

2022
[11]

Kubric: A scalable dataset generator

Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapra- gasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3761, 2022. 2, 5

2022
[12]

Alltracker: Efficient dense point tracking at high resolution

Adam W Harley, Yang You, Xinglong Sun, Yang Zheng, Nikhil Raghuraman, Yunqi Gu, Sheldon Liang, Wen-Hsuan Chu, Achal Dave, Suya You, et al. Alltracker: Efficient dense point tracking at high resolution. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 5253–5262, 2025. 2

2025
[13]

Co- tracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker: It is better to track together. InEuropean conference on computer vision, pages 18–35. Springer, 2024. 2, 5

2024
[14]

Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos

Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 6013–6022, 2025. 2

2025
[15]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui- jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.Interna- tional journal of computer vision, 128(7):1956–1981, 2020. 7

1956
[16]

Taptrv2: Attention-based position update improves tracking any point

Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Feng Li, Bohan Li, Tianhe Ren, and Lei Zhang. Taptrv2: Attention-based position update improves tracking any point. Advances in Neural Information Processing Systems, 37: 101074–101095, 2024. 2

2024
[17]

Taptr: Tracking any point with transformers as detection

Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, and Lei Zhang. Taptr: Tracking any point with transformers as detection. InEuropean Confer- ence on Computer Vision, pages 57–75. Springer, 2024. 2

2024
[18]

Megasam: Accurate, fast and ro- bust structure and motion from casual dynamic videos

Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holyn- ski, and Noah Snavely. Megasam: Accurate, fast and ro- bust structure and motion from casual dynamic videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10486–10496, 2025. 2

2025
[19]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016. 7

work page internal anchor Pith review arXiv 2016
[20]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 7

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel´aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 7

work page internal anchor Pith review arXiv 2017
[23]

Viorica P ˘atr˘aucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, Jo ˜ao Carreira, and Razvan Pascanu. Trecvit: A recurrent video transformer, 2024. 2, 3, 4

2024
[24]

Taptrv3: Spatial and tempo- ral context foster robust tracking of any point in long video

Jinyuan Qu, Hongyang Li, Shilong Liu, Tianhe Ren, Zhaoyang Zeng, and Lei Zhang. Taptrv3: Spatial and tempo- ral context foster robust tracking of any point in long video. arXiv preprint arXiv:2411.18671, 2024. 2

work page arXiv 2024
[25]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems, 37: 68658–68685, 2024. 6

2024
[26]

Robotap: Tracking arbitrary points for few-shot visual imitation

Mel Vecerik, Carl Doersch, Yi Yang, Todor Davchev, Yusuf Aytar, Guangyao Zhou, Raia Hadsell, Lourdes Agapito, and Jon Scholz. Robotap: Tracking arbitrary points for few-shot visual imitation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 5397–5403. IEEE,
[27]

Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion

Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6726–6737, 2025. 2

2025
[28]

Polyhaven: a curated public asset library for visual effects artists and game designers, 2021

Greg Zaal, Rob Tuytel, Rico Cilliers, James Ray Cock, An- dreas Mischok, Sergej Majboroda, Dimitrios Savva, and Ju- rita Burger. Polyhaven: a curated public asset library for visual effects artists and game designers, 2021. 5

2021
[29]

Harley, Bokui Shen, Gordon Wet- zstein, and Leonidas J

Yang Zheng, Adam W. Harley, Bokui Shen, Gordon Wet- zstein, and Leonidas J. Guibas. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 19855–19865, 2023. 2, 3, 4, 6, 7

2023
[30]

bag of tricks

Artem Zholus, Carl Doersch, Yi Yang, Skanda Koppula, Viorica Patraucean, Xu Owen He, Ignacio Rocco, Mehdi S. M. Sajjadi, Sarath Chandar, and Ross Goroshin. Tap- next: Tracking any point (tap) as next token prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9693–9703, 2025. 2, 3, 4 TAPNext++: What’sNextfor ...

2025
[31]

The backward pass follows a similar distributed scan pat- tern to propagate gradients efficiently across GPUs

This phase is also performed in parallel across all GPUs. The backward pass follows a similar distributed scan pat- tern to propagate gradients efficiently across GPUs. In addition to that, we also implement sequence paral- lelism for temporal causal convolution layers. Unlike the linear recurrence, each step of the sequence may only de- pend on a constan...