TrAction: Action Recognition with Sparse Trajectories

Alexander Ecker; Felix B. Mueller; Jan F. Meier; Timo L\"uddecke

arxiv: 2606.03490 · v1 · pith:IXUYRHK7new · submitted 2026-06-02 · 💻 cs.CV

TrAction: Action Recognition with Sparse Trajectories

Jan F. Meier , Felix B. Mueller , Alexander Ecker , Timo L\"uddecke This is my paper

Pith reviewed 2026-06-28 10:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords action recognitionsparse trajectoriestransformermasked pretrainingmotion featuresvideo understandingSomething-Something V2EPIC-Kitchens

0 comments

The pith

Sparse point trajectories let action models focus on motion and boost accuracy when fused with appearance features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that sparse point trajectories provide a low-bias input for action recognition because they carry little appearance or scene information by design. It introduces a transformer that processes these 2.5D trajectories together with a masked-trajectory pretraining stage that improves downstream accuracy. The resulting model reaches 45 percent top-1 on Something-Something V2 and 54 percent on EPIC-Kitchens-100 while using only a fraction of the compute of dense RGB methods. When its features are combined with strong appearance models such as DINOv2 the combined system gains 8.7 points on the same benchmark. The work therefore treats trajectories as a complementary signal rather than a replacement.

Core claim

A simple transformer trained on sparse point trajectories with masked pretraining produces motion-focused features that reach competitive accuracy on standard action benchmarks and improve further when fused with appearance-based models, raising top-1 accuracy on Something-Something V2 by 8.7 points over DINOv2 alone and by 1.6 points over V-JEPA 2.

What carries the argument

Sparse point trajectories processed by a 2.5D trajectory transformer with masked-trajectory pretraining.

If this is right

Trajectory features improve time-reversal sensitivity beyond V-JEPA.
Fusion with DINOv2 yields an 8.7-point gain on Something-Something V2.
The method uses far less memory and compute than dense RGB volumes.
Masked pretraining on trajectories measurably raises downstream recognition accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may reduce reliance on large labeled video datasets if trajectory pretraining scales.
Models built this way could be easier to audit for motion-based decisions rather than object shortcuts.
The same trajectory stream might support real-time applications on resource-limited devices.

Load-bearing premise

Sparse trajectories supply enough distinctive motion information on their own and remain largely free of appearance shortcuts.

What would settle it

A controlled test in which trajectory-only accuracy collapses on action pairs that differ only by object identity or background while fusion with appearance models yields no gain.

Figures

Figures reproduced from arXiv: 2606.03490 by Alexander Ecker, Felix B. Mueller, Jan F. Meier, Timo L\"uddecke.

**Figure 2.** Figure 2: Trajectories for action recognition (TrAction) overview. We extract 2.5D trajectories using Cotracker3 and VideoDepthAnything (A). Our trajectory transformer model is first pretrained using self-supervised masked autoencoding (B) before being finetuned for action recognition (C). indicates whether the point is visible at frame t. We sample query points uniformly at random across both space and time. Queryi… view at source ↗

**Figure 3.** Figure 3: Class-wise performance on SSv2. (a) The trajectories only model performs well on actions involving camera motion as well as directional classes. (b) Fusing both DINOv2 as well as V-JEPA 2 with our trajectory model leads to significant gains. Classes with less than 25 samples are excluded and class labels are shortened. and that sparse trajectories carry a recognition signal inaccessible to the dense video … view at source ↗

**Figure 4.** Figure 4: Last-layer CLS attention overlay. Top-25 trajectories from one of four heads on a moving X closer to Y sequence. Color encodes attention weight, size scales with weight, alpha with trajectory visibility. The visualized head concentrates on the manipulated object; other heads attend to different regions and motions. See Appendix F for additional examples [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of frames and trajectories on SSv2. Performance increases with more frames and saturates beyond 16. Increasing the number of trajectories helps consistently but gains are small beyond 256 trajectories [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Last-layer CLS attention overlay over all heads. Different heads focus on different trajectories. Head 1 and 3 focus on the trajectories covering the bottle cap, whereas head 2 and 4 focus on the head movement either directly through trajectories on the hand or through background trajectories which get occluded. rather than raw pixels also reduces the amount of identifying visual information processed by t… view at source ↗

read the original abstract

Modern action recognition models operate on memory- and compute-intensive dense RGB video volumes and frequently exploit appearance and background shortcuts, for example, predicting actions from objects or scenes instead of characteristic motion. We investigate an efficient alternative input modality that is largely free of such biases by construction: sparse point trajectories. To this end, we develop a simple transformer architecture for 2.5D trajectory-based recognition together with a masked-trajectory pretraining, which we show to substantially improve downstream action recognition accuracy. Despite using only a fraction of the dense RGB input, our method reaches 45% top-1 on Something-Something V2 and 54% on EPIC-Kitchens-100, and surpasses V-JEPA on time-reversal sensitivity. More importantly, we find trajectory features to be complementary to state-of-the-art appearance-based features. Fusing our pretrained model with DINOv2 and V-JEPA 2 improves top-1 accuracy on Something-Something V2 by 8.7 and 1.6 points, respectively. Code: https://github.com/ecker-lab/TrAction

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sparse trajectories give a lighter motion-focused input that fuses well with appearance models, but the 'free of biases by construction' part needs verification against the actual tracker.

read the letter

The main point is that this paper shows sparse point trajectories can work as a primary input for action recognition, paired with a 2.5D transformer and masked pretraining. It reports 45% top-1 on Something-Something V2 and 54% on EPIC-Kitchens-100, plus clear fusion gains when combined with DINOv2 (+8.7) or V-JEPA 2 (+1.6). That complementarity is the part worth paying attention to.

They do a clean job of keeping the input small and motion-oriented instead of dense RGB volumes. The pretraining step improves downstream accuracy, and the numbers suggest the trajectory stream adds something the appearance models miss. Linking the code is also useful.

The soft spots are the usual ones for an abstract-only view: no error bars, limited ablation detail, and no full protocol. More importantly, the claim that trajectories are largely free of appearance shortcuts by construction is not automatic. Standard trackers run on RGB frames, so consistent point tracks can still carry object identity or scene layout. If that signal is present, the fusion gains could be an ensemble effect rather than proof of orthogonal motion features. The stress-test concern holds until the methods section shows explicit controls or masking of appearance cues during extraction.

This is for people building video models who want lighter inputs or better motion signals. A reader working on shortcut mitigation or multimodal fusion would find the empirical results worth checking. It deserves a serious referee because the idea is straightforward to test and the reported gains are specific enough to evaluate.

Referee Report

1 major / 2 minor

Summary. The paper introduces TrAction, a transformer-based architecture operating on sparse 2.5D point trajectories for action recognition, paired with masked-trajectory pretraining. It reports 45% top-1 accuracy on Something-Something V2 and 54% on EPIC-Kitchens-100, claims superiority to V-JEPA on time-reversal sensitivity, and asserts that trajectory features are complementary to appearance-based models, with fusion yielding +8.7 points (DINOv2) and +1.6 points (V-JEPA 2) on SSv2. The core positioning is that trajectories are largely free of appearance/background shortcuts by construction and offer an efficient alternative to dense RGB inputs.

Significance. If the central claims hold, the work provides a computationally lighter motion-centric pathway for action recognition that could complement dense appearance models. The public code release at https://github.com/ecker-lab/TrAction is a clear strength for reproducibility. The reported fusion gains and time-reversal results, if robust, would support the value of trajectory representations in multimodal settings. Significance is limited by the absence of detailed experimental protocols in the provided abstract and the need to substantiate the bias-free assumption.

major comments (1)

[Abstract] Abstract: The claim that sparse point trajectories are 'largely free of such biases by construction' is load-bearing for interpreting the fusion gains (+8.7 with DINOv2, +1.6 with V-JEPA 2) as evidence of orthogonal motion features rather than an ensemble effect. Standard RGB-based trajectory extraction (optical flow or learned trackers) can retain appearance cues via consistent pixel tracking; the manuscript must supply explicit controls (e.g., object-category prediction from trajectories alone or background-masked variants) to support the assumption.

minor comments (2)

[Abstract] Abstract: Concrete accuracy numbers (45% SSv2, 54% EPIC-Kitchens) are stated without error bars, number of runs, or ablation details on the pretraining or fusion protocol.
[Abstract] Abstract: The fusion mechanism (late fusion, feature concatenation, etc.) and the exact pretrained model variants are not specified, hindering assessment of the complementarity result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback on our work. We address the single major comment below regarding the abstract's claim about biases in trajectory representations.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that sparse point trajectories are 'largely free of such biases by construction' is load-bearing for interpreting the fusion gains (+8.7 with DINOv2, +1.6 with V-JEPA 2) as evidence of orthogonal motion features rather than an ensemble effect. Standard RGB-based trajectory extraction (optical flow or learned trackers) can retain appearance cues via consistent pixel tracking; the manuscript must supply explicit controls (e.g., object-category prediction from trajectories alone or background-masked variants) to support the assumption.

Authors: We agree that the phrasing 'largely free of such biases by construction' is imprecise and could overstate the separation from appearance cues, since standard trackers rely on RGB consistency for point correspondence. While the sparsity and 2.5D nature of the input inherently limit dense appearance and background information relative to full RGB volumes, residual cues may persist. The fusion gains are presented as evidence of complementarity rather than a strict proof of orthogonality. To substantiate the assumption as requested, we will add explicit controls in the revision, including object-category prediction accuracy from trajectory features alone and evaluations on background-masked variants. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; empirical claims rest on benchmarks

full rationale

The paper advances an empirical architecture and pretraining scheme for trajectory-based action recognition, with the central complementarity claim supported by reported fusion gains on standard benchmarks (SSv2, EPIC-Kitchens). No equations, parameter fits, or self-citations are presented that reduce any prediction or uniqueness result to the authors' own inputs by construction. The 'by construction' phrasing for bias freedom is a modeling assumption rather than a self-referential derivation step. This is the common honest case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no model equations, hyperparameters, or assumptions are detailed enough to enumerate free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5727 in / 1013 out tokens · 29291 ms · 2026-06-28T10:31:35.498102+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 3 canonical work pages

[1]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muck- ley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, and others. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Pith/arXiv arXiv 2025
[2]

Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A

Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, and Boqing Gong. VideoPrism: A Foundational Visual Encoder for Video Understanding. InForty...

2024
[3]

Masked motion encoding for self-supervised video representation learning

Xinyu Sun, Peihao Chen, Liangwei Chen, Changhao Li, Thomas H Li, Mingkui Tan, and Chuang Gan. Masked motion encoding for self-supervised video representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2235–2245, 2023

2023
[4]

The panaf-fgbg dataset: Understanding the impact of backgrounds in wildlife behaviour recogni- tion

Otto Brookes, Maksim Kukushkin, Majid Mirmehdi, Colleen Stephens, Paula Dieguez, Thurston C Hicks, Sorrel Jones, Kevin Lee, Maureen S McCarthy, Amelia Meier, and others. The panaf-fgbg dataset: Understanding the impact of backgrounds in wildlife behaviour recogni- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p...

2025
[5]

Removing the background by adding the background: Towards background robust self-supervised video representation learning

Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J Ma, Hao Cheng, Pai Peng, Feiyue Huang, Rongrong Ji, and Xing Sun. Removing the background by adding the background: Towards background robust self-supervised video representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11804–11813, 2021

2021
[6]

Why can’t i dance in the mall? learning to mitigate scene bias in action recognition.Advances in Neural Information Processing Systems, 32, 2019

Jinwoo Choi, Chen Gao, Joseph CE Messou, and Jia-Bin Huang. Why can’t i dance in the mall? learning to mitigate scene bias in action recognition.Advances in Neural Information Processing Systems, 32, 2019

2019
[7]

On the integration of optical flow and action recognition

Laura Sevilla-Lara, Yiyi Liao, Fatma Güney, Varun Jampani, Andreas Geiger, and Michael J Black. On the integration of optical flow and action recognition. InGerman conference on pattern recognition, pages 281–297. Springer, 2018

2018
[8]

Is appearance free action recognition possible? InEuropean Conference on Computer Vision, pages 156–173

Filip Ilic, Thomas Pock, and Richard P Wildes. Is appearance free action recognition possible? InEuropean Conference on Computer Vision, pages 156–173. Springer, 2022

2022
[9]

Cotracker3: Simpler and better point tracking by pseudo-labelling real videos

Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6013–6022, 2025

2025
[10]

Tapnext: Tracking any point (tap) as next token prediction

Artem Zholus, Carl Doersch, Yi Yang, Skanda Koppula, Viorica Patraucean, Xu Owen He, Ignacio Rocco, Mehdi SM Sajjadi, Sarath Chandar, and Ross Goroshin. Tapnext: Tracking any point (tap) as next token prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9693–9703, 2025. 10

2025
[11]

Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation

Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. In European Conference on Computer Vision, pages 306–324. Springer, 2024

2024
[12]

Articulated Object Estimation in the Wild

Abdelrhman Werby, Martin Büchner, Adrian Röfer, Chenguang Huang, Wolfram Burgard, and Abhinav Valada. Articulated Object Estimation in the Wild. In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 of Proceedings of Machine Learning Research, pages 3828–3849. PMLR, September 2025

2025
[13]

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong W ANG, Hongwei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, Yingya Zhang, and Yujiu Yang. Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[14]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

2017
[15]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, and others. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on computer vision, pages 584...

2017
[16]

Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100. International Journal of Computer Vision (IJCV), 130:33–55, 2022

2022
[17]

Large-scale video classification with convolutional neural networks

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014

2014
[18]

Long-term recurrent convolutional networks for visual recognition and description

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015

2015
[19]

Courville

Nicolas Ballas, Li Yao, Chris Pal, and Aaron C. Courville. Delving Deeper into Convolutional Networks for Learning Video Representations. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016

2016
[20]

Beyond short snippets: Deep networks for video classification

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4694–4702, 2015

2015
[21]

Learning spa- tiotemporal features with 3d convolutional networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spa- tiotemporal features with 3d convolutional networks. InProceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015

2015
[22]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019

2019
[23]

X3d: Expanding architectures for efficient video recognition

Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020

2020
[24]

Is space-time attention all you need for video understanding? InIcml, volume 2, page 4, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InIcml, volume 2, page 4, 2021. 11

2021
[25]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021

2021
[26]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

2022
[27]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14549–14560, 2023

2023
[28]

Masked autoencoders as spatiotemporal learners.Advances in neural information processing systems, 35:35946–35958, 2022

Christoph Feichtenhofer, Yanghao Li, Kaiming He, and others. Masked autoencoders as spatiotemporal learners.Advances in neural information processing systems, 35:35946–35958, 2022

2022
[29]

Recurrent Video Masked Autoencoders.arXiv preprint arXiv:2512.13684, 2025

Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A Hudson, Joao Carreira, and Andrew Zisserman. Recurrent Video Masked Autoencoders.arXiv preprint arXiv:2512.13684, 2025

Pith/arXiv arXiv 2025
[30]

Revisiting Feature Prediction for Learning Visual Representations from Video.Transactions on Machine Learning Research, 2024

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. Revisiting Feature Prediction for Learning Visual Representations from Video.Transactions on Machine Learning Research, 2024. ISSN 2835-8856

2024
[31]

Two-stream convolutional networks for action recognition in videos.Advances in neural information processing systems, 27, 2014

Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos.Advances in neural information processing systems, 27, 2014

2014
[32]

Convolutional two-stream network fusion for video action recognition

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941, 2016

1933
[33]

Memory-augmented dense predictive coding for video representation learning

Tengda Han, Weidi Xie, and Andrew Zisserman. Memory-augmented dense predictive coding for video representation learning. InEuropean conference on computer vision, pages 312–329. Springer, 2020

2020
[34]

Wang, Christopher Hoang, Yuwen Xiong, Yann LeCun, and Mengye Ren

Alex N. Wang, Christopher Hoang, Yuwen Xiong, Yann LeCun, and Mengye Ren. Poo- DLe: Pooled and dense self-supervised learning from naturalistic videos. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[35]

Spatial temporal graph convolutional networks for skeleton-based action recognition

Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018
[36]

Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation

Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. InProceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, pages 786–792. AAAI Press, 2018. ISBN 978-0-9992411-2-7

2018
[37]

, author Neer, W.C

Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Skeleton-Based Action Recognition With Directed Graph Neural Networks. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7904–7913, 2019. doi: 10.1109/CVPR.2019.00810

work page doi:10.1109/cvpr.2019.00810 2019
[38]

An end-to-end spatio- temporal attention model for human action recognition from skeleton data

Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. An end-to-end spatio- temporal attention model for human action recognition from skeleton data. InProceedings of the AAAI conference on artificial intelligence, volume 31, 2017

2017
[39]

Jun Liu, Gang Wang, Ping Hu, Ling-Yu Duan, and Alex C. Kot. Global Context-Aware Attention LSTM Networks for 3D Action Recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

2017
[40]

Seeing without Pixels: Perception from Camera Trajectories.arXiv preprint arXiv:2511.21681, 2025

Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, and Tengda Han. Seeing without Pixels: Perception from Camera Trajectories.arXiv preprint arXiv:2511.21681, 2025. 12

arXiv 2025
[41]

TrackMAE: Video Representation Learning via Track Mask and Predict.arXiv preprint arXiv:2603.27268, 2026

Renaud Vandeghen, Fida Mohammad Thoker, Marc Van Droogenbroeck, and Bernard Ghanem. TrackMAE: Video Representation Learning via Track Mask and Predict.arXiv preprint arXiv:2603.27268, 2026

arXiv 2026
[42]

Trajectory-aligned space-time tokens for few-shot action recognition

Pulkit Kumar, Namitha Padmanabhan, Luke Luo, Sai Saketh Rambhatla, and Abhinav Shri- vastava. Trajectory-aligned space-time tokens for few-shot action recognition. InEuropean Conference on Computer Vision, pages 474–493. Springer, 2024

2024
[43]

Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition

Pulkit Kumar, Shuaiyi Huang, Matthew Walmer, Sai Saketh Rambhatla, and Abhinav Shrivas- tava. Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13544– 13556, 2025

2025
[44]

It’s a Matter of Time: Three Lessons on Long-Term Motion for Perception.arXiv preprint arXiv:2602.14705, 2026

Willem Davison, Xinyue Hao, and Laura Sevilla-Lara. It’s a Matter of Time: Three Lessons on Long-Term Motion for Perception.arXiv preprint arXiv:2602.14705, 2026

arXiv 2026
[45]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision, pages 402–419. Springer, 2020

2020
[46]

W AFT: Warping-Alone Field Transforms for Optical Flow

Yihan Wang and Jia Deng. W AFT: Warping-Alone Field Transforms for Optical Flow. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[47]

Tap-vid: A benchmark for tracking any point in a video.Advances in Neural Information Processing Systems, 35:13610–13626, 2022

Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Recasens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for tracking any point in a video.Advances in Neural Information Processing Systems, 35:13610–13626, 2022

2022
[48]

Particle video revisited: Tracking through occlusions using point trajectories

Adam W Harley, Zhaoyuan Fang, and Katerina Fragkiadaki. Particle video revisited: Tracking through occlusions using point trajectories. InEuropean Conference on Computer Vision, pages 59–75. Springer, 2022

2022
[49]

Tapir: Tracking any point with per-frame initialization and temporal refinement

Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10061–10072, 2023

2023
[50]

Cotracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. InEuropean conference on computer vision, pages 18–35. Springer, 2024

2024
[51]

Dense optical tracking: Connecting the dots

Guillaume Le Moing, Jean Ponce, and Cordelia Schmid. Dense optical tracking: Connecting the dots. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19187–19197, 2024

2024
[52]

Local all-pair correspondence for point tracking

Seokju Cho, Jiahui Huang, Jisu Nam, Honggyu An, Seungryong Kim, and Joon-Young Lee. Local all-pair correspondence for point tracking. InEuropean conference on computer vision, pages 306–325. Springer, 2024

2024
[53]

TAPNext++: What’s Next for Tracking Any Point (TAP)?arXiv preprint arXiv:2604.10582, 2026

Sebastian Jung, Artem Zholus, Martin Sundermeyer, Carl Doersch, Ross Goroshin, David Joseph Tan, Sarath Chandar, Rudolph Triebel, and Federico Tombari. TAPNext++: What’s Next for Tracking Any Point (TAP)?arXiv preprint arXiv:2604.10582, 2026

Pith/arXiv arXiv 2026
[54]

Bootstap: Bootstrapped training for tracking-any-point

Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, Joao Carreira, and others. Bootstap: Bootstrapped training for tracking-any-point. InProceedings of the Asian Conference on Computer Vision, pages 3257–3274, 2024

2024
[55]

Spatialtracker: Tracking any 2d pixels in 3d space

Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. Spatialtracker: Tracking any 2d pixels in 3d space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20406–20417, 2024

2024
[56]

Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion

Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6726–6737, 2025. 13

2025
[57]

DELTA: DENSE EFFICIENT LONG-RANGE 3D TRACKING FOR ANY VIDEO

Tuan Duc Ngo, Peiye Zhuang, Evangelos Kalogerakis, Chuang Gan, Sergey Tulyakov, Hsin- Ying Lee, and Chaoyang Wang. DELTA: DENSE EFFICIENT LONG-RANGE 3D TRACKING FOR ANY VIDEO. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[58]

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, and Xin Jin. DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[59]

History-Aware Visuomotor Policy Learning via Point Tracking.arXiv preprint arXiv:2509.17141, 2025

Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang, and Cewu Lu. History-Aware Visuomotor Policy Learning via Point Tracking.arXiv preprint arXiv:2509.17141, 2025

arXiv 2025
[60]

Generative Video Motion Editing with 3D Point Tracks

Yao-Chih Lee, Zhoutong Zhang, Jiahui Huang, Jui-Hsien Wang, Joon-Young Lee, Jia-Bin Huang, Eli Shechtman, and Zhengqi Li. Generative Video Motion Editing with 3D Point Tracks. arXiv preprint arXiv:2512.02015, 2025

arXiv 2025
[61]

Vggsfm: Visual geometry grounded deep structure from motion

Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21686–21697, 2024

2024
[62]

Shape of motion: 4d reconstruction from a single video

Qianqian Wang, Vickie Ye, Hang Gao, Weijia Zeng, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of motion: 4d reconstruction from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9660–9672, 2025

2025
[63]

TrackerSplat: Exploiting Point Tracking for Fast and Robust Dynamic 3D Gaussians Reconstruction

Daheng Yin, Isaac Ding, Yili Jin, Jianxin Shi, and Jiangchuan Liu. TrackerSplat: Exploiting Point Tracking for Fast and Robust Dynamic 3D Gaussians Reconstruction. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

2025
[64]

Segment anything meets point tracking

Frano Rajiˇc, Lei Ke, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, and Fisher Yu. Segment anything meets point tracking. InProceedings of the Winter Conference on Applications of Computer Vision, pages 9284–9293, 2025

2025
[65]

Nettrack: Tracking highly dynamic objects with a net

Guangze Zheng, Shijie Lin, Haobo Zuo, Changhong Fu, and Jia Pan. Nettrack: Tracking highly dynamic objects with a net. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19145–19155, 2024

2024
[66]

Forecasting Motion in the Wild, April 2026

Neerja Thakkar, Shiry Ginosar, Jacob Walker, Jitendra Malik, Joao Carreira, and Carl Doersch. Forecasting Motion in the Wild, April 2026. arXiv:2604.01015 [cs]

arXiv 2026
[67]

TRec: Learning Hand-Object Interactions through 2D Point Track Motion, January 2026

Dennis Holzmann and Sven Wachsmuth. TRec: Learning Hand-Object Interactions through 2D Point Track Motion, January 2026. arXiv:2601.03667 [cs]

arXiv 2026
[68]

Video depth anything: Consistent depth estimation for super-long videos

Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22831–22840, 2025

2025
[69]

Invariant recognition drives neural represen- tations of action sequences.PLOS Computational Biology, 13(12):1–20, December 2017

Andrea Tacchetti, Leyla Isik, and Tomaso Poggio. Invariant recognition drives neural represen- tations of action sequences.PLOS Computational Biology, 13(12):1–20, December 2017. doi: 10.1371/journal.pcbi.1005859

work page doi:10.1371/journal.pcbi.1005859 2017
[70]

DisMo: Disentangled Motion Representations for Open-World Motion Transfer

Thomas Ressler-Antal, Frank Fundel, Malek Ben Alaya, Stefan Andreas Baumann, Felix Krause, Ming Gui, and Björn Ommer. DisMo: Disentangled Motion Representations for Open-World Motion Transfer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[71]

Chirality in Action: Time-Aware Video Repre- sentation Learning by Latent Straightening

Piyush Nitin Bagad and Andrew Zisserman. Chirality in Action: Time-Aware Video Repre- sentation Learning by Latent Straightening. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[72]

DINOv2: Learning Robust Visual Features without Supervision.Transactions on Machine Learning Research Journal, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, and others. DINOv2: Learning Robust Visual Features without Supervision.Transactions on Machine Learning Research Journal, 2024. 14

2024
[73]

Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Al- abdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, and others. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Pith/arXiv arXiv 2025
[74]

A deeper dive into what deep spatiotemporal networks encode: Quantifying static vs

Matthew Kowal, Mennatullah Siam, Md Amirul Islam, Neil DB Bruce, Richard P Wildes, and Konstantinos G Derpanis. A deeper dive into what deep spatiotemporal networks encode: Quantifying static vs. dynamic information. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 13999–14009, 2022

2022
[75]

RESOUND: Towards Action Recognition without Representation Bias

Yingwei Li, Yi Li, and Nuno Vasconcelos. RESOUND: Towards Action Recognition without Representation Bias. InProceedings of the European Conference on Computer Vision (ECCV), September 2018

2018
[76]

Manmatha, Alexander J

De-An Huang, Vignesh Ramanathan, Dhruv Mahajan, Lorenzo Torresani, Manohar Paluri, Li Fei-Fei, and Juan Carlos Niebles. What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7366–7375, 2018. doi: 10.1109/CVPR.2018. 00769

work page doi:10.1109/cvpr.2018 2018
[77]

cut”, “open

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations, 2019. 15 A Datasets We evaluate our method on five action recognition datasets covering a range of video domains, label granularities, and motion characteristics. Table 5 summarizes the key statistics for each dataset. Kinetics-...

2019

[1] [1]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muck- ley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, and others. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Pith/arXiv arXiv 2025

[2] [2]

Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A

Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, and Boqing Gong. VideoPrism: A Foundational Visual Encoder for Video Understanding. InForty...

2024

[3] [3]

Masked motion encoding for self-supervised video representation learning

Xinyu Sun, Peihao Chen, Liangwei Chen, Changhao Li, Thomas H Li, Mingkui Tan, and Chuang Gan. Masked motion encoding for self-supervised video representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2235–2245, 2023

2023

[4] [4]

The panaf-fgbg dataset: Understanding the impact of backgrounds in wildlife behaviour recogni- tion

Otto Brookes, Maksim Kukushkin, Majid Mirmehdi, Colleen Stephens, Paula Dieguez, Thurston C Hicks, Sorrel Jones, Kevin Lee, Maureen S McCarthy, Amelia Meier, and others. The panaf-fgbg dataset: Understanding the impact of backgrounds in wildlife behaviour recogni- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p...

2025

[5] [5]

Removing the background by adding the background: Towards background robust self-supervised video representation learning

Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J Ma, Hao Cheng, Pai Peng, Feiyue Huang, Rongrong Ji, and Xing Sun. Removing the background by adding the background: Towards background robust self-supervised video representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11804–11813, 2021

2021

[6] [6]

Why can’t i dance in the mall? learning to mitigate scene bias in action recognition.Advances in Neural Information Processing Systems, 32, 2019

Jinwoo Choi, Chen Gao, Joseph CE Messou, and Jia-Bin Huang. Why can’t i dance in the mall? learning to mitigate scene bias in action recognition.Advances in Neural Information Processing Systems, 32, 2019

2019

[7] [7]

On the integration of optical flow and action recognition

Laura Sevilla-Lara, Yiyi Liao, Fatma Güney, Varun Jampani, Andreas Geiger, and Michael J Black. On the integration of optical flow and action recognition. InGerman conference on pattern recognition, pages 281–297. Springer, 2018

2018

[8] [8]

Is appearance free action recognition possible? InEuropean Conference on Computer Vision, pages 156–173

Filip Ilic, Thomas Pock, and Richard P Wildes. Is appearance free action recognition possible? InEuropean Conference on Computer Vision, pages 156–173. Springer, 2022

2022

[9] [9]

Cotracker3: Simpler and better point tracking by pseudo-labelling real videos

Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6013–6022, 2025

2025

[10] [10]

Tapnext: Tracking any point (tap) as next token prediction

Artem Zholus, Carl Doersch, Yi Yang, Skanda Koppula, Viorica Patraucean, Xu Owen He, Ignacio Rocco, Mehdi SM Sajjadi, Sarath Chandar, and Ross Goroshin. Tapnext: Tracking any point (tap) as next token prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9693–9703, 2025. 10

2025

[11] [11]

Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation

Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. In European Conference on Computer Vision, pages 306–324. Springer, 2024

2024

[12] [12]

Articulated Object Estimation in the Wild

Abdelrhman Werby, Martin Büchner, Adrian Röfer, Chenguang Huang, Wolfram Burgard, and Abhinav Valada. Articulated Object Estimation in the Wild. In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 of Proceedings of Machine Learning Research, pages 3828–3849. PMLR, September 2025

2025

[13] [13]

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong W ANG, Hongwei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, Yingya Zhang, and Yujiu Yang. Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[14] [14]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

2017

[15] [15]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, and others. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on computer vision, pages 584...

2017

[16] [16]

Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100. International Journal of Computer Vision (IJCV), 130:33–55, 2022

2022

[17] [17]

Large-scale video classification with convolutional neural networks

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014

2014

[18] [18]

Long-term recurrent convolutional networks for visual recognition and description

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015

2015

[19] [19]

Courville

Nicolas Ballas, Li Yao, Chris Pal, and Aaron C. Courville. Delving Deeper into Convolutional Networks for Learning Video Representations. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016

2016

[20] [20]

Beyond short snippets: Deep networks for video classification

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4694–4702, 2015

2015

[21] [21]

Learning spa- tiotemporal features with 3d convolutional networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spa- tiotemporal features with 3d convolutional networks. InProceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015

2015

[22] [22]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019

2019

[23] [23]

X3d: Expanding architectures for efficient video recognition

Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020

2020

[24] [24]

Is space-time attention all you need for video understanding? InIcml, volume 2, page 4, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InIcml, volume 2, page 4, 2021. 11

2021

[25] [25]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021

2021

[26] [26]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

2022

[27] [27]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14549–14560, 2023

2023

[28] [28]

Masked autoencoders as spatiotemporal learners.Advances in neural information processing systems, 35:35946–35958, 2022

Christoph Feichtenhofer, Yanghao Li, Kaiming He, and others. Masked autoencoders as spatiotemporal learners.Advances in neural information processing systems, 35:35946–35958, 2022

2022

[29] [29]

Recurrent Video Masked Autoencoders.arXiv preprint arXiv:2512.13684, 2025

Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A Hudson, Joao Carreira, and Andrew Zisserman. Recurrent Video Masked Autoencoders.arXiv preprint arXiv:2512.13684, 2025

Pith/arXiv arXiv 2025

[30] [30]

Revisiting Feature Prediction for Learning Visual Representations from Video.Transactions on Machine Learning Research, 2024

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. Revisiting Feature Prediction for Learning Visual Representations from Video.Transactions on Machine Learning Research, 2024. ISSN 2835-8856

2024

[31] [31]

Two-stream convolutional networks for action recognition in videos.Advances in neural information processing systems, 27, 2014

Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos.Advances in neural information processing systems, 27, 2014

2014

[32] [32]

Convolutional two-stream network fusion for video action recognition

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941, 2016

1933

[33] [33]

Memory-augmented dense predictive coding for video representation learning

Tengda Han, Weidi Xie, and Andrew Zisserman. Memory-augmented dense predictive coding for video representation learning. InEuropean conference on computer vision, pages 312–329. Springer, 2020

2020

[34] [34]

Wang, Christopher Hoang, Yuwen Xiong, Yann LeCun, and Mengye Ren

Alex N. Wang, Christopher Hoang, Yuwen Xiong, Yann LeCun, and Mengye Ren. Poo- DLe: Pooled and dense self-supervised learning from naturalistic videos. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[35] [35]

Spatial temporal graph convolutional networks for skeleton-based action recognition

Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018

[36] [36]

Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation

Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. InProceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, pages 786–792. AAAI Press, 2018. ISBN 978-0-9992411-2-7

2018

[37] [37]

, author Neer, W.C

Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Skeleton-Based Action Recognition With Directed Graph Neural Networks. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7904–7913, 2019. doi: 10.1109/CVPR.2019.00810

work page doi:10.1109/cvpr.2019.00810 2019

[38] [38]

An end-to-end spatio- temporal attention model for human action recognition from skeleton data

Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. An end-to-end spatio- temporal attention model for human action recognition from skeleton data. InProceedings of the AAAI conference on artificial intelligence, volume 31, 2017

2017

[39] [39]

Jun Liu, Gang Wang, Ping Hu, Ling-Yu Duan, and Alex C. Kot. Global Context-Aware Attention LSTM Networks for 3D Action Recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

2017

[40] [40]

Seeing without Pixels: Perception from Camera Trajectories.arXiv preprint arXiv:2511.21681, 2025

Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, and Tengda Han. Seeing without Pixels: Perception from Camera Trajectories.arXiv preprint arXiv:2511.21681, 2025. 12

arXiv 2025

[41] [41]

TrackMAE: Video Representation Learning via Track Mask and Predict.arXiv preprint arXiv:2603.27268, 2026

Renaud Vandeghen, Fida Mohammad Thoker, Marc Van Droogenbroeck, and Bernard Ghanem. TrackMAE: Video Representation Learning via Track Mask and Predict.arXiv preprint arXiv:2603.27268, 2026

arXiv 2026

[42] [42]

Trajectory-aligned space-time tokens for few-shot action recognition

Pulkit Kumar, Namitha Padmanabhan, Luke Luo, Sai Saketh Rambhatla, and Abhinav Shri- vastava. Trajectory-aligned space-time tokens for few-shot action recognition. InEuropean Conference on Computer Vision, pages 474–493. Springer, 2024

2024

[43] [43]

Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition

Pulkit Kumar, Shuaiyi Huang, Matthew Walmer, Sai Saketh Rambhatla, and Abhinav Shrivas- tava. Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13544– 13556, 2025

2025

[44] [44]

It’s a Matter of Time: Three Lessons on Long-Term Motion for Perception.arXiv preprint arXiv:2602.14705, 2026

Willem Davison, Xinyue Hao, and Laura Sevilla-Lara. It’s a Matter of Time: Three Lessons on Long-Term Motion for Perception.arXiv preprint arXiv:2602.14705, 2026

arXiv 2026

[45] [45]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision, pages 402–419. Springer, 2020

2020

[46] [46]

W AFT: Warping-Alone Field Transforms for Optical Flow

Yihan Wang and Jia Deng. W AFT: Warping-Alone Field Transforms for Optical Flow. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[47] [47]

Tap-vid: A benchmark for tracking any point in a video.Advances in Neural Information Processing Systems, 35:13610–13626, 2022

Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Recasens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for tracking any point in a video.Advances in Neural Information Processing Systems, 35:13610–13626, 2022

2022

[48] [48]

Particle video revisited: Tracking through occlusions using point trajectories

Adam W Harley, Zhaoyuan Fang, and Katerina Fragkiadaki. Particle video revisited: Tracking through occlusions using point trajectories. InEuropean Conference on Computer Vision, pages 59–75. Springer, 2022

2022

[49] [49]

Tapir: Tracking any point with per-frame initialization and temporal refinement

Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10061–10072, 2023

2023

[50] [50]

Cotracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. InEuropean conference on computer vision, pages 18–35. Springer, 2024

2024

[51] [51]

Dense optical tracking: Connecting the dots

Guillaume Le Moing, Jean Ponce, and Cordelia Schmid. Dense optical tracking: Connecting the dots. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19187–19197, 2024

2024

[52] [52]

Local all-pair correspondence for point tracking

Seokju Cho, Jiahui Huang, Jisu Nam, Honggyu An, Seungryong Kim, and Joon-Young Lee. Local all-pair correspondence for point tracking. InEuropean conference on computer vision, pages 306–325. Springer, 2024

2024

[53] [53]

TAPNext++: What’s Next for Tracking Any Point (TAP)?arXiv preprint arXiv:2604.10582, 2026

Sebastian Jung, Artem Zholus, Martin Sundermeyer, Carl Doersch, Ross Goroshin, David Joseph Tan, Sarath Chandar, Rudolph Triebel, and Federico Tombari. TAPNext++: What’s Next for Tracking Any Point (TAP)?arXiv preprint arXiv:2604.10582, 2026

Pith/arXiv arXiv 2026

[54] [54]

Bootstap: Bootstrapped training for tracking-any-point

Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, Joao Carreira, and others. Bootstap: Bootstrapped training for tracking-any-point. InProceedings of the Asian Conference on Computer Vision, pages 3257–3274, 2024

2024

[55] [55]

Spatialtracker: Tracking any 2d pixels in 3d space

Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. Spatialtracker: Tracking any 2d pixels in 3d space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20406–20417, 2024

2024

[56] [56]

Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion

Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6726–6737, 2025. 13

2025

[57] [57]

DELTA: DENSE EFFICIENT LONG-RANGE 3D TRACKING FOR ANY VIDEO

Tuan Duc Ngo, Peiye Zhuang, Evangelos Kalogerakis, Chuang Gan, Sergey Tulyakov, Hsin- Ying Lee, and Chaoyang Wang. DELTA: DENSE EFFICIENT LONG-RANGE 3D TRACKING FOR ANY VIDEO. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[58] [58]

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, and Xin Jin. DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[59] [59]

History-Aware Visuomotor Policy Learning via Point Tracking.arXiv preprint arXiv:2509.17141, 2025

Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang, and Cewu Lu. History-Aware Visuomotor Policy Learning via Point Tracking.arXiv preprint arXiv:2509.17141, 2025

arXiv 2025

[60] [60]

Generative Video Motion Editing with 3D Point Tracks

Yao-Chih Lee, Zhoutong Zhang, Jiahui Huang, Jui-Hsien Wang, Joon-Young Lee, Jia-Bin Huang, Eli Shechtman, and Zhengqi Li. Generative Video Motion Editing with 3D Point Tracks. arXiv preprint arXiv:2512.02015, 2025

arXiv 2025

[61] [61]

Vggsfm: Visual geometry grounded deep structure from motion

Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21686–21697, 2024

2024

[62] [62]

Shape of motion: 4d reconstruction from a single video

Qianqian Wang, Vickie Ye, Hang Gao, Weijia Zeng, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of motion: 4d reconstruction from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9660–9672, 2025

2025

[63] [63]

TrackerSplat: Exploiting Point Tracking for Fast and Robust Dynamic 3D Gaussians Reconstruction

Daheng Yin, Isaac Ding, Yili Jin, Jianxin Shi, and Jiangchuan Liu. TrackerSplat: Exploiting Point Tracking for Fast and Robust Dynamic 3D Gaussians Reconstruction. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

2025

[64] [64]

Segment anything meets point tracking

Frano Rajiˇc, Lei Ke, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, and Fisher Yu. Segment anything meets point tracking. InProceedings of the Winter Conference on Applications of Computer Vision, pages 9284–9293, 2025

2025

[65] [65]

Nettrack: Tracking highly dynamic objects with a net

Guangze Zheng, Shijie Lin, Haobo Zuo, Changhong Fu, and Jia Pan. Nettrack: Tracking highly dynamic objects with a net. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19145–19155, 2024

2024

[66] [66]

Forecasting Motion in the Wild, April 2026

Neerja Thakkar, Shiry Ginosar, Jacob Walker, Jitendra Malik, Joao Carreira, and Carl Doersch. Forecasting Motion in the Wild, April 2026. arXiv:2604.01015 [cs]

arXiv 2026

[67] [67]

TRec: Learning Hand-Object Interactions through 2D Point Track Motion, January 2026

Dennis Holzmann and Sven Wachsmuth. TRec: Learning Hand-Object Interactions through 2D Point Track Motion, January 2026. arXiv:2601.03667 [cs]

arXiv 2026

[68] [68]

Video depth anything: Consistent depth estimation for super-long videos

Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22831–22840, 2025

2025

[69] [69]

Invariant recognition drives neural represen- tations of action sequences.PLOS Computational Biology, 13(12):1–20, December 2017

Andrea Tacchetti, Leyla Isik, and Tomaso Poggio. Invariant recognition drives neural represen- tations of action sequences.PLOS Computational Biology, 13(12):1–20, December 2017. doi: 10.1371/journal.pcbi.1005859

work page doi:10.1371/journal.pcbi.1005859 2017

[70] [70]

DisMo: Disentangled Motion Representations for Open-World Motion Transfer

Thomas Ressler-Antal, Frank Fundel, Malek Ben Alaya, Stefan Andreas Baumann, Felix Krause, Ming Gui, and Björn Ommer. DisMo: Disentangled Motion Representations for Open-World Motion Transfer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[71] [71]

Chirality in Action: Time-Aware Video Repre- sentation Learning by Latent Straightening

Piyush Nitin Bagad and Andrew Zisserman. Chirality in Action: Time-Aware Video Repre- sentation Learning by Latent Straightening. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[72] [72]

DINOv2: Learning Robust Visual Features without Supervision.Transactions on Machine Learning Research Journal, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, and others. DINOv2: Learning Robust Visual Features without Supervision.Transactions on Machine Learning Research Journal, 2024. 14

2024

[73] [73]

Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Al- abdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, and others. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Pith/arXiv arXiv 2025

[74] [74]

A deeper dive into what deep spatiotemporal networks encode: Quantifying static vs

Matthew Kowal, Mennatullah Siam, Md Amirul Islam, Neil DB Bruce, Richard P Wildes, and Konstantinos G Derpanis. A deeper dive into what deep spatiotemporal networks encode: Quantifying static vs. dynamic information. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 13999–14009, 2022

2022

[75] [75]

RESOUND: Towards Action Recognition without Representation Bias

Yingwei Li, Yi Li, and Nuno Vasconcelos. RESOUND: Towards Action Recognition without Representation Bias. InProceedings of the European Conference on Computer Vision (ECCV), September 2018

2018

[76] [76]

Manmatha, Alexander J

De-An Huang, Vignesh Ramanathan, Dhruv Mahajan, Lorenzo Torresani, Manohar Paluri, Li Fei-Fei, and Juan Carlos Niebles. What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7366–7375, 2018. doi: 10.1109/CVPR.2018. 00769

work page doi:10.1109/cvpr.2018 2018

[77] [77]

cut”, “open

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations, 2019. 15 A Datasets We evaluate our method on five action recognition datasets covering a range of video domains, label granularities, and motion characteristics. Table 5 summarizes the key statistics for each dataset. Kinetics-...

2019