Recognition: unknown
Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation
Pith reviewed 2026-05-10 16:34 UTC · model grok-4.3
The pith
Learnable Motion-Focused Tokenization improves video unsupervised domain adaptation by discarding low-motion background tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. When used within a VUDA framework, this approach achieves state-of-the-art performance on three standard VUDA benchmarks across 21 domain adaptation settings while significantly reducing computational overhead compared with prior methods.
What carries the argument
Learnable Motion-Focused Tokenization (LMFT), which identifies and keeps motion-rich patch tokens from video frames to focus adaptation on action-relevant content.
If this is right
- State-of-the-art results on standard VUDA benchmarks across 21 settings.
- Significant reduction in computational overhead during adaptation.
- Better handling of domain shifts caused by differing static backgrounds in source and target videos.
- Retention of action-relevant information while removing redundant background tokens.
Where Pith is reading between the lines
- The same selection principle could extend to supervised video tasks or other video problems such as detection and captioning where background noise is costly.
- Training the motion selector without target labels might transfer to other unsupervised video adaptation settings beyond action recognition.
- The method's success depends on motion reliably signaling action importance, so it may need adjustments for actions that rely on subtle or static cues.
Load-bearing premise
That low-motion tokens primarily correspond to uninformative background regions whose removal will not discard action-relevant information and that the learnable selection process can be trained effectively in the unsupervised target domain.
What would settle it
A set of target-domain videos containing important actions performed with minimal motion, where discarding low-motion tokens causes clear drops in recognition accuracy.
Figures
read the original abstract
Video Unsupervised Domain Adaptation (VUDA) poses a significant challenge in action recognition, requiring the adaptation of a model from a labeled source domain to an unlabeled target domain. Despite recent advances, existing VUDA methods often fall short of fully supervised performance, a key reason being the prevalence of static and uninformative backgrounds that exacerbate domain shifts. Additionally, prior approaches largely overlook computational efficiency, limiting real-world adoption. To address these issues, we propose Learnable Motion-Focused Tokenization (LMFT) for VUDA. LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. Extensive experiments on three standard VUDA benchmarks across 21 domain adaptation settings show that our VUDA framework with LMFT achieves state-of-the-art performance while significantly reducing computational overhead. LMFT thus enables VUDA that is both effective and computationally efficient.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Learnable Motion-Focused Tokenization (LMFT) for Video Unsupervised Domain Adaptation (VUDA). It tokenizes video frames into patch tokens and introduces a learnable mechanism to discard low-motion, redundant tokens (assumed to be background) while retaining motion-rich, action-relevant tokens. The framework is evaluated on three standard VUDA benchmarks across 21 domain adaptation settings, claiming state-of-the-art performance alongside substantial reductions in computational overhead.
Significance. If the central claims hold after addressing the noted concerns, the work would be significant for video domain adaptation: it directly targets the background-induced domain shift problem while simultaneously improving efficiency, an aspect often neglected in prior VUDA literature. The motion-focused token pruning offers a practical route to scalable action recognition adaptation.
major comments (3)
- [Abstract, §3] Abstract and §3 (LMFT description): The core assumption that low-motion tokens 'primarily correspond to background regions' and can be safely discarded without losing action-relevant information is load-bearing for both the effectiveness and efficiency claims, yet the unsupervised training of the selector in the target domain provides no validation against ground-truth action regions or robustness checks for action classes where discriminative cues are static or low-motion (e.g., pose-based actions).
- [§4] §4 (Experiments): The SOTA results across 21 settings are asserted without reported ablations that isolate the contribution of the learnable motion selector versus other framework components, or tests that forcibly retain low-motion tokens to measure information loss; this leaves the efficiency-performance tradeoff unsubstantiated.
- [§3.2] §3.2 (Token selection mechanism): The end-to-end training of the motion-based selector in the unlabeled target domain risks domain-shift sensitivity in motion statistics, but no analysis or failure-case discussion is provided for scenarios where motion estimation itself shifts across domains.
minor comments (2)
- [Abstract] The abstract would be clearer with explicit naming of the three benchmarks and a one-sentence summary of the tokenization architecture.
- [§3] Notation for the motion estimation and selection thresholds should be defined consistently between text and any equations or pseudocode.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review of our manuscript. The comments highlight important aspects of our assumptions, experimental validation, and potential limitations, which we address point by point below. We will incorporate revisions to strengthen the paper as outlined.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3 (LMFT description): The core assumption that low-motion tokens 'primarily correspond to background regions' and can be safely discarded without losing action-relevant information is load-bearing for both the effectiveness and efficiency claims, yet the unsupervised training of the selector in the target domain provides no validation against ground-truth action regions or robustness checks for action classes where discriminative cues are static or low-motion (e.g., pose-based actions).
Authors: We acknowledge that the assumption linking low-motion tokens to background regions is central and that direct ground-truth validation is unavailable in the unsupervised target domain. Our empirical results across 21 settings demonstrate consistent gains, providing indirect support. To address the concern directly, we will revise §3 and the abstract to clarify the assumption's scope, add a limitations discussion for low-motion or pose-based actions, and include qualitative token visualizations in the supplementary material. revision: yes
-
Referee: [§4] §4 (Experiments): The SOTA results across 21 settings are asserted without reported ablations that isolate the contribution of the learnable motion selector versus other framework components, or tests that forcibly retain low-motion tokens to measure information loss; this leaves the efficiency-performance tradeoff unsubstantiated.
Authors: The referee is correct that dedicated ablations isolating the learnable selector are not reported. While overall SOTA comparisons and efficiency metrics are provided, we will add targeted ablations in the revised §4, including variants with/without the selector, random token retention baselines, and forced retention of low-motion tokens to quantify any information loss and better substantiate the tradeoff. revision: yes
-
Referee: [§3.2] §3.2 (Token selection mechanism): The end-to-end training of the motion-based selector in the unlabeled target domain risks domain-shift sensitivity in motion statistics, but no analysis or failure-case discussion is provided for scenarios where motion estimation itself shifts across domains.
Authors: We agree that domain shifts in motion statistics represent a potential risk not explicitly analyzed. The end-to-end adaptation and strong cross-domain results provide some evidence of robustness, but we will revise §3.2 to include an analysis of motion statistic differences across domains and a discussion of possible failure cases, drawing on examples from the evaluated benchmarks. revision: yes
Circularity Check
No circularity: LMFT is a proposed method whose performance claims rest on experimental benchmarks rather than self-referential definitions or fitted inputs.
full rationale
The paper introduces LMFT as a learnable token selection process that discards low-motion tokens while retaining action-relevant ones for VUDA. No equations or steps in the abstract or description reduce a claimed prediction or result to its own inputs by construction; the selection is trained end-to-end on the target domain without invoking self-citations for uniqueness or smuggling ansatzes. The SOTA performance is asserted via experiments across 21 settings on three benchmarks, which constitutes independent empirical content rather than a renaming or self-definition. This is a standard method-proposal paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Divprune: Diversity-based visual token pruning for large multimodal models
Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. Divprune: Diversity-based visual token pruning for large multimodal models. InCVPR, 2025. 2, 7
2025
-
[2]
Token merging: Your vit but faster.ICLR, 2023
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.ICLR, 2023. 2, 7
2023
-
[3]
Temporal attentive align- ment for large-scale video domain adaptation
Min-Hung Chen, Zsolt Kira, Ghassan AlRegib, Jaekwon Yoo, Ruxin Chen, and Jian Zheng. Temporal attentive align- ment for large-scale video domain adaptation. InICCV,
-
[4]
Don’t look twice: Faster video transformers with run-length tokenization
Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Ni- inuma, Kris Kitani, and L´aszl´o Jeni. Don’t look twice: Faster video transformers with run-length tokenization. InNeurIPS,
-
[5]
Dual-head contrastive domain adaptation for video action recognition
Victor G Turrisi da Costa, Giacomo Zara, Paolo Rota, Thi- ago Oliveira-Santos, Nicu Sebe, Vittorio Murino, and Elisa Ricci. Dual-head contrastive domain adaptation for video action recognition. InWACV, 2022. 1, 6
2022
-
[6]
Turrisi Da Costa, Giacomo Zara, Paolo Rota, Thi- ago Oliveira-Santos, Nicu Sebe, Vittorio Murino, and Elisa Ricci
Victor G. Turrisi Da Costa, Giacomo Zara, Paolo Rota, Thi- ago Oliveira-Santos, Nicu Sebe, Vittorio Murino, and Elisa Ricci. Unsupervised Domain Adaptation for Video Trans- formers in Action Recognition. InICPR, pages 1258–1265,
-
[7]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. 5
work page internal anchor Pith review arXiv 2023
-
[8]
Flashattention: Fast and memory-efficient exact attention with io-awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness. InNeurIPS, 2022. 5
2022
-
[9]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1, 2, 5, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[10]
Unsupervised domain adaptation by backpropagation
Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. InICML, 2015. 1, 2, 6
2015
-
[11]
Bypass back-propagation: Optimization-based structural pruning for large language models via policy gra- dient
Yuan Gao, Zujing Liu, Weizhong Zhang, Bo Du, and Gui- Song Xia. Bypass back-propagation: Optimization-based structural pruning for large language models via policy gra- dient. InACL, 2025. 4, 8
2025
-
[12]
Sefar: Semi-supervised fine- grained action recognition with temporal perturbation and learning stabilization
Yongle Huang, Haodong Chen, Zhenbang Xu, Zihan Jia, Haozhou Sun, and Dian Shao. Sefar: Semi-supervised fine- grained action recognition with temporal perturbation and learning stabilization. InAAAI, 2025. 2
2025
-
[13]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,
work page internal anchor Pith review arXiv
-
[14]
Drone- hat: Hybrid attention transformer for complex action recog- nition in drone surveillance videos
Mustaqeem Khan, Jamil Ahmad, Abdulmotaleb El Saddik, Wail Gueaieb, Giulia De Masi, and Fakhri Karray. Drone- hat: Hybrid attention transformer for complex action recog- nition in drone surveillance videos. InCVPR, 2024. 2
2024
-
[15]
Kuehne, H
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recogni- tion. InICCV, 2011. 5
2011
-
[16]
Source-free video domain adaptation with spatial-temporal- historical consistency learning
Kai Li, Deep Patel, Erik Kruus, and Martin Renqiang Min. Source-free video domain adaptation with spatial-temporal- historical consistency learning. InCVPR, 2023. 6
2023
-
[17]
Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. Learning transferable features with deep adaptation networks. InICML, 2015. 1, 2, 6
2015
-
[18]
The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables.arXiv preprint arXiv:1611.00712, 2016. 8
work page Pith review arXiv 2016
-
[19]
Moments in time dataset: one million videos for event under- standing.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2019
Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ra- makrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl V ondrick, and Aude Oliva. Moments in time dataset: one million videos for event under- standing.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2019. 5
2019
-
[20]
Multi-modal domain adaptation for fine-grained action recognition
Jonathan Munro and Dima Damen. Multi-modal domain adaptation for fine-grained action recognition. InCVPR,
-
[21]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 4, 6
2021
-
[22]
Unsupervised video domain adaptation with masked pre-training and collabora- tive self-training
Arun Reddy, William Paul, Corban Rivera, Ketul Shah, Celso M de Melo, and Rama Chellappa. Unsupervised video domain adaptation with masked pre-training and collabora- tive self-training. InCVPR, 2024. 1, 2, 3, 6, 7, 8
2024
-
[23]
Contrast and mix: Temporal contrastive video domain adaptation with background mixing
Aadarsh Sahoo, Rutav Shah, Rameswar Panda, Kate Saenko, and Abir Das. Contrast and mix: Temporal contrastive video domain adaptation with background mixing. InNeurIPS,
-
[24]
Llava-prumerge: Adaptive token reduction for efficient large multimodal models
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. InICCV, 2025. 2, 7
2025
-
[25]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 5
work page internal anchor Pith review arXiv 2012
-
[26]
Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. InNeurIPS, 2022. 5
2022
-
[27]
Videomae v2: Scaling video masked autoencoders with dual masking
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In CVPR, 2023. 5
2023
-
[28]
Simple statistical gradient-following al- gorithms for connectionist reinforcement learning.Machine learning, pages 229–256, 1992
Ronald J Williams. Simple statistical gradient-following al- gorithms for connectionist reinforcement learning.Machine learning, pages 229–256, 1992. 4
1992
-
[29]
Svformer: Semi-supervised video trans- former for action recognition
Zhen Xing, Qi Dai, Han Hu, Jingjing Chen, Zuxuan Wu, and Yu-Gang Jiang. Svformer: Semi-supervised video trans- former for action recognition. InCVPR, 2023. 2
2023
-
[30]
Arid: A new dataset for recogniz- ing action in the dark
Yuecong Xu, Jianfei Yang, Haozhi Cao, Kezhi Mao, Jianx- iong Yin, and Simon See. Arid: A new dataset for recogniz- ing action in the dark. InDeep Learning for Human Activity Recognition, 2021. 5 9
2021
-
[31]
Source-free video domain adaptation by learning temporal consistency for action recognition
Yuecong Xu, Jianfei Yang, Haozhi Cao, Keyu Wu, Min Wu, and Zhenghua Chen. Source-free video domain adaptation by learning temporal consistency for action recognition. In ECCV, 2022. 6
2022
-
[32]
Multi-source video do- main adaptation with temporal attentive moment alignment network.Circuits and Systems for Video Technology, 2023
Yuecong Xu, Jianfei Yang, Haozhi Cao, Keyu Wu, Min Wu, Zhengguo Li, and Zhenghua Chen. Multi-source video do- main adaptation with temporal attentive moment alignment network.Circuits and Systems for Video Technology, 2023. 5
2023
-
[33]
Leveraging endo- and exo- temporal regularization for black-box video domain adapta- tion.Transactions on Machine Learning Research, 2024
Yuecong Xu, Jianfei Yang, Haozhi Cao, Min Wu, Xiaoli Li, Lihua Xie, and Zhenghua Chen. Leveraging endo- and exo- temporal regularization for black-box video domain adapta- tion.Transactions on Machine Learning Research, 2024. 6
2024
-
[34]
The unreasonable effectiveness of large language-vision models for source-free video domain adaptation
Giacomo Zara, Alessandro Conti, Subhankar Roy, St ´ephane Lathuili`ere, Paolo Rota, and Elisa Ricci. The unreasonable effectiveness of large language-vision models for source-free video domain adaptation. InICCV, 2023. 1, 3, 6, 7
2023
-
[35]
Audio-adaptive activity recognition across video do- mains
Yunhua Zhang, Hazel Doughty, Ling Shao, and Cees GM Snoek. Audio-adaptive activity recognition across video do- mains. InCVPR, 2022. 5, 6 10
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.