arxiv: 2605.06809 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

Ali Salamatian , Anthony Fuller , Pritam Sarkar , James R. Green , Leonid Sigal , Evan Shelhamer

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:32 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords video recognitiontoken selectiontransformer efficiencydistillationspatiotemporal tokensaction recognitioncomputational trade-offs

0 comments

The pith

LookWhen selects unique tokens from downscaled videos to approximate full recognition at lower computation cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that video transformers can reduce expensive computation by factorizing recognition into a fast selector that ranks tokens from a scaled-down input and a deeper extractor that processes only the top-ranked ones. The selector ranks tokens by nearest-neighbor uniqueness, while the extractor is trained to match full-video features via distillation from a video teacher and a frame-normalized image teacher. This approach exploits redundancy in videos without task-specific labels for selection. A reader would care because it directly improves the accuracy-computation trade-off on standard benchmarks like action recognition and gesture classification.

Core claim

LookWhen is a selector-extractor framework that factorizes video recognition into learning when, where, and what to compute. The shallow selector scores all tokens across space-time from a scaled-down video using nearest-neighbor uniqueness ranking, and the deep extractor processes only the top-K selected tokens to approximate full-video representations. Selection is pre-trained with uniqueness scores, and extraction uses distillation from both a video teacher and a normalized image teacher to capture changes within videos, yielding general representations for downstream use.

What carries the argument

The selector-extractor framework, where the selector ranks tokens by nearest-neighbor uniqueness on downscaled video input and the extractor approximates full representations through video and image teacher distillation.

If this is right

Achieves a better accuracy-computation trade-off than efficient models and upgraded baselines of similar size.
Pareto-dominates accuracy-FLOPs on 9 of 12 cases across 6 tasks and 2 settings.
Matches or exceeds performance on the remaining cases while delivering higher throughput.
Yields general representations usable for feature extraction or fine-tuning to specific tasks.
Applies across diverse video recognition benchmarks including Kinetics-400, SSv2, Epic-Kitchens, Diving48, Jester, and Charades.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The downscaling-plus-uniqueness approach may extend to other redundant sequential data such as long audio clips or time-series sensor streams.
Separating a cheap selector from the heavy extractor opens the possibility of reusing one selector across multiple extractors or tasks.
Performance on videos of widely varying length or frame rate would test whether the scaled-down uniqueness ranking remains stable.
Real-time edge-device video analysis could benefit if the selector overhead stays negligible relative to the savings in the extractor.

Load-bearing premise

That ranking tokens by nearest-neighbor uniqueness in a scaled-down video and distilling from video and normalized image teachers produces selections that reliably approximate full-video representations without task-specific supervision.

What would settle it

If, on a held-out video dataset, LookWhen accuracy falls below full-video baselines at matched FLOPs levels or if selected tokens systematically miss key motion events in qualitative review.

Figures

Figures reproduced from arXiv: 2605.06809 by Ali Salamatian, Anthony Fuller, Evan Shelhamer, James R. Green, Leonid Sigal, Pritam Sarkar.

**Figure 1.** Figure 1: LookWhen’s shallow selector gets a downscaled video and scores tokens on their feature uniqueness (left). Target uniqueness is from our “top1-distance” algorithm, which computes each patch’s distance to its nearest neighbor in an image teacher’s feature space (bottom right). LookWhen’s extractor gets the top-K input tokens for sparse and deep processing. Target features are from a video teacher (top right)… view at source ↗

**Figure 2.** Figure 2: When and where to compute for efficiency. InternVideo2 space-time attention maps suffer from artifacts. DINOv3 has cleaner attention but is strictly frame-wise. “Top1-dist” (each patch’s distance to its nearest neighbor in feature space) finds the unique patches across all frames; our selector predicts it. The wolf is partly visible in frame 1, runs away, then toward the camera. 2 LookWhen: Selecting acros… view at source ↗

**Figure 3.** Figure 3: Example of learned selections. We pre-train on K400+SSv2 data and our selector generalizes to a video of an author’s nephew being thrown in a pool and swimming. More in §A.2. patch-level features, respectively. After pre-training, the teachers are discarded; only the efficient selector-extractor is needed. The video token is the input to the linear head for downstream tasks. Selector training targets. Atte… view at source ↗

**Figure 4.** Figure 4: Linear probing (LP) and fine-tuning (FT) accuracy vs. FLOPs across six datasets. Our LookWhen (•) mostly outperforms the baselines in controlled settings. Gains are largest for LP, sometimes surpassing the dense InternVideo2 (⋆). We make these upgraded baselines by applying the sparsification methods vid-TLDR (■) [11] or RLT (▲) [8] to the SOTA ViT-B InternVideo2 [18]. 3.1 Downstream Tasks: Accuracy versus… view at source ↗

**Figure 5.** Figure 5: LookWhen (•) dominates baselines in mean accuracy (over 6 datasets and 2 settings) versus measured throughput. Markers: IV2 (⋆), IV2+vidTLDR (■), and IV2+RLT (▲). Realized efficiency: LookWhen’s efficiency gains increase when measured in practice. We measure throughput on an NVIDIA L40S GPU to check if theoretical gains (accuracy-FLOPs) translates to practical gains (accuracy-throughput). Our efficie… view at source ↗

**Figure 6.** Figure 6: Throughput (videos/s) at inference time. All measurements are taken on an NVIDIA L40S GPU with batch size 32 and bfloat16 automatic mixed precision. RLT models with different sparsity have the same throughput because RLT requires token masking through attention masking (not token dropping!) for batch sizes greater than 1. Markers: LookWhen (•), IV2 (⋆), IV2+vidTLDR (■), and IV2+RLT (▲) [PITH_FULL_IMAGE:f… view at source ↗

**Figure 7.** Figure 7: Example from Kinetics-400. frames 1-8 frames 9-16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Example from Something-Something-v2. frames 1-8 frames 9-16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Example from Epic-Kitchens-100. frames 1-8 frames 9-16 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Example from Diving48. frames 1-8 frames 9-16 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Example from Jester. frames 1-8 frames 9-16 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Example from Charades. frames 1-8 frames 9-16 17 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Fine-tuning efficiency. We plot cumulative fine-tuning cost vs. accuracy. At 70% sparsity, LookWhen (•) reaches a given accuracy faster than the dense InternVideo2 (■) during fine-tuning. Each marker represents 1 epoch for EK-100 and Jester, and 5 epochs for Diving48 and Charades [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

read the original abstract

Transformers dominate video recognition. They split videos into tokens, and processing them has expensive superlinear computational cost. Yet videos are filled with redundancy, so we can question the need for this expense. We introduce LookWhen, a selector-extractor framework that factorizes video recognition into learning when, where, and what to compute. Our shallow selector gets a scaled-down video and quickly scores all tokens across space-time, while our deep extractor gets the top-K selected tokens to approximate full-video representations without actually processing all the tokens. A key challenge is defining effective supervision for selection and extraction. For selection pre-training, we introduce a score on representations that ranks tokens by uniqueness using a simple nearest-neighbor distance. For extraction pre-training, we distill both a video teacher and an image teacher, for which we normalize its frame-wise representations to learn what changes within videos. Through these strategies, our selector-extractor learns general and efficient representations for feature extraction or fine-tuning to a task. Through experiments on Kinetics-400, SSv2, Epic-Kitchens, Diving48, Jester, and Charades, we show that LookWhen achieves a better accuracy-computation trade-off than efficient models and upgraded baselines of similar size. LookWhen Pareto-dominates in accuracy-FLOPs on 9 of 12 cases (6 tasks x 2 settings) and roughly matches on 3. In accuracy-throughput, measuring time in practice, LookWhen is more efficient still at 6.7x faster than InternVideo2-B at equal accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LookWhen gives a workable selector-extractor split that cuts video transformer compute with measurable speedups on standard benchmarks, but the unsupervised uniqueness ranking for token selection is the part that still needs checking.

read the letter

LookWhen factors video recognition into a fast selector that ranks tokens by uniqueness via nearest-neighbor distance on a scaled-down input and an extractor that processes only the selected top-K tokens. Pre-training for the selector uses that uniqueness score, while the extractor gets distilled knowledge from both a full video teacher and an image teacher with normalized frames to focus on changes over time. This factorization is new in how it combines the selection pre-training with dual distillation. It leads to better accuracy versus computation trade-offs than similar-sized efficient models and upgraded baselines. The results cover Kinetics-400, SSv2, Epic-Kitchens, Diving48, Jester, and Charades, with Pareto dominance in accuracy-FLOPs for 9 of 12 cases and a 6.7x throughput improvement over InternVideo2-B at equal accuracy. The practical speed measurements are helpful for real-world use. The soft spot is the reliance on the unsupervised uniqueness metric for selection. This could select tokens that are locally distinct but not relevant to the action, such as background or camera effects, especially since no task-specific supervision guides the selector. The paper shows the approach works on the tested benchmarks, but without detailed ablations on token selection quality or how it handles different motion patterns, it's unclear how far the gains generalize. The lack of error bars or full protocol details in the high-level claims also leaves some room for questions on robustness. Readers focused on efficient inference for video transformers or reducing redundancy in long sequences will find this useful. It is worth a serious referee because the empirical gains are consistent across tasks and the method is grounded in a clear factorization, even if the selection heuristic could use more validation. I recommend sending it to peer review to get input on strengthening the selection analysis and confirming the results hold under varied conditions.

Referee Report

3 major / 2 minor

Summary. The paper introduces LookWhen, a selector-extractor framework for efficient video recognition with transformers. A shallow selector processes a scaled-down video and ranks tokens via nearest-neighbor uniqueness to select top-K tokens; a deep extractor then processes only those tokens. Selection is pre-trained unsupervised via the uniqueness score, while extraction uses distillation from a video teacher and a frame-normalized image teacher. Experiments across Kinetics-400, SSv2, Epic-Kitchens, Diving48, Jester, and Charades report that LookWhen Pareto-dominates accuracy-FLOPs trade-offs in 9 of 12 cases (6 tasks × 2 settings) and achieves 6.7× higher throughput than InternVideo2-B at matched accuracy.

Significance. If the empirical trade-offs hold under rigorous validation, the factorization into when/where/what computation offers a practical route to lower inference cost in video transformers without task-specific supervision during pre-training. The combination of unsupervised token ranking and dual-teacher distillation is a concrete contribution that could influence follow-on work on adaptive computation, provided the selected tokens prove reliably informative across domains.

major comments (3)

[§3] §3 (Selector pre-training): The central claim that nearest-neighbor uniqueness on a scaled-down video produces top-K tokens whose extractor outputs approximate full-video representations rests on an unverified assumption. No ablation or visualization shows that the ranked tokens correlate with motion or action cues rather than static backgrounds or artifacts; this directly affects whether the reported 9/12 Pareto dominance generalizes.
[Experiments] Experiments section (results tables): The accuracy-FLOPs and accuracy-throughput comparisons lack error bars, multiple random seeds, or statistical tests. Without these, it is impossible to determine whether the claimed dominance over upgraded baselines of similar size is robust or sensitive to post-hoc choices in baseline implementations.
[§4] §4 (Extraction pre-training): The frame-wise normalization of the image teacher is presented as enabling learning of intra-video changes, yet no controlled comparison quantifies its contribution versus a standard image teacher. This detail is load-bearing for the extraction stage that underpins the efficiency gains.

minor comments (2)

[Abstract and §3] The abstract and method sections use 'scaled-down video' without specifying the exact spatial or temporal downsampling factors or the resulting token count; adding these numbers would improve reproducibility.
[Tables in Experiments] Table captions for the 12-case Pareto results should explicitly list the two settings (e.g., fine-tuning vs. linear probing) to avoid ambiguity when comparing to baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below, providing clarifications and committing to revisions that strengthen the empirical support for our claims without altering the core contributions.

read point-by-point responses

Referee: [§3] §3 (Selector pre-training): The central claim that nearest-neighbor uniqueness on a scaled-down video produces top-K tokens whose extractor outputs approximate full-video representations rests on an unverified assumption. No ablation or visualization shows that the ranked tokens correlate with motion or action cues rather than static backgrounds or artifacts; this directly affects whether the reported 9/12 Pareto dominance generalizes.

Authors: The uniqueness score is defined as the nearest-neighbor distance in the representation space of the downscaled video, which by construction identifies tokens that deviate from their local spatio-temporal context. In practice, this tends to surface dynamic elements because static regions produce low uniqueness scores. While the submitted manuscript does not include explicit token visualizations or an ablation against motion heuristics, the consistent Pareto dominance across six datasets with varying background complexity (including Epic-Kitchens and Charades) provides indirect support. In the revision we will add (i) qualitative visualizations of selected tokens on representative videos and (ii) a quantitative ablation comparing uniqueness selection to random and optical-flow-based alternatives, reporting the resulting accuracy-FLOPs curves. revision: yes
Referee: [Experiments] Experiments section (results tables): The accuracy-FLOPs and accuracy-throughput comparisons lack error bars, multiple random seeds, or statistical tests. Without these, it is impossible to determine whether the claimed dominance over upgraded baselines of similar size is robust or sensitive to post-hoc choices in baseline implementations.

Authors: We agree that variability estimates are necessary to substantiate robustness. The reported numbers were obtained from single training runs per configuration due to the high computational cost of video transformer training. In the revised manuscript we will re-train the primary LookWhen variants and the strongest baselines with three independent random seeds, report mean accuracy together with standard deviation, add error bars to all tables and figures, and include a brief note on statistical significance where differences exceed one standard deviation. revision: yes
Referee: [§4] §4 (Extraction pre-training): The frame-wise normalization of the image teacher is presented as enabling learning of intra-video changes, yet no controlled comparison quantifies its contribution versus a standard image teacher. This detail is load-bearing for the extraction stage that underpins the efficiency gains.

Authors: Frame-wise normalization subtracts the per-video mean from each frame's representation, thereby directing the image teacher toward temporal differences rather than absolute appearance. Although the current version does not isolate this component with a controlled ablation, the dual-teacher objective (video teacher + normalized image teacher) demonstrably improves the accuracy-FLOPs frontier relative to video-only distillation. We will add an explicit ablation table in the revision that compares the normalized image teacher against an un-normalized image teacher while keeping all other factors fixed, thereby quantifying its isolated contribution to the reported efficiency gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of pre-training definitions

full rationale

The paper defines a selector via nearest-neighbor uniqueness ranking on downscaled video inputs and an extractor via distillation from external video and frame-normalized image teachers; these are explicit design choices with independent supervision signals. The central performance claims (Pareto dominance on 9/12 accuracy-FLOPs cases and 6.7x throughput gain) are established through direct experiments on Kinetics-400, SSv2, Epic-Kitchens and other benchmarks rather than by algebraic reduction or renaming of the pre-training quantities. No equations equate the final recognition accuracy to the uniqueness scores or distillation losses by construction, no fitted parameters are relabeled as predictions, and no load-bearing self-citations appear in the derivation. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that uniqueness (nearest-neighbor distance) provides useful supervision for token selection and that dual distillation yields general representations; the top-K threshold is an explicit free parameter.

free parameters (1)

top-K
Hyperparameter controlling how many tokens the extractor receives; directly sets the accuracy-computation operating point.

axioms (1)

domain assumption Video tokens can be ranked by uniqueness using nearest-neighbor distance in representation space without task labels.
Invoked for selection pre-training.

invented entities (1)

LookWhen selector-extractor framework no independent evidence
purpose: Factorizes video recognition into separate selection and extraction stages.
Core new architecture introduced by the paper.

pith-pipeline@v0.9.0 · 5598 in / 1311 out tokens · 53884 ms · 2026-05-11T01:32:47.673040+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean bare_distinguishability_of_absolute_floor echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We propose 'top1-distance', which ranks each token by its distance to its nearest neighbor in feature space... U_{x,y,t}=1−max_{(x',y',t')≠(x,y,t)} cos(z^{DINOv3}_{x,y,t}, z^{DINOv3}_{x',y',t'})
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LookWhen Pareto-dominates in accuracy-FLOPs on 9 of 12 cases... 6.7× faster than InternVideo2-B

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 5 internal anchors

[1]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[2]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InInternational Conference on Computer Vision (ICCV), 2021

work page 2021
[3]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

work page 2022
[4]

Anticipative video transformer

Rohit Girdhar and Kristen Grauman. Anticipative video transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 13505–13515, 2021

work page 2021
[5]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14549–14560, 2023

work page 2023
[6]

Lookwhere? efficient visual recognition by learning where to look and what to see from self-supervision

Anthony Fuller, Yousef Yassin, Junfeng Wen, Tarek Ibrahim, Daniel Kyrollos, James R Green, and Evan Shelhamer. Lookwhere? efficient visual recognition by learning where to look and what to see from self-supervision. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page
[7]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[8]

Don’t look twice: Faster video transformers with run-length tokenization.Advances in Neural Information Processing Systems, 37:28127–28149, 2024

Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Niinuma, Kris M Kitani, and László A Jeni. Don’t look twice: Faster video transformers with run-length tokenization.Advances in Neural Information Processing Systems, 37:28127–28149, 2024

work page 2024
[9]

K-centered patch sampling for efficient video recognition

Seong Hyeon Park, Jihoon Tack, Byeongho Heo, Jung-Woo Ha, and Jinwoo Shin. K-centered patch sampling for efficient video recognition. InEuropean Conference on Computer Vision, pages 160–176. Springer, 2022

work page 2022
[10]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022

work page internal anchor Pith review arXiv 2022
[11]

vid-tldr: Training free token merging for light-weight video transformer

Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, and Hyunwoo J Kim. vid-tldr: Training free token merging for light-weight video transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18771–18781, 2024

work page 2024
[12]

The kinetics human action video dataset, 2017

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017

work page 2017
[13]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on computer vision, pages 5842– 5850, 2017

work page 2017
[14]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. InEuropean Conference on Computer Vision (ECCV), 2018

work page 2018
[15]

Resound: Towards action recognition without representation bias

Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: Towards action recognition without representation bias. InProceedings of the European conference on computer vision (ECCV), pages 513–528, 2018. 10

work page 2018
[16]

The jester dataset: A large-scale video dataset of human gestures

Joanna Materzynska, Guillaume Berger, Ingo Bax, and Roland Memisevic. The jester dataset: A large-scale video dataset of human gestures. InProceedings of the IEEE/CVF international conference on computer vision workshops, pages 0–0, 2019

work page 2019
[17]

Sigurdsson, Gül Varol, Xiaolong Wang, Ivan Laptev, Ali Farhadi, and Abhinav Gupta

Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ivan Laptev, Ali Farhadi, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding.ArXiv e-prints, 2016

work page 2016
[18]

Internvideo2: Scaling foundation models for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. InEuropean conference on computer vision, pages 396–416. Springer, 2024

work page 2024
[19]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the International Conference on Computer Vision (ICCV), 2021

work page 2021
[20]

Do all vision transformers need registers? a cross-architectural reassessment.arXiv preprint arXiv:2603.25803, 2026

Spiros Baxevanakis, Platon Karageorgis, Ioannis Dravilas, and Konrad Szewczyk. Do all vision transformers need registers? a cross-architectural reassessment.arXiv preprint arXiv:2603.25803, 2026

work page arXiv 2026
[21]

Guibas, Dilip Krishnan, Kilian Q Weinberger, Yonglong Tian, and Yue Wang

Jiawei Yang, Katie Z Luo, Jiefeng Li, Congyue Deng, Leonidas J. Guibas, Dilip Krishnan, Kilian Q Weinberger, Yonglong Tian, and Yue Wang. Dvt: Denoising vision transformers. arXiv preprint arXiv:2401.02957, 2024

work page arXiv 2024
[22]

Vision Transformers Need More Than Registers

Cheng Shi, Yizhou Yu, and Sibei Yang. Vision transformers need more than registers.arXiv preprint arXiv:2602.22394, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Vision transformers with self-distilled registers

Zipeng Yan, Yinjie Chen, Chong Zhou, Bo Dai, and Andrew Luo. Vision transformers with self-distilled registers. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026
[24]

Thicker and quicker: The jumbo token for fast plain vision transformers

Anthony Fuller, Yousef Yassin, Daniel Kyrollos, Evan Shelhamer, and James R Green. Thicker and quicker: The jumbo token for fast plain vision transformers. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[25]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xia...

work page 2025
[26]

Principles of visual tokens for efficient video understanding

Xinyue Hao, Gen Li, Shreyank N Gowda, Robert B Fisher, Jonathan Huang, Anurag Arnab, and Laura Sevilla-Lara. Principles of visual tokens for efficient video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21254–21264, 2025

work page 2025
[27]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page 2025
[28]

Masked autoencoders as spatiotemporal learners.arXiv:2205.09113, 2022

Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, and Kaiming He. Masked autoencoders as spatiotemporal learners.arXiv:2205.09113, 2022

work page arXiv 2022
[29]

Unmasked teacher: Towards training-efficient video foundation models

Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. InProceedings of the IEEE/CVF international conference on computer vision, pages 19948–19960, 2023

work page 2023
[30]

Video- mamba: State space model for efficient video understanding

Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Video- mamba: State space model for efficient video understanding. InEuropean conference on computer vision, pages 237–255. Springer, 2024. 11

work page 2024
[31]

Snakes and ladders: Two steps up for videomamba

Hui Lu, Albert A Salah, and Ronald Poppe. Snakes and ladders: Two steps up for videomamba. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24234– 24244, 2025

work page 2025
[32]

Efficient video transformers with spatial-temporal token selection

Junke Wang, Xitong Yang, Hengduo Li, Li Liu, Zuxuan Wu, and Yu-Gang Jiang. Efficient video transformers with spatial-temporal token selection. InEuropean Conference on Computer Vision, pages 69–86. Springer, 2022

work page 2022
[33]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[34]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

work page 2019
[35]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30

work page
[36]

Agglomerative token clustering

Joakim Bruslund Haurum, Sergio Escalera, Graham W Taylor, and Thomas B Moeslund. Agglomerative token clustering. InEuropean Conference on Computer Vision, pages 200–218. Springer, 2024

work page 2024
[37]

Learning to merge tokens via decoupled embedding for efficient vision transformers

Dong Hoon Lee and Seunghoon Hong. Learning to merge tokens via decoupled embedding for efficient vision transformers. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[38]

Accelerating transformers with spectrum- preserving token merging.Advances in Neural Information Processing Systems, 37:30772– 30810, 2024

Hoai-Chau Tran, Duy M Nguyen, TrungTin Nguyen, Ngan Le, Pengtao Xie, Daniel Sonntag, James Zou, Binh T Nguyen, and Mathias Niepert. Accelerating transformers with spectrum- preserving token merging.Advances in Neural Information Processing Systems, 37:30772– 30810, 2024

work page 2024
[39]

Prune spatio-temporal tokens by semantic-aware temporal accumulation

Shuangrui Ding, Peisen Zhao, Xiaopeng Zhang, Rui Qian, Hongkai Xiong, and Qi Tian. Prune spatio-temporal tokens by semantic-aware temporal accumulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16945–16956, 2023

work page 2023
[40]

Everest: Efficient masked video autoencoder by removing redundant spatiotemporal tokens

Sunil Hwang, Jaehong Yoon, Youngwan Lee, and Sung Ju Hwang. Everest: Efficient masked video autoencoder by removing redundant spatiotemporal tokens. InInternational Conference on Machine Learning, 2024

work page 2024
[41]

Attend before attention: Efficient and scalable video understanding via autoregressive gazing.arXiv preprint arXiv:2603.12254, 2026

Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M Chan, et al. Attend before attention: Efficient and scalable video understanding via autoregressive gazing.arXiv preprint arXiv:2603.12254, 2026

work page arXiv 2026
[42]

Is space-time attention all you need for video understanding

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding?arXiv preprint arXiv:2102.05095, 2021

work page arXiv 2021
[43]

Space-time mixing attention for video transformer

Adrian Bulat, Juan-Manuel Perez-Rua, Swathikiran Sudhakaran, Brais Martinez, and Georgios Tzimiropoulos. Space-time mixing attention for video transformer. InAdvances in Neural Information Processing Systems

work page
[44]

Video-focalnets: Spatio-temporal focal modulation for video action recognition

Syed Talal Wasim, Muhammad Uzair Khattak, Muzammal Naseer, Salman Khan, Mubarak Shah, and Fahad Shahbaz Khan. Video-focalnets: Spatio-temporal focal modulation for video action recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13778–13789, 2023

work page 2023
[45]

X3d: Expanding architectures for efficient video recognition

Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020

work page 2020
[46]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019. 12

work page 2019
[47]

Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition

Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 13587–13597, 2022

work page 2022
[48]

Nvila: Efficient frontier visual language models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4122–4134, 2025

work page 2025
[49]

Pumer: Pruning and merging tokens for efficient vision language models

Qingqing Cao, Bhargavi Paranjape, and Hannaneh Hajishirzi. Pumer: Pruning and merging tokens for efficient vision language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12890–12903, 2023

work page 2023
[50]

Storm: Token-efficient long video understanding for multimodal llms

Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, et al. Storm: Token-efficient long video understanding for multimodal llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5830–5841, 2025

work page 2025
[51]

Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

work page arXiv 2024
[52]

Chat-univi: Unified visual representation empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700–13710, 2024

work page 2024
[53]

Testa: Temporal-spatial token aggregation for long-form video-language understanding

Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, and Lu Hou. Testa: Temporal-spatial token aggregation for long-form video-language understanding. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 932–947, 2023

work page 2023
[54]

Slowfast-llava-1.5: A family of token-efficient video large language models for long-form video understanding.arXiv preprint arXiv:2503.18943, 2025

Mingze Xu, Mingfei Gao, Shiyu Li, Jiasen Lu, Zhe Gan, Zhengfeng Lai, Meng Cao, Kai Kang, Yinfei Yang, and Afshin Dehghan. Slowfast-llava-1.5: A family of token-efficient video large language models for long-form video understanding.arXiv preprint arXiv:2503.18943, 2025

work page arXiv 2025
[55]

Llava-mini: Efficient image and video large mul- timodal models with one vision token.arXiv preprint arXiv:2501.03895, 2025

Shaolei Zhang, Qingkai Fang, Zhe Yang, and Yang Feng. Llava-mini: Efficient image and video large multimodal models with one vision token.arXiv preprint arXiv:2501.03895, 2025

work page arXiv 2025
[56]

Dycoke: Dynamic compression of tokens for fast video large language models

Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18992–19001, 2025

work page 2025
[57]

Efficient universal perception encoder.arXiv preprint arXiv:2603.22387, 2026

Chenchen Zhu, Saksham Suri, Cijo Jose, Maxime Oquab, Marc Szafraniec, Wei Wen, Yunyang Xiong, Patrick Labatut, Piotr Bojanowski, Raghuraman Krishnamoorthi, et al. Efficient universal perception encoder.arXiv preprint arXiv:2603.22387, 2026

work page arXiv 2026
[58]

T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability

Savya Khosla, Sethuraman TV , Aryan Chadha, Alex Schwing, and Derek Hoiem. T-ren: Learning text-aligned region tokens improves dense vision-language alignment and scalability. arXiv preprint arXiv:2604.18573, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[59]

TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

Bingyi Cao, Koert Chen, Kevis-Kokitsi Maninis, Kaifeng Chen, Arjun Karpur, Ye Xia, Sahil Dua, Tanmaya Dabral, Guangxing Han, Bohyung Han, et al. Tipsv2: Advancing vision-language pretraining with enhanced patch-text alignment.arXiv preprint arXiv:2604.12012, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[60]

arXiv preprint arXiv:2603.14482 (2026)

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

work page arXiv 2026
[61]

mixup: Beyond Empirical Risk Minimization

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization.arXiv preprint arXiv:1710.09412, 2017. 13

work page internal anchor Pith review arXiv 2017
[62]

Non-local neural networks

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018

work page 2018
[63]

A large-scale study on unsupervised spatiotemporal representation learning

Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, and Kaiming He. A large-scale study on unsupervised spatiotemporal representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3299–3309, 2021

work page 2021
[64]

Temporal segment networks for action recognition in videos.IEEE transactions on pattern analysis and machine intelligence, 41(11):2740–2755, 2018

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks for action recognition in videos.IEEE transactions on pattern analysis and machine intelligence, 41(11):2740–2755, 2018. 14 A Technical Appendices and Supplementary Material A.1 Efficiency in Practice: Throughput and Memory Measurements 0 500 ...

work page 2018