pith. machine review for the scientific record. sign in

arxiv: 2604.20760 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

Exploring High-Order Self-Similarity for Video Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords space-time self-similaritymulti-order self-similarityvideo action recognitiontemporal dynamicsmotion modelinglightweight neural modulevideo visual question answering
0
0 comments X

The pith

Integrating multi-order space-time self-similarities via a lightweight module improves motion modeling across video tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores how space-time self-similarity at different orders captures distinct aspects of temporal dynamics in videos. It introduces the Multi-Order Self-Similarity module to learn and combine these features in a neural network. This design targets better motion representation while keeping added computation and memory low. A reader would care because improved temporal modeling could make video systems more accurate for recognition, question answering, and control without heavy resource demands. Experiments across action recognition, video VQA, and robotic tasks show consistent gains.

Core claim

Space-time self-similarity at higher orders reveals distinct aspects of temporal dynamics. The Multi-Order Self-Similarity module is a lightweight neural component that learns and integrates multi-order STSS features to enhance motion modeling capabilities with only marginal computational cost and memory usage. Applied to diverse video tasks, it produces substantial improvements on action recognition, motion-centric video VQA, and real-world robotic tasks.

What carries the argument

The Multi-Order Self-Similarity (MOSS) module, a neural module that learns and integrates multi-order space-time self-similarity features for temporal dynamics.

Load-bearing premise

Higher-order space-time self-similarities supply distinct and complementary information on temporal dynamics that a lightweight integration module can combine effectively without meaningful overhead or loss of accuracy.

What would settle it

Inserting the MOSS module into standard video models and measuring no accuracy gains on action recognition or VQA benchmarks together with increased runtime or memory usage would show the approach does not deliver substantial improvements at marginal cost.

Figures

Figures reproduced from arXiv: 2604.20760 by Heeseung Kwon, Karteek Alahari, Manjin Kim, Minsu Cho.

Figure 1
Figure 1. Figure 1: High-order space-time self-similarities (STSS) for effective video [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: High-order STSS transformation & Multi-Order Self-Similarity [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: STSS map visualizations on a toy video clip. From top to bot￾tom, we visualize RGB frames and 1st- , 2nd-, and 3rd-order STSS maps of the brown query by setting STSS en￾coding function g as vectorization over (L, U, V ) dimensions. The STSS maps progressively capture different tempo￾ral dynamics: motion flow, motion seg￾ments, and overall motion layouts. Unlike the 2nd-order STSS that identifies individual… view at source ↗
Figure 5
Figure 5. Figure 5: STSS visualization. RGB frames at the top where two queries and their spatio￾temporal matching regions are marked in red and green respectively. The subsequent rows show STSS maps for the two queries and L2-norm of feature maps across 1st-, 2nd-, and 3rd-order. Best viewed in pdf. Comparison to Other STSS Learning Methods. In Tab. 3d, we delve into the effectiveness of high-order STSS by comparing differen… view at source ↗
Figure 6
Figure 6. Figure 6: VideoLLaMA3 with MOSS. MOSS is integrated with the vision en￾coder and provides early motion cues for advanced temporal reasoning in LLM. fine-grained motion-level reasoning in Video MLLMs, comprising 1,776 videos and 8,184 multiple-choice QA pairs across 6 motion-related tasks. We use 15K sam￾ples from the publicly released training set, FAVOR-Train, for fine-tuning. Motion￾Bench [24] is another recent be… view at source ↗
Figure 7
Figure 7. Figure 7: Proposed real-world robotic tasks (MoveSense & PongPredict) and [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of STSS tensors. (a) Input RGB frames, where two different queries and their spatio-temporal matching regions. (b) 1st- to 3rd-order STSS maps of the brown query. (c) 1st- to 3rd-order STSS maps of the yellow query. A Illustration of High-Order STSS We present a toy example with a simplified video clip to clarify the characteristics of high-order STSS in modeling temporal dynamics, as describ… view at source ↗
Figure 9
Figure 9. Figure 9: Real-robot platform. We specify the robot specifications and other environ￾ment settings. B Implementation Details In Tabs. 6 and 7, we provide detailed model configurations and training hyperpa￾rameters across different model scales and datasets. All models are trained using 8 NVIDIA RTX 6000 Ada GPUs. C Experimental Setup for Real-World Robotic Tasks In [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Effects of 2nd-order STSS on Something-Something V1. [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: STSS visualization. RGB frames at the top show query locations and their spatio-temporal matching regions marked in red and green, respectively. The subsequent rows show STSS maps for the two queries and the L2-norm of feature maps from 1st￾to 3rd-order STSSs. Best viewed in PDF [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: STSS visualization. RGB frames at the top show query locations and their spatio-temporal matching regions marked in red and green, respectively. The subsequent rows show STSS maps for the two queries and the L2-norm of feature maps from 1st￾to 3rd-order STSSs. Best viewed in PDF [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: STSS visualization. RGB frames at the top show query locations and their spatio-temporal matching regions marked in red and green, respectively. The subsequent rows show STSS maps for the two queries and the L2-norm of feature maps from 1st￾to 3rd-order STSSs. Best viewed in PDF [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: STSS visualization. RGB frames at the top show query locations and their spatio-temporal matching regions marked in red and green, respectively. The subsequent rows show STSS maps for the two queries and the L2-norm of feature maps from 1st￾to 3rd-order STSSs. Best viewed in PDF [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: STSS visualization. RGB frames at the top show query locations and their spatio-temporal matching regions marked in red and green, respectively. The subsequent rows show STSS maps for the two queries and the L2-norm of feature maps from 1st￾to 3rd-order STSSs. Best viewed in PDF [PITH_FULL_IMAGE:figures/full_fig_p035_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Example rollouts of real-world robot tasks [PITH_FULL_IMAGE:figures/full_fig_p036_16.png] view at source ↗
read the original abstract

Space-time self-similarity (STSS), which captures visual correspondences across frames, provides an effective way to represent temporal dynamics for video understanding. In this work, we explore higher-order STSS and demonstrate how STSSs at different orders reveal distinct aspects of these dynamics. We then introduce the Multi-Order Self-Similarity (MOSS) module, a lightweight neural module designed to learn and integrate multi-order STSS features. It can be applied to diverse video tasks to enhance motion modeling capabilities while consuming only marginal computational cost and memory usage. Extensive experiments on video action recognition, motion-centric video VQA, and real-world robotic tasks consistently demonstrate substantial improvements, validating the broad applicability of MOSS as a general temporal modeling module. The source code and checkpoints will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper explores higher-order space-time self-similarity (STSS) for representing temporal dynamics in videos, arguing that STSS at different orders capture distinct aspects of motion. It introduces the Multi-Order Self-Similarity (MOSS) module, a lightweight neural module to learn and integrate these multi-order features. The module is presented as a general-purpose component that can be inserted into video architectures to enhance motion modeling at marginal computational and memory cost. Claims are supported by experiments on action recognition, motion-centric video VQA, and real-world robotic tasks showing consistent improvements, with code and checkpoints to be released.

Significance. If the empirical results hold, MOSS offers a practical, efficient temporal modeling primitive with broad applicability across video tasks. Its lightweight design and plug-and-play nature could see adoption in existing pipelines, particularly if gains are reproducible across datasets and architectures. The planned public release of code and checkpoints strengthens the contribution by enabling verification and extension.

minor comments (3)
  1. [Abstract] Abstract: the phrasing 'higher-order STSS' and 'multi-order STSS features' is used interchangeably without an explicit definition of the orders considered (e.g., first-order vs. second-order correspondences); a short clarifying sentence would aid readers.
  2. [§4 or §5] The manuscript states that MOSS consumes 'only marginal computational cost and memory usage'; providing a table or paragraph with exact FLOPs and parameter overhead relative to the backbone (e.g., in §4 or §5) would make this claim more precise and verifiable.
  3. [Experiments] Experiments section: while tables are referenced, ensuring that every reported improvement includes the corresponding baseline value, metric (e.g., top-1 accuracy, mAP), and dataset split would allow direct assessment of effect sizes.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. We appreciate the recognition that MOSS provides a practical, efficient temporal modeling primitive with broad applicability, and we value the note on the planned public release of code and checkpoints.

Circularity Check

0 steps flagged

No significant circularity; MOSS module is an independent architectural contribution

full rationale

The paper presents MOSS as a new lightweight neural module for learning and integrating multi-order space-time self-similarity features, with claims supported directly by its definition, integration details, and empirical results across video tasks. No derivation chain, equations, or predictions are shown that reduce by construction to fitted inputs or prior self-citations. The abstract and context describe an empirical validation approach without self-definitional loops, uniqueness theorems, or ansatz smuggling. This is a standard case of a self-contained neural architecture paper whose central claims rest on experimental tables rather than internal reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities detailed beyond the MOSS module itself as a new component.

invented entities (1)
  • Multi-Order Self-Similarity (MOSS) module no independent evidence
    purpose: Learn and integrate multi-order STSS features for video temporal modeling
    New neural module introduced to combine higher-order self-similarity features

pith-pipeline@v0.9.0 · 5432 in / 1126 out tokens · 46914 ms · 2026-05-10T00:55:43.219326+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.

  2. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.

Reference graph

Works this paper leans on

100 extracted references · 32 canonical work pages · cited by 1 Pith paper · 14 internal anchors

  1. [1]

    Vivit: A video vision transformer,

    Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691 (2021)

  2. [2]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

  3. [3]

    arXiv preprint arXiv:2312.00826 (2023)

    Bae, K., Ahn, G., Kim, Y., Choi, J.: Devias: Learning disentangled video repre- sentations of action and scene for holistic video understanding. arXiv preprint arXiv:2312.00826 (2023)

  4. [4]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., Ballas, N.: Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471 (2024)

  5. [5]

    Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? arXiv preprint arXiv:2102.05095 (2021)

  6. [6]

    In: CVPR

    Bian, Z., Jabri, A., Efros, A.A., Owens, A.: Learning pixel trajectories with multi- scale contrastive random walks. In: CVPR. pp. 6508–6519 (2022)

  7. [7]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

  8. [8]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.: pi0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)

  9. [9]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

  10. [10]

    In: CVPR (2017)

    Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR (2017)

  11. [11]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

  12. [12]

    In: European Conference on Computer Vision

    Cheng, F., Bertasius, G.: Tallformer: Temporal action localization with a long- memory transformer. In: European Conference on Computer Vision. pp. 503–521. Springer (2022)

  13. [13]

    NeurIPS32(2019)

    Choi, J., Gao, C., Messou, J.C., Huang, J.B.: Why can’t i dance in the mall? learning to mitigate scene bias in action recognition. NeurIPS32(2019)

  14. [14]

    NeurIPS35, 39020–39033 (2022)

    Chung, J., Wu, Y., Russakovsky, O.: Enabling detailed action recognition evaluation through video dataset augmentation. NeurIPS35, 39020–39033 (2022)

  15. [15]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy,A.,Beyer,L.,Kolesnikov,A.,Weissenborn,D.,Zhai,X.,Unterthiner,T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  16. [16]

    In: ICCV (2015)

    Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolu- tional networks. In: ICCV (2015)

  17. [17]

    arXiv preprint arXiv:2104.11227 (2021) 16 Manjin Kim 1∗, Heeseung Kwon2∗, Karteek Alahari3, and Minsu Cho1

    Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale vision transformers. arXiv preprint arXiv:2104.11227 (2021) 16 Manjin Kim 1∗, Heeseung Kwon2∗, Karteek Alahari3, and Minsu Cho1

  18. [18]

    In: CVPR (2020)

    Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In: CVPR (2020)

  19. [19]

    In: ICCV (2019)

    Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV (2019)

  20. [20]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24108–24118 (2025)

  21. [21]

    something something

    Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al.: The" something something" video database for learning and evaluating visual common sense. In: ICCV (2017)

  22. [22]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

  23. [23]

    In: CVPR

    Herzig, R., Ben-Avraham, E., Mangalam, K., Bar, A., Chechik, G., Rohrbach, A., Darrell, T., Globerson, A.: Object-region video transformers. In: CVPR. pp. 3148–3159 (2022)

  24. [24]

    In: CVPR

    Hong, W., Cheng, Y., Yang, Z., Wang, W., Wang, L., Gu, X., Huang, S., Dong, Y., Tang, J.: Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. In: CVPR. pp. 8450–8460 (2025)

  25. [25]

    Qwen2.5-Coder Technical Report

    Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al.: Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186 (2024)

  26. [26]

    in the wild

    Idrees, H., Zamir, A.R., Jiang, Y.G., Gorban, A., Laptev, I., Sukthankar, R., Shah, M.: The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding155, 1–23 (2017)

  27. [27]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.: pi0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

  28. [28]

    arXiv preprint arXiv:2510.04246 (2025) 13

    Jang, H., Yu, S., Kwon, H., Jeon, H., Seo, Y., Shin, J.: Contextvla: Vision-language- action model with amortized multi-frame context. arXiv preprint arXiv:2510.04246 (2025)

  29. [29]

    IEEE TPAMI (2010)

    Junejo, I.N., Dexter, E., Laptev, I., Perez, P.: View-independent action recognition from temporal self-similarities. IEEE TPAMI (2010)

  30. [30]

    In: ECCV (2008)

    Junejo, I.N., Dexter, E., Laptev, I., PÚrez, P.: Cross-view action recognition from temporal self-similarities. In: ECCV (2008)

  31. [31]

    The Kinetics Human Action Video Dataset

    Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  32. [32]

    NeurIPS34, 8046–8059 (2021)

    Kim, M., Kwon, H., Wang, C., Kwak, S., Cho, M.: Relational self-attention: What’s missing in attention for video understanding. NeurIPS34, 8046–8059 (2021)

  33. [33]

    In: CVPR

    Kim, M., Seo, P.H., Schmid, C., Cho, M.: Learning correlation structures for vision transformers. In: CVPR. pp. 18941–18951 (2024)

  34. [34]

    arXiv preprint arXiv:2007.09933 (2020)

    Kwon, H., Kim, M., Kwak, S., Cho, M.: Motionsqueeze: Neural motion feature learning for video understanding. arXiv preprint arXiv:2007.09933 (2020)

  35. [35]

    arXiv preprint arXiv:2102.07092 (2021)

    Kwon, H., Kim, M., Kwak, S., Cho, M.: Learning self-similarity in space and time as generalized motion for action recognition. arXiv preprint arXiv:2102.07092 (2021)

  36. [36]

    arXiv:2208.01897 (2022)

    Leong, M.C., Zhang, H., Tan, H.L., Li, L., Lim, J.H.: Combined cnn trans- former encoder for enhanced fine-grained human action recognition. arXiv preprint arXiv:2208.01897 (2022) Exploring High-Order Self-Similarity for Video Understanding 17

  37. [37]

    arXiv preprint arXiv:2206.02985 (2022)

    Li, C., Wang, X., Hong, D., Wang, Y., Zhang, L., Luo, T., Wen, L.: Struc- tured context transformer for generic event boundary detection. arXiv preprint arXiv:2206.02985 (2022)

  38. [38]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22195–22206 (2024)

  39. [39]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Wang, L., Qiao, Y.: Uniformerv2: Unlocking the potential of image vits for video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1632–1643 (2023)

  40. [40]

    IEEE Transactions on Pattern Analysis and Machine Intelligence45(10), 12581–12600 (2023)

    Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y.: Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence45(10), 12581–12600 (2023)

  41. [41]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Li, X., Hsu, K., Gu, J., Pertsch, K., Mees, O., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., et al.: Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941 (2024)

  42. [42]

    In: CVPR

    Li, Y., Wu, C.Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., Feichtenhofer, C.: Mvitv2: Improved multiscale vision transformers for classification and detection. In: CVPR. pp. 4804–4814 (2022)

  43. [43]

    In: ECCV (2018)

    Li, Y., Li, Y., Vasconcelos, N.: Resound: Towards action recognition without representation bias. In: ECCV (2018)

  44. [44]

    In: Proceedings of the 2024 conference on empirical methods in natural language processing

    Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 5971–5984 (2024)

  45. [45]

    In: ICCV (2019)

    Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video under- standing. In: ICCV (2019)

  46. [46]

    In: ECCV

    Lin, Z., Geng, S., Zhang, R., Gao, P., De Melo, G., Wang, X., Dai, J., Qiao, Y., Li, H.: Frozen clip models are efficient video learners. In: ECCV. pp. 388–404. Springer (2022)

  47. [47]

    Advances in Neural Information Processing Systems36, 44776–44791 (2023)

    Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)

  48. [48]

    Liu, H., Li, X., Li, P., Liu, M., Wang, D., Liu, J., Kang, B., Ma, X., Kong, T., Zhang, H.: Towards generalist robot policies: What matters in building vision- language-action models (2025)

  49. [49]

    arXiv preprint arXiv:2408.06158 (2024)

    Liu, M., Li, B., Yu, Y.: Omniclip: Adapting clip for video recognition with spatial- temporal omni-scale feature learning. arXiv preprint arXiv:2408.06158 (2024)

  50. [50]

    In: CVPR

    Liu, R., Li, C., Ge, Y., Li, T.H., Shan, Y., Li, G.: Bt-adapter: Video conversation is feasible without video instruction tuning. In: CVPR. pp. 13658–13667 (2024)

  51. [51]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, S., Zhang, C.L., Zhao, C., Ghanem, B.: End-to-end temporal action detection with 1b parameters across 1000 frames. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18591–18601 (2024)

  52. [52]

    In: CVPR

    Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin transformer. In: CVPR. pp. 3202–3211 (2022)

  53. [53]

    In: Proceedings of the 30th ACM International Conference on Multimedia

    Ma, Y., Xu, G., Sun, X., Yan, M., Zhang, J., Ji, R.: X-clip: End-to-end multi- grained contrastive learning for video-text retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 638–647 (2022) 18 Manjin Kim 1∗, Heeseung Kwon2∗, Karteek Alahari3, and Minsu Cho1

  54. [54]

    On the effectiveness of task granularity for transfer learning

    Mahdisoltani, F., Berger, G., Gharbieh, W., Fleet, D., Memisevic, R.: On the effec- tiveness of task granularity for transfer learning. arXiv preprint arXiv:1804.09235 (2018)

  55. [55]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: Robocasa: Large-scale simulation of everyday tasks for generalist robots. arXiv preprint arXiv:2406.02523 (2024)

  56. [56]

    In: Proc

    Ng, J.Y.H., Choi, J., Neumann, J., Davis, L.S.: Actionflownet: Learning motion representation for action recognition. In: Proc. Winter Conference on Applications of Computer Vision (WACV) (2018)

  57. [57]

    Advances in Neural Information Processing Systems37, 81808–81835 (2024)

    Nie, M., Ding, D., Wang, C., Guo, Y., Han, J., Xu, H., Zhang, L.: Slowfocus: Enhancing fine-grained temporal understanding in video llm. Advances in Neural Information Processing Systems37, 81808–81835 (2024)

  58. [58]

    NeurIPS35, 26462–26477 (2022)

    Pan, J., Lin, Z., Zhu, X., Shao, J., Li, H.: St-adapter: Parameter-efficient image-to- video transfer learning. NeurIPS35, 26462–26477 (2022)

  59. [59]

    In: CVPR

    Park, J., Lee, J., Sohn, K.: Dual-path adaptation from image to video transformers. In: CVPR. pp. 2203–2213 (2023)

  60. [60]

    In: ECCV

    Qian, R., Ding, S., Lin, D.: Rethinking image-to-video adaptation: An object-centric perspective. In: ECCV. pp. 329–348. Springer (2025)

  61. [61]

    In: ICCV

    Qing, Z., Zhang, S., Huang, Z., Zhang, Y., Gao, C., Zhao, D., Sang, N.: Disentangling spatial and temporal learning for efficient image-to-video transfer learning. In: ICCV. pp. 13934–13944 (2023)

  62. [62]

    In: ICML

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763. PMLR (2021)

  63. [63]

    arXiv preprint arXiv:2510.26027 (2025)

    Rasekh,A.,Soula,E.B.,Daliran,O.,Gottschalk,S.,Fayyaz,M.:Enhancingtemporal understanding in video-llms through stacked temporal attention in vision encoders. arXiv preprint arXiv:2510.26027 (2025)

  64. [64]

    In: CVPR (2020)

    Shao, D., Zhao, Y., Dai, B., Lin, D.: Finegym: A hierarchical video dataset for fine-grained action understanding. In: CVPR (2020)

  65. [65]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Shao, D., Zhao, Y., Dai, B., Lin, D.: Intra-and inter-action understanding via temporal action parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 730–739 (2020)

  66. [66]

    In: CVPR

    Shechtman, E., Irani, M.: Space-time behavior based correlation. In: CVPR. vol. 1, pp. 405–412. IEEE (2005)

  67. [67]

    In: CVPR (2007)

    Shechtman, E., Irani, M.: Matching local self-similarities across images and videos. In: CVPR (2007)

  68. [68]

    arXiv preprint arXiv:2309.05590 (2023)

    Shi, D., Cao, Q., Zhong, Y., An, S., Cheng, J., Zhu, H., Tao, D.: Temporal action localization with enhanced instant discriminability. arXiv preprint arXiv:2309.05590 (2023)

  69. [69]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Shou, M.Z., Lei, S.W., Wang, W., Ghadiyaram, D., Feiszli, M.: Generic event boundary detection: A benchmark for event segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 8075–8084 (2021)

  70. [70]

    In: NeurIPS (2014)

    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recog- nition in videos. In: NeurIPS (2014)

  71. [71]

    In: CVPR

    Son, J.: Contrastive learning for space-time correspondence via self-cycle consistency. In: CVPR. pp. 14679–14688 (2022)

  72. [72]

    In: CVPR (2018)

    Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In: CVPR (2018)

  73. [73]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023) Exploring High-Order Self-Similarity for Video Understanding 19

  74. [74]

    EVA-CLIP- 18B: Scaling clip to 18 billion parameters.arXiv:2402.04252, 2024

    Sun, Q., Wang, J., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, X.: Eva-clip-18b: Scaling clip to 18 billion parameters. arXiv preprint arXiv:2402.04252 (2024)

  75. [75]

    IEEE Transactions on Pattern Analysis and Machine Intelligence45(10), 12506–12520 (2023)

    Tan, J., Wang, Y., Wu, G., Wang, L.: Temporal perceiver: A general architecture for arbitrary boundary detection. IEEE Transactions on Pattern Analysis and Machine Intelligence45(10), 12506–12520 (2023)

  76. [76]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Tang, J., Liu, Z., Qian, C., Wu, W., Wang, L.: Progressive attention on multi-level dense difference maps for generic event boundary detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3355–3364 (2022)

  77. [77]

    Octo: An Open-Source Generalist Robot Policy

    Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024)

  78. [78]

    In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16

    Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. pp. 402–419. Springer (2020)

  79. [79]

    In: ICCV (2015)

    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV (2015)

  80. [80]

    In: ICCV (2019)

    Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel- separated convolutional networks. In: ICCV (2019)

Showing first 80 references.