pith. sign in

arxiv: 2605.17165 · v1 · pith:Y6XLWCIDnew · submitted 2026-05-16 · 💻 cs.CV · cs.LG

Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives

Pith reviewed 2026-05-20 14:34 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords self-supervised video learningVideo JEPAauxiliary objectiveslatent factorizationhard-region weightingrepresentation learningempirical study
0
0 comments X

The pith

Factorizing Video-JEPA latents into appearance and dynamics subspaces improves performance on appearance and temporal tasks with minimal impact on motion tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how different auxiliary objectives affect Video-JEPA training on small-scale video data. It finds that many objectives create trade-offs where improving one capability hurts another. The authors then introduce and test a factorized approach that splits the latent space and weights difficult regions, showing specific gains in a mixed-dataset setup. A reader should care because this points to a way to make self-supervised video learning more efficient and balanced without needing massive compute. The results suggest latent factorization as a promising tool for handling conflicting objectives in representation learning.

Core claim

The paper claims that FWM-HW-LD, which separates the latent representation into appearance and dynamics subspaces and applies hard-region weighting to both JEPA prediction errors and latent dynamics errors, leads to improvements of 5.92 percentage points on ImageNet-100 and 3.21 on Something-Something V2 in the mixed-dataset pretraining regime, while staying within 0.30 points on Diving-48 compared to the baseline.

What carries the argument

Factorized World-Model with Hard-Region-Weighted Latent Dynamics (FWM-HW-LD), which splits the latent space into separate appearance and dynamics components and weights errors from hard regions during training.

If this is right

  • Many auxiliary objectives in Video-JEPA exhibit capacity trade-offs across different downstream tasks.
  • Latent factorization into appearance and dynamics subspaces helps mitigate these trade-offs.
  • Hard-region weighting applied to both prediction and dynamics errors boosts performance in mixed data settings.
  • The approach maintains strong performance on fine-grained motion tasks while gaining on appearance and reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar factorization techniques might help in other self-supervised learning frameworks beyond JEPA.
  • Testing this method on larger datasets could reveal if the benefits scale up.
  • Combining this with other objectives might further reduce the observed trade-offs.

Load-bearing premise

The performance gains are specifically due to the latent factorization and hard-region weighting rather than differences in training procedures or random variations.

What would settle it

Replicating the experiments with identical training settings except removing the factorization and hard weighting components, and observing no similar gains, would support the claim; the opposite result would falsify it.

Figures

Figures reproduced from arXiv: 2605.17165 by Santosh Premi.

Figure 1
Figure 1. Figure 1: FWM-HW-LD training objective overview. The standard V-JEPA latent prediction path is retained. During training, the student encoder output is additionally partitioned into appearance (Zapp) and dynamics (Zdyn) subspaces. Hard-region weighting is applied to the JEPA prediction error (LHW-JEPA) and, when latent dynamics is enabled, to the latent dynamics error (LLD-HW). Additional FWM losses encourage tempor… view at source ↗
Figure 2
Figure 2. Figure 2: Capacity trade-offs across mixed-dataset methods. Grouped bars show percentage-point ∆ accuracy relative to baseline on three benchmarks. FWM-HW-LD (leftmost, highlighted) improves appearance (ImageNet-100, green) and temporal reasoning (SSv2, red) while staying close to the Diving-48 baseline. Pixel-prediction methods (AC-JEPA, FAC-JEPA) perform poorly on ImageNet-100 in this setting. denced by trade-off … view at source ↗
Figure 3
Figure 3. Figure 3: Ablation of FWM-HW-LD components. Dashed lines indicate the baseline. The tested components interact non-additively: LD alone boosts SSv2 but hurts others; FWM alone boosts ImageNet-100 modestly; FWM+LD without hard weighting performs poorly on ImageNet-100. The full FWM+HW+LD combination (green bars) gives the most balanced result in this single-seed ablation [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Joint-Embedding Predictive Architectures (JEPA) are a promising framework for self-supervised video representation learning, yet the behavior of auxiliary objectives in small-scale Video-JEPA training is not well characterized. We report a small-scale empirical study of 18 auxiliary objective variants for Video-JEPA across two pretraining regimes: single-dataset (UCF-101) and mixed-dataset (UCF-101 + Something-Something V2 + ImageNet-100). We evaluate frozen representations on three complementary benchmarks: Diving-48 (fine-grained motion), SomethingSomething V2 (temporal reasoning), and ImageNet-100 (appearance). Our experiments suggest that many auxiliary objectives exhibit capacity trade-offs: gains on one downstream capability often coincide with degradation on another. We then study FWM-HW-LD (Factorized World-Model with Hard-Region-Weighted Latent Dynamics), a training-time objective that separates the latent representation into appearance and dynamics subspaces and applies hard-region weighting to both JEPA prediction errors and latent dynamics errors. In our mixed-dataset setting, FWM-HW-LD improves ImageNet-100 by +5.92 and SSv2 by +3.21 percentage points relative to the reference baseline, while remaining within 0.30 percentage points on Diving-48. These results indicate that latent factorization is a useful direction for studying auxiliary-objective trade-offs in Video-JEPA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reports a small-scale empirical study of 18 auxiliary objective variants for Video-JEPA under single-dataset (UCF-101) and mixed-dataset (UCF-101 + SSv2 + ImageNet-100) pretraining. It introduces FWM-HW-LD, which factorizes latents into appearance and dynamics subspaces and applies hard-region weighting to both JEPA prediction and latent dynamics losses. In the mixed-dataset regime, FWM-HW-LD is reported to improve ImageNet-100 by +5.92 pp and SSv2 by +3.21 pp while staying within 0.30 pp on Diving-48 relative to a reference baseline, suggesting latent factorization helps manage capacity trade-offs across appearance, temporal reasoning, and fine-grained motion tasks.

Significance. If the reported gains are robust, the work provides a useful empirical reference for auxiliary-objective design in Video-JEPA, particularly the value of explicit appearance/dynamics factorization plus hard-region weighting for balancing downstream capabilities. The systematic comparison of 18 variants on complementary benchmarks (Diving-48, SSv2, ImageNet-100) could help future studies avoid unintended trade-offs in small-scale video self-supervision.

major comments (2)
  1. [Abstract] Abstract: The central empirical claim states precise improvements (+5.92 pp on ImageNet-100, +3.21 pp on SSv2) as single-point estimates. No error bars, standard deviations across runs, number of random seeds, or statistical significance tests are mentioned, leaving open the possibility that the deltas arise from training stochasticity rather than the latent factorization and hard-region weighting.
  2. [Abstract / Results] The attribution of gains specifically to FWM-HW-LD (latent factorization plus hard-region weighting of JEPA and dynamics losses) requires that all other training factors (initialization, data ordering, optimizer state) are identical to the reference baseline. The manuscript provides no description of such controls or ablation isolating these components from other unstated procedural differences.
minor comments (2)
  1. The abstract introduces the acronym FWM-HW-LD with its expansion, but subsequent sections should consistently use the full name on first mention in the main text for clarity.
  2. Consider adding a table summarizing all 18 auxiliary objectives with their key design choices (factorization, weighting, etc.) to make the experimental design easier to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our empirical study. We address the two major comments below regarding statistical reporting and experimental controls. Revisions will be incorporated to improve clarity and robustness without altering the core findings.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claim states precise improvements (+5.92 pp on ImageNet-100, +3.21 pp on SSv2) as single-point estimates. No error bars, standard deviations across runs, number of random seeds, or statistical significance tests are mentioned, leaving open the possibility that the deltas arise from training stochasticity rather than the latent factorization and hard-region weighting.

    Authors: We acknowledge that the reported deltas are presented as single-point estimates without accompanying measures of variability. Our study systematically compared 18 auxiliary objective variants under fixed computational budgets, which limited repeated runs. To strengthen the claims, the revised manuscript will include results averaged over three random seeds for the baseline and FWM-HW-LD variants, along with standard deviations and error bars on the key tables and in the abstract. revision: yes

  2. Referee: [Abstract / Results] The attribution of gains specifically to FWM-HW-LD (latent factorization plus hard-region weighting of JEPA and dynamics losses) requires that all other training factors (initialization, data ordering, optimizer state) are identical to the reference baseline. The manuscript provides no description of such controls or ablation isolating these components from other unstated procedural differences.

    Authors: All variants, including the reference baseline and FWM-HW-LD, were trained using an identical codebase, the same data loading and preprocessing pipeline, fixed hyperparameters, and consistent initialization procedures. Only the auxiliary objective formulation differs. We agree that explicit documentation of these controls is necessary for reproducibility. The revision will add a dedicated paragraph in the experimental setup section confirming that training factors were held constant and describing the seed handling and optimizer state management. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons with no derivation chain

full rationale

The manuscript is an empirical study that trains Video-JEPA variants with 18 auxiliary-objective configurations and reports downstream accuracies on three benchmarks. No mathematical derivation, uniqueness theorem, or first-principles prediction is claimed; the central results are direct experimental deltas (e.g., +5.92 pp on ImageNet-100) obtained by comparing each variant against a fixed reference baseline under the same training regime. Because there is no load-bearing equation or fitted quantity that reduces to itself by construction, and no self-citation is invoked to justify a theoretical step, the work is self-contained against external benchmarks and exhibits no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the empirical effectiveness of the introduced objective; no explicit free parameters, mathematical axioms, or new physical entities are introduced beyond standard deep-learning training assumptions.

invented entities (1)
  • FWM-HW-LD objective no independent evidence
    purpose: Separates latent representation into appearance and dynamics subspaces and applies hard-region weighting to prediction and dynamics errors
    Presented as a new training-time objective whose benefits are demonstrated experimentally.

pith-pipeline@v0.9.0 · 5774 in / 1326 out tokens · 68381 ms · 2026-05-20T14:34:15.181412+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 3 internal anchors

  1. [1]

    ViViT: A video vision transformer

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. ViViT: A video vision transformer. InICCV, 2021. 2

  2. [2]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InCVPR, 2023. 1, 2

  3. [3]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mahmoud Assran, Adrien Bardes, David Fan, Quentin Gar- rido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zho- lus, Yann LeCun, Michael Rabbat, and Nicolas Ballas. V- JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985,

  4. [4]

    BEiT: Bert pre-training of image transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: Bert pre-training of image transformers. InICLR, 2022. 2

  5. [5]

    VI- CReg: Variance-invariance-covariance regularization for self-supervised learning

    Adrien Bardes, Jean Ponce, and Yann LeCun. VI- CReg: Variance-invariance-covariance regularization for self-supervised learning. InICLR, 2022. 2

  6. [6]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nico- las Ballas. V-JEPA: Video joint embedding predictive archi- tecture.arXiv preprint arXiv:2404.08471, 2024. 1, 2, 5

  7. [7]

    Is space-time attention all you need for video understanding? InICML, 2021

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InICML, 2021. 2

  8. [8]

    Unsupervised learn- ing of visual features by contrasting cluster assignments

    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learn- ing of visual features by contrasting cluster assignments. In NeurIPS, 2020. 2

  9. [9]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 2

  10. [10]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Jo ˜ao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. InCVPR,

  11. [11]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. 2

  12. [12]

    Improved baselines with momentum contrastive learning,

    Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning,

  13. [13]

    Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.Inter- national Journal of Computer Vision, 2022

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.Inter- national Journal of Computer Vision, 2022. 2

  14. [14]

    Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsu- pervised visual representation learning by context prediction. InICCV, 2015. 2

  15. [15]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 2, 4 8

  16. [16]

    Multiscale vision transformers

    Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. InICCV, 2021. 2

  17. [17]

    Slowfast networks for video recognition

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In ICCV, 2019. 2

  18. [18]

    Masked autoencoders as spatiotemporal learners

    Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, and Kaim- ing He. Masked autoencoders as spatiotemporal learners. In NeurIPS, 2022. 2

  19. [19]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The “something something” video database for learning and evaluating visual common sense. InICCV, 2017. 1, 2, 4, 5

  20. [20]

    Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R´emi Munos, and Michal Valko

    Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R´emi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning. InNeurIPS, 2020. 2

  21. [21]

    Siamese masked autoencoders

    Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and Jos ´e Lezama. Siamese masked autoencoders. InNeurIPS, 2024. 2

  22. [22]

    Self- supervised co-training for video representation learning

    Tengda Han, Weidi Xie, and Andrew Zisserman. Self- supervised co-training for video representation learning. In NeurIPS, 2020. 2, 4

  23. [23]

    Momentum contrast for unsupervised visual rep- resentation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InCVPR, 2020. 2

  24. [24]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. 1, 2, 3

  25. [25]

    InICLR, 2017

    Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner.β-V AE: Learning basic visual concepts with a constrained variational framework. InICLR, 2017. 2

  26. [26]

    Learning deep representations by mutual informa- tion estimation and maximization

    R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual informa- tion estimation and maximization. InICLR, 2019. 2

  27. [27]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. InICLR, 2014. 2

  28. [28]

    Cycle-contrast for self-supervised video representation learning

    Quan Kong, Wen Wei, Ziwei Deng, Tomoo Yoshinaga, and Tomokazu Murakami. Cycle-contrast for self-supervised video representation learning. InNeurIPS, 2020. 2, 4

  29. [29]

    A path towards autonomous machine intelli- gence, 2022

    Yann LeCun. A path towards autonomous machine intelli- gence, 2022. Open review essay. 1, 2

  30. [30]

    Resound: To- wards action recognition without representation bias

    Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: To- wards action recognition without representation bias. In ECCV, 2018. 1, 2, 5

  31. [31]

    Video swin transformer

    Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InCVPR,

  32. [32]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 4

  33. [33]

    LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

    Luc Maes, Quentin Le Lidec, Damien Scieur, Yann Le- Cun, and Randall Balestriero. LeJEPA: Stable joint- embedding predictive architectures with sketched-isotropic- Gaussian regularization.arXiv preprint arXiv:2511.08544,

  34. [34]

    Lawrence Zitnick, and Martial Hebert

    Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. Shuf- fle and learn: Unsupervised learning using temporal order verification. InECCV, 2016. 2

  35. [35]

    V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

    Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026. 2

  36. [36]

    Unsupervised learning of visual representations by solving jigsaw puzzles

    Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. InECCV,

  37. [37]

    Repre- sentation learning with contrastive predictive coding, 2018

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding, 2018. 2

  38. [38]

    Videomoco: Contrastive video representation learning with temporally adversarial examples

    Tian Pan, Yibing Song, Tianyu Yang, Wenhao Jiang, and Wei Liu. Videomoco: Contrastive video representation learning with temporally adversarial examples. InCVPR, 2021. 2

  39. [39]

    Learning features by watching ob- jects move

    Deepak Pathak, Ross Girshick, Piotr Doll ´ar, Trevor Darrell, and Bharath Hariharan. Learning features by watching ob- jects move. InCVPR, 2017. 2, 3

  40. [40]

    Broaden Your Views for Self-Supervised Video Learning

    Adri `a Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica P ˘atr˘aucean, Florent Altch ´e, Michal Valko, Jean- Bastien Grill, Aaron van den Oord, and Andrew Zisserman. Broaden Your Views for Self-Supervised Video Learning. In ICCV, 2021. 2, 4

  41. [41]

    Berg, and Li Fei-Fei

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet large scale visual recognition chal- lenge. InInternational Journal of Computer Vision, 2015. 2, 4, 5

  42. [42]

    UCF101: A dataset of 101 human actions classes from videos in the wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. InCRCV-TR-12-01, 2012. 2, 4

  43. [43]

    Videobert: A joint model for video and language representation learning

    Chen Sun, Austin Myers, Carl V ondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. InICCV, 2019. 2

  44. [44]

    Con- trastive multiview coding

    Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con- trastive multiview coding. InECCV, 2020. 2

  45. [45]

    VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. InNeurIPS, 2022. 1, 2, 3

  46. [46]

    A closer look at spatiotemporal convolutions for action recognition

    Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. InCVPR, 2018. 2

  47. [47]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 2

  48. [48]

    VideoMAE V2: 9 Scaling video masked autoencoders with dual masking

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. VideoMAE V2: 9 Scaling video masked autoencoders with dual masking. In CVPR, 2023. 2

  49. [49]

    SimMIM: A simple framework for masked image modeling

    Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. SimMIM: A simple framework for masked image modeling. InCVPR, 2022. 2

  50. [50]

    Self-supervised spatiotemporal learning via video clip order prediction

    Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. InCVPR, 2019. 2

  51. [51]

    Video representa- tion learning using discriminative pooling

    Shen Yan, Xuehan Xiong, Anurag Arnab, Zhicheng Lu, Mi Zhang, Chen Sun, and Cordelia Schmid. Video representa- tion learning using discriminative pooling. InCVPR, 2022. 2

  52. [52]

    Why and how auxiliary tasks improve JEPA rep- resentations.arXiv preprint arXiv:2509.12249, 2025

    Jiacan Yu, Siyi Chen, Mingrui Liu, Nono Horiuchi, Vladimir Braverman, Zicheng Xu, Dan Haramati, and Randall Balestriero. Why and how auxiliary tasks improve JEPA rep- resentations.arXiv preprint arXiv:2509.12249, 2025. 1, 2, 8

  53. [53]

    Barlow twins: Self-supervised learning via redundancy reduction

    Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and St´ephane Deny. Barlow twins: Self-supervised learning via redundancy reduction. InICML, 2021. 2 10 A. Pseudocode for Key Methods Algorithm 1 summarizes the standard V-JEPA training loop. Algorithm 2 describes FWM-HW-LD, which extends the baseline with hard-weighted prediction, factorization losses, and ...

  54. [54]

    The 60% SSv2 weight reflects our explicit goal of strength- ening temporal reasoning

    so that the same data pipeline handles all three sources. The 60% SSv2 weight reflects our explicit goal of strength- ening temporal reasoning. Mask sampling.The default V-JEPA mask is a multi- block spatiotemporal tube. For Motion-Guided Masking, we compute per-frame motion energy as the mean L1 dif- ference between consecutive frames at the patch level,...