Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives

Santosh Premi

arxiv: 2605.17165 · v1 · pith:Y6XLWCIDnew · submitted 2026-05-16 · 💻 cs.CV · cs.LG

Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives

Santosh Premi This is my paper

Pith reviewed 2026-05-20 14:34 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords self-supervised video learningVideo JEPAauxiliary objectiveslatent factorizationhard-region weightingrepresentation learningempirical study

0 comments

The pith

Factorizing Video-JEPA latents into appearance and dynamics subspaces improves performance on appearance and temporal tasks with minimal impact on motion tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how different auxiliary objectives affect Video-JEPA training on small-scale video data. It finds that many objectives create trade-offs where improving one capability hurts another. The authors then introduce and test a factorized approach that splits the latent space and weights difficult regions, showing specific gains in a mixed-dataset setup. A reader should care because this points to a way to make self-supervised video learning more efficient and balanced without needing massive compute. The results suggest latent factorization as a promising tool for handling conflicting objectives in representation learning.

Core claim

The paper claims that FWM-HW-LD, which separates the latent representation into appearance and dynamics subspaces and applies hard-region weighting to both JEPA prediction errors and latent dynamics errors, leads to improvements of 5.92 percentage points on ImageNet-100 and 3.21 on Something-Something V2 in the mixed-dataset pretraining regime, while staying within 0.30 points on Diving-48 compared to the baseline.

What carries the argument

Factorized World-Model with Hard-Region-Weighted Latent Dynamics (FWM-HW-LD), which splits the latent space into separate appearance and dynamics components and weights errors from hard regions during training.

If this is right

Many auxiliary objectives in Video-JEPA exhibit capacity trade-offs across different downstream tasks.
Latent factorization into appearance and dynamics subspaces helps mitigate these trade-offs.
Hard-region weighting applied to both prediction and dynamics errors boosts performance in mixed data settings.
The approach maintains strong performance on fine-grained motion tasks while gaining on appearance and reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar factorization techniques might help in other self-supervised learning frameworks beyond JEPA.
Testing this method on larger datasets could reveal if the benefits scale up.
Combining this with other objectives might further reduce the observed trade-offs.

Load-bearing premise

The performance gains are specifically due to the latent factorization and hard-region weighting rather than differences in training procedures or random variations.

What would settle it

Replicating the experiments with identical training settings except removing the factorization and hard weighting components, and observing no similar gains, would support the claim; the opposite result would falsify it.

Figures

Figures reproduced from arXiv: 2605.17165 by Santosh Premi.

**Figure 1.** Figure 1: FWM-HW-LD training objective overview. The standard V-JEPA latent prediction path is retained. During training, the student encoder output is additionally partitioned into appearance (Zapp) and dynamics (Zdyn) subspaces. Hard-region weighting is applied to the JEPA prediction error (LHW-JEPA) and, when latent dynamics is enabled, to the latent dynamics error (LLD-HW). Additional FWM losses encourage tempor… view at source ↗

**Figure 2.** Figure 2: Capacity trade-offs across mixed-dataset methods. Grouped bars show percentage-point ∆ accuracy relative to baseline on three benchmarks. FWM-HW-LD (leftmost, highlighted) improves appearance (ImageNet-100, green) and temporal reasoning (SSv2, red) while staying close to the Diving-48 baseline. Pixel-prediction methods (AC-JEPA, FAC-JEPA) perform poorly on ImageNet-100 in this setting. denced by trade-off … view at source ↗

**Figure 3.** Figure 3: Ablation of FWM-HW-LD components. Dashed lines indicate the baseline. The tested components interact non-additively: LD alone boosts SSv2 but hurts others; FWM alone boosts ImageNet-100 modestly; FWM+LD without hard weighting performs poorly on ImageNet-100. The full FWM+HW+LD combination (green bars) gives the most balanced result in this single-seed ablation [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Joint-Embedding Predictive Architectures (JEPA) are a promising framework for self-supervised video representation learning, yet the behavior of auxiliary objectives in small-scale Video-JEPA training is not well characterized. We report a small-scale empirical study of 18 auxiliary objective variants for Video-JEPA across two pretraining regimes: single-dataset (UCF-101) and mixed-dataset (UCF-101 + Something-Something V2 + ImageNet-100). We evaluate frozen representations on three complementary benchmarks: Diving-48 (fine-grained motion), SomethingSomething V2 (temporal reasoning), and ImageNet-100 (appearance). Our experiments suggest that many auxiliary objectives exhibit capacity trade-offs: gains on one downstream capability often coincide with degradation on another. We then study FWM-HW-LD (Factorized World-Model with Hard-Region-Weighted Latent Dynamics), a training-time objective that separates the latent representation into appearance and dynamics subspaces and applies hard-region weighting to both JEPA prediction errors and latent dynamics errors. In our mixed-dataset setting, FWM-HW-LD improves ImageNet-100 by +5.92 and SSv2 by +3.21 percentage points relative to the reference baseline, while remaining within 0.30 percentage points on Diving-48. These results indicate that latent factorization is a useful direction for studying auxiliary-objective trade-offs in Video-JEPA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A solid small-scale empirical sweep of 18 auxiliary objectives for Video JEPA that flags a factorization-plus-weighting variant as one way to ease capacity trade-offs, though the reported gains rest on single-point estimates.

read the letter

The paper's core finding is that splitting the latent space into appearance and dynamics factors, then applying hard-region weighting to both the JEPA prediction loss and the dynamics loss, lifts frozen performance on ImageNet-100 by roughly 6 points and on Something-Something V2 by 3 points in their mixed-dataset pretraining runs, while staying within 0.3 points on Diving-48. That combination (FWM-HW-LD) comes out of a broader check of 18 auxiliary variants run on both single-dataset and mixed pretraining regimes. The work maps out the usual capacity trade-offs in a direct way: many auxiliaries improve one downstream axis at the expense of another. The systematic comparison itself is the main new piece; prior JEPA papers did not lay out this many variants side-by-side on these exact benchmarks. The authors keep the scope modest and the evaluation focused on frozen linear probes, which makes the results easy to interpret. The writing is clear about what was measured and where the gains appear. The main limitation is that all deltas are single-run point estimates with no error bars, no seed averages, and no mention of hyperparameter search or training controls. Without those, it is hard to rule out that the observed lifts come from stochastic training differences rather than the factorization and weighting. The mixed-dataset regime also introduces extra variables that are not isolated in the reported numbers. This is the kind of paper that is useful for groups already running Video JEPA or similar predictive architectures and looking for concrete auxiliary tweaks to test on modest compute. It gives a practical starting point for further ablations. I would send it to peer review; the empirical map is worth checking with proper variance reporting and a few more controls, and the factorization direction is worth a closer look even if the current numbers need tightening.

Referee Report

2 major / 2 minor

Summary. The paper reports a small-scale empirical study of 18 auxiliary objective variants for Video-JEPA under single-dataset (UCF-101) and mixed-dataset (UCF-101 + SSv2 + ImageNet-100) pretraining. It introduces FWM-HW-LD, which factorizes latents into appearance and dynamics subspaces and applies hard-region weighting to both JEPA prediction and latent dynamics losses. In the mixed-dataset regime, FWM-HW-LD is reported to improve ImageNet-100 by +5.92 pp and SSv2 by +3.21 pp while staying within 0.30 pp on Diving-48 relative to a reference baseline, suggesting latent factorization helps manage capacity trade-offs across appearance, temporal reasoning, and fine-grained motion tasks.

Significance. If the reported gains are robust, the work provides a useful empirical reference for auxiliary-objective design in Video-JEPA, particularly the value of explicit appearance/dynamics factorization plus hard-region weighting for balancing downstream capabilities. The systematic comparison of 18 variants on complementary benchmarks (Diving-48, SSv2, ImageNet-100) could help future studies avoid unintended trade-offs in small-scale video self-supervision.

major comments (2)

[Abstract] Abstract: The central empirical claim states precise improvements (+5.92 pp on ImageNet-100, +3.21 pp on SSv2) as single-point estimates. No error bars, standard deviations across runs, number of random seeds, or statistical significance tests are mentioned, leaving open the possibility that the deltas arise from training stochasticity rather than the latent factorization and hard-region weighting.
[Abstract / Results] The attribution of gains specifically to FWM-HW-LD (latent factorization plus hard-region weighting of JEPA and dynamics losses) requires that all other training factors (initialization, data ordering, optimizer state) are identical to the reference baseline. The manuscript provides no description of such controls or ablation isolating these components from other unstated procedural differences.

minor comments (2)

The abstract introduces the acronym FWM-HW-LD with its expansion, but subsequent sections should consistently use the full name on first mention in the main text for clarity.
Consider adding a table summarizing all 18 auxiliary objectives with their key design choices (factorization, weighting, etc.) to make the experimental design easier to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our empirical study. We address the two major comments below regarding statistical reporting and experimental controls. Revisions will be incorporated to improve clarity and robustness without altering the core findings.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim states precise improvements (+5.92 pp on ImageNet-100, +3.21 pp on SSv2) as single-point estimates. No error bars, standard deviations across runs, number of random seeds, or statistical significance tests are mentioned, leaving open the possibility that the deltas arise from training stochasticity rather than the latent factorization and hard-region weighting.

Authors: We acknowledge that the reported deltas are presented as single-point estimates without accompanying measures of variability. Our study systematically compared 18 auxiliary objective variants under fixed computational budgets, which limited repeated runs. To strengthen the claims, the revised manuscript will include results averaged over three random seeds for the baseline and FWM-HW-LD variants, along with standard deviations and error bars on the key tables and in the abstract. revision: yes
Referee: [Abstract / Results] The attribution of gains specifically to FWM-HW-LD (latent factorization plus hard-region weighting of JEPA and dynamics losses) requires that all other training factors (initialization, data ordering, optimizer state) are identical to the reference baseline. The manuscript provides no description of such controls or ablation isolating these components from other unstated procedural differences.

Authors: All variants, including the reference baseline and FWM-HW-LD, were trained using an identical codebase, the same data loading and preprocessing pipeline, fixed hyperparameters, and consistent initialization procedures. Only the auxiliary objective formulation differs. We agree that explicit documentation of these controls is necessary for reproducibility. The revision will add a dedicated paragraph in the experimental setup section confirming that training factors were held constant and describing the seed handling and optimizer state management. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons with no derivation chain

full rationale

The manuscript is an empirical study that trains Video-JEPA variants with 18 auxiliary-objective configurations and reports downstream accuracies on three benchmarks. No mathematical derivation, uniqueness theorem, or first-principles prediction is claimed; the central results are direct experimental deltas (e.g., +5.92 pp on ImageNet-100) obtained by comparing each variant against a fixed reference baseline under the same training regime. Because there is no load-bearing equation or fitted quantity that reduces to itself by construction, and no self-citation is invoked to justify a theoretical step, the work is self-contained against external benchmarks and exhibits no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the empirical effectiveness of the introduced objective; no explicit free parameters, mathematical axioms, or new physical entities are introduced beyond standard deep-learning training assumptions.

invented entities (1)

FWM-HW-LD objective no independent evidence
purpose: Separates latent representation into appearance and dynamics subspaces and applies hard-region weighting to prediction and dynamics errors
Presented as a new training-time objective whose benefits are demonstrated experimentally.

pith-pipeline@v0.9.0 · 5774 in / 1326 out tokens · 68381 ms · 2026-05-20T14:34:15.181412+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FWM-HW-LD ... separates the latent representation into appearance and dynamics subspaces and applies hard-region weighting to both JEPA prediction errors and latent dynamics errors
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lstatic = E[|Z_app(t+1) − Z_app(t)|]; Lorth = 1/N ||C^T_Zapp C_Zdyn||^2_F

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 3 internal anchors

[1]

ViViT: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. ViViT: A video vision transformer. InICCV, 2021. 2

work page 2021
[2]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InCVPR, 2023. 1, 2

work page 2023
[3]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mahmoud Assran, Adrien Bardes, David Fan, Quentin Gar- rido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zho- lus, Yann LeCun, Michael Rabbat, and Nicolas Ballas. V- JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

BEiT: Bert pre-training of image transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: Bert pre-training of image transformers. InICLR, 2022. 2

work page 2022
[5]

VI- CReg: Variance-invariance-covariance regularization for self-supervised learning

Adrien Bardes, Jean Ponce, and Yann LeCun. VI- CReg: Variance-invariance-covariance regularization for self-supervised learning. InICLR, 2022. 2

work page 2022
[6]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nico- las Ballas. V-JEPA: Video joint embedding predictive archi- tecture.arXiv preprint arXiv:2404.08471, 2024. 1, 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Is space-time attention all you need for video understanding? InICML, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InICML, 2021. 2

work page 2021
[8]

Unsupervised learn- ing of visual features by contrasting cluster assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learn- ing of visual features by contrasting cluster assignments. In NeurIPS, 2020. 2

work page 2020
[9]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 2

work page 2021
[10]

Quo vadis, action recognition? a new model and the kinetics dataset

Jo ˜ao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. InCVPR,

work page
[11]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. 2

work page 2020
[12]

Improved baselines with momentum contrastive learning,

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning,

work page
[13]

Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.Inter- national Journal of Computer Vision, 2022

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.Inter- national Journal of Computer Vision, 2022. 2

work page 2022
[14]

Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsu- pervised visual representation learning by context prediction. InICCV, 2015. 2

work page 2015
[15]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 2, 4 8

work page 2021
[16]

Multiscale vision transformers

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. InICCV, 2021. 2

work page 2021
[17]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In ICCV, 2019. 2

work page 2019
[18]

Masked autoencoders as spatiotemporal learners

Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, and Kaim- ing He. Masked autoencoders as spatiotemporal learners. In NeurIPS, 2022. 2

work page 2022
[19]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The “something something” video database for learning and evaluating visual common sense. InICCV, 2017. 1, 2, 4, 5

work page 2017
[20]

Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R´emi Munos, and Michal Valko

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R´emi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning. InNeurIPS, 2020. 2

work page 2020
[21]

Siamese masked autoencoders

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and Jos ´e Lezama. Siamese masked autoencoders. InNeurIPS, 2024. 2

work page 2024
[22]

Self- supervised co-training for video representation learning

Tengda Han, Weidi Xie, and Andrew Zisserman. Self- supervised co-training for video representation learning. In NeurIPS, 2020. 2, 4

work page 2020
[23]

Momentum contrast for unsupervised visual rep- resentation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InCVPR, 2020. 2

work page 2020
[24]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. 1, 2, 3

work page 2022
[25]

InICLR, 2017

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner.β-V AE: Learning basic visual concepts with a constrained variational framework. InICLR, 2017. 2

work page 2017
[26]

Learning deep representations by mutual informa- tion estimation and maximization

R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual informa- tion estimation and maximization. InICLR, 2019. 2

work page 2019
[27]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. InICLR, 2014. 2

work page 2014
[28]

Cycle-contrast for self-supervised video representation learning

Quan Kong, Wen Wei, Ziwei Deng, Tomoo Yoshinaga, and Tomokazu Murakami. Cycle-contrast for self-supervised video representation learning. InNeurIPS, 2020. 2, 4

work page 2020
[29]

A path towards autonomous machine intelli- gence, 2022

Yann LeCun. A path towards autonomous machine intelli- gence, 2022. Open review essay. 1, 2

work page 2022
[30]

Resound: To- wards action recognition without representation bias

Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: To- wards action recognition without representation bias. In ECCV, 2018. 1, 2, 5

work page 2018
[31]

Video swin transformer

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InCVPR,

work page
[32]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 4

work page 2019
[33]

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Luc Maes, Quentin Le Lidec, Damien Scieur, Yann Le- Cun, and Randall Balestriero. LeJEPA: Stable joint- embedding predictive architectures with sketched-isotropic- Gaussian regularization.arXiv preprint arXiv:2511.08544,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Lawrence Zitnick, and Martial Hebert

Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. Shuf- fle and learn: Unsupervised learning using temporal order verification. InECCV, 2016. 2

work page 2016
[35]

V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026. 2

work page arXiv 2026
[36]

Unsupervised learning of visual representations by solving jigsaw puzzles

Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. InECCV,

work page
[37]

Repre- sentation learning with contrastive predictive coding, 2018

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding, 2018. 2

work page 2018
[38]

Videomoco: Contrastive video representation learning with temporally adversarial examples

Tian Pan, Yibing Song, Tianyu Yang, Wenhao Jiang, and Wei Liu. Videomoco: Contrastive video representation learning with temporally adversarial examples. InCVPR, 2021. 2

work page 2021
[39]

Learning features by watching ob- jects move

Deepak Pathak, Ross Girshick, Piotr Doll ´ar, Trevor Darrell, and Bharath Hariharan. Learning features by watching ob- jects move. InCVPR, 2017. 2, 3

work page 2017
[40]

Broaden Your Views for Self-Supervised Video Learning

Adri `a Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica P ˘atr˘aucean, Florent Altch ´e, Michal Valko, Jean- Bastien Grill, Aaron van den Oord, and Andrew Zisserman. Broaden Your Views for Self-Supervised Video Learning. In ICCV, 2021. 2, 4

work page 2021
[41]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet large scale visual recognition chal- lenge. InInternational Journal of Computer Vision, 2015. 2, 4, 5

work page 2015
[42]

UCF101: A dataset of 101 human actions classes from videos in the wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. InCRCV-TR-12-01, 2012. 2, 4

work page 2012
[43]

Videobert: A joint model for video and language representation learning

Chen Sun, Austin Myers, Carl V ondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. InICCV, 2019. 2

work page 2019
[44]

Con- trastive multiview coding

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con- trastive multiview coding. InECCV, 2020. 2

work page 2020
[45]

VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. InNeurIPS, 2022. 1, 2, 3

work page 2022
[46]

A closer look at spatiotemporal convolutions for action recognition

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. InCVPR, 2018. 2

work page 2018
[47]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 2

work page 2017
[48]

VideoMAE V2: 9 Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. VideoMAE V2: 9 Scaling video masked autoencoders with dual masking. In CVPR, 2023. 2

work page 2023
[49]

SimMIM: A simple framework for masked image modeling

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. SimMIM: A simple framework for masked image modeling. InCVPR, 2022. 2

work page 2022
[50]

Self-supervised spatiotemporal learning via video clip order prediction

Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. InCVPR, 2019. 2

work page 2019
[51]

Video representa- tion learning using discriminative pooling

Shen Yan, Xuehan Xiong, Anurag Arnab, Zhicheng Lu, Mi Zhang, Chen Sun, and Cordelia Schmid. Video representa- tion learning using discriminative pooling. InCVPR, 2022. 2

work page 2022
[52]

Why and how auxiliary tasks improve JEPA rep- resentations.arXiv preprint arXiv:2509.12249, 2025

Jiacan Yu, Siyi Chen, Mingrui Liu, Nono Horiuchi, Vladimir Braverman, Zicheng Xu, Dan Haramati, and Randall Balestriero. Why and how auxiliary tasks improve JEPA rep- resentations.arXiv preprint arXiv:2509.12249, 2025. 1, 2, 8

work page arXiv 2025
[53]

Barlow twins: Self-supervised learning via redundancy reduction

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and St´ephane Deny. Barlow twins: Self-supervised learning via redundancy reduction. InICML, 2021. 2 10 A. Pseudocode for Key Methods Algorithm 1 summarizes the standard V-JEPA training loop. Algorithm 2 describes FWM-HW-LD, which extends the baseline with hard-weighted prediction, factorization losses, and ...

work page 2021
[54]

The 60% SSv2 weight reflects our explicit goal of strength- ening temporal reasoning

so that the same data pipeline handles all three sources. The 60% SSv2 weight reflects our explicit goal of strength- ening temporal reasoning. Mask sampling.The default V-JEPA mask is a multi- block spatiotemporal tube. For Motion-Guided Masking, we compute per-frame motion energy as the mean L1 dif- ference between consecutive frames at the patch level,...

work page

[1] [1]

ViViT: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. ViViT: A video vision transformer. InICCV, 2021. 2

work page 2021

[2] [2]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InCVPR, 2023. 1, 2

work page 2023

[3] [3]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mahmoud Assran, Adrien Bardes, David Fan, Quentin Gar- rido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zho- lus, Yann LeCun, Michael Rabbat, and Nicolas Ballas. V- JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

BEiT: Bert pre-training of image transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: Bert pre-training of image transformers. InICLR, 2022. 2

work page 2022

[5] [5]

VI- CReg: Variance-invariance-covariance regularization for self-supervised learning

Adrien Bardes, Jean Ponce, and Yann LeCun. VI- CReg: Variance-invariance-covariance regularization for self-supervised learning. InICLR, 2022. 2

work page 2022

[6] [6]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nico- las Ballas. V-JEPA: Video joint embedding predictive archi- tecture.arXiv preprint arXiv:2404.08471, 2024. 1, 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Is space-time attention all you need for video understanding? InICML, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InICML, 2021. 2

work page 2021

[8] [8]

Unsupervised learn- ing of visual features by contrasting cluster assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learn- ing of visual features by contrasting cluster assignments. In NeurIPS, 2020. 2

work page 2020

[9] [9]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 2

work page 2021

[10] [10]

Quo vadis, action recognition? a new model and the kinetics dataset

Jo ˜ao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. InCVPR,

work page

[11] [11]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020. 2

work page 2020

[12] [12]

Improved baselines with momentum contrastive learning,

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning,

work page

[13] [13]

Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.Inter- national Journal of Computer Vision, 2022

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.Inter- national Journal of Computer Vision, 2022. 2

work page 2022

[14] [14]

Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsu- pervised visual representation learning by context prediction. InICCV, 2015. 2

work page 2015

[15] [15]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 2, 4 8

work page 2021

[16] [16]

Multiscale vision transformers

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. InICCV, 2021. 2

work page 2021

[17] [17]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In ICCV, 2019. 2

work page 2019

[18] [18]

Masked autoencoders as spatiotemporal learners

Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, and Kaim- ing He. Masked autoencoders as spatiotemporal learners. In NeurIPS, 2022. 2

work page 2022

[19] [19]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The “something something” video database for learning and evaluating visual common sense. InICCV, 2017. 1, 2, 4, 5

work page 2017

[20] [20]

Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R´emi Munos, and Michal Valko

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R´emi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning. InNeurIPS, 2020. 2

work page 2020

[21] [21]

Siamese masked autoencoders

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and Jos ´e Lezama. Siamese masked autoencoders. InNeurIPS, 2024. 2

work page 2024

[22] [22]

Self- supervised co-training for video representation learning

Tengda Han, Weidi Xie, and Andrew Zisserman. Self- supervised co-training for video representation learning. In NeurIPS, 2020. 2, 4

work page 2020

[23] [23]

Momentum contrast for unsupervised visual rep- resentation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InCVPR, 2020. 2

work page 2020

[24] [24]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. 1, 2, 3

work page 2022

[25] [25]

InICLR, 2017

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner.β-V AE: Learning basic visual concepts with a constrained variational framework. InICLR, 2017. 2

work page 2017

[26] [26]

Learning deep representations by mutual informa- tion estimation and maximization

R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual informa- tion estimation and maximization. InICLR, 2019. 2

work page 2019

[27] [27]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. InICLR, 2014. 2

work page 2014

[28] [28]

Cycle-contrast for self-supervised video representation learning

Quan Kong, Wen Wei, Ziwei Deng, Tomoo Yoshinaga, and Tomokazu Murakami. Cycle-contrast for self-supervised video representation learning. InNeurIPS, 2020. 2, 4

work page 2020

[29] [29]

A path towards autonomous machine intelli- gence, 2022

Yann LeCun. A path towards autonomous machine intelli- gence, 2022. Open review essay. 1, 2

work page 2022

[30] [30]

Resound: To- wards action recognition without representation bias

Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: To- wards action recognition without representation bias. In ECCV, 2018. 1, 2, 5

work page 2018

[31] [31]

Video swin transformer

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InCVPR,

work page

[32] [32]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 4

work page 2019

[33] [33]

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Luc Maes, Quentin Le Lidec, Damien Scieur, Yann Le- Cun, and Randall Balestriero. LeJEPA: Stable joint- embedding predictive architectures with sketched-isotropic- Gaussian regularization.arXiv preprint arXiv:2511.08544,

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Lawrence Zitnick, and Martial Hebert

Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. Shuf- fle and learn: Unsupervised learning using temporal order verification. InECCV, 2016. 2

work page 2016

[35] [35]

V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026. 2

work page arXiv 2026

[36] [36]

Unsupervised learning of visual representations by solving jigsaw puzzles

Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. InECCV,

work page

[37] [37]

Repre- sentation learning with contrastive predictive coding, 2018

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding, 2018. 2

work page 2018

[38] [38]

Videomoco: Contrastive video representation learning with temporally adversarial examples

Tian Pan, Yibing Song, Tianyu Yang, Wenhao Jiang, and Wei Liu. Videomoco: Contrastive video representation learning with temporally adversarial examples. InCVPR, 2021. 2

work page 2021

[39] [39]

Learning features by watching ob- jects move

Deepak Pathak, Ross Girshick, Piotr Doll ´ar, Trevor Darrell, and Bharath Hariharan. Learning features by watching ob- jects move. InCVPR, 2017. 2, 3

work page 2017

[40] [40]

Broaden Your Views for Self-Supervised Video Learning

Adri `a Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica P ˘atr˘aucean, Florent Altch ´e, Michal Valko, Jean- Bastien Grill, Aaron van den Oord, and Andrew Zisserman. Broaden Your Views for Self-Supervised Video Learning. In ICCV, 2021. 2, 4

work page 2021

[41] [41]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet large scale visual recognition chal- lenge. InInternational Journal of Computer Vision, 2015. 2, 4, 5

work page 2015

[42] [42]

UCF101: A dataset of 101 human actions classes from videos in the wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. InCRCV-TR-12-01, 2012. 2, 4

work page 2012

[43] [43]

Videobert: A joint model for video and language representation learning

Chen Sun, Austin Myers, Carl V ondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. InICCV, 2019. 2

work page 2019

[44] [44]

Con- trastive multiview coding

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con- trastive multiview coding. InECCV, 2020. 2

work page 2020

[45] [45]

VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. InNeurIPS, 2022. 1, 2, 3

work page 2022

[46] [46]

A closer look at spatiotemporal convolutions for action recognition

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. InCVPR, 2018. 2

work page 2018

[47] [47]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 2

work page 2017

[48] [48]

VideoMAE V2: 9 Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. VideoMAE V2: 9 Scaling video masked autoencoders with dual masking. In CVPR, 2023. 2

work page 2023

[49] [49]

SimMIM: A simple framework for masked image modeling

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. SimMIM: A simple framework for masked image modeling. InCVPR, 2022. 2

work page 2022

[50] [50]

Self-supervised spatiotemporal learning via video clip order prediction

Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. InCVPR, 2019. 2

work page 2019

[51] [51]

Video representa- tion learning using discriminative pooling

Shen Yan, Xuehan Xiong, Anurag Arnab, Zhicheng Lu, Mi Zhang, Chen Sun, and Cordelia Schmid. Video representa- tion learning using discriminative pooling. InCVPR, 2022. 2

work page 2022

[52] [52]

Why and how auxiliary tasks improve JEPA rep- resentations.arXiv preprint arXiv:2509.12249, 2025

Jiacan Yu, Siyi Chen, Mingrui Liu, Nono Horiuchi, Vladimir Braverman, Zicheng Xu, Dan Haramati, and Randall Balestriero. Why and how auxiliary tasks improve JEPA rep- resentations.arXiv preprint arXiv:2509.12249, 2025. 1, 2, 8

work page arXiv 2025

[53] [53]

Barlow twins: Self-supervised learning via redundancy reduction

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and St´ephane Deny. Barlow twins: Self-supervised learning via redundancy reduction. InICML, 2021. 2 10 A. Pseudocode for Key Methods Algorithm 1 summarizes the standard V-JEPA training loop. Algorithm 2 describes FWM-HW-LD, which extends the baseline with hard-weighted prediction, factorization losses, and ...

work page 2021

[54] [54]

The 60% SSv2 weight reflects our explicit goal of strength- ening temporal reasoning

so that the same data pipeline handles all three sources. The 60% SSv2 weight reflects our explicit goal of strength- ening temporal reasoning. Mask sampling.The default V-JEPA mask is a multi- block spatiotemporal tube. For Motion-Guided Masking, we compute per-frame motion energy as the mean L1 dif- ference between consecutive frames at the patch level,...

work page