How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction

Hai Nguyen-Truong; Lorenzo Torresani; Luigi Seminara; Sejoon Jun

arxiv: 2605.20388 · v1 · pith:S5PMOIDQnew · submitted 2026-05-19 · 💻 cs.CV

How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction

Sejoon Jun , Hai Nguyen-Truong , Luigi Seminara , Lorenzo Torresani This is my paper

Pith reviewed 2026-05-21 07:10 UTC · model grok-4.3

classification 💻 cs.CV

keywords egocentric videotrajectory predictionaction anticipationprocedural planningintent predictionfirst-person viewcamera motiongoal-free prediction

0 comments

The pith

Future camera trajectories encode operator intent to predict actions more accurately than language in egocentric video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that the future path of the camera in first-person video carries the person's intent in enough detail to specify how an upcoming action will play out. This would matter because prediction models currently struggle with many possible futures from the same starting view, often averaging them into incorrect outputs. By first predicting likely trajectories from the current context and then using those to guide action prediction in a special embedding space, the approach avoids language conditioning altogether while achieving better results. The gains are especially clear for longer time horizons and when goals are not provided at test time. This setup also works with estimated camera poses from RGB alone.

Core claim

The future camera trajectory carries the operator's intent in a form fine enough to determine how an action will unfold, substantially outperforming language as a conditioning signal. This same intent makes the trajectory itself partially predictable from the context, so TrajPilot predicts candidate future trajectories from egocentric context and uses them to pilot action prediction in an action-aligned embedding space where language shapes the structure but is never used as a conditioning input.

What carries the argument

TrajPilot, which predicts candidate future trajectories from egocentric context and conditions action prediction on them within an action-aligned embedding space.

If this is right

Beats VLM and structured-planner baselines on procedural planning tasks across multiple egocentric datasets including Ego-Exo4D and Ego4D.
The performance advantage over baselines increases as the prediction horizon lengthens.
Maintains gains even when using RGB-only estimated camera poses instead of ground-truth trajectories.
Enables goal-free anticipation that outperforms VLM baselines on Ego-Exo4D atomic tasks.
Extends successfully to datasets like EPIC-Kitchens-100 and basketball shot-outcome prediction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If head trajectories reliably signal intent, this could be used to anticipate user actions in augmented reality or assistive robotics without explicit commands.
Trajectory conditioning might generalize to other video domains where motion paths disambiguate future events better than text descriptions.
Models could be trained to jointly optimize trajectory prediction and action forecasting for better overall performance on long sequences.

Load-bearing premise

The future trajectory can be predicted sufficiently well from the current egocentric context that the model does not need the actual future trajectory provided at test time to achieve most of its improvement.

What would settle it

Training a variant of the model that uses randomly sampled or mean trajectories instead of predicted ones and finding no significant drop in action prediction accuracy would falsify the claim that predicted trajectories provide useful conditioning.

Figures

Figures reproduced from arXiv: 2605.20388 by Hai Nguyen-Truong, Lorenzo Torresani, Luigi Seminara, Sejoon Jun.

**Figure 2.** Figure 2: Latent-space planning fails on EgoExo4D atomic-action planning. CEM (searching trajectory candidates by ℓ1 to the goal latent) underperforms No-Traj at every horizon; Oracle (groundtruth trajectory) shows trajectory carries substantial signal. Full mean accuracy in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: TrajPilot architecture overview. (A) Training: V-JEPA context features and trajectory embeddings from Eτ feed an additive token construction read by a causal predictor that scores against the EgoVLPv2 action bank (§3.4–3.5). (B) Inference: No-Traj mode runs the predictor with a zero trajectory input; Scorer mode retrieves K candidate trajectories, runs the predictor over each in parallel, and selects via t… view at source ↗

**Figure 4.** Figure 4: Atomic-action planning on Ego-Exo4D. Left and middle: open vocabulary (|V|=8,472); mid-step retrieval (M@1) and exact mid-sequence match (MSeq) versus planning horizon H; baselines are VLMs. Right: closed vocabulary (|V|=1,165); Mid R@1; baselines are structured procedural planners. Oracle (dashed) is a diagnostic upper bound that uses ground-truth middle trajectory at inference. Per-horizon open-vocabular… view at source ↗

**Figure 5.** Figure 5: Open-vocabulary atomic anticipation on Ego-Exo4D atomic [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Latent-space planning under Full mean accuracy (companion to Figure 2). Same predictor and three [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Gate-then-rank scorer mechanism. Top-K retrieved trajectory candidates are rolled through the frozen trajectory-conditioned predictor (left) to produce candidate plans; in parallel, the frozen No-Traj predictor produces the fallback. The scorer (center) reads all candidate plans plus the fallback and emits a binary gate logit and per-candidate rank scores. At inference (right), the gate routes to either th… view at source ↗

**Figure 8.** Figure 8: G Broader impacts TrajPilot predicts near-future actions and outcomes from first-person video. Positive applications include assistive guidance for skilled procedural tasks (flagging deviations during cooking, assembly, or medical procedures before errors compound), training and coaching feedback in sports and crafts, and accessibility tools that anticipate user intent from movement. Negative applications … view at source ↗

**Figure 8.** Figure 8: Qualitative open-vocabulary planning examples. Each block shows the observed start clip on the left, the observed goal clip on the right, and the predicted action sequences. TrajPilot (Scorer) matches the ground-truth action sequence in these examples, while the Qwen-SFT+LLM baseline often predicts plausible but incorrect intermediate actions (e.g., wrong object/action substitutions or incorrect procedural… view at source ↗

read the original abstract

Predicting how a person's first-person view will evolve (what action will follow, what plan completes a task, whether an in-progress shot will score) is fundamentally under-specified: the same context admits many plausible futures, and a model trained to minimize prediction error is forced to hedge or average across them, getting it wrong either way. Two findings shape our approach. First, the future camera trajectory, the path the head carves through space, lets the model commit to one of those futures: it carries the operator's intent in a form fine enough to determine how an action will unfold, substantially outperforming language as a conditioning signal. Second, this same intent makes the trajectory itself partially predictable from the context at hand, enough that trajectory need not be observed at test time to recover most of the gain. We instantiate these findings as TrajPilot, a model that predicts candidate future trajectories from egocentric context and uses them to pilot action prediction in an action-aligned embedding space where language shapes the structure but is never used as a conditioning input. TrajPilot beats VLM and structured-planner baselines on procedural planning across Ego-Exo4D atomic, Ego-Exo4D Keystep, Ego4D GoalStep, and EgoPER, with the trajectory advantage widening with horizon (exactly where prior planners collapse) and holding under RGB-only camera-pose estimation. With the goal masked at inference, the same model performs goal-free anticipation, beating VLM baselines on Ego-Exo4D atomic and extending to EPIC-Kitchens-100 and basketball shot-outcome prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Trajectories encode intent more precisely than language for egocentric action prediction, and the model predicts them from context well enough to capture most of the reported gains without test-time observation.

read the letter

The core finding is that future camera trajectories in first-person video carry enough detail about the operator's intent to improve action and goal prediction over language conditioning. TrajPilot predicts candidate trajectories from egocentric context, then uses them to pilot prediction inside an action-aligned embedding space. Language helps shape the space during training but is absent at inference. This setup beats VLM and structured-planner baselines on Ego-Exo4D atomic and Keystep, Ego4D GoalStep, and EgoPER. The edge grows with longer horizons, where the baselines degrade, and it survives RGB-only pose estimation. The same model also handles goal-free anticipation on EPIC-Kitchens-100 and basketball shot outcomes when the goal is masked at test time.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces TrajPilot, a model that predicts candidate future camera trajectories from egocentric video context and conditions downstream action prediction and procedural planning on those trajectories inside an action-aligned embedding space. It claims that trajectories encode operator intent more finely than language, yielding consistent gains over VLM and structured-planner baselines on Ego-Exo4D atomic, Ego-Exo4D Keystep, Ego4D GoalStep, and EgoPER; that the trajectory advantage widens with horizon; that the gains persist under RGB-only pose estimation; and that the same model supports goal-free anticipation on Ego-Exo4D, EPIC-Kitchens-100, and basketball shot-outcome prediction.

Significance. If the central claims are substantiated, the work would be significant for egocentric vision and action anticipation: it supplies a conditioning signal that demonstrably outperforms language for long-horizon procedural tasks where current planners degrade. The multi-dataset evaluation and the reported robustness to RGB-only pose estimation are concrete strengths. No machine-checked proofs or open reproducible code are mentioned, so the assessment rests entirely on the empirical results.

major comments (1)

[Experiments / Results] The manuscript asserts that predicted trajectories recover most of the performance gain without test-time observation of the trajectory. However, no explicit ablation is presented that directly compares action-prediction accuracy when conditioning on ground-truth trajectories versus on the model's own predicted trajectories. This comparison is load-bearing for the second finding, especially because the reported advantage widens with horizon (where prediction error is expected to grow).

minor comments (1)

[Method] The description of the action-aligned embedding space and how language shapes its structure without being used at inference could be expanded with a diagram or explicit equations in §3.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive feedback on the experimental design. We address the major comment below.

read point-by-point responses

Referee: The manuscript asserts that predicted trajectories recover most of the performance gain without test-time observation of the trajectory. However, no explicit ablation is presented that directly compares action-prediction accuracy when conditioning on ground-truth trajectories versus on the model's own predicted trajectories. This comparison is load-bearing for the second finding, especially because the reported advantage widens with horizon (where prediction error is expected to grow).

Authors: We agree that an explicit side-by-side comparison of action-prediction performance under ground-truth versus predicted trajectory conditioning is important to substantiate the claim that predicted trajectories recover most of the gain. The manuscript reports strong results with predicted trajectories and shows that the advantage over baselines grows with horizon, but does not include the direct GT-versus-predicted ablation requested. In the revised manuscript we will add this ablation, reporting action-prediction accuracy for both conditioning regimes across horizons on Ego-Exo4D and Ego4D GoalStep. This will quantify how much performance is retained when the model must rely on its own trajectory predictions. revision: yes

Circularity Check

0 steps flagged

No circularity: learned predictors and empirical ablations remain independent of inputs

full rationale

The paper trains TrajPilot to predict future trajectories from egocentric context and then conditions action prediction on those outputs in an embedding space shaped by language but without using language as conditioning. Both the trajectory predictor and the downstream action model are trained end-to-end on data; no equation defines the output as a direct function of the input by construction, no parameter is fitted on a subset and then renamed a prediction, and no load-bearing claim rests on a self-citation whose content is unverified. Experiments compare against VLM and planner baselines across multiple datasets and horizons, with the advantage claimed to persist under predicted (not ground-truth) trajectories. This structure is self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are identifiable beyond standard neural network assumptions and dataset-specific training.

pith-pipeline@v0.9.0 · 5825 in / 1012 out tokens · 34152 ms · 2026-05-21T07:10:49.170046+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

future camera trajectory... carries the operator's intent in a form fine enough to determine how an action will unfold, substantially outperforming language as a conditioning signal
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TrajPilot... predicts candidate future trajectories from egocentric context and uses them to pilot action prediction in an action-aligned embedding space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 7 internal anchors

[1]

Self-supervised learning from images with a joint- embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[2]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, et al. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Walk through paintings: Ego-centric world models from internet priors.arXiv preprint arXiv:2601.15284, 2026

Anurag Bagchi, Zhipeng Bao, Homanga Bharadhwaj, Yu-Xiong Wang, Pavel Tokmakov, and Martial Hebert. Walk through paintings: Ego-centric world models from internet priors.arXiv preprint arXiv:2601.15284, 2026

work page arXiv 2026
[4]

Whole- body conditioned egocentric video prediction

Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole- body conditioned egocentric video prediction. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2506.21552

work page arXiv 2025
[5]

Navigation world mod- els

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world mod- els. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15791–15801, 2025

work page 2025
[6]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Procedure planning in instructional videos

Chien-Yi Chang, De-An Huang, Danfei Xu, Ehsan Adeli, Li Fei-Fei, and Juan Carlos Niebles. Procedure planning in instructional videos. InEuropean Conference on Computer Vision (ECCV), 2020

work page 2020
[8]

Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025

Delong Chen, Théo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, and Pascale Fung. Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025

work page arXiv 2025
[9]

arXiv preprint arXiv:2512.10942 (2025)

Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, and Pascale Fung. VL-JEPA: Joint embedding predictive architecture for vision-language.arXiv preprint arXiv:2512.10942, 2025

work page arXiv 2025
[10]

Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-Kitchens-100.International Journal of Computer Vision, 130:33–55, 2022

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, et al. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-Kitchens-100.International Journal of Computer Vision, 130:33–55, 2022

work page 2022
[11]

Anticipative video transformer

Rohit Girdhar and Kristen Grauman. Anticipative video transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13505–13515, 2021

work page 2021
[12]

Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[13]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

GEM: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control.arXiv preprint arXiv:2412.11198, 2024

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M B Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, et al. GEM: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control.arXiv preprint arXiv:2412.11198, 2024

work page arXiv 2024
[15]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Error detection in egocentric procedural task videos

Shih-Po Lee, Zijia Lu, and Kristen Grauman. Error detection in egocentric procedural task videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2024. 10

work page 2024
[17]

VEDiT: Latent prediction architecture for procedural video representation learning

Han Lin, Tushar Nagarajan, Nicolas Ballas, Mahmoud Assran, Mojtaba Komeili, Mohit Bansal, and Koustuv Sinha. VEDiT: Latent prediction architecture for procedural video representation learning. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[18]

Can’t make an omelette without breaking some eggs: Plausible action anticipation using large video-language models

Himangi Mittal, Nakul Agarwal, Shao-Yuan Lo, and Kwonjoon Lee. Can’t make an omelette without breaking some eggs: Plausible action anticipation using large video-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[19]

V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mahmoud Assran, Koustuv Sinha, Michael Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

work page arXiv 2026
[20]

Schema: State changes matter for pro- cedure planning in instructional videos.arXiv preprint arXiv:2403.01599, 2024

Yulei Niu, Wenliang Guo, Long Chen, Xudong Lin, and Shih-Fu Chang. SCHEMA: State CHangEs MAtter for procedure planning in instructional videos. InInternational Conference on Learning Representations (ICLR), 2024. arXiv:2403.01599

work page arXiv 2024
[21]

Cosmos World Foundation Model Platform for Physical AI

NVIDIA. Cosmos world foundation model technical report.arXiv preprint arXiv:2501.03575, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

EgoVLPv2: Egocentric video-language pre- training with fusion in the backbone

Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. EgoVLPv2: Egocentric video-language pre- training with fusion in the backbone. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[23]

Interaction region visual trans- former for egocentric action anticipation

Debaditya Roy, Ramanathan Rajendiran, and Basura Fernando. Interaction region visual trans- former for egocentric action anticipation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

work page 2024
[24]

ViterbiPlanNet: Injecting procedural knowledge via differentiable Viterbi for planning in instructional videos, 2026

Luigi Seminara, Davide Moltisanti, and Antonino Furnari. ViterbiPlanNet: Injecting procedural knowledge via differentiable Viterbi for planning in instructional videos, 2026

work page 2026
[25]

Ego4D Goal-Step: Toward hierarchical understanding of procedural activities

Yale Song, Eugene Byrne, Tushar Nagarajan, Huiyu Wang, Roberto Martín-Martín, and Lorenzo Torresani. Ego4D Goal-Step: Toward hierarchical understanding of procedural activities. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023

work page 2023
[26]

EgoDistill: Egocentric head motion distillation for efficient video understanding

Shuhan Tan, Tushar Nagarajan, and Kristen Grauman. EgoDistill: Egocentric head motion distillation for efficient video understanding. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[27]

COIN: A large-scale dataset for comprehensive instructional video analysis

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. COIN: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[28]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), volume 30, 2017

work page 2017
[29]

PDPP: Projected diffusion for procedure planning in instructional videos

Hanlin Wang, Yilu Wu, Sheng Guo, and Limin Wang. PDPP: Projected diffusion for procedure planning in instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14836–14845, 2023

work page 2023
[30]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. Scalable permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Seeing without pixels: Perception from camera trajectories.arXiv preprint arXiv:2511.21681, 2025

Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, and Tengda Han. Seeing without pixels: Perception from camera trajectories.arXiv preprint arXiv:2511.21681, 2025

work page arXiv 2025
[32]

Video-LLaMA: An instruction-tuned audio-visual language model for video understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2023. 11

work page 2023
[33]

GeoWorld: Geometric World Models

Zeyu Zhang, Danning Li, Ian Reid, and Richard Hartley. GeoWorld: Geometric world models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. arXiv:2602.23058

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

sequence

Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross-task weakly supervised learning from instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 12 A Implementation details A.1 Datasets Ego-Exo4D atomic.We use the open-vocabulary Eg...

work page arXiv 2019

[1] [1]

Self-supervised learning from images with a joint- embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[2] [2]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, et al. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Walk through paintings: Ego-centric world models from internet priors.arXiv preprint arXiv:2601.15284, 2026

Anurag Bagchi, Zhipeng Bao, Homanga Bharadhwaj, Yu-Xiong Wang, Pavel Tokmakov, and Martial Hebert. Walk through paintings: Ego-centric world models from internet priors.arXiv preprint arXiv:2601.15284, 2026

work page arXiv 2026

[4] [4]

Whole- body conditioned egocentric video prediction

Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, and Jitendra Malik. Whole- body conditioned egocentric video prediction. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2506.21552

work page arXiv 2025

[5] [5]

Navigation world mod- els

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world mod- els. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15791–15801, 2025

work page 2025

[6] [6]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Procedure planning in instructional videos

Chien-Yi Chang, De-An Huang, Danfei Xu, Ehsan Adeli, Li Fei-Fei, and Juan Carlos Niebles. Procedure planning in instructional videos. InEuropean Conference on Computer Vision (ECCV), 2020

work page 2020

[8] [8]

Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025

Delong Chen, Théo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, and Pascale Fung. Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025

work page arXiv 2025

[9] [9]

arXiv preprint arXiv:2512.10942 (2025)

Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, and Pascale Fung. VL-JEPA: Joint embedding predictive architecture for vision-language.arXiv preprint arXiv:2512.10942, 2025

work page arXiv 2025

[10] [10]

Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-Kitchens-100.International Journal of Computer Vision, 130:33–55, 2022

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, et al. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-Kitchens-100.International Journal of Computer Vision, 130:33–55, 2022

work page 2022

[11] [11]

Anticipative video transformer

Rohit Girdhar and Kristen Grauman. Anticipative video transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13505–13515, 2021

work page 2021

[12] [12]

Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[13] [13]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

GEM: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control.arXiv preprint arXiv:2412.11198, 2024

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M B Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, et al. GEM: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control.arXiv preprint arXiv:2412.11198, 2024

work page arXiv 2024

[15] [15]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Error detection in egocentric procedural task videos

Shih-Po Lee, Zijia Lu, and Kristen Grauman. Error detection in egocentric procedural task videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2024. 10

work page 2024

[17] [17]

VEDiT: Latent prediction architecture for procedural video representation learning

Han Lin, Tushar Nagarajan, Nicolas Ballas, Mahmoud Assran, Mojtaba Komeili, Mohit Bansal, and Koustuv Sinha. VEDiT: Latent prediction architecture for procedural video representation learning. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025

[18] [18]

Can’t make an omelette without breaking some eggs: Plausible action anticipation using large video-language models

Himangi Mittal, Nakul Agarwal, Shao-Yuan Lo, and Kwonjoon Lee. Can’t make an omelette without breaking some eggs: Plausible action anticipation using large video-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[19] [19]

V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mahmoud Assran, Koustuv Sinha, Michael Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

work page arXiv 2026

[20] [20]

Schema: State changes matter for pro- cedure planning in instructional videos.arXiv preprint arXiv:2403.01599, 2024

Yulei Niu, Wenliang Guo, Long Chen, Xudong Lin, and Shih-Fu Chang. SCHEMA: State CHangEs MAtter for procedure planning in instructional videos. InInternational Conference on Learning Representations (ICLR), 2024. arXiv:2403.01599

work page arXiv 2024

[21] [21]

Cosmos World Foundation Model Platform for Physical AI

NVIDIA. Cosmos world foundation model technical report.arXiv preprint arXiv:2501.03575, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

EgoVLPv2: Egocentric video-language pre- training with fusion in the backbone

Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. EgoVLPv2: Egocentric video-language pre- training with fusion in the backbone. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023

[23] [23]

Interaction region visual trans- former for egocentric action anticipation

Debaditya Roy, Ramanathan Rajendiran, and Basura Fernando. Interaction region visual trans- former for egocentric action anticipation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

work page 2024

[24] [24]

ViterbiPlanNet: Injecting procedural knowledge via differentiable Viterbi for planning in instructional videos, 2026

Luigi Seminara, Davide Moltisanti, and Antonino Furnari. ViterbiPlanNet: Injecting procedural knowledge via differentiable Viterbi for planning in instructional videos, 2026

work page 2026

[25] [25]

Ego4D Goal-Step: Toward hierarchical understanding of procedural activities

Yale Song, Eugene Byrne, Tushar Nagarajan, Huiyu Wang, Roberto Martín-Martín, and Lorenzo Torresani. Ego4D Goal-Step: Toward hierarchical understanding of procedural activities. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023

work page 2023

[26] [26]

EgoDistill: Egocentric head motion distillation for efficient video understanding

Shuhan Tan, Tushar Nagarajan, and Kristen Grauman. EgoDistill: Egocentric head motion distillation for efficient video understanding. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[27] [27]

COIN: A large-scale dataset for comprehensive instructional video analysis

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. COIN: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019

[28] [28]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), volume 30, 2017

work page 2017

[29] [29]

PDPP: Projected diffusion for procedure planning in instructional videos

Hanlin Wang, Yilu Wu, Sheng Guo, and Limin Wang. PDPP: Projected diffusion for procedure planning in instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14836–14845, 2023

work page 2023

[30] [30]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. Scalable permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Seeing without pixels: Perception from camera trajectories.arXiv preprint arXiv:2511.21681, 2025

Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, and Tengda Han. Seeing without pixels: Perception from camera trajectories.arXiv preprint arXiv:2511.21681, 2025

work page arXiv 2025

[32] [32]

Video-LLaMA: An instruction-tuned audio-visual language model for video understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2023. 11

work page 2023

[33] [33]

GeoWorld: Geometric World Models

Zeyu Zhang, Danning Li, Ian Reid, and Richard Hartley. GeoWorld: Geometric world models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. arXiv:2602.23058

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

sequence

Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross-task weakly supervised learning from instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 12 A Implementation details A.1 Datasets Ego-Exo4D atomic.We use the open-vocabulary Eg...

work page arXiv 2019