SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting

Caixin Kang; Liangyang Ouyang; Ruicong Liu; Yifei Huang; Yoichi Sato

arxiv: 2511.18127 · v2 · pith:GMBD7KOEnew · submitted 2025-11-22 · 💻 cs.CV

SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting

Ruicong Liu , Yifei Huang , Liangyang Ouyang , Caixin Kang , Yoichi Sato This is my paper

Pith reviewed 2026-05-21 17:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D hand forecastingstreaming frameworklanguage-guidedegocentric videoembodied manipulationautoregressive predictionROI-enhanced memory

0 comments

The pith

SFHand is the first streaming framework that autoregressively forecasts future 3D hand states from continuous egocentric video and language instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SFHand to enable real-time 3D hand forecasting for applications like AR and robotics. Existing methods need offline video access and lack language guidance for task intent. SFHand uses a streaming autoregressive architecture with an ROI-enhanced memory layer to predict hand type, 2D bounding box, 3D pose, and trajectory on the fly. It also presents the EgoHaFL dataset with synchronized 3D poses and language. This setup achieves substantial gains in forecasting accuracy and helps improve embodied manipulation performance.

Core claim

SFHand autoregressively predicts a comprehensive set of future 3D hand states, including hand type, 2D bounding box, 3D pose, and trajectory, from a continuous stream of video and language instructions by combining a streaming autoregressive architecture with an ROI-enhanced memory layer that captures temporal context while focusing on salient hand-centric regions.

What carries the argument

Streaming autoregressive architecture combined with an ROI-enhanced memory layer for capturing temporal context in hand-centric regions.

If this is right

Outperforms prior work by up to 35.8% in 3D hand forecasting tasks.
Transfers learned representations to improve embodied manipulation task success rates by up to 13.4% on multiple benchmarks.
Supports real-time applications in AR and assistive robotics by processing live video streams without future frames.
Enables research through the introduction of the EgoHaFL dataset featuring synchronized 3D hand poses and language instructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such streaming methods could reduce latency in interactive systems by avoiding batch processing of video.
Integrating language instructions might enable more adaptive and intent-aware hand predictions in dynamic environments.
The autoregressive nature suggests potential for long-term forecasting sequences if memory mechanisms scale appropriately.

Load-bearing premise

The approach relies on the ROI-enhanced memory layer and autoregressive design continuing to deliver accurate predictions in live streaming scenarios without any access to future video frames or offline post-processing.

What would settle it

Running the model on a live egocentric video feed in a real-time setting and measuring if forecasting accuracy drops significantly below offline baselines or fails to run at interactive speeds.

Figures

Figures reproduced from arXiv: 2511.18127 by Caixin Kang, Liangyang Ouyang, Ruicong Liu, Yifei Huang, Yoichi Sato.

**Figure 1.** Figure 1: Comparison between previous hand forecasting methods and our proposed approach. (a) Prior 3D hand forecasting models rely [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The overview of our method. Given a streaming egocentric video, language instruction, and the current 3D hand state, our model [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Illustrations of various tasks in the Franka Kitchen [ [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Function of memory layer. All hands are forecasted from [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison between our method and HaMeR. “HaMeR + In.” indicates HaMeR incorporating language instructions [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Results of robotic manipulation on Adroit. We report the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Real-time 3D hand forecasting is a critical component for fluid human-computer interaction in applications like AR and assistive robotics. However, existing methods are ill-suited for these scenarios, as they typically require offline access to accumulated video sequences and cannot incorporate language guidance that conveys task intent. To overcome these limitations, we introduce SFHand, the first streaming framework for language-guided 3D hand forecasting. SFHand autoregressively predicts a comprehensive set of future 3D hand states, including hand type, 2D bounding box, 3D pose, and trajectory, from a continuous stream of video and language instructions. Our framework combines a streaming autoregressive architecture with an ROI-enhanced memory layer, capturing temporal context while focusing on salient hand-centric regions. To enable this research, we also introduce EgoHaFL, the first large-scale dataset featuring synchronized 3D hand poses and language instructions. We demonstrate that SFHand achieves new state-of-the-art results in 3D hand forecasting, outperforming prior work by a significant margin of up to 35.8%. Furthermore, we show the practical utility of our learned representations by transferring them to downstream embodied manipulation tasks, improving task success rates by up to 13.4% on multiple benchmarks. Dataset page: https://huggingface.co/datasets/ut-vision/EgoHaFL, project page: https://github.com/ut-vision/SFHand.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SFHand offers a streaming autoregressive model for language-guided 3D hand forecasting plus a new dataset, but the reported gains need checking for truly causal online evaluation.

read the letter

Hi, the main thing here is that SFHand introduces an autoregressive setup for predicting hand type, 2D box, 3D pose, and trajectory from continuous egocentric video and language instructions, with an ROI memory layer to keep focus on relevant regions. They also release the EgoHaFL dataset of synced 3D poses and instructions. This moves beyond the offline methods that dominated earlier work. The language conditioning for task intent is a practical addition that prior visual-only forecasting papers largely skipped. The transfer results are the strongest part: using the forecasts improves embodied manipulation success by up to 13.4% on benchmarks, which gives an external signal that the predictions carry usable information rather than just fitting the forecasting metrics. Dataset release on Hugging Face is also useful for anyone who wants to reproduce or build on it. The softer area is the streaming claim itself. The abstract positions the work as real-time and causal, yet it does not spell out whether the 35.8% forecasting gains were measured with strictly past-only context and forward-only memory updates or whether any batch or future-frame processing crept into the test rollouts. That distinction matters for the central motivation, so the methods section needs to show explicit causal inference details and memory state handling. Missing error bars and ablation tables on the ROI size or attention window are noticeable but fixable. This paper is aimed at researchers working on real-time egocentric vision for AR or assistive robotics. A reader who needs hand trajectory estimates inside a control loop would get concrete value from the architecture and the transfer experiments. I would send it to peer review. The new dataset and the downstream manipulation results give it enough substance to justify referee time, even if the authors need to tighten the evaluation protocol description.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SFHand, the first streaming framework for language-guided 3D hand forecasting from egocentric video. SFHand employs an autoregressive architecture augmented by an ROI-enhanced memory layer to predict hand type, 2D bounding box, 3D pose, and trajectory from a continuous video stream and language instructions. The authors release the EgoHaFL dataset containing synchronized 3D hand poses and language annotations. They report new state-of-the-art forecasting performance with gains of up to 35.8% and demonstrate transfer to embodied manipulation tasks yielding up to 13.4% higher success rates.

Significance. If the streaming evaluation protocol is rigorously causal and the quantitative gains are reproducible under consistent metrics with proper controls, the work would advance real-time 3D hand forecasting for AR and robotics applications. The dataset release constitutes a concrete community contribution, and the downstream transfer results supply an external check on the utility of the learned representations.

major comments (3)

[Evaluation] Evaluation section: The central streaming claim requires that test-time rollouts are strictly causal (frame-by-frame autoregressive inference using only past and current frames with forward-only memory state). The manuscript does not explicitly confirm this protocol or rule out batch/offline evaluation with future-frame access; without such confirmation the reported 35.8% forecasting margin cannot be taken as evidence for the streaming regime advertised in the motivation.
[§4] §4 (Experiments) and Table 2: The paper reports large quantitative improvements but provides no error bars, statistical significance tests, or full ablation controls isolating the ROI-enhanced memory layer and language conditioning. These omissions make it difficult to assess whether the gains are robust or attributable to the proposed architecture.
[Dataset] Dataset section: Construction details for EgoHaFL (annotation protocol, train/val/test splits, language instruction collection) are insufficiently described to allow independent verification or reproduction of the reported results.

minor comments (2)

[Notation] Notation for the four predicted outputs (hand type, 2D bbox, 3D pose, trajectory) should be introduced once and used consistently in equations and figures.
[Figures] Figure captions describing the streaming pipeline would benefit from explicit arrows or labels indicating causal information flow versus any non-causal components.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the streaming evaluation protocol, experimental reporting, and dataset documentation. We address each major comment below and have revised the manuscript to incorporate the suggested improvements for clarity and reproducibility.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The central streaming claim requires that test-time rollouts are strictly causal (frame-by-frame autoregressive inference using only past and current frames with forward-only memory state). The manuscript does not explicitly confirm this protocol or rule out batch/offline evaluation with future-frame access; without such confirmation the reported 35.8% forecasting margin cannot be taken as evidence for the streaming regime advertised in the motivation.

Authors: We agree that explicit confirmation of the causal protocol is essential. In the revised manuscript, we have added a new subsection in Section 4.1 detailing the streaming inference procedure, including confirmation that all test-time rollouts are strictly causal with frame-by-frame autoregressive prediction and forward-only memory state updates. We have also included pseudocode for the rollout process and a figure illustrating the causal data flow to rule out any future-frame access. revision: yes
Referee: [§4] §4 (Experiments) and Table 2: The paper reports large quantitative improvements but provides no error bars, statistical significance tests, or full ablation controls isolating the ROI-enhanced memory layer and language conditioning. These omissions make it difficult to assess whether the gains are robust or attributable to the proposed architecture.

Authors: We acknowledge that the original presentation lacked sufficient statistical controls. We have updated Table 2 to include error bars computed over five independent runs with different random seeds. Additional ablation experiments isolating the ROI-enhanced memory layer and language conditioning have been added to the supplementary material, along with paired t-test p-values comparing SFHand to baselines to assess statistical significance of the reported gains. revision: yes
Referee: [Dataset] Dataset section: Construction details for EgoHaFL (annotation protocol, train/val/test splits, language instruction collection) are insufficiently described to allow independent verification or reproduction of the reported results.

Authors: We thank the referee for pointing out the need for greater detail on reproducibility. Section 3 has been expanded with a full description of the annotation protocol, the crowdsourcing interface and quality control steps used for language instruction collection, the precise criteria and ratios for the train/val/test splits, and additional summary statistics. The complete annotation guidelines have also been added to the supplementary material and linked from the dataset page. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on new dataset and architecture

full rationale

The paper introduces a new streaming autoregressive architecture (ROI-enhanced memory + decoder) and a new dataset (EgoHaFL) for language-guided 3D hand forecasting, then reports empirical SOTA gains (up to 35.8%) and downstream transfer improvements (up to 13.4%). No derivation chain, first-principles result, or prediction is shown to reduce by construction to its own inputs or fitted parameters; the claims rest on experimental comparison against prior methods on held-out data and external manipulation benchmarks, which constitute independent checks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the existence of a sufficiently large and accurately annotated EgoHaFL dataset, the assumption that language instructions reliably encode task intent for hand motion, and standard supervised training of an autoregressive transformer-style model whose hyperparameters are not detailed in the abstract.

free parameters (1)

ROI memory size and attention window
Hyperparameters controlling temporal context length and hand-centric focus that are tuned during training.

axioms (1)

domain assumption Language instructions provide sufficient task intent to condition future hand states
Invoked in the motivation and framework description as the reason for adding language guidance.

pith-pipeline@v0.9.0 · 5799 in / 1329 out tokens · 47420 ms · 2026-05-21T17:48:20.700487+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SFHand autoregressively predicts ... from a continuous stream of video and language instructions. Our framework combines a streaming autoregressive architecture with an ROI-enhanced memory layer
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ROI-enhanced memory layer ... fixed-size FIFO queue ... additive bias ... α·M

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?
cs.AI 2026-05 unverdicted novelty 7.0

Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.
SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

Hierarq: Task-aware hierarchical q-former for enhanced video understanding

Shehreen Azad, Vibhav Vineet, and Yogesh Singh Rawat. Hierarq: Task-aware hierarchical q-former for enhanced video understanding. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 8545–8556,

work page
[2]

Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting

Wentao Bao, Lele Chen, Libing Zeng, Zhong Li, Yi Xu, Junsong Yuan, and Yu Kong. Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13702–13711, 2023. 1, 2, 5, 6

work page 2023
[3]

Recur- rent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022

Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. Recur- rent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022. 3

work page 2022
[4]

What makes pre-trained visual representations successful for robust manipulation? arXiv preprint arXiv:2312.12444, 2023

Kaylee Burns, Zach Witzel, Jubayer Ibn Hamid, Tianhe Yu, Chelsea Finn, and Karol Hausman. What makes pre-trained visual representations successful for robust manipulation? arXiv preprint arXiv:2312.12444, 2023. 3

work page arXiv 2023
[5]

In- n-on: Scaling egocentric manipulation with in-the-wild and on-task data.arXiv preprint arXiv:2511.15704, 2025

Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu, Tianshu Huang, Xuxin Cheng, and Xiaolong Wang. In- n-on: Scaling egocentric manipulation with in-the-wild and on-task data.arXiv preprint arXiv:2511.15704, 2025. 1

work page arXiv 2025
[6]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 3, 4

work page 2020
[7]

Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation

Hanzhi Chen, Boyang Sun, Anran Zhang, Marc Pollefeys, and Stefan Leutenegger. Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27661–27672,

work page
[8]

End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, An- dreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 3

work page 2024
[9]

Arcap: Collecting high-quality human demon- strations for robot learning with augmented reality feedback

Sirui Chen, Chen Wang, Kaden Nguyen, Li Fei-Fei, and C Karen Liu. Arcap: Collecting high-quality human demon- strations for robot learning with augmented reality feedback. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 8291–8298. IEEE, 2025. 1

work page 2025
[10]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018. 3

work page 2018
[11]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 3

work page 2009
[12]

Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025

Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025. 3

work page arXiv 2025
[13]

Long-term recurrent convolutional net- works for visual recognition and description

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional net- works for visual recognition and description. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015. 3

work page 2015
[14]

Tripad: Touch input in ar on ordinary surfaces with hand tracking only

Camille Dupr ´e, Caroline Appert, St ´ephanie Rey, Houssem Saidi, and Emmanuel Pietriga. Tripad: Touch input in ar on ordinary surfaces with hand tracking only. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–18, 2024. 1

work page 2024
[15]

Generalized procrustes analysis.Psychome- trika, 40(1):33–51, 1975

John C Gower. Generalized procrustes analysis.Psychome- trika, 40(1):33–51, 1975. 5

work page 1975
[16]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18995–19012, 2022. 3, 5

work page 2022
[17]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

work page 2024
[18]

Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning

Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, and Karol Hausman. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. InConference on Robot Learning, pages 1025–1037. PMLR,

work page
[19]

On pre-training for visuo-motor control: Re- visiting a learning-from-scratch baseline.arXiv preprint arXiv:2212.05749, 2022

Nicklas Hansen, Zhecheng Yuan, Yanjie Ze, Tongzhou Mu, Aravind Rajeswaran, Hao Su, Huazhe Xu, and Xiao- long Wang. On pre-training for visuo-motor control: Re- visiting a learning-from-scratch baseline.arXiv preprint arXiv:2212.05749, 2022. 3

work page arXiv 2022
[20]

Emag: ego-motion aware and generalizable 2d hand forecasting from egocentric videos

Masashi Hatano, Ryo Hachiuma, and Hideo Saito. Emag: ego-motion aware and generalizable 2d hand forecasting from egocentric videos. InEuropean Conference on Com- puter Vision, pages 119–136. Springer, 2024. 2

work page 2024
[21]

The invisible egohand: 3d hand forecasting through egobody pose estimation.arXiv preprint arXiv:2504.08654, 2025

Masashi Hatano, Zhifan Zhu, Hideo Saito, and Dima Damen. The invisible egohand: 3d hand forecasting through egobody pose estimation.arXiv preprint arXiv:2504.08654, 2025. 1, 2, 5, 6

work page arXiv 2025
[22]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 8

work page 2016
[23]

Egoexobench: A benchmark for first-and third-person view video understanding in mllms

Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, and Jiangmiao Pang. Egoexobench: A benchmark for first-and third-person view video understanding in mllms. arXiv preprint arXiv:2507.18342, 2025. 1

work page arXiv 2025
[24]

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapu- rapu, and Jian Zhang. Egodex: Learning dexterous manip- ulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Hoimotion: Forecasting human motion during human-object interactions using egocentric 3d object bounding boxes.IEEE Transactions on Visualization and Computer Graphics, 2024

Zhiming Hu, Zheming Yin, Daniel Haeufle, Syn Schmitt, and Andreas Bulling. Hoimotion: Forecasting human motion during human-object interactions using egocentric 3d object bounding boxes.IEEE Transactions on Visualization and Computer Graphics, 2024. 1

work page 2024
[26]

Predicting gaze in egocentric video by learning task- dependent attention transition

Yifei Huang, Minjie Cai, Zhenqiang Li, and Yoichi Sato. Predicting gaze in egocentric video by learning task- dependent attention transition. InECCV, 2018. 1, 3

work page 2018
[27]

Mutual context network for jointly estimating egocen- tric gaze and action.IEEE Transactions on Image Process- ing, 29:7795–7806, 2020

Yifei Huang, Minjie Cai, Zhenqiang Li, Feng Lu, and Yoichi Sato. Mutual context network for jointly estimating egocen- tric gaze and action.IEEE Transactions on Image Process- ing, 29:7795–7806, 2020. 1

work page 2020
[28]

Improving action segmentation via graph-based temporal reasoning

Yifei Huang, Yusuke Sugano, and Yoichi Sato. Improving action segmentation via graph-based temporal reasoning. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14024–14034, 2020. 3

work page 2020
[29]

Compound pro- totype matching for few-shot action recognition

Yifei Huang, Lijin Yang, and Yoichi Sato. Compound pro- totype matching for few-shot action recognition. InECCV,

work page
[30]

Egoexolearn: A dataset for bridging asyn- chronous ego-and exo-centric view of procedural activities in real world

Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Li- jin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, et al. Egoexolearn: A dataset for bridging asyn- chronous ego-and exo-centric view of procedural activities in real world. InCVPR, pages 22072–22086, 2024. 3

work page 2024
[31]

Yifei Huang, Jilan Xu, Baoqi Pei, Lijin Yang, Mingfang Zhang, Yuping He, Guo Chen, Xinyuan Chen, Yaohui Wang, Zheng Nie, et al. Vinci: A real-time smart assistant based on egocentric vision-language model for portable devices.Pro- ceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 9(3):1–33, 2025. 1

work page 2025
[32]

Language-driven representation learning for robotics

Siddharth Karamcheti, Suraj Nair, Annie S Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics. In Robotics: Science and Systems, 2023. 3, 5, 8

work page 2023
[33]

Simple but effective: Clip embed- dings for embodied ai

Apoorv Khandelwal, Luca Weihs, Roozbeh Mottaghi, and Aniruddha Kembhavi. Simple but effective: Clip embed- dings for embodied ai. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14829–14838, 2022. 3

work page 2022
[34]

Infinipot-v: Memory-constrained kv cache com- pression for streaming video understanding.arXiv preprint arXiv:2506.15745, 2025

Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot-v: Memory-constrained kv cache com- pression for streaming video understanding.arXiv preprint arXiv:2506.15745, 2025. 3

work page arXiv 2025
[35]

The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97,

Harold W Kuhn. The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97,

work page
[36]

Handy ar: Markerless in- spection of augmented reality objects using fingertip track- ing

Taehee Lee and Tobias Hollerer. Handy ar: Markerless in- spection of augmented reality objects using fingertip track- ing. In2007 11th IEEE International Symposium on Wear- able Computers, pages 83–90. IEEE, 2007. 1

work page 2007
[37]

Online human action de- tection using joint classification-regression recurrent neural networks

Yanghao Li, Cuiling Lan, Junliang Xing, Wenjun Zeng, Chunfeng Yuan, and Jiaying Liu. Online human action de- tection using joint classification-regression recurrent neural networks. InEuropean conference on computer vision, pages 203–220. Springer, 2016. 3

work page 2016
[38]

Point cloud classification via learnable memory bank

Lisa Liu, William Y Wang, and Pingping Cai. Point cloud classification via learnable memory bank. InInterna- tional Conference on Multimedia Modeling, pages 216–229. Springer, 2024. 3

work page 2024
[39]

Fore- casting human-object interaction: joint prediction of motor attention and actions in first person video

Miao Liu, Siyu Tang, Yin Li, and James M Rehg. Fore- casting human-object interaction: joint prediction of motor attention and actions in first person video. InEuropean con- ference on computer vision, pages 704–721. Springer, 2020. 2

work page 2020
[40]

Uvagaze: Unsupervised 1-to- 2 views adaptation for gaze estimation

Ruicong Liu and Feng Lu. Uvagaze: Unsupervised 1-to- 2 views adaptation for gaze estimation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3693– 3701, 2024. 1

work page 2024
[41]

Pnp- ga+: Plug-and-play domain adaptation for gaze estimation using model variants.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3707–3721, 2024

Ruicong Liu, Yunfei Liu, Haofei Wang, and Feng Lu. Pnp- ga+: Plug-and-play domain adaptation for gaze estimation using model variants.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3707–3721, 2024. 1

work page 2024
[42]

Single-to-dual-view adaptation for egocentric 3d hand pose estimation

Ruicong Liu, Takehiko Ohkawa, Mingfang Zhang, and Yoichi Sato. Single-to-dual-view adaptation for egocentric 3d hand pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 677–686, 2024. 1

work page 2024
[43]

From gaze jitter to domain adaptation: Generalizing gaze estimation by manip- ulating high-frequency components.International Journal of Computer Vision, 133(3):1290–1305, 2025

Ruicong Liu, Haofei Wang, and Feng Lu. From gaze jitter to domain adaptation: Generalizing gaze estimation by manip- ulating high-frequency components.International Journal of Computer Vision, 133(3):1290–1305, 2025. 1

work page 2025
[44]

Joint hand motion and interaction hotspots prediction from egocentric videos

Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xi- aolong Wang. Joint hand motion and interaction hotspots prediction from egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3282–3292, 2022. 2

work page 2022
[45]

Goal-oriented gaze estimation for zero-shot learning

Yang Liu, Lei Zhou, Xiao Bai, Yifei Huang, Lin Gu, Jun Zhou, and Tatsuya Harada. Goal-oriented gaze estimation for zero-shot learning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3794–3803, 2021. 4

work page 2021
[46]

Realdex: Towards human- like grasping for robotic dexterous hand.arXiv preprint arXiv:2402.13853, 2024

Yumeng Liu, Yaxun Yang, Youzhuo Wang, Xiaofei Wu, Ji- amin Wang, Yichen Yao, S ¨oren Schwertfeger, Sibei Yang, Wenping Wang, Jingyi Yu, et al. Realdex: Towards human- like grasping for robotic dexterous hand.arXiv preprint arXiv:2402.13853, 2024. 1

work page arXiv 2024
[47]

Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: vision-language-action pre- training from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025. 1

work page arXiv 2025
[48]

Madiff: Motion-aware mamba diffusion models for hand trajectory prediction on egocentric videos.arXiv preprint arXiv:2409.02638, 2024

Junyi Ma, Xieyuanli Chen, Wentao Bao, Jingyi Xu, and Hes- heng Wang. Madiff: Motion-aware mamba diffusion models for hand trajectory prediction on egocentric videos.arXiv preprint arXiv:2409.02638, 2024. 2

work page arXiv 2024
[49]

Diff-ip2d: Diffusion-based hand-object interaction predic- tion on egocentric videos.arXiv preprint arXiv:2405.04370,

Junyi Ma, Jingyi Xu, Xieyuanli Chen, and Hesheng Wang. Diff-ip2d: Diffusion-based hand-object interaction predic- tion on egocentric videos.arXiv preprint arXiv:2405.04370,

work page arXiv
[50]

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Os- bert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

Liv: Language-image represen- tations and rewards for robotic control

Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bas- tani, and Dinesh Jayaraman. Liv: Language-image represen- tations and rewards for robotic control. InInternational Con- ference on Machine Learning, pages 23301–23320. PMLR,

work page
[52]

R3m: A universal visual repre- sentation for robot manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual repre- sentation for robot manipulation. InConference on Robot Learning, pages 892–909. PMLR, 2023. 3, 5, 8

work page 2023
[53]

Install: Context-aware instructional task assistance with multi-modal large language models.arXiv preprint arXiv:2501.12231, 2025

Pha Nguyen, Sailik Sengupta, Girik Malik, Arshit Gupta, and Bonan Min. Install: Context-aware instructional task assistance with multi-modal large language models.arXiv preprint arXiv:2501.12231, 2025. 1

work page arXiv 2025
[54]

Actionvos: Actions as prompts for video object segmentation

Liangyang Ouyang, Ruicong Liu, Yifei Huang, Ryosuke Furuta, and Yoichi Sato. Actionvos: Actions as prompts for video object segmentation. InEuropean Conference on Computer Vision, pages 216–235. Springer, 2024. 1

work page 2024
[55]

The unsurprising effectiveness of pre- trained vision models for control

Simone Parisi, Aravind Rajeswaran, Senthil Purushwalkam, and Abhinav Gupta. The unsurprising effectiveness of pre- trained vision models for control. Ininternational con- ference on machine learning, pages 17359–17371. PMLR,

work page
[56]

Recon- structing hands in 3d with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Recon- structing hands in 3d with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9826–9836, 2024. 5, 6, 7

work page 2024
[57]

Modeling fine-grained hand-object dynamics for egocentric video representation learning

Baoqi Pei, Yifei Huang, Jilan Xu, Guo Chen, Yuping He, Lijin Yang, Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, et al. Modeling fine-grained hand-object dynamics for egocentric video representation learning. InInternational Conference on Learning Representations, 2025. 4, 5

work page 2025
[58]

Egothinker: Un- veiling egocentric reasoning with spatio-temporal cot.arXiv preprint arXiv:2510.23569, 2025

Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Yu Qiao, and Jiangmiao Pang. Egothinker: Un- veiling egocentric reasoning with spatio-temporal cot.arXiv preprint arXiv:2510.23569, 2025. 1

work page arXiv 2025
[59]

Hand interfaces: Using hands to imitate objects in ar/vr for expressive interactions

Siyou Pei, Alexander Chen, Jaewook Lee, and Yang Zhang. Hand interfaces: Using hands to imitate objects in ar/vr for expressive interactions. InProceedings of the 2022 CHI con- ference on human factors in computing systems, pages 1–16,

work page 2022
[60]

Streaming long video understanding with large language models.Advances in Neu- ral Information Processing Systems, 37:119336–119360,

Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuan- grui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.Advances in Neu- ral Information Processing Systems, 37:119336–119360,

work page
[61]

Arnnotate: An augmented reality inter- face for collecting custom dataset of 3d hand-object inter- action pose estimation

Xun Qian, Fengming He, Xiyun Hu, Tianyi Wang, and Karthik Ramani. Arnnotate: An augmented reality inter- face for collecting custom dataset of 3d hand-object inter- action pose estimation. InProceedings of the 35th Annual ACM Symposium on User Interface Software and Technol- ogy, pages 1–14, 2022. 1

work page 2022
[62]

Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019. 4

work page 2019
[63]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3, 5, 8

work page 2021
[64]

Real-world robot learn- ing with masked visual pre-training

Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learn- ing with masked visual pre-training. InConference on Robot Learning, pages 416–426. PMLR, 2023. 3, 5, 8

work page 2023
[65]

Learning complex dexterous manipulation with deep reinforcement learning and demonstrations

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giu- lia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. InRobotics: Science and Systems, 2018. 2, 5

work page 2018
[66]

Generalized in- tersection over union: A metric and a loss for bounding box regression

Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666,

work page
[67]

Em- bodied hands: Modeling and capturing hands and bodies to- gether

Javier Romero, Dimitrios Tzionas, and Michael J Black. Em- bodied hands: Modeling and capturing hands and bodies to- gether.arXiv preprint arXiv:2201.02610, 2022. 3, 5

work page arXiv 2022
[68]

Neu- ral machine translation of rare words with subword units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neu- ral machine translation of rare words with subword units. InProceedings of the 54th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, 2016. 5

work page 2016
[69]

Time-contrastive networks: Self-supervised learn- ing from video

Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learn- ing from video. In2018 IEEE international conference on robotics and automation (ICRA), pages 1134–1141. IEEE,

work page
[70]

Online detection of action start in untrimmed, streaming videos

Zheng Shou, Junting Pan, Jonathan Chan, Kazuyuki Miyazawa, Hassan Mansour, Anthony Vetro, Xavier Giro-i Nieto, and Shih-Fu Chang. Online detection of action start in untrimmed, streaming videos. InProceedings of the Eu- ropean conference on computer vision (ECCV), pages 534– 551, 2018. 3

work page 2018
[71]

Cliport: What and where pathways for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. InCon- ference on robot learning, pages 894–906. PMLR, 2022. 3

work page 2022
[72]

Oadtr: Online action detection with transformers

Xiang Wang, Shiwei Zhang, Zhiwu Qing, Yuanjie Shao, Zhengrong Zuo, Changxin Gao, and Nong Sang. Oadtr: Online action detection with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7565–7575, 2021. 3

work page 2021
[73]

Egovid-5m: A large-scale video-action dataset for egocentric video generation.arXiv preprint arXiv:2411.08380, 2024

Xiaofeng Wang, Kang Zhao, Feng Liu, Jiayu Wang, Gu- osheng Zhao, Xiaoyi Bao, Zheng Zhu, Yingya Zhang, and Xingang Wang. Egovid-5m: A large-scale video-action dataset for egocentric video generation.arXiv preprint arXiv:2411.08380, 2024. 5

work page arXiv 2024
[74]

Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition

Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 13587–13597, 2022. 3

work page 2022
[75]

Policy pre-training for autonomous driv- ing via self-supervised geometric modeling.arXiv preprint arXiv:2301.01006, 2023

Penghao Wu, Li Chen, Hongyang Li, Xiaosong Jia, Junchi Yan, and Yu Qiao. Policy pre-training for autonomous driv- ing via self-supervised geometric modeling.arXiv preprint arXiv:2301.01006, 2023. 3

work page arXiv 2023
[76]

Timegazer: Temporal modeling of predic- tive gaze stabilization for ar interaction.arXiv preprint arXiv:2510.01561, 2025

Yaozheng Xia, Zaiping Zhu, Bo Pang, Shaorong Wang, and Sheng Li. Timegazer: Temporal modeling of predic- tive gaze stabilization for ar interaction.arXiv preprint arXiv:2510.01561, 2025. 1

work page arXiv 2025
[77]

Egoexo- gen: Ego-centric video prediction by watching exo-centric videos.arXiv preprint arXiv:2504.11732, 2025

Jilan Xu, Yifei Huang, Baoqi Pei, Junlin Hou, Qingqiu Li, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. Egoexo- gen: Ego-centric video prediction by watching exo-centric videos.arXiv preprint arXiv:2504.11732, 2025. 1

work page arXiv 2025
[78]

Temporal recurrent networks for online action detection

Mingze Xu, Mingfei Gao, Yi-Ting Chen, Larry S Davis, and David J Crandall. Temporal recurrent networks for online action detection. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 5532–5541,

work page
[79]

Interact before align: Leveraging cross-modal knowledge for domain adaptive action recognition

Lijin Yang, Yifei Huang, Yusuke Sugano, and Yoichi Sato. Interact before align: Leveraging cross-modal knowledge for domain adaptive action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14722–14732, 2022. 3

work page 2022
[80]

Deco: Decomposi- tion and reconstruction for compositional temporal ground- ing via coarse-to-fine contrastive ranking

Lijin Yang, Quan Kong, Hsuan-Kung Yang, Wadim Kehl, Yoichi Sato, and Norimasa Kobori. Deco: Decomposi- tion and reconstruction for compositional temporal ground- ing via coarse-to-fine contrastive ranking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23130–23140, 2023. 3

work page 2023

Showing first 80 references.

[1] [1]

Hierarq: Task-aware hierarchical q-former for enhanced video understanding

Shehreen Azad, Vibhav Vineet, and Yogesh Singh Rawat. Hierarq: Task-aware hierarchical q-former for enhanced video understanding. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 8545–8556,

work page

[2] [2]

Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting

Wentao Bao, Lele Chen, Libing Zeng, Zhong Li, Yi Xu, Junsong Yuan, and Yu Kong. Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13702–13711, 2023. 1, 2, 5, 6

work page 2023

[3] [3]

Recur- rent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022

Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. Recur- rent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022. 3

work page 2022

[4] [4]

What makes pre-trained visual representations successful for robust manipulation? arXiv preprint arXiv:2312.12444, 2023

Kaylee Burns, Zach Witzel, Jubayer Ibn Hamid, Tianhe Yu, Chelsea Finn, and Karol Hausman. What makes pre-trained visual representations successful for robust manipulation? arXiv preprint arXiv:2312.12444, 2023. 3

work page arXiv 2023

[5] [5]

In- n-on: Scaling egocentric manipulation with in-the-wild and on-task data.arXiv preprint arXiv:2511.15704, 2025

Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu, Tianshu Huang, Xuxin Cheng, and Xiaolong Wang. In- n-on: Scaling egocentric manipulation with in-the-wild and on-task data.arXiv preprint arXiv:2511.15704, 2025. 1

work page arXiv 2025

[6] [6]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 3, 4

work page 2020

[7] [7]

Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation

Hanzhi Chen, Boyang Sun, Anran Zhang, Marc Pollefeys, and Stefan Leutenegger. Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27661–27672,

work page

[8] [8]

End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, An- dreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 3

work page 2024

[9] [9]

Arcap: Collecting high-quality human demon- strations for robot learning with augmented reality feedback

Sirui Chen, Chen Wang, Kaden Nguyen, Li Fei-Fei, and C Karen Liu. Arcap: Collecting high-quality human demon- strations for robot learning with augmented reality feedback. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 8291–8298. IEEE, 2025. 1

work page 2025

[10] [10]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018. 3

work page 2018

[11] [11]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 3

work page 2009

[12] [12]

Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025

Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025. 3

work page arXiv 2025

[13] [13]

Long-term recurrent convolutional net- works for visual recognition and description

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional net- works for visual recognition and description. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015. 3

work page 2015

[14] [14]

Tripad: Touch input in ar on ordinary surfaces with hand tracking only

Camille Dupr ´e, Caroline Appert, St ´ephanie Rey, Houssem Saidi, and Emmanuel Pietriga. Tripad: Touch input in ar on ordinary surfaces with hand tracking only. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–18, 2024. 1

work page 2024

[15] [15]

Generalized procrustes analysis.Psychome- trika, 40(1):33–51, 1975

John C Gower. Generalized procrustes analysis.Psychome- trika, 40(1):33–51, 1975. 5

work page 1975

[16] [16]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18995–19012, 2022. 3, 5

work page 2022

[17] [17]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

work page 2024

[18] [18]

Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning

Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, and Karol Hausman. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. InConference on Robot Learning, pages 1025–1037. PMLR,

work page

[19] [19]

On pre-training for visuo-motor control: Re- visiting a learning-from-scratch baseline.arXiv preprint arXiv:2212.05749, 2022

Nicklas Hansen, Zhecheng Yuan, Yanjie Ze, Tongzhou Mu, Aravind Rajeswaran, Hao Su, Huazhe Xu, and Xiao- long Wang. On pre-training for visuo-motor control: Re- visiting a learning-from-scratch baseline.arXiv preprint arXiv:2212.05749, 2022. 3

work page arXiv 2022

[20] [20]

Emag: ego-motion aware and generalizable 2d hand forecasting from egocentric videos

Masashi Hatano, Ryo Hachiuma, and Hideo Saito. Emag: ego-motion aware and generalizable 2d hand forecasting from egocentric videos. InEuropean Conference on Com- puter Vision, pages 119–136. Springer, 2024. 2

work page 2024

[21] [21]

The invisible egohand: 3d hand forecasting through egobody pose estimation.arXiv preprint arXiv:2504.08654, 2025

Masashi Hatano, Zhifan Zhu, Hideo Saito, and Dima Damen. The invisible egohand: 3d hand forecasting through egobody pose estimation.arXiv preprint arXiv:2504.08654, 2025. 1, 2, 5, 6

work page arXiv 2025

[22] [22]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 8

work page 2016

[23] [23]

Egoexobench: A benchmark for first-and third-person view video understanding in mllms

Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, and Jiangmiao Pang. Egoexobench: A benchmark for first-and third-person view video understanding in mllms. arXiv preprint arXiv:2507.18342, 2025. 1

work page arXiv 2025

[24] [24]

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapu- rapu, and Jian Zhang. Egodex: Learning dexterous manip- ulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Hoimotion: Forecasting human motion during human-object interactions using egocentric 3d object bounding boxes.IEEE Transactions on Visualization and Computer Graphics, 2024

Zhiming Hu, Zheming Yin, Daniel Haeufle, Syn Schmitt, and Andreas Bulling. Hoimotion: Forecasting human motion during human-object interactions using egocentric 3d object bounding boxes.IEEE Transactions on Visualization and Computer Graphics, 2024. 1

work page 2024

[26] [26]

Predicting gaze in egocentric video by learning task- dependent attention transition

Yifei Huang, Minjie Cai, Zhenqiang Li, and Yoichi Sato. Predicting gaze in egocentric video by learning task- dependent attention transition. InECCV, 2018. 1, 3

work page 2018

[27] [27]

Mutual context network for jointly estimating egocen- tric gaze and action.IEEE Transactions on Image Process- ing, 29:7795–7806, 2020

Yifei Huang, Minjie Cai, Zhenqiang Li, Feng Lu, and Yoichi Sato. Mutual context network for jointly estimating egocen- tric gaze and action.IEEE Transactions on Image Process- ing, 29:7795–7806, 2020. 1

work page 2020

[28] [28]

Improving action segmentation via graph-based temporal reasoning

Yifei Huang, Yusuke Sugano, and Yoichi Sato. Improving action segmentation via graph-based temporal reasoning. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14024–14034, 2020. 3

work page 2020

[29] [29]

Compound pro- totype matching for few-shot action recognition

Yifei Huang, Lijin Yang, and Yoichi Sato. Compound pro- totype matching for few-shot action recognition. InECCV,

work page

[30] [30]

Egoexolearn: A dataset for bridging asyn- chronous ego-and exo-centric view of procedural activities in real world

Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Li- jin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, et al. Egoexolearn: A dataset for bridging asyn- chronous ego-and exo-centric view of procedural activities in real world. InCVPR, pages 22072–22086, 2024. 3

work page 2024

[31] [31]

Yifei Huang, Jilan Xu, Baoqi Pei, Lijin Yang, Mingfang Zhang, Yuping He, Guo Chen, Xinyuan Chen, Yaohui Wang, Zheng Nie, et al. Vinci: A real-time smart assistant based on egocentric vision-language model for portable devices.Pro- ceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 9(3):1–33, 2025. 1

work page 2025

[32] [32]

Language-driven representation learning for robotics

Siddharth Karamcheti, Suraj Nair, Annie S Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics. In Robotics: Science and Systems, 2023. 3, 5, 8

work page 2023

[33] [33]

Simple but effective: Clip embed- dings for embodied ai

Apoorv Khandelwal, Luca Weihs, Roozbeh Mottaghi, and Aniruddha Kembhavi. Simple but effective: Clip embed- dings for embodied ai. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14829–14838, 2022. 3

work page 2022

[34] [34]

Infinipot-v: Memory-constrained kv cache com- pression for streaming video understanding.arXiv preprint arXiv:2506.15745, 2025

Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. Infinipot-v: Memory-constrained kv cache com- pression for streaming video understanding.arXiv preprint arXiv:2506.15745, 2025. 3

work page arXiv 2025

[35] [35]

The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97,

Harold W Kuhn. The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97,

work page

[36] [36]

Handy ar: Markerless in- spection of augmented reality objects using fingertip track- ing

Taehee Lee and Tobias Hollerer. Handy ar: Markerless in- spection of augmented reality objects using fingertip track- ing. In2007 11th IEEE International Symposium on Wear- able Computers, pages 83–90. IEEE, 2007. 1

work page 2007

[37] [37]

Online human action de- tection using joint classification-regression recurrent neural networks

Yanghao Li, Cuiling Lan, Junliang Xing, Wenjun Zeng, Chunfeng Yuan, and Jiaying Liu. Online human action de- tection using joint classification-regression recurrent neural networks. InEuropean conference on computer vision, pages 203–220. Springer, 2016. 3

work page 2016

[38] [38]

Point cloud classification via learnable memory bank

Lisa Liu, William Y Wang, and Pingping Cai. Point cloud classification via learnable memory bank. InInterna- tional Conference on Multimedia Modeling, pages 216–229. Springer, 2024. 3

work page 2024

[39] [39]

Fore- casting human-object interaction: joint prediction of motor attention and actions in first person video

Miao Liu, Siyu Tang, Yin Li, and James M Rehg. Fore- casting human-object interaction: joint prediction of motor attention and actions in first person video. InEuropean con- ference on computer vision, pages 704–721. Springer, 2020. 2

work page 2020

[40] [40]

Uvagaze: Unsupervised 1-to- 2 views adaptation for gaze estimation

Ruicong Liu and Feng Lu. Uvagaze: Unsupervised 1-to- 2 views adaptation for gaze estimation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3693– 3701, 2024. 1

work page 2024

[41] [41]

Pnp- ga+: Plug-and-play domain adaptation for gaze estimation using model variants.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3707–3721, 2024

Ruicong Liu, Yunfei Liu, Haofei Wang, and Feng Lu. Pnp- ga+: Plug-and-play domain adaptation for gaze estimation using model variants.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3707–3721, 2024. 1

work page 2024

[42] [42]

Single-to-dual-view adaptation for egocentric 3d hand pose estimation

Ruicong Liu, Takehiko Ohkawa, Mingfang Zhang, and Yoichi Sato. Single-to-dual-view adaptation for egocentric 3d hand pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 677–686, 2024. 1

work page 2024

[43] [43]

From gaze jitter to domain adaptation: Generalizing gaze estimation by manip- ulating high-frequency components.International Journal of Computer Vision, 133(3):1290–1305, 2025

Ruicong Liu, Haofei Wang, and Feng Lu. From gaze jitter to domain adaptation: Generalizing gaze estimation by manip- ulating high-frequency components.International Journal of Computer Vision, 133(3):1290–1305, 2025. 1

work page 2025

[44] [44]

Joint hand motion and interaction hotspots prediction from egocentric videos

Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xi- aolong Wang. Joint hand motion and interaction hotspots prediction from egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3282–3292, 2022. 2

work page 2022

[45] [45]

Goal-oriented gaze estimation for zero-shot learning

Yang Liu, Lei Zhou, Xiao Bai, Yifei Huang, Lin Gu, Jun Zhou, and Tatsuya Harada. Goal-oriented gaze estimation for zero-shot learning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3794–3803, 2021. 4

work page 2021

[46] [46]

Realdex: Towards human- like grasping for robotic dexterous hand.arXiv preprint arXiv:2402.13853, 2024

Yumeng Liu, Yaxun Yang, Youzhuo Wang, Xiaofei Wu, Ji- amin Wang, Yichen Yao, S ¨oren Schwertfeger, Sibei Yang, Wenping Wang, Jingyi Yu, et al. Realdex: Towards human- like grasping for robotic dexterous hand.arXiv preprint arXiv:2402.13853, 2024. 1

work page arXiv 2024

[47] [47]

Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: vision-language-action pre- training from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025. 1

work page arXiv 2025

[48] [48]

Madiff: Motion-aware mamba diffusion models for hand trajectory prediction on egocentric videos.arXiv preprint arXiv:2409.02638, 2024

Junyi Ma, Xieyuanli Chen, Wentao Bao, Jingyi Xu, and Hes- heng Wang. Madiff: Motion-aware mamba diffusion models for hand trajectory prediction on egocentric videos.arXiv preprint arXiv:2409.02638, 2024. 2

work page arXiv 2024

[49] [49]

Diff-ip2d: Diffusion-based hand-object interaction predic- tion on egocentric videos.arXiv preprint arXiv:2405.04370,

Junyi Ma, Jingyi Xu, Xieyuanli Chen, and Hesheng Wang. Diff-ip2d: Diffusion-based hand-object interaction predic- tion on egocentric videos.arXiv preprint arXiv:2405.04370,

work page arXiv

[50] [50]

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Os- bert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[51] [51]

Liv: Language-image represen- tations and rewards for robotic control

Yecheng Jason Ma, Vikash Kumar, Amy Zhang, Osbert Bas- tani, and Dinesh Jayaraman. Liv: Language-image represen- tations and rewards for robotic control. InInternational Con- ference on Machine Learning, pages 23301–23320. PMLR,

work page

[52] [52]

R3m: A universal visual repre- sentation for robot manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual repre- sentation for robot manipulation. InConference on Robot Learning, pages 892–909. PMLR, 2023. 3, 5, 8

work page 2023

[53] [53]

Install: Context-aware instructional task assistance with multi-modal large language models.arXiv preprint arXiv:2501.12231, 2025

Pha Nguyen, Sailik Sengupta, Girik Malik, Arshit Gupta, and Bonan Min. Install: Context-aware instructional task assistance with multi-modal large language models.arXiv preprint arXiv:2501.12231, 2025. 1

work page arXiv 2025

[54] [54]

Actionvos: Actions as prompts for video object segmentation

Liangyang Ouyang, Ruicong Liu, Yifei Huang, Ryosuke Furuta, and Yoichi Sato. Actionvos: Actions as prompts for video object segmentation. InEuropean Conference on Computer Vision, pages 216–235. Springer, 2024. 1

work page 2024

[55] [55]

The unsurprising effectiveness of pre- trained vision models for control

Simone Parisi, Aravind Rajeswaran, Senthil Purushwalkam, and Abhinav Gupta. The unsurprising effectiveness of pre- trained vision models for control. Ininternational con- ference on machine learning, pages 17359–17371. PMLR,

work page

[56] [56]

Recon- structing hands in 3d with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Recon- structing hands in 3d with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9826–9836, 2024. 5, 6, 7

work page 2024

[57] [57]

Modeling fine-grained hand-object dynamics for egocentric video representation learning

Baoqi Pei, Yifei Huang, Jilan Xu, Guo Chen, Yuping He, Lijin Yang, Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, et al. Modeling fine-grained hand-object dynamics for egocentric video representation learning. InInternational Conference on Learning Representations, 2025. 4, 5

work page 2025

[58] [58]

Egothinker: Un- veiling egocentric reasoning with spatio-temporal cot.arXiv preprint arXiv:2510.23569, 2025

Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Yu Qiao, and Jiangmiao Pang. Egothinker: Un- veiling egocentric reasoning with spatio-temporal cot.arXiv preprint arXiv:2510.23569, 2025. 1

work page arXiv 2025

[59] [59]

Hand interfaces: Using hands to imitate objects in ar/vr for expressive interactions

Siyou Pei, Alexander Chen, Jaewook Lee, and Yang Zhang. Hand interfaces: Using hands to imitate objects in ar/vr for expressive interactions. InProceedings of the 2022 CHI con- ference on human factors in computing systems, pages 1–16,

work page 2022

[60] [60]

Streaming long video understanding with large language models.Advances in Neu- ral Information Processing Systems, 37:119336–119360,

Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuan- grui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.Advances in Neu- ral Information Processing Systems, 37:119336–119360,

work page

[61] [61]

Arnnotate: An augmented reality inter- face for collecting custom dataset of 3d hand-object inter- action pose estimation

Xun Qian, Fengming He, Xiyun Hu, Tianyi Wang, and Karthik Ramani. Arnnotate: An augmented reality inter- face for collecting custom dataset of 3d hand-object inter- action pose estimation. InProceedings of the 35th Annual ACM Symposium on User Interface Software and Technol- ogy, pages 1–14, 2022. 1

work page 2022

[62] [62]

Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019. 4

work page 2019

[63] [63]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3, 5, 8

work page 2021

[64] [64]

Real-world robot learn- ing with masked visual pre-training

Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learn- ing with masked visual pre-training. InConference on Robot Learning, pages 416–426. PMLR, 2023. 3, 5, 8

work page 2023

[65] [65]

Learning complex dexterous manipulation with deep reinforcement learning and demonstrations

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giu- lia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. InRobotics: Science and Systems, 2018. 2, 5

work page 2018

[66] [66]

Generalized in- tersection over union: A metric and a loss for bounding box regression

Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666,

work page

[67] [67]

Em- bodied hands: Modeling and capturing hands and bodies to- gether

Javier Romero, Dimitrios Tzionas, and Michael J Black. Em- bodied hands: Modeling and capturing hands and bodies to- gether.arXiv preprint arXiv:2201.02610, 2022. 3, 5

work page arXiv 2022

[68] [68]

Neu- ral machine translation of rare words with subword units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neu- ral machine translation of rare words with subword units. InProceedings of the 54th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, 2016. 5

work page 2016

[69] [69]

Time-contrastive networks: Self-supervised learn- ing from video

Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learn- ing from video. In2018 IEEE international conference on robotics and automation (ICRA), pages 1134–1141. IEEE,

work page

[70] [70]

Online detection of action start in untrimmed, streaming videos

Zheng Shou, Junting Pan, Jonathan Chan, Kazuyuki Miyazawa, Hassan Mansour, Anthony Vetro, Xavier Giro-i Nieto, and Shih-Fu Chang. Online detection of action start in untrimmed, streaming videos. InProceedings of the Eu- ropean conference on computer vision (ECCV), pages 534– 551, 2018. 3

work page 2018

[71] [71]

Cliport: What and where pathways for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. InCon- ference on robot learning, pages 894–906. PMLR, 2022. 3

work page 2022

[72] [72]

Oadtr: Online action detection with transformers

Xiang Wang, Shiwei Zhang, Zhiwu Qing, Yuanjie Shao, Zhengrong Zuo, Changxin Gao, and Nong Sang. Oadtr: Online action detection with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7565–7575, 2021. 3

work page 2021

[73] [73]

Egovid-5m: A large-scale video-action dataset for egocentric video generation.arXiv preprint arXiv:2411.08380, 2024

Xiaofeng Wang, Kang Zhao, Feng Liu, Jiayu Wang, Gu- osheng Zhao, Xiaoyi Bao, Zheng Zhu, Yingya Zhang, and Xingang Wang. Egovid-5m: A large-scale video-action dataset for egocentric video generation.arXiv preprint arXiv:2411.08380, 2024. 5

work page arXiv 2024

[74] [74]

Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition

Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 13587–13597, 2022. 3

work page 2022

[75] [75]

Policy pre-training for autonomous driv- ing via self-supervised geometric modeling.arXiv preprint arXiv:2301.01006, 2023

Penghao Wu, Li Chen, Hongyang Li, Xiaosong Jia, Junchi Yan, and Yu Qiao. Policy pre-training for autonomous driv- ing via self-supervised geometric modeling.arXiv preprint arXiv:2301.01006, 2023. 3

work page arXiv 2023

[76] [76]

Timegazer: Temporal modeling of predic- tive gaze stabilization for ar interaction.arXiv preprint arXiv:2510.01561, 2025

Yaozheng Xia, Zaiping Zhu, Bo Pang, Shaorong Wang, and Sheng Li. Timegazer: Temporal modeling of predic- tive gaze stabilization for ar interaction.arXiv preprint arXiv:2510.01561, 2025. 1

work page arXiv 2025

[77] [77]

Egoexo- gen: Ego-centric video prediction by watching exo-centric videos.arXiv preprint arXiv:2504.11732, 2025

Jilan Xu, Yifei Huang, Baoqi Pei, Junlin Hou, Qingqiu Li, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. Egoexo- gen: Ego-centric video prediction by watching exo-centric videos.arXiv preprint arXiv:2504.11732, 2025. 1

work page arXiv 2025

[78] [78]

Temporal recurrent networks for online action detection

Mingze Xu, Mingfei Gao, Yi-Ting Chen, Larry S Davis, and David J Crandall. Temporal recurrent networks for online action detection. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 5532–5541,

work page

[79] [79]

Interact before align: Leveraging cross-modal knowledge for domain adaptive action recognition

Lijin Yang, Yifei Huang, Yusuke Sugano, and Yoichi Sato. Interact before align: Leveraging cross-modal knowledge for domain adaptive action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14722–14732, 2022. 3

work page 2022

[80] [80]

Deco: Decomposi- tion and reconstruction for compositional temporal ground- ing via coarse-to-fine contrastive ranking

Lijin Yang, Quan Kong, Hsuan-Kung Yang, Wadim Kehl, Yoichi Sato, and Norimasa Kobori. Deco: Decomposi- tion and reconstruction for compositional temporal ground- ing via coarse-to-fine contrastive ranking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23130–23140, 2023. 3

work page 2023