pith. machine review for the scientific record. sign in

arxiv: 2604.23803 · v1 · submitted 2026-04-26 · 💻 cs.CV

Recognition: unknown

Bringing a Personal Point of View: Evaluating Dynamic 3D Gaussian Splatting for Egocentric Scene Reconstruction

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords egocentric video3D Gaussian Splattingdynamic scene reconstructionEgoExo4D datasetnovel view synthesisstatic vs dynamic contentviewpoint comparison
0
0 comments X

The pith

Dynamic 3D Gaussian Splatting models produce lower-quality reconstructions from egocentric video than exocentric video, with the gap traced to static scene content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates existing dynamic monocular 3D Gaussian Splatting methods using paired egocentric and exocentric videos from the EgoExo4D dataset. It reports consistently lower reconstruction quality in egocentric views. Analysis attributes this shortfall to inaccuracies in static background regions rather than moving elements. This matters for applications such as augmented reality and robotics that rely on personal viewpoints, indicating that current general-purpose techniques may require targeted adaptations for egocentric capture conditions.

Core claim

Using paired ego-exo recordings from the EgoExo4D dataset, we evaluate dynamic monocular 3DGS models and find that reconstruction quality is consistently lower in egocentric views. The difference in reconstruction quality, measured in peak signal-to-noise ratio, stems from the reconstruction of static, not dynamic, content. Our findings underscore current limitations and motivate the development of egocentric-specific approaches, while also highlighting the value of separately evaluating static and dynamic regions of a video.

What carries the argument

Paired ego-exocentric video recordings from the EgoExo4D dataset, with separate evaluation of static and dynamic scene regions to isolate viewpoint effects.

Load-bearing premise

The paired ego and exo videos from the same scenes allow direct comparison without differences in lighting, camera setup, or scene coverage that could affect results independently of viewpoint.

What would settle it

A new set of recordings from identical scenes using both viewpoints under matched lighting, calibration, and full coverage, followed by equal reconstruction quality across views, would show the gap is not due to egocentric perspective.

Figures

Figures reproduced from arXiv: 2604.23803 by Jan van Gemert, Jan Warchocki, Jonas Kulhanek, Xi Wang.

Figure 1
Figure 1. Figure 1: First frame of selected scenes from the exo (left, first exo camera) and ego (right) views. First 4 scenes are random; last 2 are view at source ↗
Figure 2
Figure 2. Figure 2: Ground truths and model renderings on 2 random and 1 EgoGaussian scene. EgoGaussian could only be evaluated on the ego view at source ↗
Figure 3
Figure 3. Figure 3: Camera linear velocity v¯t plotted against mLPIPS. For legibility, only about 30% of points are shown. Additional trend lines are computed and plotted based on all 100% of points. As we can observe, as linear velocity increases, mLPIPS increases, which corresponds to worse reconstruction quality. Results. The resulting scatter plot for linear velocity is presented in view at source ↗
Figure 5
Figure 5. Figure 5: First frame of 10 example scenes from the exo (left, first exo camera) and ego (right) views. First 8 scenes are random; last view at source ↗
Figure 6
Figure 6. Figure 6: Camera linear and angular velocity results for mPSNR and mSSIM. As we can see, increasing either velocity correlates with a view at source ↗
Figure 7
Figure 7. Figure 7: Camera linear and angular baseline results for mPSNR and mSSIM. As we can see, increasing any baseline leads to a visible view at source ↗
Figure 8
Figure 8. Figure 8: Camera angular velocity ω¯t plotted against mLPIPS. Additional trend lines are plotted. As we can observe, as angular velocity increases, mLPIPS increases, which corresponds to worse reconstruction quality. −3 −2 −1 0 1 0.2 0.3 0.4 0.5 Camera angular baseline mLPIPS ↓ Camera angular baseline against masked LPIPS Def3DGS 4DGS RTGS view at source ↗
Figure 9
Figure 9. Figure 9: Camera angular baseline plotted against mLPIPS. Ad view at source ↗
read the original abstract

Egocentric video provides a unique view into human perception and interaction, with growing relevance for augmented reality, robotics, and assistive technologies. However, rapid camera motion and complex scene dynamics pose major challenges for 3D reconstruction from this perspective. While 3D Gaussian Splatting (3DGS) has become a state-of-the-art method for efficient, high-quality novel view synthesis, variants, that focus on reconstructing dynamic scenes from monocular video are rarely evaluated on egocentric video. It remains unclear whether existing models generalize to this setting or if egocentric-specific solutions are needed. In this work, we evaluate dynamic monocular 3DGS models on egocentric and exocentric video using paired ego-exo recordings from the EgoExo4D dataset. We find that reconstruction quality is consistently lower in egocentric views. Analysis reveals that the difference in reconstruction quality, measured in peak signal-to-noise ratio, stems from the reconstruction of static, not dynamic, content. Our findings underscore current limitations and motivate the development of egocentric-specific approaches, while also highlighting the value of separately evaluating static and dynamic regions of a video.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates several dynamic monocular 3D Gaussian Splatting models on paired egocentric and exocentric videos from the EgoExo4D dataset. It reports consistently lower reconstruction quality (PSNR) for egocentric views and, via region-separated analysis, attributes the gap primarily to difficulties in reconstructing static scene content rather than dynamic elements. The work concludes that current methods have limitations in the egocentric setting and calls for egocentric-specific approaches while advocating separate static/dynamic evaluation.

Significance. If the central empirical pattern holds after validation of the paired comparison, the result is significant for computer vision and AR/robotics applications: it provides the first controlled ego-exo benchmark of dynamic 3DGS, demonstrates the value of region-separated metrics, and supplies concrete motivation for viewpoint-aware reconstruction techniques. The public-dataset evaluation and direct measurement against external video data are strengths that support reproducibility.

major comments (2)
  1. [Evaluation on EgoExo4D paired recordings] The attribution of the PSNR gap to static-content reconstruction (region-separated analysis) is load-bearing for the central claim, yet the manuscript provides no verification that lighting, exposure, intrinsic calibration, or scene coverage are equivalent across the paired ego-exo streams in EgoExo4D. Systematic differences in these factors would confound both the overall quality comparison and the static/dynamic separation.
  2. [Region-separated analysis] The region-separation procedure used to isolate static versus dynamic content is not described in sufficient detail (e.g., how masks or motion thresholds are computed and whether they are applied identically to ego and exo views). This detail is required to confirm that the reported static-content attribution is not an artifact of the separation method itself.
minor comments (2)
  1. [Abstract] The abstract states that 'variants that focus on reconstructing dynamic scenes from monocular video are rarely evaluated on egocentric video' but does not name the specific models or baselines used in the experiments; adding this list would improve clarity.
  2. [Results tables and figures] Figure captions and table headers should explicitly indicate whether PSNR values are computed on full frames or masked regions to avoid reader confusion when interpreting the static/dynamic breakdown.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment below with clarifications and indicate the revisions we will make to improve the paper.

read point-by-point responses
  1. Referee: [Evaluation on EgoExo4D paired recordings] The attribution of the PSNR gap to static-content reconstruction (region-separated analysis) is load-bearing for the central claim, yet the manuscript provides no verification that lighting, exposure, intrinsic calibration, or scene coverage are equivalent across the paired ego-exo streams in EgoExo4D. Systematic differences in these factors would confound both the overall quality comparison and the static/dynamic separation.

    Authors: We appreciate this observation. The EgoExo4D dataset supplies synchronized ego-exo pairs with provided camera intrinsics and extrinsics, which we used directly. However, we did not include explicit verification or quantitative comparison of lighting/exposure equivalence in the original manuscript, as the dataset captures reflect real-world viewpoint differences. Scene coverage inherently varies between egocentric and exocentric views, which is central to the challenge we study. We will revise the paper to add a dedicated discussion subsection on potential confounding factors (including lighting and exposure variations), sequence selection criteria, and their possible influence on results, while emphasizing that the consistent PSNR gap across multiple sequences supports the static-content attribution. revision: partial

  2. Referee: [Region-separated analysis] The region-separation procedure used to isolate static versus dynamic content is not described in sufficient detail (e.g., how masks or motion thresholds are computed and whether they are applied identically to ego and exo views). This detail is required to confirm that the reported static-content attribution is not an artifact of the separation method itself.

    Authors: We agree that the region-separation method requires fuller description. Static/dynamic masks were derived from per-frame optical flow magnitude combined with absolute frame differencing, using a uniform motion threshold applied identically to both egocentric and exocentric views. These masks then partitioned the PSNR computation. We will expand the methods section in the revised manuscript with the precise algorithm, threshold values, implementation details, and explicit confirmation of identical application across view types to ensure reproducibility and rule out methodological artifacts. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation with no derivations or self-referential predictions

full rationale

The paper reports direct PSNR measurements on paired ego-exo videos from the external EgoExo4D dataset, comparing reconstruction quality of existing dynamic 3DGS models between viewpoints and attributing the gap to static vs. dynamic content via region separation. No equations, fitted parameters renamed as predictions, self-citation load-bearing premises, uniqueness theorems, or ansatzes appear in the derivation chain. All claims rest on external video data and standard metrics rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation paper. It relies on the standard assumptions of 3D Gaussian Splatting (e.g., Gaussian primitives, differentiable rendering) and the validity of the EgoExo4D paired recordings, but introduces no new free parameters, ad-hoc axioms, or invented entities.

pith-pipeline@v0.9.0 · 5511 in / 1102 out tokens · 29033 ms · 2026-05-08T06:35:34.378125+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    Mip-nerf: A multiscale representation for anti-aliasing neu- ral radiance fields

    Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neu- ral radiance fields. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 5855–5864,

  2. [2]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. InProceedings of the European Conference on Com- puter Vision, pages 720–736, 2018. 1, 2, 5

  3. [3]

    Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, pages 1–23, 2022

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, pages 1–23, 2022. 1, 2, 5, 6

  4. [4]

    Epic-kitchens visor benchmark: Video segmenta- tions and object relations.Advances in Neural Information Processing Systems, 35:13745–13758, 2022

    Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. Epic-kitchens visor benchmark: Video segmenta- tions and object relations.Advances in Neural Information Processing Systems, 35:13745–13758, 2022. 2

  5. [5]

    Gaussians on their way: Wasserstein-constrained 4d gaussian splatting with state- space modeling.arXiv preprint arXiv:2412.00333, 2024

    Junli Deng and Yihao Luo. Gaussians on their way: Wasserstein-constrained 4d gaussian splatting with state- space modeling.arXiv preprint arXiv:2412.00333, 2024. 1, 2

  6. [6]

    4d-rotor gaussian splatting: towards efficient novel view synthesis for dynamic scenes

    Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wen- zheng Chen, and Baoquan Chen. 4d-rotor gaussian splatting: towards efficient novel view synthesis for dynamic scenes. InACM SIGGRAPH 2024 Conference Papers, pages 1–11,

  7. [7]

    Monocular dynamic view synthesis: A reality check.Advances in Neural Information Processing Systems, 35:33768–33780, 2022

    Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, and Angjoo Kanazawa. Monocular dynamic view synthesis: A reality check.Advances in Neural Information Processing Systems, 35:33768–33780, 2022. 1, 2, 6, 7, 8

  8. [8]

    RelayGS: Reconstructing dynamic scenes with large-scale and complex motions via relay Gaussians.arXiv preprint arXiv:2412.02493, 2024

    Qiankun Gao, Yanmin Wu, Chengxiang Wen, Jiarui Meng, Luyang Tang, Jie Chen, Ronggang Wang, and Jian Zhang. Relaygs: Reconstructing dynamic scenes with large-scale and complex motions via relay gaussians.arXiv preprint arXiv:2412.02493, 2024. 2

  9. [9]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Mar- tin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car...

  10. [10]

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu- Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Moh...

  11. [11]

    Egolifter: Open-world 3d seg- mentation for egocentric perception

    Qiao Gu, Zhaoyang Lv, Duncan Frost, Simon Green, Julian Straub, and Chris Sweeney. Egolifter: Open-world 3d seg- mentation for egocentric perception. InProceedings of the European Conference on Computer Vision, pages 382–400. Springer, 2024. 1, 2

  12. [12]

    Egoexolearn: A dataset for bridging asyn- chronous ego-and exo-centric view of procedural activities 9 in real world

    Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Li- jin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, et al. Egoexolearn: A dataset for bridging asyn- chronous ego-and exo-centric view of procedural activities 9 in real world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22072– 22086, 2024. 3, 8

  13. [13]

    Endo-4dgs: En- doscopic monocular scene reconstruction with 4d gaussian splatting

    Yiming Huang, Beilei Cui, Long Bai, Ziqi Guo, Mengya Xu, Mobarakol Islam, and Hongliang Ren. Endo-4dgs: En- doscopic monocular scene reconstruction with 4d gaussian splatting. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 197–

  14. [14]

    Springer, 2024. 2, 5

  15. [15]

    Apostolakis, Stavroula Ntoa, Nikolaos Dimitriou, George Margetis, and Dimitrios Tzovaras

    Iason Karakostas, Aikaterini Valakou, Despoina Gavgio- taki, Zinovia Stefanidi, Ioannis Pastaltzidis, Grigorios Tsipouridis, Nikolaos Kilis, Konstantinos C. Apostolakis, Stavroula Ntoa, Nikolaos Dimitriou, George Margetis, and Dimitrios Tzovaras. A real-time wearable ar system for ego- centric vision on the edge.Virtual Reality, 28(1):44, 2024. 1, 2

  16. [16]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4):1–14, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4):1–14, 2023. 1, 2, 3

  17. [17]

    Eyes are faster than hands: A soft wearable robot learns user intention from the egocentric view.Science Robotics, 4(26): eaav2949, 2019

    Daekyum Kim, Brian Byunghyun Kang, Kyu Bum Kim, Hyungmin Choi, Jeesoo Ha, Kyu-Jin Cho, and Sungho Jo. Eyes are faster than hands: A soft wearable robot learns user intention from the egocentric view.Science Robotics, 4(26): eaav2949, 2019. 1, 2

  18. [18]

    H2o: Two hands manipulating objects for first person interaction recognition

    Taein Kwon, Bugra Tekin, Jan St ¨uhmer, Federica Bogo, and Marc Pollefeys. H2o: Two hands manipulating objects for first person interaction recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10138–10148, 2021. 3, 8

  19. [19]

    Hodgins, Adam W

    Fernando De la Torre Frade, Jessica K. Hodgins, Adam W. Bargteil, Xavier Martin Artal, Justin C. Macey, Alexan- dre Collado I Castells, and Josep Beltran. Guide to the carnegie mellon university multimodal activity (cmu-mmac) database. Technical Report CMU-RI-TR-08-22, Carnegie Mellon University, Pittsburgh, PA, 2008. 3, 8

  20. [20]

    Monocular dynamic gaussian splatting is fast and brittle but smooth motion helps.arXiv preprint arXiv:2412.04457, 2024

    Yiqing Liang, Mikhail Okunev, Mikaela Angelina Uy, Run- feng Li, Leonidas Guibas, James Tompkin, and Adam W Harley. Monocular dynamic gaussian splatting is fast and brittle but smooth motion helps.arXiv preprint arXiv:2412.04457, 2024. 1, 2, 3, 4, 6, 7, 8

  21. [21]

    Gaufre: Gaussian deformation fields for real-time dynamic novel view synthesis

    Yiqing Liang, Numair Khan, Zhengqin Li, Thu Nguyen- Phuoc, Douglas Lanman, James Tompkin, and Lei Xiao. Gaufre: Gaussian deformation fields for real-time dynamic novel view synthesis. In2025 IEEE/CVF Winter Confer- ence on Applications of Computer Vision, pages 2642–2652. IEEE, 2025. 1, 2, 5, 8

  22. [22]

    Fisheye-gs: Lightweight and extensible gaus- sian splatting module for fisheye cameras.arXiv preprint arXiv:2409.04751, 2024

    Zimu Liao, Siyan Chen, Rong Fu, Yi Wang, Zhongling Su, Hao Luo, Li Ma, Linning Xu, Bo Dai, Hengjie Li, et al. Fisheye-gs: Lightweight and extensible gaus- sian splatting module for fisheye cameras.arXiv preprint arXiv:2409.04751, 2024. 3

  23. [23]

    Gaussian-flow: 4d reconstruction with dynamic 3d gaus- sian particle

    Youtian Lin, Zuozhuo Dai, Siyu Zhu, and Yao Yao. Gaussian-flow: 4d reconstruction with dynamic 3d gaus- sian particle. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21136– 21145, 2024. 2

  24. [24]

    Hoi4d: A 4d egocentric dataset for category-level human- object interaction

    Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human- object interaction. In2022 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, page 20981–20990, New Orleans, LA, USA, 2022. IEEE. 1, 5, 6

  25. [25]

    Manigaussian: Dynamic gaus- sian splatting for multi-task robotic manipulation

    Guanxing Lu, Shiyi Zhang, Ziwei Wang, Changliu Liu, Ji- wen Lu, and Yansong Tang. Manigaussian: Dynamic gaus- sian splatting for multi-task robotic manipulation. InPro- ceedings of the European Conference on Computer Vision, pages 349–366. Springer, 2024. 1

  26. [26]

    Dynamic 3d gaussians: Tracking by per- sistent dynamic view synthesis

    Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by per- sistent dynamic view synthesis. In2024 International Con- ference on 3D Vision (3DV), pages 800–809. IEEE, 2024. 2

  27. [27]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

  28. [28]

    Splinegs: Robust motion-adaptive spline for real-time dy- namic 3d gaussians from monocular video

    Jongmin Park, Minh-Quan Viet Bui, Juan Luis Gonza- lez Bello, Jaeho Moon, Jihyong Oh, and Munchurl Kim. Splinegs: Robust motion-adaptive spline for real-time dy- namic 3d gaussians from monocular video. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26866–26875, 2025. 2

  29. [29]

    Nerfies: Deformable neural radiance fields

    Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865–5874, 2021. 1, 2, 3, 8

  30. [30]

    arXiv preprint arXiv:2106.13228 (2021)

    Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin- Brualla, and Steven M Seitz. Hypernerf: A higher- dimensional representation for topologically varying neural radiance fields.arXiv preprint arXiv:2106.13228, 2021. 2, 3, 8

  31. [31]

    Summarize the past to predict the future: Natural language descriptions of context boost mul- timodal object interaction anticipation

    Razvan-George Pasca, Alexey Gavryushin, Muhammad Hamza, Yen-Ling Kuo, Kaichun Mo, Luc Van Gool, Otmar Hilliges, and Xi Wang. Summarize the past to predict the future: Natural language descriptions of context boost mul- timodal object interaction anticipation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18286–...

  32. [32]

    Ego-slam: A robust monocular slam for egocentric videos

    Suvam Patra, Kartikeya Gupta, Faran Ahmad, Chetan Arora, and Subhashis Banerjee. Ego-slam: A robust monocular slam for egocentric videos. In2019 IEEE Winter Conference on Applications of Computer Vision, pages 31–40, 2019. 1, 2

  33. [33]

    D-nerf: Neural radiance fields for dynamic scenes

    Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 10318–10327, 2021. 1, 2, 3, 8

  34. [34]

    Home action genome: Cooperative compositional 10 action understanding

    Nishant Rai, Haofeng Chen, Jingwei Ji, Rishi Desai, Kazuki Kozuka, Shun Ishizaka, Ehsan Adeli, and Juan Carlos Niebles. Home action genome: Cooperative compositional 10 action understanding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11184–11193, 2021. 3, 8

  35. [35]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3, 5

  36. [36]

    The castle 2024 dataset: Advanc- ing the art of multimodal understanding.arXiv preprint arXiv:2503.17116, 2025

    Luca Rossetto, Werner Bailer, Duc-Tien Dang-Nguyen, Gra- ham Healy, Bj ¨orn TH ´or J ´onsson, Onanong Kongmeesub, Hoang-Bao Le, Stevan Rudinac, Klaus Sch ¨offmann, Flo- rian Spiess, et al. The castle 2024 dataset: Advanc- ing the art of multimodal understanding.arXiv preprint arXiv:2503.17116, 2025. 3, 8

  37. [37]

    Schonberger and Jan-Michael Frahm

    Johannes L. Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. In2016 IEEE Conference on Com- puter Vision and Pattern Recognition, page 4104–4113, Las Vegas, NV , USA, 2016. IEEE. 3

  38. [38]

    As- sembly101: A large-scale multi-view video dataset for un- derstanding procedural activities

    Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. As- sembly101: A large-scale multi-view video dataset for un- derstanding procedural activities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21096–21106, 2022. 3, 8

  39. [39]

    Charades-ego: A large-scale dataset of paired third and first person videos.arXiv preprint arXiv:1804.09626, 2018

    Gunnar A Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. Charades-ego: A large-scale dataset of paired third and first person videos.arXiv preprint arXiv:1804.09626, 2018. 3, 8

  40. [40]

    Lsta: Long short-term attention for egocentric action recog- nition

    Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. Lsta: Long short-term attention for egocentric action recog- nition. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 9954–9963,

  41. [41]

    Visual intention grounding for egocentric assistants.arXiv preprint arXiv:2504.13621,

    Pengzhan Sun, Junbin Xiao, Tze Ho Elden Tse, Yicong Li, Arjun Akula, and Angela Yao. Visual intention grounding for egocentric assistants.arXiv preprint arXiv:2504.13621,

  42. [42]

    Epic fields: Marrying 3d geometry and video under- standing.Advances in Neural Information Processing Sys- tems, 36:26485–26500, 2023

    Vadim Tschernezki, Ahmad Darkhalil, Zhifan Zhu, David Fouhey, Iro Laina, Diane Larlus, Dima Damen, and Andrea Vedaldi. Epic fields: Marrying 3d geometry and video under- standing.Advances in Neural Information Processing Sys- tems, 36:26485–26500, 2023. 1, 2, 6

  43. [43]

    Degauss: Dynamic-static decomposition with gaus- sian splatting for distractor-free 3d reconstruction.arXiv preprint arXiv:2503.13176, 2025

    Rui Wang, Quentin Lohmeyer, Mirko Meboldt, and Siyu Tang. Degauss: Dynamic-static decomposition with gaus- sian splatting for distractor-free 3d reconstruction.arXiv preprint arXiv:2503.13176, 2025. 1, 2

  44. [44]

    Tutorial 3: Undistort frames and overlay anno- tations, 2024

    Xizi Wang. Tutorial 3: Undistort frames and overlay anno- tations, 2024. 3

  45. [45]

    4d gaussian splatting for real-time dynamic scene rendering

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20310–20320, 2024. 1, 2, 3

  46. [46]

    Nerf-ds: Neural ra- diance fields for dynamic specular objects

    Zhiwen Yan, Chen Li, and Gim Hee Lee. Nerf-ds: Neural ra- diance fields for dynamic specular objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8285–8295, 2023. 1, 2, 8

  47. [47]

    Deformable 3d gaussians for high- fidelity monocular dynamic scene reconstruction

    Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high- fidelity monocular dynamic scene reconstruction. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20331–20341, 2024. 1, 2, 3

  48. [48]

    4d gaussian splatting: Model- ing dynamic scenes with native 4d primitives.arXiv preprint arXiv:2412.20720, 2024

    Zeyu Yang, Zijie Pan, Xiatian Zhu, Li Zhang, Yu-Gang Jiang, and Philip HS Torr. 4d gaussian splatting: Model- ing dynamic scenes with native 4d primitives.arXiv preprint arXiv:2412.20720, 2024. 1, 2, 3

  49. [49]

    Video state-changing object segmentation

    Jiangwei Yu, Xiang Li, Xinran Zhao, Hongming Zhang, and Yu-Xiong Wang. Video state-changing object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20439–20448, 2023. 2

  50. [50]

    Egogaussian: Dynamic scene understanding from egocentric video with 3d gaussian splatting.arXiv preprint arXiv:2406.19811, 2024

    Daiwei Zhang, Gengyan Li, Jiajie Li, Micka ¨el Bressieux, Otmar Hilliges, Marc Pollefeys, Luc Van Gool, and Xi Wang. Egogaussian: Dynamic scene understanding from egocentric video with 3d gaussian splatting.arXiv preprint arXiv:2406.19811, 2024. 1, 2, 3, 4, 5, 6, 8

  51. [51]

    An egocentric vision based assis- tive co-robot

    Jingzhe Zhang, Lishuo Zhuang, Yang Wang, Yameng Zhou, Yan Meng, and Gang Hua. An egocentric vision based assis- tive co-robot. In2013 IEEE 13th International Conference on Rehabilitation Robotics (ICORR), page 1–7, Seattle, W A,

  52. [52]

    Fine-grained egocentric hand-object segmentation: Dataset, model, and applications

    Lingzhi Zhang, Shenghao Zhou, Simon Stent, and Jianbo Shi. Fine-grained egocentric hand-object segmentation: Dataset, model, and applications. InProceedings of the European Conference on Computer Vision, pages 127–145. Springer, 2022. 1

  53. [53]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 586–595, 2018. 6, 7

  54. [54]

    Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers

    Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Yan-Pei Cao, and Song-Hai Zhang. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10324–10335, 2024. 3 11 Bringing a Personal Point of View: Ev...