pith. machine review for the scientific record. sign in

arxiv: 2604.09366 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: unknown

Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors

Chaotao Ding, Deyi Ji, Jin Ma, Lanyun Zhu, Lingyun Sun, Qi Zhu, Tianrun Chen, Xuanfu Li, Yidong Han, Ying Zang, Yuanqi Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D reconstructiondynamic scenesuncertainty estimationvisual geometry transformermulti-view consistencygeometry purificationmotion disentanglement
0
0 comments X

The pith

A transformer framework with three uncertainty mechanisms disentangles dynamic motion from static structure in 4D scene reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method to extend static 3D geometry models to handle moving scenes by explicitly modeling uncertainty during reconstruction. It introduces entropy-guided projection to weight attention and pull out motion signals, local neighborhood constraints to clean up geometry, and probabilistic weighting in cross-view checks to refine depth estimates. These steps together aim to separate moving objects from fixed backgrounds in a single forward pass. The approach requires no extra training per scene and reports gains on standard dynamic benchmarks. If the mechanisms work as described, they would allow reliable 4D output from ordinary video inputs without manual tuning.

Core claim

The central claim is that entropy-guided subspace projection, local-consistency driven geometry purification, and uncertainty-aware cross-view consistency, when combined inside a visual geometry transformer, enable reliable separation of dynamic and static scene components by treating uncertainty as an explicit signal at each processing stage.

What carries the argument

The three synergistic mechanisms of entropy-guided subspace projection for isolating motion cues, local-consistency geometry purification via neighborhood constraints, and uncertainty-aware cross-view consistency formulated as heteroscedastic estimation.

Load-bearing premise

The three proposed mechanisms can reliably disentangle dynamic and static components across diverse real-world sequences without task-specific fine-tuning or per-scene optimization.

What would settle it

A dynamic video sequence where applying the entropy-guided projection, local purification, and uncertainty-weighted refinement produces no measurable drop in mean accuracy error or rise in segmentation F-measure relative to an unmodified baseline transformer.

read the original abstract

Reconstructing dynamic 4D scenes is an important yet challenging task. While 3D foundation models like VGGT excel in static settings, they often struggle with dynamic sequences where motion causes significant geometric ambiguity. To address this, we present a framework designed to disentangle dynamic and static components by modeling uncertainty across different stages of the reconstruction process. Our approach introduces three synergistic mechanisms: (1) Entropy-Guided Subspace Projection, which leverages information-theoretic weighting to adaptively aggregate multi-head attention distributions, effectively isolating dynamic motion cues from semantic noise; (2) Local-Consistency Driven Geometry Purification, which enforces spatial continuity via radius-based neighborhood constraints to eliminate structural outliers; and (3) Uncertainty-Aware Cross-View Consistency, which formulates multi-view projection refinement as a heteroscedastic maximum likelihood estimation problem, utilizing depth confidence as a probabilistic weight. Experiments on dynamic benchmarks show that our approach outperforms current state-of-the-art methods, reducing Mean Accuracy error by 13.43\% and improving segmentation F-measure by 10.49\%. Our framework maintains the efficiency of feed-forward inference and requires no task-specific fine-tuning or per-scene optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a Robust 4D Visual Geometry Transformer incorporating uncertainty-aware priors to reconstruct dynamic scenes. It introduces three mechanisms—entropy-guided subspace projection to isolate motion cues via information-theoretic weighting, local-consistency driven geometry purification using radius-based constraints, and uncertainty-aware cross-view consistency formulated as heteroscedastic maximum likelihood estimation with depth confidence weights—to disentangle dynamic and static components. The framework claims to outperform state-of-the-art methods on dynamic benchmarks, reducing Mean Accuracy error by 13.43% and improving segmentation F-measure by 10.49%, while preserving feed-forward inference without task-specific fine-tuning or per-scene optimization.

Significance. If the reported gains are substantiated and the mechanisms generalize, the work would meaningfully extend static 3D foundation models like VGGT to dynamic 4D settings, addressing geometric ambiguity from motion. The combination of information-theoretic and probabilistic uncertainty modeling offers a practical, optimization-free approach with potential impact on video-based reconstruction, robotics, and AR applications.

major comments (2)
  1. Abstract: The central performance claims (13.43% Mean Accuracy error reduction and 10.49% F-measure gain) are stated without reference to specific dynamic benchmarks, datasets, baseline methods, number of runs, or error bars. This is load-bearing for the claim that the three mechanisms outperform SOTA, as the gains could arise from unstated factors rather than the proposed components.
  2. Method section (around the descriptions of the three mechanisms): The entropy-guided subspace projection, local geometry purification, and heteroscedastic MLE cross-view consistency are described at a conceptual level only, with no equations, pseudocode, or derivation details showing how they avoid failure modes such as motion under-segmentation or outlier propagation in fast/non-rigid sequences. This undermines verification of the weakest assumption that they reliably disentangle components in a purely feed-forward manner across diverse real-world data.
minor comments (2)
  1. Clarify the exact definition and computation of 'Mean Accuracy error' and 'segmentation F-measure' in the experimental section, as these terms can vary across papers.
  2. Ensure VGGT and other acronyms are expanded on first use in the introduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve transparency and technical detail.

read point-by-point responses
  1. Referee: Abstract: The central performance claims (13.43% Mean Accuracy error reduction and 10.49% F-measure gain) are stated without reference to specific dynamic benchmarks, datasets, baseline methods, number of runs, or error bars. This is load-bearing for the claim that the three mechanisms outperform SOTA, as the gains could arise from unstated factors rather than the proposed components.

    Authors: We agree that greater specificity in the abstract would strengthen the presentation. In the revised manuscript we will update the abstract to name the specific dynamic benchmarks and datasets, list the primary baseline methods, and indicate that results are averaged over multiple runs with error bars or standard deviations reported in the main experiments section. revision: yes

  2. Referee: Method section (around the descriptions of the three mechanisms): The entropy-guided subspace projection, local geometry purification, and heteroscedastic MLE cross-view consistency are described at a conceptual level only, with no equations, pseudocode, or derivation details showing how they avoid failure modes such as motion under-segmentation or outlier propagation in fast/non-rigid sequences. This undermines verification of the weakest assumption that they reliably disentangle components in a purely feed-forward manner across diverse real-world data.

    Authors: We accept that the current Method section presents the mechanisms at a high level. In the revision we will add the explicit mathematical formulations (including the entropy weighting, radius-based neighborhood constraints, and heteroscedastic MLE objective), provide pseudocode for the end-to-end pipeline, and include targeted discussion plus ablation evidence showing how each component reduces motion under-segmentation and outlier propagation on fast or non-rigid sequences. revision: yes

Circularity Check

0 steps flagged

No significant circularity; mechanisms and gains are independently proposed and experimentally validated

full rationale

The paper introduces three explicit mechanisms (entropy-guided subspace projection using information-theoretic weighting, radius-based local geometry purification, and heteroscedastic MLE for uncertainty-aware cross-view consistency) as novel ways to disentangle dynamic/static components in 4D reconstruction. These are not defined in terms of each other or the target performance metrics; they are described as feed-forward operations drawing on standard probabilistic and geometric concepts. The reported gains (13.43% Mean Accuracy reduction, 10.49% F-measure improvement) are presented as outcomes of benchmark experiments rather than quantities fitted or renamed from the same inputs. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing justifications for the central claims. The derivation remains self-contained with external experimental falsifiability.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on the domain assumption that uncertainty can be factored into attention, geometry, and cross-view stages to separate motion; no explicit free parameters, invented entities, or additional axioms are stated.

pith-pipeline@v0.9.0 · 5535 in / 1204 out tokens · 74046 ms · 2026-05-10T17:03:34.904256+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Are we ready for autonomous driving? the kitti vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012

  2. [2]

    Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time

    Richard A Newcombe, Dieter Fox, and Steven M Seitz. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 343–352, 2015

  3. [3]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

  4. [4]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  5. [5]

    Robust consistent video depth estimation

    Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. Robust consistent video depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1611–1621, 2021

  6. [6]

    Easi3r: Estimating disentangled motion from dust3r without training

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3r: Estimating disentangled motion from dust3r without training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9158–9168, 2025

  7. [7]

    Map-free visual relocalization: Metric pose relative to a single image

    Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Aron Monszpart, Victor Prisacariu, Daniyar Turmukhambetov, and Eric Brachmann. Map-free visual relocalization: Metric pose relative to a single image. In European Conference on Computer Vision, pages 690–708. Springer, 2022

  8. [8]

    Llafs: When large language models meet few-shot segmentation

    Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, and Jun Liu. Llafs: When large language models meet few-shot segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3065–3075, 2024

  9. [9]

    Deepmvs: Learning multi-view stereopsis

    Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2821–2830, 2018

  10. [10]

    Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling

    Qi Zhu, Jiangwei Lao, Deyi Ji, Junwei Luo, Kang Wu, Yingying Zhang, Lixiang Ru, Jian Wang, Jingdong Chen, Ming Yang, et al. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  11. [11]

    Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding

    Lanyun Zhu, Deyi Ji, Tianrun Chen, Peng Xu, Jieping Ye, and Jun Liu. Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  12. [12]

    Structural and statistical texture knowledge distillation and learning for segmentation.IEEE Transactionson PatternAnalysis and MachineIntelligence, 47(5):3639–3656, 2025

    Deyi Ji, Feng Zhao, Hongtao Lu, Feng Wu, and Jieping Ye. Structural and statistical texture knowledge distillation and learning for segmentation.IEEE Transactionson PatternAnalysis and MachineIntelligence, 47(5):3639–3656, 2025

  13. [13]

    Discrete latent perspective learning for segmentation and detection

    Deyi Ji, Feng Zhao, Lanyun Zhu, Wenwei Jin, Hongtao Lu, and Jieping Ye. Discrete latent perspective learning for segmentation and detection. InInternational Conference on Machine Learning, pages 21719–21730, 2024

  14. [14]

    Not every patch is needed: Towards a more efficient and effective backbone for video-based person re-identification.IEEE Transactionson Image Processing, 2025

    Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, and Jun Liu. Not every patch is needed: Towards a more efficient and effective backbone for video-based person re-identification.IEEE Transactionson Image Processing, 2025

  15. [15]

    Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

    You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

  16. [16]

    Structural and statistical texture knowledge distillation for semantic segmentation

    Deyi Ji, Haoran Wang, Mingyuan Tao, Jianqiang Huang, Xian-Sheng Hua, and Hongtao Lu. Structural and statistical texture knowledge distillation for semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16876–16885, 2022

  17. [17]

    Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911, 2021. 10

  18. [18]

    Pptformer: Pseudo multi-perspective transformer for uav segmentation

    Deyi Ji, Wenwei Jin, Hongtao Lu, and Feng Zhao. Pptformer: Pseudo multi-perspective transformer for uav segmentation. International Joint Conference on Artificial Intelligence, pages 893–901, 2024

  19. [19]

    Megadepth: Learning single-view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018

  20. [20]

    Replay master: Automatic sample selection and effective memory utilization for continual semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Lanyun Zhu, Tianrun Chen, Jianxiong Yin, Simon See, De Wen Soh, and Jun Liu. Replay master: Automatic sample selection and effective memory utilization for continual semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  21. [21]

    $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He. pi3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025

  22. [22]

    Llafs++: Few-shot image segmentation with large language models.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2025

    Lanyun Zhu, Tianrun Chen, Deyi Ji, Peng Xu, Jieping Ye, and Jun Liu. Llafs++: Few-shot image segmentation with large language models.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2025

  23. [23]

    Context-aware graph convolution network for target re-identification

    Deyi Ji, Haoran Wang, Hanzhe Hu, Weihao Gan, Wei Wu, and Junjie Yan. Context-aware graph convolution network for target re-identification. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1646–1654, 2021

  24. [24]

    CPCF: A cross-prompt contrastive framework for referring multimodal large language models

    Lanyun Zhu, Deyi Ji, Tianrun Chen, Haiyang Wu, De Wen Soh, and Jun Liu. CPCF: A cross-prompt contrastive framework for referring multimodal large language models. InForty-secondInternational Conference on Machine Learning, 2025

  25. [25]

    View-centric multi-object tracking with homographic matching in moving uav.IEEE Transactionson Geoscience and Remote Sensing, 2026

    Deyi Ji, Lanyun Zhu, Siqi Gao, Qi Zhu, Yiru Zhao, Peng Xu, Yue Ding, Hongtao Lu, Jieping Ye, Feng Wu, et al. View-centric multi-object tracking with homographic matching in moving uav.IEEE Transactionson Geoscience and Remote Sensing, 2026

  26. [26]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

  27. [27]

    MASt3R: Grounding Image Matching in 3D with Multi-View Strengths and Relations

    Victor Leroy, D Ceylan, David Novotny, Andrea Vedaldi, and Christian Rupprecht. MASt3R: Grounding Image Matching in 3D with Multi-View Strengths and Relations. InAdvancesin Neural Information Processing Systems (NeurIPS), 2024

  28. [28]

    Stream3r: Scalable sequential 3d reconstruction with causal transformer

    LAN Yushi, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Bo Dai, Shuai Yang, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer. In The FourteenthInternational Conference on Learning Representations, 2026

  29. [29]

    Ultra-high resolution segmentation with ultra-rich context: A novel benchmark

    Deyi Ji, Feng Zhao, Hongtao Lu, Mingyuan Tao, and Jieping Ye. Ultra-high resolution segmentation with ultra-rich context: A novel benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23621–23630, 2023

  30. [30]

    Volumede- form: Real-time volumetric non-rigid reconstruction

    Matthias Innmann, Michael Zollhöfer, Matthias Nießner, Christian Theobalt, and Marc Stamminger. Volumede- form: Real-time volumetric non-rigid reconstruction. InEuropean conference on computer vision, pages 362–379. Springer, 2016

  31. [31]

    Learning statistical texture for semantic segmentation

    Lanyu Zhu, Deyi Ji, Shiping Zhu, Weihao Gan, Wei Wu, and Junjie Yan. Learning statistical texture for semantic segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  32. [32]

    Pixelwise view selection for unstructured multi-view stereo

    Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEuropean conference on computer vision, pages 501–518. Springer, 2016

  33. [33]

    Popen: Preference-based optimization and ensemble for lvlm-based reasoning segmentation

    Lanyun Zhu, Tianrun Chen, Qianxiong Xu, Xuanyi Liu, Deyi Ji, Haiyang Wu, De Wen Soh, and Jun Liu. Popen: Preference-based optimization and ensemble for lvlm-based reasoning segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  34. [34]

    Retrv-r1: A reasoning-driven mllm framework for universal and efficient multimodal retrieval.Neural Information Processing Systems (NeurIPS), 2025

    Lanyun Zhu, Deyi Ji, Tianrun Chen, Haiyang Wu, and Shiqi Wang. Retrv-r1: A reasoning-driven mllm framework for universal and efficient multimodal retrieval.Neural Information Processing Systems (NeurIPS), 2025

  35. [35]

    Deyi Ji, Feng Zhao, and Hongtao Lu. Guided patch-grouping wavelet transformer with spatial congruence for ultra-high resolution segmentation.International Joint Conference on Artificial Intelligence, pages 920–928, 2023. 11

  36. [36]

    SpatialTrackerV2: 3D point tracking made easy.arXiv preprint arXiv:2507.12462, 2025

    Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: 3d point tracking made easy.arXiv preprint arXiv:2507.12462, 2025

  37. [37]

    Megasam: Accurate, fast and robust structure and motion from casual dynamic videos

    Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10486–10496, 2025

  38. [38]

    Paper copilot: A personalized research assistant.arXiv preprint arXiv:2403.12345, 2024

    QZhangetal. Monst3r: Amonocularandsemanticpipelinefor3dreconstruction. arXivpreprintarXiv:2403.12345, 2024

  39. [39]

    Das3r: Dynamics-aware gaussian splatting for static scene reconstruction

    Kai Xu, Tze Ho Elden Tse, Jizong Peng, and Angela Yao. Das3r: Dynamics-aware gaussian splatting for static scene reconstruction. arXiv preprint arXiv:2412.19584, 2024

  40. [40]

    Cut3r: A contrastive and unifying training framework for 3d reconstruction.arXiv preprint arXiv:2503.67890, 2025

    Y Wang et al. Cut3r: A contrastive and unifying training framework for 3d reconstruction.arXiv preprint arXiv:2503.67890, 2025

  41. [41]

    Page-4d: Disentangled pose and geometry estimation for 4d perception.arXiv e-prints, pages arXiv–2510, 2025

    Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, and Mengyu Wang. Page-4d: Disentangled pose and geometry estimation for 4d perception.arXiv e-prints, pages arXiv–2510, 2025

  42. [42]

    Uncertainty guided multi-view stereo network for depth estimation

    Wanjuan Su, Qingshan Xu, and Wenbing Tao. Uncertainty guided multi-view stereo network for depth estimation. IEEE Transactions on Circuits and Systems for Video Technology, 32(11):7796–7808, 2022

  43. [43]

    Multi-view 3d object reconstruction and uncertainty modelling with neural shape prior

    Ziwei Liao and Steven L Waslander. Multi-view 3d object reconstruction and uncertainty modelling with neural shape prior. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3098–3107, 2024

  44. [44]

    Geomvsnet: Learning multi-view stereo with geometry perception

    Zhe Zhang, Rui Peng, Yuxi Hu, and Ronggang Wang. Geomvsnet: Learning multi-view stereo with geometry perception. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21508–21518, 2023

  45. [45]

    Learning multi-view stereo with geometry-aware prior.IEEE Transactionson Circuits and Systems for Video Technology, 2025

    Kehua Chen, Zhenlong Yuan, Haihong Xiao, Tianlu Mao, and Zhaoqi Wang. Learning multi-view stereo with geometry-aware prior.IEEE Transactionson Circuits and Systems for Video Technology, 2025

  46. [46]

    Uncertainty- aware vision-based metric cross-view geolocalization

    Florian Fervers, Sebastian Bullinger, Christoph Bodensteiner, Michael Arens, and Rainer Stiefelhagen. Uncertainty- aware vision-based metric cross-view geolocalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21621–21631, 2023

  47. [47]

    What uncertainties do we need in bayesian deep learning for computer vision? Advancesin neural information processing systems, 30, 2017

    Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advancesin neural information processing systems, 30, 2017

  48. [48]

    Estimating the mean and variance of the target probability distribution

    David A Nix and Andreas S Weigend. Estimating the mean and variance of the target probability distribution. In Proceedings of 1994 ieee international conference on neural networks (ICNN’94), volume 1, pages 55–60. IEEE, 1994

  49. [49]

    Vggt4d: Mining motion cues in visual geometry transformers for 4d scene reconstruction.arXiv preprint arXiv:2511.19971, 2025

    Yu Hu, Chong Cheng, Sicheng Yu, Xiaoyang Guo, and Hao Wang. Vggt4d: Mining motion cues in visual geometry transformers for 4d scene reconstruction.arXiv preprint arXiv:2511.19971, 2025

  50. [50]

    A benchmark dataset and evaluation methodology for video object segmentation

    Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016

  51. [51]

    Monocular dynamic view synthesis: A reality check.Advancesin Neural Information Processing Systems, 35:33768–33780, 2022

    Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, and Angjoo Kanazawa. Monocular dynamic view synthesis: A reality check.Advancesin Neural Information Processing Systems, 35:33768–33780, 2022. 12