arxiv: 2604.09366 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: unknown

Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors

Chaotao Ding, Deyi Ji, Jin Ma, Lanyun Zhu, Lingyun Sun, Qi Zhu, Tianrun Chen, Xuanfu Li, Yidong Han, Ying Zang, Yuanqi Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords 4D reconstructiondynamic scenesuncertainty estimationvisual geometry transformermulti-view consistencygeometry purificationmotion disentanglement

0 comments

The pith

A transformer framework with three uncertainty mechanisms disentangles dynamic motion from static structure in 4D scene reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method to extend static 3D geometry models to handle moving scenes by explicitly modeling uncertainty during reconstruction. It introduces entropy-guided projection to weight attention and pull out motion signals, local neighborhood constraints to clean up geometry, and probabilistic weighting in cross-view checks to refine depth estimates. These steps together aim to separate moving objects from fixed backgrounds in a single forward pass. The approach requires no extra training per scene and reports gains on standard dynamic benchmarks. If the mechanisms work as described, they would allow reliable 4D output from ordinary video inputs without manual tuning.

Core claim

The central claim is that entropy-guided subspace projection, local-consistency driven geometry purification, and uncertainty-aware cross-view consistency, when combined inside a visual geometry transformer, enable reliable separation of dynamic and static scene components by treating uncertainty as an explicit signal at each processing stage.

What carries the argument

The three synergistic mechanisms of entropy-guided subspace projection for isolating motion cues, local-consistency geometry purification via neighborhood constraints, and uncertainty-aware cross-view consistency formulated as heteroscedastic estimation.

Load-bearing premise

The three proposed mechanisms can reliably disentangle dynamic and static components across diverse real-world sequences without task-specific fine-tuning or per-scene optimization.

What would settle it

A dynamic video sequence where applying the entropy-guided projection, local purification, and uncertainty-weighted refinement produces no measurable drop in mean accuracy error or rise in segmentation F-measure relative to an unmodified baseline transformer.

read the original abstract

Reconstructing dynamic 4D scenes is an important yet challenging task. While 3D foundation models like VGGT excel in static settings, they often struggle with dynamic sequences where motion causes significant geometric ambiguity. To address this, we present a framework designed to disentangle dynamic and static components by modeling uncertainty across different stages of the reconstruction process. Our approach introduces three synergistic mechanisms: (1) Entropy-Guided Subspace Projection, which leverages information-theoretic weighting to adaptively aggregate multi-head attention distributions, effectively isolating dynamic motion cues from semantic noise; (2) Local-Consistency Driven Geometry Purification, which enforces spatial continuity via radius-based neighborhood constraints to eliminate structural outliers; and (3) Uncertainty-Aware Cross-View Consistency, which formulates multi-view projection refinement as a heteroscedastic maximum likelihood estimation problem, utilizing depth confidence as a probabilistic weight. Experiments on dynamic benchmarks show that our approach outperforms current state-of-the-art methods, reducing Mean Accuracy error by 13.43\% and improving segmentation F-measure by 10.49\%. Our framework maintains the efficiency of feed-forward inference and requires no task-specific fine-tuning or per-scene optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The three uncertainty mechanisms add a practical layer on top of VGGT for dynamic scenes, but the gains rest on an assumption that needs stronger experimental backing.

read the letter

The paper builds on VGGT by adding three mechanisms to better separate motion from static geometry in 4D reconstruction. Entropy-Guided Subspace Projection uses information weighting on attention heads, Local-Consistency Driven Geometry Purification applies radius-based checks to drop outliers, and Uncertainty-Aware Cross-View Consistency treats depth as a probabilistic weight in a heteroscedastic setup. The result is a feed-forward model that claims a 13.43% drop in mean accuracy error and 10.49% better F-measure on dynamic benchmarks, all without fine-tuning or per-scene optimization.

Referee Report

2 major / 2 minor

Summary. The paper proposes a Robust 4D Visual Geometry Transformer incorporating uncertainty-aware priors to reconstruct dynamic scenes. It introduces three mechanisms—entropy-guided subspace projection to isolate motion cues via information-theoretic weighting, local-consistency driven geometry purification using radius-based constraints, and uncertainty-aware cross-view consistency formulated as heteroscedastic maximum likelihood estimation with depth confidence weights—to disentangle dynamic and static components. The framework claims to outperform state-of-the-art methods on dynamic benchmarks, reducing Mean Accuracy error by 13.43% and improving segmentation F-measure by 10.49%, while preserving feed-forward inference without task-specific fine-tuning or per-scene optimization.

Significance. If the reported gains are substantiated and the mechanisms generalize, the work would meaningfully extend static 3D foundation models like VGGT to dynamic 4D settings, addressing geometric ambiguity from motion. The combination of information-theoretic and probabilistic uncertainty modeling offers a practical, optimization-free approach with potential impact on video-based reconstruction, robotics, and AR applications.

major comments (2)

Abstract: The central performance claims (13.43% Mean Accuracy error reduction and 10.49% F-measure gain) are stated without reference to specific dynamic benchmarks, datasets, baseline methods, number of runs, or error bars. This is load-bearing for the claim that the three mechanisms outperform SOTA, as the gains could arise from unstated factors rather than the proposed components.
Method section (around the descriptions of the three mechanisms): The entropy-guided subspace projection, local geometry purification, and heteroscedastic MLE cross-view consistency are described at a conceptual level only, with no equations, pseudocode, or derivation details showing how they avoid failure modes such as motion under-segmentation or outlier propagation in fast/non-rigid sequences. This undermines verification of the weakest assumption that they reliably disentangle components in a purely feed-forward manner across diverse real-world data.

minor comments (2)

Clarify the exact definition and computation of 'Mean Accuracy error' and 'segmentation F-measure' in the experimental section, as these terms can vary across papers.
Ensure VGGT and other acronyms are expanded on first use in the introduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve transparency and technical detail.

read point-by-point responses

Referee: Abstract: The central performance claims (13.43% Mean Accuracy error reduction and 10.49% F-measure gain) are stated without reference to specific dynamic benchmarks, datasets, baseline methods, number of runs, or error bars. This is load-bearing for the claim that the three mechanisms outperform SOTA, as the gains could arise from unstated factors rather than the proposed components.

Authors: We agree that greater specificity in the abstract would strengthen the presentation. In the revised manuscript we will update the abstract to name the specific dynamic benchmarks and datasets, list the primary baseline methods, and indicate that results are averaged over multiple runs with error bars or standard deviations reported in the main experiments section. revision: yes
Referee: Method section (around the descriptions of the three mechanisms): The entropy-guided subspace projection, local geometry purification, and heteroscedastic MLE cross-view consistency are described at a conceptual level only, with no equations, pseudocode, or derivation details showing how they avoid failure modes such as motion under-segmentation or outlier propagation in fast/non-rigid sequences. This undermines verification of the weakest assumption that they reliably disentangle components in a purely feed-forward manner across diverse real-world data.

Authors: We accept that the current Method section presents the mechanisms at a high level. In the revision we will add the explicit mathematical formulations (including the entropy weighting, radius-based neighborhood constraints, and heteroscedastic MLE objective), provide pseudocode for the end-to-end pipeline, and include targeted discussion plus ablation evidence showing how each component reduces motion under-segmentation and outlier propagation on fast or non-rigid sequences. revision: yes

Circularity Check

0 steps flagged

No significant circularity; mechanisms and gains are independently proposed and experimentally validated

full rationale

The paper introduces three explicit mechanisms (entropy-guided subspace projection using information-theoretic weighting, radius-based local geometry purification, and heteroscedastic MLE for uncertainty-aware cross-view consistency) as novel ways to disentangle dynamic/static components in 4D reconstruction. These are not defined in terms of each other or the target performance metrics; they are described as feed-forward operations drawing on standard probabilistic and geometric concepts. The reported gains (13.43% Mean Accuracy reduction, 10.49% F-measure improvement) are presented as outcomes of benchmark experiments rather than quantities fitted or renamed from the same inputs. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing justifications for the central claims. The derivation remains self-contained with external experimental falsifiability.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on the domain assumption that uncertainty can be factored into attention, geometry, and cross-view stages to separate motion; no explicit free parameters, invented entities, or additional axioms are stated.

pith-pipeline@v0.9.0 · 5535 in / 1204 out tokens · 74046 ms · 2026-05-10T17:03:34.904256+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012

2012
[2]

Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time

Richard A Newcombe, Dieter Fox, and Steven M Seitz. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 343–352, 2015

2015
[3]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

2023
[4]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025
[5]

Robust consistent video depth estimation

Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. Robust consistent video depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1611–1621, 2021

2021
[6]

Easi3r: Estimating disentangled motion from dust3r without training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3r: Estimating disentangled motion from dust3r without training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9158–9168, 2025

2025
[7]

Map-free visual relocalization: Metric pose relative to a single image

Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Aron Monszpart, Victor Prisacariu, Daniyar Turmukhambetov, and Eric Brachmann. Map-free visual relocalization: Metric pose relative to a single image. In European Conference on Computer Vision, pages 690–708. Springer, 2022

2022
[8]

Llafs: When large language models meet few-shot segmentation

Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, and Jun Liu. Llafs: When large language models meet few-shot segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3065–3075, 2024

2024
[9]

Deepmvs: Learning multi-view stereopsis

Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2821–2830, 2018

2018
[10]

Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling

Qi Zhu, Jiangwei Lao, Deyi Ji, Junwei Luo, Kang Wu, Yingying Zhang, Lixiang Ru, Jian Wang, Jingdong Chen, Ming Yang, et al. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[11]

Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding

Lanyun Zhu, Deyi Ji, Tianrun Chen, Peng Xu, Jieping Ye, and Jun Liu. Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[12]

Structural and statistical texture knowledge distillation and learning for segmentation.IEEE Transactionson PatternAnalysis and MachineIntelligence, 47(5):3639–3656, 2025

Deyi Ji, Feng Zhao, Hongtao Lu, Feng Wu, and Jieping Ye. Structural and statistical texture knowledge distillation and learning for segmentation.IEEE Transactionson PatternAnalysis and MachineIntelligence, 47(5):3639–3656, 2025

2025
[13]

Discrete latent perspective learning for segmentation and detection

Deyi Ji, Feng Zhao, Lanyun Zhu, Wenwei Jin, Hongtao Lu, and Jieping Ye. Discrete latent perspective learning for segmentation and detection. InInternational Conference on Machine Learning, pages 21719–21730, 2024

2024
[14]

Not every patch is needed: Towards a more efficient and effective backbone for video-based person re-identification.IEEE Transactionson Image Processing, 2025

Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, and Jun Liu. Not every patch is needed: Towards a more efficient and effective backbone for video-based person re-identification.IEEE Transactionson Image Processing, 2025

2025
[15]

Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

work page arXiv 2025
[16]

Structural and statistical texture knowledge distillation for semantic segmentation

Deyi Ji, Haoran Wang, Mingyuan Tao, Jianqiang Huang, Xian-Sheng Hua, and Hongtao Lu. Structural and statistical texture knowledge distillation for semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16876–16885, 2022

2022
[17]

Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911, 2021. 10

2021
[18]

Pptformer: Pseudo multi-perspective transformer for uav segmentation

Deyi Ji, Wenwei Jin, Hongtao Lu, and Feng Zhao. Pptformer: Pseudo multi-perspective transformer for uav segmentation. International Joint Conference on Artificial Intelligence, pages 893–901, 2024

2024
[19]

Megadepth: Learning single-view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018

2041
[20]

Replay master: Automatic sample selection and effective memory utilization for continual semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Lanyun Zhu, Tianrun Chen, Jianxiong Yin, Simon See, De Wen Soh, and Jun Liu. Replay master: Automatic sample selection and effective memory utilization for continual semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[21]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He. pi3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025

work page internal anchor Pith review arXiv 2025
[22]

Llafs++: Few-shot image segmentation with large language models.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2025

Lanyun Zhu, Tianrun Chen, Deyi Ji, Peng Xu, Jieping Ye, and Jun Liu. Llafs++: Few-shot image segmentation with large language models.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2025

2025
[23]

Context-aware graph convolution network for target re-identification

Deyi Ji, Haoran Wang, Hanzhe Hu, Weihao Gan, Wei Wu, and Junjie Yan. Context-aware graph convolution network for target re-identification. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1646–1654, 2021

2021
[24]

CPCF: A cross-prompt contrastive framework for referring multimodal large language models

Lanyun Zhu, Deyi Ji, Tianrun Chen, Haiyang Wu, De Wen Soh, and Jun Liu. CPCF: A cross-prompt contrastive framework for referring multimodal large language models. InForty-secondInternational Conference on Machine Learning, 2025

2025
[25]

View-centric multi-object tracking with homographic matching in moving uav.IEEE Transactionson Geoscience and Remote Sensing, 2026

Deyi Ji, Lanyun Zhu, Siqi Gao, Qi Zhu, Yiru Zhao, Peng Xu, Yue Ding, Hongtao Lu, Jieping Ye, Feng Wu, et al. View-centric multi-object tracking with homographic matching in moving uav.IEEE Transactionson Geoscience and Remote Sensing, 2026

2026
[26]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

2024
[27]

MASt3R: Grounding Image Matching in 3D with Multi-View Strengths and Relations

Victor Leroy, D Ceylan, David Novotny, Andrea Vedaldi, and Christian Rupprecht. MASt3R: Grounding Image Matching in 3D with Multi-View Strengths and Relations. InAdvancesin Neural Information Processing Systems (NeurIPS), 2024

2024
[28]

Stream3r: Scalable sequential 3d reconstruction with causal transformer

LAN Yushi, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Bo Dai, Shuai Yang, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer. In The FourteenthInternational Conference on Learning Representations, 2026

2026
[29]

Ultra-high resolution segmentation with ultra-rich context: A novel benchmark

Deyi Ji, Feng Zhao, Hongtao Lu, Mingyuan Tao, and Jieping Ye. Ultra-high resolution segmentation with ultra-rich context: A novel benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23621–23630, 2023

2023
[30]

Volumede- form: Real-time volumetric non-rigid reconstruction

Matthias Innmann, Michael Zollhöfer, Matthias Nießner, Christian Theobalt, and Marc Stamminger. Volumede- form: Real-time volumetric non-rigid reconstruction. InEuropean conference on computer vision, pages 362–379. Springer, 2016

2016
[31]

Learning statistical texture for semantic segmentation

Lanyu Zhu, Deyi Ji, Shiping Zhu, Weihao Gan, Wei Wu, and Junjie Yan. Learning statistical texture for semantic segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021
[32]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEuropean conference on computer vision, pages 501–518. Springer, 2016

2016
[33]

Popen: Preference-based optimization and ensemble for lvlm-based reasoning segmentation

Lanyun Zhu, Tianrun Chen, Qianxiong Xu, Xuanyi Liu, Deyi Ji, Haiyang Wu, De Wen Soh, and Jun Liu. Popen: Preference-based optimization and ensemble for lvlm-based reasoning segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[34]

Retrv-r1: A reasoning-driven mllm framework for universal and efficient multimodal retrieval.Neural Information Processing Systems (NeurIPS), 2025

Lanyun Zhu, Deyi Ji, Tianrun Chen, Haiyang Wu, and Shiqi Wang. Retrv-r1: A reasoning-driven mllm framework for universal and efficient multimodal retrieval.Neural Information Processing Systems (NeurIPS), 2025

2025
[35]

Deyi Ji, Feng Zhao, and Hongtao Lu. Guided patch-grouping wavelet transformer with spatial congruence for ultra-high resolution segmentation.International Joint Conference on Artificial Intelligence, pages 920–928, 2023. 11

2023
[36]

SpatialTrackerV2: 3D point tracking made easy.arXiv preprint arXiv:2507.12462, 2025

Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: 3d point tracking made easy.arXiv preprint arXiv:2507.12462, 2025

work page arXiv 2025
[37]

Megasam: Accurate, fast and robust structure and motion from casual dynamic videos

Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10486–10496, 2025

2025
[38]

Paper copilot: A personalized research assistant.arXiv preprint arXiv:2403.12345, 2024

QZhangetal. Monst3r: Amonocularandsemanticpipelinefor3dreconstruction. arXivpreprintarXiv:2403.12345, 2024

work page arXiv 2024
[39]

Das3r: Dynamics-aware gaussian splatting for static scene reconstruction

Kai Xu, Tze Ho Elden Tse, Jizong Peng, and Angela Yao. Das3r: Dynamics-aware gaussian splatting for static scene reconstruction. arXiv preprint arXiv:2412.19584, 2024

work page arXiv 2024
[40]

Cut3r: A contrastive and unifying training framework for 3d reconstruction.arXiv preprint arXiv:2503.67890, 2025

Y Wang et al. Cut3r: A contrastive and unifying training framework for 3d reconstruction.arXiv preprint arXiv:2503.67890, 2025

work page arXiv 2025
[41]

Page-4d: Disentangled pose and geometry estimation for 4d perception.arXiv e-prints, pages arXiv–2510, 2025

Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, and Mengyu Wang. Page-4d: Disentangled pose and geometry estimation for 4d perception.arXiv e-prints, pages arXiv–2510, 2025

2025
[42]

Uncertainty guided multi-view stereo network for depth estimation

Wanjuan Su, Qingshan Xu, and Wenbing Tao. Uncertainty guided multi-view stereo network for depth estimation. IEEE Transactions on Circuits and Systems for Video Technology, 32(11):7796–7808, 2022

2022
[43]

Multi-view 3d object reconstruction and uncertainty modelling with neural shape prior

Ziwei Liao and Steven L Waslander. Multi-view 3d object reconstruction and uncertainty modelling with neural shape prior. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3098–3107, 2024

2024
[44]

Geomvsnet: Learning multi-view stereo with geometry perception

Zhe Zhang, Rui Peng, Yuxi Hu, and Ronggang Wang. Geomvsnet: Learning multi-view stereo with geometry perception. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21508–21518, 2023

2023
[45]

Learning multi-view stereo with geometry-aware prior.IEEE Transactionson Circuits and Systems for Video Technology, 2025

Kehua Chen, Zhenlong Yuan, Haihong Xiao, Tianlu Mao, and Zhaoqi Wang. Learning multi-view stereo with geometry-aware prior.IEEE Transactionson Circuits and Systems for Video Technology, 2025

2025
[46]

Uncertainty- aware vision-based metric cross-view geolocalization

Florian Fervers, Sebastian Bullinger, Christoph Bodensteiner, Michael Arens, and Rainer Stiefelhagen. Uncertainty- aware vision-based metric cross-view geolocalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21621–21631, 2023

2023
[47]

What uncertainties do we need in bayesian deep learning for computer vision? Advancesin neural information processing systems, 30, 2017

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advancesin neural information processing systems, 30, 2017

2017
[48]

Estimating the mean and variance of the target probability distribution

David A Nix and Andreas S Weigend. Estimating the mean and variance of the target probability distribution. In Proceedings of 1994 ieee international conference on neural networks (ICNN’94), volume 1, pages 55–60. IEEE, 1994

1994
[49]

Vggt4d: Mining motion cues in visual geometry transformers for 4d scene reconstruction.arXiv preprint arXiv:2511.19971, 2025

Yu Hu, Chong Cheng, Sicheng Yu, Xiaoyang Guo, and Hao Wang. Vggt4d: Mining motion cues in visual geometry transformers for 4d scene reconstruction.arXiv preprint arXiv:2511.19971, 2025

work page arXiv 2025
[50]

A benchmark dataset and evaluation methodology for video object segmentation

Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016

2016
[51]

Monocular dynamic view synthesis: A reality check.Advancesin Neural Information Processing Systems, 35:33768–33780, 2022

Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, and Angjoo Kanazawa. Monocular dynamic view synthesis: A reality check.Advancesin Neural Information Processing Systems, 35:33768–33780, 2022. 12

2022