pith. sign in

arxiv: 2604.13596 · v3 · pith:AYRQIWVYnew · submitted 2026-04-15 · 💻 cs.CV

VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

Pith reviewed 2026-05-25 07:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords cross-view segmentationegocentric exocentric viewsinstance segmentationgeometry-aware featuresself-supervised trainingUnion Segmentation HeadEgo-Exo4D benchmark
0
0 comments X

The pith

VGGT-Segmentor adds a three-stage Union Segmentation Head to VGGT features to produce accurate instance masks across ego and exo views without paired annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve instance-level object segmentation between egocentric and exocentric camera views, a task made hard by large shifts in scale, perspective, and occlusion that break direct pixel matching. It starts from VGGT's geometry-aware feature alignment but notes that this alignment alone suffers from pixel-level projection drift. The solution is a Union Segmentation Head that runs mask prompt fusion, point-guided prediction, and iterative refinement to turn object-level consistency into dense masks, paired with single-image self-supervised training that removes the need for correspondence labels. On the Ego-Exo4D benchmark this yields 67.7 percent and 68.0 percent average IoU for the two directions and beats most fully supervised prior work. If correct, the approach shows that high-level geometric consistency can be turned into usable dense output for embodied AI and remote collaboration tasks.

Core claim

VGGT-Segmentor unifies VGGT's cross-view feature representation with a Union Segmentation Head that operates in three stages—mask prompt fusion, point-guided prediction, and iterative mask refinement—to translate high-level geometric alignment into pixel-accurate segmentation masks, while a single-image self-supervised training strategy removes the need for paired annotations and produces new state-of-the-art results of 67.7 percent and 68.0 percent average IoU on Ego-to-Exo and Exo-to-Ego tasks.

What carries the argument

The Union Segmentation Head, a three-stage module that fuses mask prompts, guides predictions with points, and refines masks iteratively to convert object-level feature alignment into precise pixel masks despite projection drift.

If this is right

  • Cross-view instance segmentation becomes feasible without paired view annotations or explicit correspondence labels.
  • A correspondence-free pretrained model can surpass most fully supervised methods on the Ego-Exo4D benchmark.
  • The same geometry-to-mask translation supports embodied AI and remote collaboration applications that require consistent object identity across viewpoints.
  • Single-image self-supervised training scales to new view pairs without additional annotation cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same head design could be attached to other geometry-aware backbones to turn their object-level signals into dense outputs.
  • Success here implies that attention consistency at the object level is a more stable signal than pixel-level matching when viewpoints differ sharply.
  • The method may extend to video or multi-camera settings where drift accumulates over time rather than across two static views.

Load-bearing premise

VGGT's internal object-level attention stays consistent enough that the three-stage head can convert it into accurate per-pixel masks even when pixel projections drift.

What would settle it

On the Ego-Exo4D test set, a version that uses only raw VGGT features or drops the iterative refinement stage produces IoU no higher than existing supervised baselines.

Figures

Figures reproduced from arXiv: 2604.13596 by Bohao Zhang, Jitong Liao, Si Liu, Wenjun Wu, Yulu Gao, Zongheng Tang.

Figure 1
Figure 1. Figure 1: Visualizing VGGT Cross-View Correspondence. Left: source image. Middle: target image with the projections of source￾sampled points obtained by directly applying VGGT, which exhibit the systematic drift and misalignment. Right: star markers in the source image with the corresponding attention map on the target image, illustrating VGGT’s instance-consistent object alignment across views. perspective. As a la… view at source ↗
Figure 2
Figure 2. Figure 2: (A) Overall Architecture of VGGT-S, which integrates the original VGGT encoder with our Union Segmentation Head. (B) Mask Prompt Fusion stage, which injects the source mask Ms into source feature map Fs and target feature map Ft via convolutional fusion and a Bottleneck Fusion module. (C) Point-Guided Prediction stage, which uses point sets (Ps, Pt) to guide target mask predic￾tion through bidirectional in… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of VGGT-S vs. DOMR. The first row shows the Ego→Exo task. DOMR incorrectly takes the chopping board as the predicted result, while VGGT-S correctly identifies the pot. The second row illustrates the Exo→Ego task. Two similar bottles are nearby. Due to a lack of geometric information, DOMR mistakenly confuses them, whereas VGGT-S continues to make accurate predictions. under significant viewpo… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the Effect of the Union Segmentation Head. Although VGGT projects points to incorrect locations, our Union Segmentation Head adjusts the predicted mask to geomet￾rically consistent positions. Zooming in provides better results. Visualization of the Effect of the Union Segmentation Head. To evaluate the effect of the Union Segmentation Head, we visualize predictions in [PITH_FULL_IMAGE:fig… view at source ↗
read the original abstract

Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong generalization. On the Ego-Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks, respectively, significantly outperforming prior methods. Notably, our correspondence-free pretrained model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach. Code is publicly available at: https://github.com/buaa-colalab/VGGT-S.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces VGGT-Segmentor (VGGT-S), a framework that augments VGGT's cross-view geometric features with a Union Segmentation Head operating in three stages (mask prompt fusion, point-guided prediction, iterative refinement) to achieve instance-level object segmentation between egocentric and exocentric views. It employs a single-image self-supervised training strategy that avoids paired annotations and reports new state-of-the-art results on the Ego-Exo4D benchmark: 67.7% average IoU (Ego to Exo) and 68.0% (Exo to Ego), with the correspondence-free pretrained model surpassing most fully-supervised baselines.

Significance. If the reported IoU gains are shown to stem from the proposed head rather than confounding factors, the work would be significant for cross-view segmentation in embodied AI and remote collaboration, particularly due to the self-supervised training that improves scalability and generalization. Public code release is a positive factor for reproducibility.

major comments (1)
  1. [Abstract] Abstract: The central claim that the Union Segmentation Head converts VGGT's internal object-level attention into pixel-accurate masks despite projection drift rests on the unverified assumption that this attention remains sufficiently consistent across Ego-Exo4D view pairs. No ablation studies, visualizations, quantitative consistency metrics, or error analysis are provided to support this assumption or to attribute the 67.7%/68.0% IoU gains specifically to the three-stage head rather than pretraining or data factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concern about supporting evidence for the consistency of object-level attention and attribution of gains to the Union Segmentation Head below, and will revise the manuscript to incorporate additional analyses.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the Union Segmentation Head converts VGGT's internal object-level attention into pixel-accurate masks despite projection drift rests on the unverified assumption that this attention remains sufficiently consistent across Ego-Exo4D view pairs. No ablation studies, visualizations, quantitative consistency metrics, or error analysis are provided to support this assumption or to attribute the 67.7%/68.0% IoU gains specifically to the three-stage head rather than pretraining or data factors.

    Authors: We agree that the manuscript would benefit from explicit evidence on this point. While the paper notes that VGGT's internal object-level attention remains consistent despite projection drift (based on our empirical observations), dedicated ablation studies, visualizations of attention maps, quantitative consistency metrics (such as cross-view attention similarity), and error analysis are not included. In the revised version, we will add these elements, along with ablations that isolate the contribution of the three-stage Union Segmentation Head from the correspondence-free pretraining and other factors, to more clearly attribute the reported IoU improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain is self-contained with external benchmark validation

full rationale

The paper introduces VGGT-Segmentor by extending VGGT features via a novel three-stage Union Segmentation Head and a single-image self-supervised training strategy. Reported 67.7/68.0 IoU results are measured on the external Ego-Exo4D benchmark, not derived from internal fits or self-citations. No equations or steps reduce the claimed performance or alignment to the inputs by construction; the self-supervised claim is explicitly correspondence-free and independent of paired labels. The architecture description and benchmark evaluation constitute independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5818 in / 1074 out tokens · 26179 ms · 2026-05-25T07:04:25.696853+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 4 internal anchors

  1. [1]

    Building rome in a day.Communications of the ACM, 54 (10):105–112, 2011

    Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Si- mon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day.Communications of the ACM, 54 (10):105–112, 2011. 2

  2. [2]

    Ego2top: Matching view- ers in egocentric and top-view videos

    Shervin Ardeshir and Ali Borji. Ego2top: Matching view- ers in egocentric and top-view videos. InComputer Vision– ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 253–268. Springer, 2016. 3

  3. [3]

    Self-supervised cross-view correspondence with predictive cycle consistency

    Alan Baade and Changan Chen. Self-supervised cross-view correspondence with predictive cycle consistency. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 16753–16763, 2025. 6

  4. [4]

    Allison Bayro, Hongju Moon, Yalda Ghasemi, Heejin Jeong, and Jae Yeol Lee. Object manipulation in physically con- strained workplaces: remote collaboration with extended re- ality.IISE Transactions on Occupational Ergonomics and Human Factors, 13(3):177–190, 2025. 1

  5. [5]

    Yolact: Real-time instance segmentation

    Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-time instance segmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 9157–9166, 2019. 3

  6. [6]

    Brief: Binary robust independent elementary features

    Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. Brief: Binary robust independent elementary features. InComputer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11, pages 778–

  7. [7]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3

  8. [8]

    Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

    Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmen- tation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014. 3

  9. [9]

    Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs.IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017

  10. [10]

    Rethinking Atrous Convolution for Semantic Image Segmentation

    Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for seman- tic image segmentation.arXiv preprint arXiv:1706.05587, 2017

  11. [11]

    Encoder-decoder with atrous separable convolution for semantic image segmentation

    Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. 3

  12. [12]

    Masked-attention mask transformer for universal image segmentation

    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 3

  13. [13]

    Aritra Dutta, Srijan Das, Jacob Nielsen, Rajatsubhra Chakraborty, and Mubarak Shah. Multiview aerial visual recognition (mavrec): Can multi-view improve aerial visual perception? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22678–22690, 2024. 8

  14. [14]

    Learning by watch- ing: A review of video-based learning approaches for robot manipulation.IEEE Access, 2025

    Chrisantus Eze and Christopher Crick. Learning by watch- ing: A review of video-based learning approaches for robot manipulation.IEEE Access, 2025. 1

  15. [15]

    Identifying first-person camera wearers in third- person videos

    Chenyou Fan, Jangwon Lee, Mingze Xu, Krishna Ku- mar Singh, Yong Jae Lee, David J Crandall, and Michael S Ryoo. Identifying first-person camera wearers in third- person videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5125– 5133, 2017. 3

  16. [16]

    Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction.Advances in Neural Information Processing Systems, 35:3403–3416, 2022

    Qiancheng Fu, Qingshan Xu, Yew Soon Ong, and Wenbing Tao. Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction.Advances in Neural Information Processing Systems, 35:3403–3416, 2022. 2

  17. [17]

    Objectrelator: Enabling cross-view object rela- tion understanding across ego-centric and exo-centric per- spectives

    Yuqian Fu, Runze Wang, Bin Ren, Guolei Sun, Biao Gong, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, and Luc Van Gool. Objectrelator: Enabling cross-view object rela- tion understanding across ego-centric and exo-centric per- spectives. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6530–6540, 2025. 1, 3, 6

  18. [18]

    Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015

    Yasutaka Furukawa, Carlos Hern ´andez, et al. Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015. 1, 2

  19. [19]

    Massively parallel multiview stereopsis by surface normal diffusion

    Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview stereopsis by surface normal diffusion. InProceedings of the IEEE international confer- ence on computer vision, pages 873–881, 2015. 2

  20. [20]

    Self-supervised multi-view multi-human association and tracking

    Yiyang Gan, Ruize Han, Liqiang Yin, Wei Feng, and Song Wang. Self-supervised multi-view multi-human association and tracking. InACM MM, 2021. 6

  21. [21]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

  22. [22]

    A survey on instance segmentation: state of the art.International jour- nal of multimedia information retrieval, 9(3):171–189, 2020

    Abdul Mueed Hafiz and Ghulam Mohiuddin Bhat. A survey on instance segmentation: state of the art.International jour- nal of multimedia information retrieval, 9(3):171–189, 2020. 3

  23. [23]

    Bridging perspec- tives: A survey on cross-view collaborative intelligence with egocentric-exocentric vision.International Journal of Com- puter Vision, 134(2):62, 2026

    Yuping He, Yifei Huang, Guo Chen, Lidong Lu, Baoqi Pei, Jilan Xu, Tong Lu, and Yoichi Sato. Bridging perspec- tives: A survey on cross-view collaborative intelligence with egocentric-exocentric vision.International Journal of Com- puter Vision, 134(2):62, 2026. 1

  24. [24]

    Deepmvs: Learning multi- view stereopsis

    Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi- view stereopsis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2821–2830,

  25. [25]

    Segmast3r: Geometry grounded segment matching.arXiv preprint arXiv:2510.05051, 2025

    Rohit Jayanti, Swayam Agrawal, Vansh Garg, Siddharth Tourani, Muhammad Haris Khan, Sourav Garg, and Mad- hava Krishna. Segmast3r: Geometry grounded segment matching.arXiv preprint arXiv:2510.05051, 2025. 3

  26. [26]

    Egomimic: Scaling imitation learning via egocentric video

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 13226–13233. IEEE, 2025. 1

  27. [27]

    Panoptic segmentation

    Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Doll ´ar. Panoptic segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9404–9413, 2019. 3

  28. [28]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 3, 5

  29. [29]

    Ground- ing image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 3

  30. [30]

    Matching anything by segmenting anything

    Siyuan Li, Lei Ke, Martin Danelljan, Luigi Piccinelli, Mattia Segu, Luc Van Gool, and Fisher Yu. Matching anything by segmenting anything. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 18963–18973, 2024. 3, 5

  31. [31]

    Domr: Establishing cross-view segmentation via dense object matching

    Jitong Liao, Yulu Gao, Shaofei Huang, Jialin Gao, Jie Lei, Ronghua Liang, and Si Liu. Domr: Establishing cross-view segmentation via dense object matching. InProceedings of the 33rd ACM International Conference on Multimedia, pages 412–421, 2025. 1, 3, 6

  32. [32]

    Path aggregation network for instance segmentation

    Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 8759–8768, 2018. 3

  33. [33]

    Least squares quantization in pcm.IEEE trans- actions on information theory, 28(2):129–137, 1982

    Stuart Lloyd. Least squares quantization in pcm.IEEE trans- actions on information theory, 28(2):129–137, 1982. 4, 5

  34. [34]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

  35. [35]

    Distinctive image features from scale- invariant keypoints.International journal of computer vi- sion, 60:91–110, 2004

    David G Lowe. Distinctive image features from scale- invariant keypoints.International journal of computer vi- sion, 60:91–110, 2004. 2

  36. [36]

    Multiview stereo with cascaded epipolar raft

    Zeyu Ma, Zachary Teed, and Jia Deng. Multiview stereo with cascaded epipolar raft. InEuropean Conference on Com- puter Vision, pages 734–750. Springer, 2022. 2

  37. [37]

    Differentiable volumetric rendering: Learn- ing implicit 3d representations without 3d supervision

    Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learn- ing implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 3504–3515, 2020

  38. [38]

    Rethinking depth estimation for multi- view stereo: A unified representation

    Rui Peng, Rongjie Wang, Zhenyu Wang, Yawen Lai, and Ronggang Wang. Rethinking depth estimation for multi- view stereo: A unified representation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8645–8654, 2022. 2

  39. [39]

    Vi- sion transformers for dense prediction

    Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 3

  40. [40]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3, 5

  41. [41]

    Structure- from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 2

  42. [42]

    Photorealistic scene reconstruction by voxel coloring.International journal of computer vision, 35(2):151–173, 1999

    Steven M Seitz and Charles R Dyer. Photorealistic scene reconstruction by voxel coloring.International journal of computer vision, 35(2):151–173, 1999. 1

  43. [43]

    Dense cross-query-and-support attention weighted mask aggrega- tion for few-shot segmentation

    Xinyu Shi, Dong Wei, Yu Zhang, Donghuan Lu, Munan Ning, Jiashun Chen, Kai Ma, and Yefeng Zheng. Dense cross-query-and-support attention weighted mask aggrega- tion for few-shot segmentation. InEuropean Conference on Computer Vision, pages 151–168. Springer, 2022. 6

  44. [44]

    Clustergnn: Cluster-based coarse-to- fine graph neural network for efficient feature matching

    Yan Shi, Jun-Xiong Cai, Yoli Shavit, Tai-Jiang Mu, Wensen Feng, and Kai Zhang. Clustergnn: Cluster-based coarse-to- fine graph neural network for efficient feature matching. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12517–12526, 2022. 2

  45. [45]

    Vggsfm: Visual geometry grounded deep structure from motion

    Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 21686–21697, 2024. 2

  46. [46]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 1, 3

  47. [47]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 3

  48. [48]

    Deepsfm: Structure from motion via deep bundle adjustment

    Xingkui Wei, Yinda Zhang, Zhuwen Li, Yanwei Fu, and Xi- angyang Xue. Deepsfm: Structure from motion via deep bundle adjustment. InEuropean conference on computer vi- sion, pages 230–247. Springer, 2020. 2

  49. [49]

    Seeing the unseen: Pre- dicting the first-person camera wearer’s location and pose in third-person scenes

    Yangming Wen, Krishna Kumar Singh, Markham Anderson, Wei-Pang Jan, and Yong Jae Lee. Seeing the unseen: Pre- dicting the first-person camera wearer’s location and pose in third-person scenes. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 3446–3455,

  50. [50]

    Joint person segmentation and iden- tification in synchronized first-and third-person videos

    Mingze Xu, Chenyou Fan, Yuchen Wang, Michael S Ryoo, and David J Crandall. Joint person segmentation and iden- tification in synchronized first-and third-person videos. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 637–652, 2018. 3

  51. [51]

    Lift: Learned invariant feature transform

    Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform. InCom- puter Vision–ECCV 2016: 14th European Conference, Am- 10 sterdam, The Netherlands, October 11-14, 2016, Proceed- ings, Part VI 14, pages 467–483. Springer, 2016. 2

  52. [52]

    Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers.IEEE Transactions on intelligent transportation systems, 24(12): 14679–14694, 2023

    Jiaming Zhang, Huayao Liu, Kailun Yang, Xinxin Hu, Ruip- ing Liu, and Rainer Stiefelhagen. Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers.IEEE Transactions on intelligent transportation systems, 24(12): 14679–14694, 2023. 6

  53. [53]

    Ge- omvsnet: Learning multi-view stereo with geometry percep- tion

    Zhe Zhang, Rui Peng, Yuxi Hu, and Ronggang Wang. Ge- omvsnet: Learning multi-view stereo with geometry percep- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 21508–21518,

  54. [54]

    Psalm: Pixelwise segmentation with large multi-modal model

    Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 1, 3, 6

  55. [55]

    Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782,

    Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782,