VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

Bohao Zhang; Jitong Liao; Si Liu; Wenjun Wu; Yulu Gao; Zongheng Tang

arxiv: 2604.13596 · v3 · pith:AYRQIWVYnew · submitted 2026-04-15 · 💻 cs.CV

VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

Yulu Gao , Bohao Zhang , Zongheng Tang , Jitong Liao , Wenjun Wu , Si Liu This is my paper

Pith reviewed 2026-05-25 07:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords cross-view segmentationegocentric exocentric viewsinstance segmentationgeometry-aware featuresself-supervised trainingUnion Segmentation HeadEgo-Exo4D benchmark

0 comments

The pith

VGGT-Segmentor adds a three-stage Union Segmentation Head to VGGT features to produce accurate instance masks across ego and exo views without paired annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve instance-level object segmentation between egocentric and exocentric camera views, a task made hard by large shifts in scale, perspective, and occlusion that break direct pixel matching. It starts from VGGT's geometry-aware feature alignment but notes that this alignment alone suffers from pixel-level projection drift. The solution is a Union Segmentation Head that runs mask prompt fusion, point-guided prediction, and iterative refinement to turn object-level consistency into dense masks, paired with single-image self-supervised training that removes the need for correspondence labels. On the Ego-Exo4D benchmark this yields 67.7 percent and 68.0 percent average IoU for the two directions and beats most fully supervised prior work. If correct, the approach shows that high-level geometric consistency can be turned into usable dense output for embodied AI and remote collaboration tasks.

Core claim

VGGT-Segmentor unifies VGGT's cross-view feature representation with a Union Segmentation Head that operates in three stages—mask prompt fusion, point-guided prediction, and iterative mask refinement—to translate high-level geometric alignment into pixel-accurate segmentation masks, while a single-image self-supervised training strategy removes the need for paired annotations and produces new state-of-the-art results of 67.7 percent and 68.0 percent average IoU on Ego-to-Exo and Exo-to-Ego tasks.

What carries the argument

The Union Segmentation Head, a three-stage module that fuses mask prompts, guides predictions with points, and refines masks iteratively to convert object-level feature alignment into precise pixel masks despite projection drift.

If this is right

Cross-view instance segmentation becomes feasible without paired view annotations or explicit correspondence labels.
A correspondence-free pretrained model can surpass most fully supervised methods on the Ego-Exo4D benchmark.
The same geometry-to-mask translation supports embodied AI and remote collaboration applications that require consistent object identity across viewpoints.
Single-image self-supervised training scales to new view pairs without additional annotation cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same head design could be attached to other geometry-aware backbones to turn their object-level signals into dense outputs.
Success here implies that attention consistency at the object level is a more stable signal than pixel-level matching when viewpoints differ sharply.
The method may extend to video or multi-camera settings where drift accumulates over time rather than across two static views.

Load-bearing premise

VGGT's internal object-level attention stays consistent enough that the three-stage head can convert it into accurate per-pixel masks even when pixel projections drift.

What would settle it

On the Ego-Exo4D test set, a version that uses only raw VGGT features or drops the iterative refinement stage produces IoU no higher than existing supervised baselines.

Figures

Figures reproduced from arXiv: 2604.13596 by Bohao Zhang, Jitong Liao, Si Liu, Wenjun Wu, Yulu Gao, Zongheng Tang.

**Figure 1.** Figure 1: Visualizing VGGT Cross-View Correspondence. Left: source image. Middle: target image with the projections of sourcesampled points obtained by directly applying VGGT, which exhibit the systematic drift and misalignment. Right: star markers in the source image with the corresponding attention map on the target image, illustrating VGGT’s instance-consistent object alignment across views. perspective. As a la… view at source ↗

**Figure 2.** Figure 2: (A) Overall Architecture of VGGT-S, which integrates the original VGGT encoder with our Union Segmentation Head. (B) Mask Prompt Fusion stage, which injects the source mask Ms into source feature map Fs and target feature map Ft via convolutional fusion and a Bottleneck Fusion module. (C) Point-Guided Prediction stage, which uses point sets (Ps, Pt) to guide target mask prediction through bidirectional in… view at source ↗

**Figure 3.** Figure 3: Visualization of VGGT-S vs. DOMR. The first row shows the Ego→Exo task. DOMR incorrectly takes the chopping board as the predicted result, while VGGT-S correctly identifies the pot. The second row illustrates the Exo→Ego task. Two similar bottles are nearby. Due to a lack of geometric information, DOMR mistakenly confuses them, whereas VGGT-S continues to make accurate predictions. under significant viewpo… view at source ↗

**Figure 4.** Figure 4: Visualization of the Effect of the Union Segmentation Head. Although VGGT projects points to incorrect locations, our Union Segmentation Head adjusts the predicted mask to geometrically consistent positions. Zooming in provides better results. Visualization of the Effect of the Union Segmentation Head. To evaluate the effect of the Union Segmentation Head, we visualize predictions in [PITH_FULL_IMAGE:fig… view at source ↗

read the original abstract

Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong generalization. On the Ego-Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks, respectively, significantly outperforming prior methods. Notably, our correspondence-free pretrained model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach. Code is publicly available at: https://github.com/buaa-colalab/VGGT-S.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a three-stage Union Segmentation Head and single-image self-supervision to VGGT for ego-exo instance segmentation, claiming SOTA IoU on Ego-Exo4D, but provides no evidence that object-level attention stays consistent enough to overcome projection drift.

read the letter

The main takeaway is a new segmentation head built on VGGT plus a self-supervised training trick that avoids paired labels, with reported gains to 67.7/68.0 average IoU on the Ego-Exo4D benchmark that beat prior work including some supervised baselines. Code is released, which helps. What stands out as new is the specific three-stage Union Segmentation Head (mask prompt fusion, point-guided prediction, iterative refinement) and the single-image self-supervised strategy on top of VGGT features. The framing of the problem is also clear: geometry models like VGGT align at object level but drift at pixels, so a refinement head is needed for dense output. That part is practical for embodied AI settings where annotation is expensive. The soft spots are more substantial. The abstract states that VGGT's internal object-level attention remains consistent enough for the head to produce accurate masks, yet offers no ablation, visualization, or analysis to support that the attention actually holds across view pairs in Ego-Exo4D. Without that check, the IoU numbers cannot be confidently tied to the proposed head rather than data choices or other implementation details. No error analysis or component breakdowns appear in the available text either. This work targets researchers doing cross-view dense prediction in robotics or collaboration scenarios. A reader who wants a concrete method to reduce annotation needs could extract value once the full experiments are examined. It deserves a serious referee because the task matters and the high-level approach is reasonable, even though the current evidence is thin and the central assumption needs direct testing.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces VGGT-Segmentor (VGGT-S), a framework that augments VGGT's cross-view geometric features with a Union Segmentation Head operating in three stages (mask prompt fusion, point-guided prediction, iterative refinement) to achieve instance-level object segmentation between egocentric and exocentric views. It employs a single-image self-supervised training strategy that avoids paired annotations and reports new state-of-the-art results on the Ego-Exo4D benchmark: 67.7% average IoU (Ego to Exo) and 68.0% (Exo to Ego), with the correspondence-free pretrained model surpassing most fully-supervised baselines.

Significance. If the reported IoU gains are shown to stem from the proposed head rather than confounding factors, the work would be significant for cross-view segmentation in embodied AI and remote collaboration, particularly due to the self-supervised training that improves scalability and generalization. Public code release is a positive factor for reproducibility.

major comments (1)

[Abstract] Abstract: The central claim that the Union Segmentation Head converts VGGT's internal object-level attention into pixel-accurate masks despite projection drift rests on the unverified assumption that this attention remains sufficiently consistent across Ego-Exo4D view pairs. No ablation studies, visualizations, quantitative consistency metrics, or error analysis are provided to support this assumption or to attribute the 67.7%/68.0% IoU gains specifically to the three-stage head rather than pretraining or data factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concern about supporting evidence for the consistency of object-level attention and attribution of gains to the Union Segmentation Head below, and will revise the manuscript to incorporate additional analyses.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the Union Segmentation Head converts VGGT's internal object-level attention into pixel-accurate masks despite projection drift rests on the unverified assumption that this attention remains sufficiently consistent across Ego-Exo4D view pairs. No ablation studies, visualizations, quantitative consistency metrics, or error analysis are provided to support this assumption or to attribute the 67.7%/68.0% IoU gains specifically to the three-stage head rather than pretraining or data factors.

Authors: We agree that the manuscript would benefit from explicit evidence on this point. While the paper notes that VGGT's internal object-level attention remains consistent despite projection drift (based on our empirical observations), dedicated ablation studies, visualizations of attention maps, quantitative consistency metrics (such as cross-view attention similarity), and error analysis are not included. In the revised version, we will add these elements, along with ablations that isolate the contribution of the three-stage Union Segmentation Head from the correspondence-free pretraining and other factors, to more clearly attribute the reported IoU improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain is self-contained with external benchmark validation

full rationale

The paper introduces VGGT-Segmentor by extending VGGT features via a novel three-stage Union Segmentation Head and a single-image self-supervised training strategy. Reported 67.7/68.0 IoU results are measured on the external Ego-Exo4D benchmark, not derived from internal fits or self-citations. No equations or steps reduce the claimed performance or alignment to the inputs by construction; the self-supervised claim is explicitly correspondence-free and independent of paired labels. The architecture description and benchmark evaluation constitute independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5818 in / 1074 out tokens · 26179 ms · 2026-05-25T07:04:25.696853+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 4 internal anchors

[1]

Building rome in a day.Communications of the ACM, 54 (10):105–112, 2011

Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Si- mon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day.Communications of the ACM, 54 (10):105–112, 2011. 2

work page 2011
[2]

Ego2top: Matching view- ers in egocentric and top-view videos

Shervin Ardeshir and Ali Borji. Ego2top: Matching view- ers in egocentric and top-view videos. InComputer Vision– ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 253–268. Springer, 2016. 3

work page 2016
[3]

Self-supervised cross-view correspondence with predictive cycle consistency

Alan Baade and Changan Chen. Self-supervised cross-view correspondence with predictive cycle consistency. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 16753–16763, 2025. 6

work page 2025
[4]

Allison Bayro, Hongju Moon, Yalda Ghasemi, Heejin Jeong, and Jae Yeol Lee. Object manipulation in physically con- strained workplaces: remote collaboration with extended re- ality.IISE Transactions on Occupational Ergonomics and Human Factors, 13(3):177–190, 2025. 1

work page 2025
[5]

Yolact: Real-time instance segmentation

Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-time instance segmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 9157–9166, 2019. 3

work page 2019
[6]

Brief: Binary robust independent elementary features

Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. Brief: Binary robust independent elementary features. InComputer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11, pages 778–

work page 2010
[7]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3

work page 2021
[8]

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmen- tation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014. 3

work page internal anchor Pith review Pith/arXiv arXiv 2014
[9]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs.IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017

work page 2017
[10]

Rethinking Atrous Convolution for Semantic Image Segmentation

Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for seman- tic image segmentation.arXiv preprint arXiv:1706.05587, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. 3

work page 2018
[12]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 3

work page 2022
[13]

Aritra Dutta, Srijan Das, Jacob Nielsen, Rajatsubhra Chakraborty, and Mubarak Shah. Multiview aerial visual recognition (mavrec): Can multi-view improve aerial visual perception? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22678–22690, 2024. 8

work page 2024
[14]

Learning by watch- ing: A review of video-based learning approaches for robot manipulation.IEEE Access, 2025

Chrisantus Eze and Christopher Crick. Learning by watch- ing: A review of video-based learning approaches for robot manipulation.IEEE Access, 2025. 1

work page 2025
[15]

Identifying first-person camera wearers in third- person videos

Chenyou Fan, Jangwon Lee, Mingze Xu, Krishna Ku- mar Singh, Yong Jae Lee, David J Crandall, and Michael S Ryoo. Identifying first-person camera wearers in third- person videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5125– 5133, 2017. 3

work page 2017
[16]

Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction.Advances in Neural Information Processing Systems, 35:3403–3416, 2022

Qiancheng Fu, Qingshan Xu, Yew Soon Ong, and Wenbing Tao. Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction.Advances in Neural Information Processing Systems, 35:3403–3416, 2022. 2

work page 2022
[17]

Objectrelator: Enabling cross-view object rela- tion understanding across ego-centric and exo-centric per- spectives

Yuqian Fu, Runze Wang, Bin Ren, Guolei Sun, Biao Gong, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, and Luc Van Gool. Objectrelator: Enabling cross-view object rela- tion understanding across ego-centric and exo-centric per- spectives. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6530–6540, 2025. 1, 3, 6

work page 2025
[18]

Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015

Yasutaka Furukawa, Carlos Hern ´andez, et al. Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015. 1, 2

work page 2015
[19]

Massively parallel multiview stereopsis by surface normal diffusion

Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview stereopsis by surface normal diffusion. InProceedings of the IEEE international confer- ence on computer vision, pages 873–881, 2015. 2

work page 2015
[20]

Self-supervised multi-view multi-human association and tracking

Yiyang Gan, Ruize Han, Liqiang Yin, Wei Feng, and Song Wang. Self-supervised multi-view multi-human association and tracking. InACM MM, 2021. 6

work page 2021
[21]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

work page 2024
[22]

A survey on instance segmentation: state of the art.International jour- nal of multimedia information retrieval, 9(3):171–189, 2020

Abdul Mueed Hafiz and Ghulam Mohiuddin Bhat. A survey on instance segmentation: state of the art.International jour- nal of multimedia information retrieval, 9(3):171–189, 2020. 3

work page 2020
[23]

Bridging perspec- tives: A survey on cross-view collaborative intelligence with egocentric-exocentric vision.International Journal of Com- puter Vision, 134(2):62, 2026

Yuping He, Yifei Huang, Guo Chen, Lidong Lu, Baoqi Pei, Jilan Xu, Tong Lu, and Yoichi Sato. Bridging perspec- tives: A survey on cross-view collaborative intelligence with egocentric-exocentric vision.International Journal of Com- puter Vision, 134(2):62, 2026. 1

work page 2026
[24]

Deepmvs: Learning multi- view stereopsis

Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi- view stereopsis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2821–2830,

work page
[25]

Segmast3r: Geometry grounded segment matching.arXiv preprint arXiv:2510.05051, 2025

Rohit Jayanti, Swayam Agrawal, Vansh Garg, Siddharth Tourani, Muhammad Haris Khan, Sourav Garg, and Mad- hava Krishna. Segmast3r: Geometry grounded segment matching.arXiv preprint arXiv:2510.05051, 2025. 3

work page arXiv 2025
[26]

Egomimic: Scaling imitation learning via egocentric video

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 13226–13233. IEEE, 2025. 1

work page 2025
[27]

Panoptic segmentation

Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Doll ´ar. Panoptic segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9404–9413, 2019. 3

work page 2019
[28]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 3, 5

work page 2023
[29]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 3

work page 2024
[30]

Matching anything by segmenting anything

Siyuan Li, Lei Ke, Martin Danelljan, Luigi Piccinelli, Mattia Segu, Luc Van Gool, and Fisher Yu. Matching anything by segmenting anything. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 18963–18973, 2024. 3, 5

work page 2024
[31]

Domr: Establishing cross-view segmentation via dense object matching

Jitong Liao, Yulu Gao, Shaofei Huang, Jialin Gao, Jie Lei, Ronghua Liang, and Si Liu. Domr: Establishing cross-view segmentation via dense object matching. InProceedings of the 33rd ACM International Conference on Multimedia, pages 412–421, 2025. 1, 3, 6

work page 2025
[32]

Path aggregation network for instance segmentation

Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 8759–8768, 2018. 3

work page 2018
[33]

Least squares quantization in pcm.IEEE trans- actions on information theory, 28(2):129–137, 1982

Stuart Lloyd. Least squares quantization in pcm.IEEE trans- actions on information theory, 28(2):129–137, 1982. 4, 5

work page 1982
[34]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

Distinctive image features from scale- invariant keypoints.International journal of computer vi- sion, 60:91–110, 2004

David G Lowe. Distinctive image features from scale- invariant keypoints.International journal of computer vi- sion, 60:91–110, 2004. 2

work page 2004
[36]

Multiview stereo with cascaded epipolar raft

Zeyu Ma, Zachary Teed, and Jia Deng. Multiview stereo with cascaded epipolar raft. InEuropean Conference on Com- puter Vision, pages 734–750. Springer, 2022. 2

work page 2022
[37]

Differentiable volumetric rendering: Learn- ing implicit 3d representations without 3d supervision

Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learn- ing implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 3504–3515, 2020

work page 2020
[38]

Rethinking depth estimation for multi- view stereo: A unified representation

Rui Peng, Rongjie Wang, Zhenyu Wang, Yawen Lai, and Ronggang Wang. Rethinking depth estimation for multi- view stereo: A unified representation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8645–8654, 2022. 2

work page 2022
[39]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 3

work page 2021
[40]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 2

work page 2016
[42]

Photorealistic scene reconstruction by voxel coloring.International journal of computer vision, 35(2):151–173, 1999

Steven M Seitz and Charles R Dyer. Photorealistic scene reconstruction by voxel coloring.International journal of computer vision, 35(2):151–173, 1999. 1

work page 1999
[43]

Dense cross-query-and-support attention weighted mask aggrega- tion for few-shot segmentation

Xinyu Shi, Dong Wei, Yu Zhang, Donghuan Lu, Munan Ning, Jiashun Chen, Kai Ma, and Yefeng Zheng. Dense cross-query-and-support attention weighted mask aggrega- tion for few-shot segmentation. InEuropean Conference on Computer Vision, pages 151–168. Springer, 2022. 6

work page 2022
[44]

Clustergnn: Cluster-based coarse-to- fine graph neural network for efficient feature matching

Yan Shi, Jun-Xiong Cai, Yoli Shavit, Tai-Jiang Mu, Wensen Feng, and Kai Zhang. Clustergnn: Cluster-based coarse-to- fine graph neural network for efficient feature matching. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12517–12526, 2022. 2

work page 2022
[45]

Vggsfm: Visual geometry grounded deep structure from motion

Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 21686–21697, 2024. 2

work page 2024
[46]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 1, 3

work page 2025
[47]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 3

work page 2024
[48]

Deepsfm: Structure from motion via deep bundle adjustment

Xingkui Wei, Yinda Zhang, Zhuwen Li, Yanwei Fu, and Xi- angyang Xue. Deepsfm: Structure from motion via deep bundle adjustment. InEuropean conference on computer vi- sion, pages 230–247. Springer, 2020. 2

work page 2020
[49]

Seeing the unseen: Pre- dicting the first-person camera wearer’s location and pose in third-person scenes

Yangming Wen, Krishna Kumar Singh, Markham Anderson, Wei-Pang Jan, and Yong Jae Lee. Seeing the unseen: Pre- dicting the first-person camera wearer’s location and pose in third-person scenes. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 3446–3455,

work page
[50]

Joint person segmentation and iden- tification in synchronized first-and third-person videos

Mingze Xu, Chenyou Fan, Yuchen Wang, Michael S Ryoo, and David J Crandall. Joint person segmentation and iden- tification in synchronized first-and third-person videos. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 637–652, 2018. 3

work page 2018
[51]

Lift: Learned invariant feature transform

Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform. InCom- puter Vision–ECCV 2016: 14th European Conference, Am- 10 sterdam, The Netherlands, October 11-14, 2016, Proceed- ings, Part VI 14, pages 467–483. Springer, 2016. 2

work page 2016
[52]

Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers.IEEE Transactions on intelligent transportation systems, 24(12): 14679–14694, 2023

Jiaming Zhang, Huayao Liu, Kailun Yang, Xinxin Hu, Ruip- ing Liu, and Rainer Stiefelhagen. Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers.IEEE Transactions on intelligent transportation systems, 24(12): 14679–14694, 2023. 6

work page 2023
[53]

Ge- omvsnet: Learning multi-view stereo with geometry percep- tion

Zhe Zhang, Rui Peng, Yuxi Hu, and Ronggang Wang. Ge- omvsnet: Learning multi-view stereo with geometry percep- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 21508–21518,

work page
[54]

Psalm: Pixelwise segmentation with large multi-modal model

Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 1, 3, 6

work page 2024
[55]

Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782,

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782,

work page

[1] [1]

Building rome in a day.Communications of the ACM, 54 (10):105–112, 2011

Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Si- mon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day.Communications of the ACM, 54 (10):105–112, 2011. 2

work page 2011

[2] [2]

Ego2top: Matching view- ers in egocentric and top-view videos

Shervin Ardeshir and Ali Borji. Ego2top: Matching view- ers in egocentric and top-view videos. InComputer Vision– ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 253–268. Springer, 2016. 3

work page 2016

[3] [3]

Self-supervised cross-view correspondence with predictive cycle consistency

Alan Baade and Changan Chen. Self-supervised cross-view correspondence with predictive cycle consistency. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 16753–16763, 2025. 6

work page 2025

[4] [4]

Allison Bayro, Hongju Moon, Yalda Ghasemi, Heejin Jeong, and Jae Yeol Lee. Object manipulation in physically con- strained workplaces: remote collaboration with extended re- ality.IISE Transactions on Occupational Ergonomics and Human Factors, 13(3):177–190, 2025. 1

work page 2025

[5] [5]

Yolact: Real-time instance segmentation

Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-time instance segmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 9157–9166, 2019. 3

work page 2019

[6] [6]

Brief: Binary robust independent elementary features

Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. Brief: Binary robust independent elementary features. InComputer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11, pages 778–

work page 2010

[7] [7]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3

work page 2021

[8] [8]

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmen- tation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014. 3

work page internal anchor Pith review Pith/arXiv arXiv 2014

[9] [9]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs.IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017

work page 2017

[10] [10]

Rethinking Atrous Convolution for Semantic Image Segmentation

Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for seman- tic image segmentation.arXiv preprint arXiv:1706.05587, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. 3

work page 2018

[12] [12]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 3

work page 2022

[13] [13]

Aritra Dutta, Srijan Das, Jacob Nielsen, Rajatsubhra Chakraborty, and Mubarak Shah. Multiview aerial visual recognition (mavrec): Can multi-view improve aerial visual perception? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22678–22690, 2024. 8

work page 2024

[14] [14]

Learning by watch- ing: A review of video-based learning approaches for robot manipulation.IEEE Access, 2025

Chrisantus Eze and Christopher Crick. Learning by watch- ing: A review of video-based learning approaches for robot manipulation.IEEE Access, 2025. 1

work page 2025

[15] [15]

Identifying first-person camera wearers in third- person videos

Chenyou Fan, Jangwon Lee, Mingze Xu, Krishna Ku- mar Singh, Yong Jae Lee, David J Crandall, and Michael S Ryoo. Identifying first-person camera wearers in third- person videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5125– 5133, 2017. 3

work page 2017

[16] [16]

Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction.Advances in Neural Information Processing Systems, 35:3403–3416, 2022

Qiancheng Fu, Qingshan Xu, Yew Soon Ong, and Wenbing Tao. Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction.Advances in Neural Information Processing Systems, 35:3403–3416, 2022. 2

work page 2022

[17] [17]

Objectrelator: Enabling cross-view object rela- tion understanding across ego-centric and exo-centric per- spectives

Yuqian Fu, Runze Wang, Bin Ren, Guolei Sun, Biao Gong, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, and Luc Van Gool. Objectrelator: Enabling cross-view object rela- tion understanding across ego-centric and exo-centric per- spectives. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6530–6540, 2025. 1, 3, 6

work page 2025

[18] [18]

Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015

Yasutaka Furukawa, Carlos Hern ´andez, et al. Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015. 1, 2

work page 2015

[19] [19]

Massively parallel multiview stereopsis by surface normal diffusion

Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview stereopsis by surface normal diffusion. InProceedings of the IEEE international confer- ence on computer vision, pages 873–881, 2015. 2

work page 2015

[20] [20]

Self-supervised multi-view multi-human association and tracking

Yiyang Gan, Ruize Han, Liqiang Yin, Wei Feng, and Song Wang. Self-supervised multi-view multi-human association and tracking. InACM MM, 2021. 6

work page 2021

[21] [21]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

work page 2024

[22] [22]

A survey on instance segmentation: state of the art.International jour- nal of multimedia information retrieval, 9(3):171–189, 2020

Abdul Mueed Hafiz and Ghulam Mohiuddin Bhat. A survey on instance segmentation: state of the art.International jour- nal of multimedia information retrieval, 9(3):171–189, 2020. 3

work page 2020

[23] [23]

Bridging perspec- tives: A survey on cross-view collaborative intelligence with egocentric-exocentric vision.International Journal of Com- puter Vision, 134(2):62, 2026

Yuping He, Yifei Huang, Guo Chen, Lidong Lu, Baoqi Pei, Jilan Xu, Tong Lu, and Yoichi Sato. Bridging perspec- tives: A survey on cross-view collaborative intelligence with egocentric-exocentric vision.International Journal of Com- puter Vision, 134(2):62, 2026. 1

work page 2026

[24] [24]

Deepmvs: Learning multi- view stereopsis

Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi- view stereopsis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2821–2830,

work page

[25] [25]

Segmast3r: Geometry grounded segment matching.arXiv preprint arXiv:2510.05051, 2025

Rohit Jayanti, Swayam Agrawal, Vansh Garg, Siddharth Tourani, Muhammad Haris Khan, Sourav Garg, and Mad- hava Krishna. Segmast3r: Geometry grounded segment matching.arXiv preprint arXiv:2510.05051, 2025. 3

work page arXiv 2025

[26] [26]

Egomimic: Scaling imitation learning via egocentric video

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 13226–13233. IEEE, 2025. 1

work page 2025

[27] [27]

Panoptic segmentation

Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Doll ´ar. Panoptic segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9404–9413, 2019. 3

work page 2019

[28] [28]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 3, 5

work page 2023

[29] [29]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 3

work page 2024

[30] [30]

Matching anything by segmenting anything

Siyuan Li, Lei Ke, Martin Danelljan, Luigi Piccinelli, Mattia Segu, Luc Van Gool, and Fisher Yu. Matching anything by segmenting anything. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 18963–18973, 2024. 3, 5

work page 2024

[31] [31]

Domr: Establishing cross-view segmentation via dense object matching

Jitong Liao, Yulu Gao, Shaofei Huang, Jialin Gao, Jie Lei, Ronghua Liang, and Si Liu. Domr: Establishing cross-view segmentation via dense object matching. InProceedings of the 33rd ACM International Conference on Multimedia, pages 412–421, 2025. 1, 3, 6

work page 2025

[32] [32]

Path aggregation network for instance segmentation

Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 8759–8768, 2018. 3

work page 2018

[33] [33]

Least squares quantization in pcm.IEEE trans- actions on information theory, 28(2):129–137, 1982

Stuart Lloyd. Least squares quantization in pcm.IEEE trans- actions on information theory, 28(2):129–137, 1982. 4, 5

work page 1982

[34] [34]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017

[35] [35]

Distinctive image features from scale- invariant keypoints.International journal of computer vi- sion, 60:91–110, 2004

David G Lowe. Distinctive image features from scale- invariant keypoints.International journal of computer vi- sion, 60:91–110, 2004. 2

work page 2004

[36] [36]

Multiview stereo with cascaded epipolar raft

Zeyu Ma, Zachary Teed, and Jia Deng. Multiview stereo with cascaded epipolar raft. InEuropean Conference on Com- puter Vision, pages 734–750. Springer, 2022. 2

work page 2022

[37] [37]

Differentiable volumetric rendering: Learn- ing implicit 3d representations without 3d supervision

Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learn- ing implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 3504–3515, 2020

work page 2020

[38] [38]

Rethinking depth estimation for multi- view stereo: A unified representation

Rui Peng, Rongjie Wang, Zhenyu Wang, Yawen Lai, and Ronggang Wang. Rethinking depth estimation for multi- view stereo: A unified representation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8645–8654, 2022. 2

work page 2022

[39] [39]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 3

work page 2021

[40] [40]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 2

work page 2016

[42] [42]

Photorealistic scene reconstruction by voxel coloring.International journal of computer vision, 35(2):151–173, 1999

Steven M Seitz and Charles R Dyer. Photorealistic scene reconstruction by voxel coloring.International journal of computer vision, 35(2):151–173, 1999. 1

work page 1999

[43] [43]

Dense cross-query-and-support attention weighted mask aggrega- tion for few-shot segmentation

Xinyu Shi, Dong Wei, Yu Zhang, Donghuan Lu, Munan Ning, Jiashun Chen, Kai Ma, and Yefeng Zheng. Dense cross-query-and-support attention weighted mask aggrega- tion for few-shot segmentation. InEuropean Conference on Computer Vision, pages 151–168. Springer, 2022. 6

work page 2022

[44] [44]

Clustergnn: Cluster-based coarse-to- fine graph neural network for efficient feature matching

Yan Shi, Jun-Xiong Cai, Yoli Shavit, Tai-Jiang Mu, Wensen Feng, and Kai Zhang. Clustergnn: Cluster-based coarse-to- fine graph neural network for efficient feature matching. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12517–12526, 2022. 2

work page 2022

[45] [45]

Vggsfm: Visual geometry grounded deep structure from motion

Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 21686–21697, 2024. 2

work page 2024

[46] [46]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 1, 3

work page 2025

[47] [47]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 3

work page 2024

[48] [48]

Deepsfm: Structure from motion via deep bundle adjustment

Xingkui Wei, Yinda Zhang, Zhuwen Li, Yanwei Fu, and Xi- angyang Xue. Deepsfm: Structure from motion via deep bundle adjustment. InEuropean conference on computer vi- sion, pages 230–247. Springer, 2020. 2

work page 2020

[49] [49]

Seeing the unseen: Pre- dicting the first-person camera wearer’s location and pose in third-person scenes

Yangming Wen, Krishna Kumar Singh, Markham Anderson, Wei-Pang Jan, and Yong Jae Lee. Seeing the unseen: Pre- dicting the first-person camera wearer’s location and pose in third-person scenes. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 3446–3455,

work page

[50] [50]

Joint person segmentation and iden- tification in synchronized first-and third-person videos

Mingze Xu, Chenyou Fan, Yuchen Wang, Michael S Ryoo, and David J Crandall. Joint person segmentation and iden- tification in synchronized first-and third-person videos. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 637–652, 2018. 3

work page 2018

[51] [51]

Lift: Learned invariant feature transform

Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform. InCom- puter Vision–ECCV 2016: 14th European Conference, Am- 10 sterdam, The Netherlands, October 11-14, 2016, Proceed- ings, Part VI 14, pages 467–483. Springer, 2016. 2

work page 2016

[52] [52]

Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers.IEEE Transactions on intelligent transportation systems, 24(12): 14679–14694, 2023

Jiaming Zhang, Huayao Liu, Kailun Yang, Xinxin Hu, Ruip- ing Liu, and Rainer Stiefelhagen. Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers.IEEE Transactions on intelligent transportation systems, 24(12): 14679–14694, 2023. 6

work page 2023

[53] [53]

Ge- omvsnet: Learning multi-view stereo with geometry percep- tion

Zhe Zhang, Rui Peng, Yuxi Hu, and Ronggang Wang. Ge- omvsnet: Learning multi-view stereo with geometry percep- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 21508–21518,

work page

[54] [54]

Psalm: Pixelwise segmentation with large multi-modal model

Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 1, 3, 6

work page 2024

[55] [55]

Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782,

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782,

work page