VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
Pith reviewed 2026-05-25 07:04 UTC · model grok-4.3
The pith
VGGT-Segmentor adds a three-stage Union Segmentation Head to VGGT features to produce accurate instance masks across ego and exo views without paired annotations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VGGT-Segmentor unifies VGGT's cross-view feature representation with a Union Segmentation Head that operates in three stages—mask prompt fusion, point-guided prediction, and iterative mask refinement—to translate high-level geometric alignment into pixel-accurate segmentation masks, while a single-image self-supervised training strategy removes the need for paired annotations and produces new state-of-the-art results of 67.7 percent and 68.0 percent average IoU on Ego-to-Exo and Exo-to-Ego tasks.
What carries the argument
The Union Segmentation Head, a three-stage module that fuses mask prompts, guides predictions with points, and refines masks iteratively to convert object-level feature alignment into precise pixel masks despite projection drift.
If this is right
- Cross-view instance segmentation becomes feasible without paired view annotations or explicit correspondence labels.
- A correspondence-free pretrained model can surpass most fully supervised methods on the Ego-Exo4D benchmark.
- The same geometry-to-mask translation supports embodied AI and remote collaboration applications that require consistent object identity across viewpoints.
- Single-image self-supervised training scales to new view pairs without additional annotation cost.
Where Pith is reading between the lines
- The same head design could be attached to other geometry-aware backbones to turn their object-level signals into dense outputs.
- Success here implies that attention consistency at the object level is a more stable signal than pixel-level matching when viewpoints differ sharply.
- The method may extend to video or multi-camera settings where drift accumulates over time rather than across two static views.
Load-bearing premise
VGGT's internal object-level attention stays consistent enough that the three-stage head can convert it into accurate per-pixel masks even when pixel projections drift.
What would settle it
On the Ego-Exo4D test set, a version that uses only raw VGGT features or drops the iterative refinement stage produces IoU no higher than existing supervised baselines.
Figures
read the original abstract
Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong generalization. On the Ego-Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks, respectively, significantly outperforming prior methods. Notably, our correspondence-free pretrained model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach. Code is publicly available at: https://github.com/buaa-colalab/VGGT-S.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VGGT-Segmentor (VGGT-S), a framework that augments VGGT's cross-view geometric features with a Union Segmentation Head operating in three stages (mask prompt fusion, point-guided prediction, iterative refinement) to achieve instance-level object segmentation between egocentric and exocentric views. It employs a single-image self-supervised training strategy that avoids paired annotations and reports new state-of-the-art results on the Ego-Exo4D benchmark: 67.7% average IoU (Ego to Exo) and 68.0% (Exo to Ego), with the correspondence-free pretrained model surpassing most fully-supervised baselines.
Significance. If the reported IoU gains are shown to stem from the proposed head rather than confounding factors, the work would be significant for cross-view segmentation in embodied AI and remote collaboration, particularly due to the self-supervised training that improves scalability and generalization. Public code release is a positive factor for reproducibility.
major comments (1)
- [Abstract] Abstract: The central claim that the Union Segmentation Head converts VGGT's internal object-level attention into pixel-accurate masks despite projection drift rests on the unverified assumption that this attention remains sufficiently consistent across Ego-Exo4D view pairs. No ablation studies, visualizations, quantitative consistency metrics, or error analysis are provided to support this assumption or to attribute the 67.7%/68.0% IoU gains specifically to the three-stage head rather than pretraining or data factors.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the concern about supporting evidence for the consistency of object-level attention and attribution of gains to the Union Segmentation Head below, and will revise the manuscript to incorporate additional analyses.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the Union Segmentation Head converts VGGT's internal object-level attention into pixel-accurate masks despite projection drift rests on the unverified assumption that this attention remains sufficiently consistent across Ego-Exo4D view pairs. No ablation studies, visualizations, quantitative consistency metrics, or error analysis are provided to support this assumption or to attribute the 67.7%/68.0% IoU gains specifically to the three-stage head rather than pretraining or data factors.
Authors: We agree that the manuscript would benefit from explicit evidence on this point. While the paper notes that VGGT's internal object-level attention remains consistent despite projection drift (based on our empirical observations), dedicated ablation studies, visualizations of attention maps, quantitative consistency metrics (such as cross-view attention similarity), and error analysis are not included. In the revised version, we will add these elements, along with ablations that isolate the contribution of the three-stage Union Segmentation Head from the correspondence-free pretraining and other factors, to more clearly attribute the reported IoU improvements. revision: yes
Circularity Check
No circularity: derivation chain is self-contained with external benchmark validation
full rationale
The paper introduces VGGT-Segmentor by extending VGGT features via a novel three-stage Union Segmentation Head and a single-image self-supervised training strategy. Reported 67.7/68.0 IoU results are measured on the external Ego-Exo4D benchmark, not derived from internal fits or self-citations. No equations or steps reduce the claimed performance or alignment to the inputs by construction; the self-supervised claim is explicitly correspondence-free and independent of paired labels. The architecture description and benchmark evaluation constitute independent content.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Building rome in a day.Communications of the ACM, 54 (10):105–112, 2011
Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Si- mon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day.Communications of the ACM, 54 (10):105–112, 2011. 2
work page 2011
-
[2]
Ego2top: Matching view- ers in egocentric and top-view videos
Shervin Ardeshir and Ali Borji. Ego2top: Matching view- ers in egocentric and top-view videos. InComputer Vision– ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 253–268. Springer, 2016. 3
work page 2016
-
[3]
Self-supervised cross-view correspondence with predictive cycle consistency
Alan Baade and Changan Chen. Self-supervised cross-view correspondence with predictive cycle consistency. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 16753–16763, 2025. 6
work page 2025
-
[4]
Allison Bayro, Hongju Moon, Yalda Ghasemi, Heejin Jeong, and Jae Yeol Lee. Object manipulation in physically con- strained workplaces: remote collaboration with extended re- ality.IISE Transactions on Occupational Ergonomics and Human Factors, 13(3):177–190, 2025. 1
work page 2025
-
[5]
Yolact: Real-time instance segmentation
Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-time instance segmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 9157–9166, 2019. 3
work page 2019
-
[6]
Brief: Binary robust independent elementary features
Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. Brief: Binary robust independent elementary features. InComputer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11, pages 778–
work page 2010
-
[7]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3
work page 2021
-
[8]
Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmen- tation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014. 3
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[9]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs.IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017
work page 2017
-
[10]
Rethinking Atrous Convolution for Semantic Image Segmentation
Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for seman- tic image segmentation.arXiv preprint arXiv:1706.05587, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[11]
Encoder-decoder with atrous separable convolution for semantic image segmentation
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. 3
work page 2018
-
[12]
Masked-attention mask transformer for universal image segmentation
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 3
work page 2022
-
[13]
Aritra Dutta, Srijan Das, Jacob Nielsen, Rajatsubhra Chakraborty, and Mubarak Shah. Multiview aerial visual recognition (mavrec): Can multi-view improve aerial visual perception? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22678–22690, 2024. 8
work page 2024
-
[14]
Chrisantus Eze and Christopher Crick. Learning by watch- ing: A review of video-based learning approaches for robot manipulation.IEEE Access, 2025. 1
work page 2025
-
[15]
Identifying first-person camera wearers in third- person videos
Chenyou Fan, Jangwon Lee, Mingze Xu, Krishna Ku- mar Singh, Yong Jae Lee, David J Crandall, and Michael S Ryoo. Identifying first-person camera wearers in third- person videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5125– 5133, 2017. 3
work page 2017
-
[16]
Qiancheng Fu, Qingshan Xu, Yew Soon Ong, and Wenbing Tao. Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction.Advances in Neural Information Processing Systems, 35:3403–3416, 2022. 2
work page 2022
-
[17]
Yuqian Fu, Runze Wang, Bin Ren, Guolei Sun, Biao Gong, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, and Luc Van Gool. Objectrelator: Enabling cross-view object rela- tion understanding across ego-centric and exo-centric per- spectives. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6530–6540, 2025. 1, 3, 6
work page 2025
-
[18]
Yasutaka Furukawa, Carlos Hern ´andez, et al. Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015. 1, 2
work page 2015
-
[19]
Massively parallel multiview stereopsis by surface normal diffusion
Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview stereopsis by surface normal diffusion. InProceedings of the IEEE international confer- ence on computer vision, pages 873–881, 2015. 2
work page 2015
-
[20]
Self-supervised multi-view multi-human association and tracking
Yiyang Gan, Ruize Han, Liqiang Yin, Wei Feng, and Song Wang. Self-supervised multi-view multi-human association and tracking. InACM MM, 2021. 6
work page 2021
-
[21]
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...
work page 2024
-
[22]
Abdul Mueed Hafiz and Ghulam Mohiuddin Bhat. A survey on instance segmentation: state of the art.International jour- nal of multimedia information retrieval, 9(3):171–189, 2020. 3
work page 2020
-
[23]
Yuping He, Yifei Huang, Guo Chen, Lidong Lu, Baoqi Pei, Jilan Xu, Tong Lu, and Yoichi Sato. Bridging perspec- tives: A survey on cross-view collaborative intelligence with egocentric-exocentric vision.International Journal of Com- puter Vision, 134(2):62, 2026. 1
work page 2026
-
[24]
Deepmvs: Learning multi- view stereopsis
Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi- view stereopsis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2821–2830,
-
[25]
Segmast3r: Geometry grounded segment matching.arXiv preprint arXiv:2510.05051, 2025
Rohit Jayanti, Swayam Agrawal, Vansh Garg, Siddharth Tourani, Muhammad Haris Khan, Sourav Garg, and Mad- hava Krishna. Segmast3r: Geometry grounded segment matching.arXiv preprint arXiv:2510.05051, 2025. 3
-
[26]
Egomimic: Scaling imitation learning via egocentric video
Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 13226–13233. IEEE, 2025. 1
work page 2025
-
[27]
Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Doll ´ar. Panoptic segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9404–9413, 2019. 3
work page 2019
-
[28]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 3, 5
work page 2023
-
[29]
Ground- ing image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 3
work page 2024
-
[30]
Matching anything by segmenting anything
Siyuan Li, Lei Ke, Martin Danelljan, Luigi Piccinelli, Mattia Segu, Luc Van Gool, and Fisher Yu. Matching anything by segmenting anything. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 18963–18973, 2024. 3, 5
work page 2024
-
[31]
Domr: Establishing cross-view segmentation via dense object matching
Jitong Liao, Yulu Gao, Shaofei Huang, Jialin Gao, Jie Lei, Ronghua Liang, and Si Liu. Domr: Establishing cross-view segmentation via dense object matching. InProceedings of the 33rd ACM International Conference on Multimedia, pages 412–421, 2025. 1, 3, 6
work page 2025
-
[32]
Path aggregation network for instance segmentation
Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 8759–8768, 2018. 3
work page 2018
-
[33]
Least squares quantization in pcm.IEEE trans- actions on information theory, 28(2):129–137, 1982
Stuart Lloyd. Least squares quantization in pcm.IEEE trans- actions on information theory, 28(2):129–137, 1982. 4, 5
work page 1982
-
[34]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[35]
David G Lowe. Distinctive image features from scale- invariant keypoints.International journal of computer vi- sion, 60:91–110, 2004. 2
work page 2004
-
[36]
Multiview stereo with cascaded epipolar raft
Zeyu Ma, Zachary Teed, and Jia Deng. Multiview stereo with cascaded epipolar raft. InEuropean Conference on Com- puter Vision, pages 734–750. Springer, 2022. 2
work page 2022
-
[37]
Differentiable volumetric rendering: Learn- ing implicit 3d representations without 3d supervision
Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learn- ing implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 3504–3515, 2020
work page 2020
-
[38]
Rethinking depth estimation for multi- view stereo: A unified representation
Rui Peng, Rongjie Wang, Zhenyu Wang, Yawen Lai, and Ronggang Wang. Rethinking depth estimation for multi- view stereo: A unified representation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8645–8654, 2022. 2
work page 2022
-
[39]
Vi- sion transformers for dense prediction
Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 3
work page 2021
-
[40]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Structure- from-motion revisited
Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 2
work page 2016
-
[42]
Steven M Seitz and Charles R Dyer. Photorealistic scene reconstruction by voxel coloring.International journal of computer vision, 35(2):151–173, 1999. 1
work page 1999
-
[43]
Dense cross-query-and-support attention weighted mask aggrega- tion for few-shot segmentation
Xinyu Shi, Dong Wei, Yu Zhang, Donghuan Lu, Munan Ning, Jiashun Chen, Kai Ma, and Yefeng Zheng. Dense cross-query-and-support attention weighted mask aggrega- tion for few-shot segmentation. InEuropean Conference on Computer Vision, pages 151–168. Springer, 2022. 6
work page 2022
-
[44]
Clustergnn: Cluster-based coarse-to- fine graph neural network for efficient feature matching
Yan Shi, Jun-Xiong Cai, Yoli Shavit, Tai-Jiang Mu, Wensen Feng, and Kai Zhang. Clustergnn: Cluster-based coarse-to- fine graph neural network for efficient feature matching. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12517–12526, 2022. 2
work page 2022
-
[45]
Vggsfm: Visual geometry grounded deep structure from motion
Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 21686–21697, 2024. 2
work page 2024
-
[46]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 1, 3
work page 2025
-
[47]
Dust3r: Geometric 3d vi- sion made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 3
work page 2024
-
[48]
Deepsfm: Structure from motion via deep bundle adjustment
Xingkui Wei, Yinda Zhang, Zhuwen Li, Yanwei Fu, and Xi- angyang Xue. Deepsfm: Structure from motion via deep bundle adjustment. InEuropean conference on computer vi- sion, pages 230–247. Springer, 2020. 2
work page 2020
-
[49]
Yangming Wen, Krishna Kumar Singh, Markham Anderson, Wei-Pang Jan, and Yong Jae Lee. Seeing the unseen: Pre- dicting the first-person camera wearer’s location and pose in third-person scenes. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 3446–3455,
-
[50]
Joint person segmentation and iden- tification in synchronized first-and third-person videos
Mingze Xu, Chenyou Fan, Yuchen Wang, Michael S Ryoo, and David J Crandall. Joint person segmentation and iden- tification in synchronized first-and third-person videos. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 637–652, 2018. 3
work page 2018
-
[51]
Lift: Learned invariant feature transform
Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform. InCom- puter Vision–ECCV 2016: 14th European Conference, Am- 10 sterdam, The Netherlands, October 11-14, 2016, Proceed- ings, Part VI 14, pages 467–483. Springer, 2016. 2
work page 2016
-
[52]
Jiaming Zhang, Huayao Liu, Kailun Yang, Xinxin Hu, Ruip- ing Liu, and Rainer Stiefelhagen. Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers.IEEE Transactions on intelligent transportation systems, 24(12): 14679–14694, 2023. 6
work page 2023
-
[53]
Ge- omvsnet: Learning multi-view stereo with geometry percep- tion
Zhe Zhang, Rui Peng, Yuxi Hu, and Ronggang Wang. Ge- omvsnet: Learning multi-view stereo with geometry percep- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 21508–21518,
-
[54]
Psalm: Pixelwise segmentation with large multi-modal model
Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 1, 3, 6
work page 2024
-
[55]
Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.