V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

· 2025 · cs.CV · arXiv 2511.20886

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Cross-view object correspondence, exemplified by the representative task of ego-exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., egocentric and exocentric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, difficult to apply directly. To address this, we present V2-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V2-Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, enables coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V2-Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego-exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V2-SAM, achieving new state-of-the-art performance on Ego-Exo4D (ego-exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence).

representative citing papers

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception

cs.CV · 2026-05-22 · unverdicted · novelty 5.0

CVSearch proposes an Assess-then-Search workflow combining expert-assisted search with Semantic Guided Adaptive Patching and Dynamic Bottom-Up Search to improve efficiency and accuracy on high-resolution image tasks for MLLMs.

citing papers explorer

Showing 2 of 2 citing papers.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos cs.CV · 2026-02-15 · unverdicted · none · ref 26 · internal anchor
EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception cs.CV · 2026-05-22 · unverdicted · none · ref 9 · internal anchor
CVSearch proposes an Assess-then-Search workflow combining expert-assisted search with Semantic Guided Adaptive Patching and Dynamic Bottom-Up Search to improve efficiency and accuracy on high-resolution image tasks for MLLMs.

V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

fields

years

verdicts

representative citing papers

citing papers explorer