pith. machine review for the scientific record. sign in

Harnessing vision foundation models for high-performance, training- free open vocabulary segmentation

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

fields

cs.CV 2 cs.RO 1

years

2026 3

verdicts

UNVERDICTED 3

representative citing papers

FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction

cs.RO · 2026-04-30 · unverdicted · novelty 6.0

FreeOcc enables training-free open-vocabulary 3D occupancy prediction from RGB-D sequences by combining SLAM, dense Gaussian maps, off-the-shelf vision-language models, and probabilistic projection, achieving over 2x gains on benchmarks and zero-shot transfer to novel scenes.

Cross-Attentive Multiview Fusion of Vision-Language Embeddings

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

CAMFusion fuses multiview 2D vision-language embeddings via cross-attention and multiview consistency self-supervision to produce better 3D semantic and instance representations, outperforming averaging and reaching SOTA on benchmarks including zero-shot out-of-domain cases.

citing papers explorer

Showing 3 of 3 citing papers.

  • FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction cs.RO · 2026-04-30 · unverdicted · none · ref 62

    FreeOcc enables training-free open-vocabulary 3D occupancy prediction from RGB-D sequences by combining SLAM, dense Gaussian maps, off-the-shelf vision-language models, and probabilistic projection, achieving over 2x gains on benchmarks and zero-shot transfer to novel scenes.

  • RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments cs.CV · 2026-04-28 · unverdicted · none · ref 40

    RADIO-ViPE performs online open-vocabulary semantic SLAM directly from monocular RGB video in dynamic environments by tightly coupling vision-language embeddings from foundation models with geometric factor-graph optimization using adaptive robust kernels.

  • Cross-Attentive Multiview Fusion of Vision-Language Embeddings cs.CV · 2026-04-14 · unverdicted · none · ref 31

    CAMFusion fuses multiview 2D vision-language embeddings via cross-attention and multiview consistency self-supervision to produce better 3D semantic and instance representations, outperforming averaging and reaching SOTA on benchmarks including zero-shot out-of-domain cases.