super hub Mixed citations

SAM 3: Segment Anything with Concepts

Didac Suris, Laura Gustafson, Nicolas Carion, Ronghang Hu, Shoubhik Debnath, Yuan-Ting Hu · 2025 · cs.CV · arXiv 2511.16719

Mixed citation behavior. Most common role is method (58%).

306 Pith papers citing it

Method 58% of classified citations

open full Pith review browse 306 citing papers more from Didac Suris arXiv PDF

abstract

We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 29 background 17 baseline 4

citation-polarity summary

use method 29 background 15 baseline 4 unclear 2

claims ledger

abstract We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of

authors

Didac Suris Laura Gustafson Nicolas Carion Ronghang Hu Shoubhik Debnath Yuan-Ting Hu

co-cited works

representative citing papers

One Video, One World: Turning Monocular Video into Physical 4D Scenes

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.

PCFootprint: A Large-Scale Dataset and Benchmark for Vectorized Building Footprint Extraction from Aerial LiDAR Point Clouds

cs.CV · 2026-06-18 · accept · novelty 8.0

PCFootprint is the first large-scale public dataset and benchmark for vectorized building footprint extraction from aerial LiDAR point clouds.

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

cs.CV · 2026-05-26 · unverdicted · novelty 8.0

SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.

iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning

cs.CV · 2026-05-16 · unverdicted · novelty 8.0

iMiGUE-3K is the largest in-the-wild micro-gesture video dataset with 3.4K clips and 37M frames from real interviews, supporting self-supervised foundation models and benchmarks that show micro-gestures improve emotion understanding.

Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models

cs.CV · 2026-05-09 · unverdicted · novelty 8.0

Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

Seek to Segment: Active Perception for Panoramic Referring Segmentation

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

Introduces APRS task and PanoSeeker agent using VLM plus EgoSphere memory for active 360° search and segmentation, outperforming baselines on a new benchmark.

LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.

Online Segment 3D Gaussians via Launching Virtual Drones

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

SAGO achieves setup-free interactive 3D Gaussian segmentation by modeling it as an online NBV planning task in a Markov process, delivering sub-second latency and over 50x speedup over prior setup-free methods.

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

MindEdit-Bench introduces six spatial reasoning tasks from 120 private indoor photo triplets, with two new counterfactual editing tasks where VLMs score 8-31% against 81-97% human accuracy.

Open-Vocabulary and Referring Segmentation for 3D Gaussians Using 2D Detectors

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

GaussDet enables open-vocabulary and referring segmentation in 3D Gaussians by learning instance features and aggregating votes from 2D detectors, improving referential grounding by 16.7% mIoU in zero-shot setting.

UnfoldArt: Zero-Shot Recovery of Full Articulated 3D Objects from Text or Image

cs.CV · 2026-06-29 · unverdicted · novelty 7.0 · 2 refs

UnfoldArt uses a two-round structured debate between high-level semantic agents and low-level parameter agents, grounded in generated video, to infer articulation and reconstruct full articulated 3D objects including occluded geometry from text or image inputs.

Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

A new dataset of 220k+ cross-view pairs and a single-stage geometry-aware model GAGeo based on the π³ 3D foundation model outperforms prior methods on object geo-localization with strong generalization and zero-shot ground-to-drone capability.

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

cs.CV · 2026-06-25 · unverdicted · novelty 7.0 · 4 refs

MemoBench is a new diagnostic benchmark with automated and VQA metrics that evaluates memory consistency in video models under disappear-and-reappear in dynamic environments.

Memory Retrieval in Visuomotor Policies for Long-Horizon Robot Control

cs.RO · 2026-06-23 · unverdicted · novelty 7.0

HALO distills VLM priors via question-answering objectives and applies sparse attention to enable reliable memory retrieval from up to eight minutes of history in imitation-learned visuomotor policies.

Trustworthy Image Authentication using Forensic Knowledge Graphs

cs.CV · 2026-06-22 · unverdicted · novelty 7.0

Forensic Knowledge Graphs integrate forensic traces, causal dependencies, and scene links via a new authentication network and Iterative Context Refinement to outperform standard detectors and VLMs on detection, localization, and justification.

CheXpercept: A Benchmark for Evaluating Expert-Level Lesion Perception in Chest X-rays

cs.CV · 2026-06-19 · unverdicted · novelty 7.0

CheXpercept is a sequential multi-level perception benchmark showing VLMs perform adequately only on coarse lesion detection in chest X-rays while degrading sharply on finer tasks, with medical VLMs offering no advantage over general models.

Thinking in Boxes: 3D Editing in Real Images Made Easy

cs.CV · 2026-06-18 · unverdicted · novelty 7.0

A method that treats 3D box pairs as exact transformation specs, adds a depth-aware floor reference, and trains an image generator on synthetic scenes plus Objectron videos to perform large 3D edits on real photographs.

Intrinsic 4D Gaussian Segmentation from Scene Cues

cs.CV · 2026-06-17 · unverdicted · novelty 7.0

Intrinsic-GS recovers object-level segmentation in 4D Gaussian scenes from intrinsic cues alone via affinity graph and Leiden partitioning, reaching 0.746 mIoU on Neu3D and 0.575 on HyperNeRF without mask supervision.

Recover, Discover, Plan: Learning Skills and Concepts from Robot Failures

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

ReSYNC learns recovery skills via RL then discovers and refines relational predicates to enable abstract planning that generalizes failure avoidance to unseen long-horizon tasks, outperforming baselines by over 50% in simulation and transferring to real robots.

Million-scale multimodal pollen microscopy with expert-guided foundation models

cs.CV · 2026-06-16 · accept · novelty 7.0

Releases Pollen AI Atlas, a million-scale multimodal pollen microscopy dataset with expert-guided VLM captions and baseline benchmarks for recognition and cross-regional retrieval.

MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

MuseVLA adds on-demand sensor selection via tokens and converts readings into grounded sensor images for multimodal fusion, reporting 80.6% average success on real-robot dexterous tasks that need non-visual sensing.

Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins

cs.CV · 2026-06-15 · conditional · novelty 7.0

OR3 converts OR clips to action-driven digital twins, uses LLM imagination for hypothetical ActDTs, and achieves 57.6 R@1 and 77.3 R@5 on 276 implicit queries from 386 robotic knee procedure clips, outperforming baselines.

Human Universal Grasping

cs.RO · 2026-06-15 · unverdicted · novelty 7.0

HUG trains a flow-matching model on a new 1M-frame egocentric human grasp dataset to generate retargetable grasps from single RGB-D images, beating baselines by 23-34% on a new 90-object benchmark.

citing papers explorer

Showing 7 of 7 citing papers after filters.

EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents cs.AI · 2026-05-02 · unverdicted · none · ref 53 · internal anchor
EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models cs.AI · 2026-05-18 · unverdicted · none · ref 5 · internal anchor
OCCAM discovers open-set visual concepts, estimates causal contributions via object-level interventions on black-box vision models, and induces a global concept ontology from aggregated dataset evidence.
Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework cs.AI · 2026-05-18 · unverdicted · none · ref 1 · internal anchor
ConceptAgent is a black-box multi-agent system that awakens erased concepts in diffusion models by initializing denoising trajectories from surrogate-guided noisy states.
Low-cost concept-based localized explanations: How far can we get with training-free approaches? cs.AI · 2026-06-27 · unverdicted · none · ref 13 · internal anchor
Mid-scale MLLMs reach 62-88% object-level exact-match accuracy in zero-shot localized concept naming via closed-set prompting and an embedding-based Open-CoNa strategy across datasets.
AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models cs.AI · 2026-06-08 · unverdicted · none · ref 6 · internal anchor
AlloSpatial adds structured allocentric priors and a harness for tool-use and arbitration to improve spatial reasoning in foundation models, with 5-18% gains on VSI-Bench and MindCube in training-free settings and further gains after RL internalization.
MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation cs.AI · 2026-06-03 · unverdicted · none · ref 5 · internal anchor
MapAgent augments vectorized lane mapping with a bounded verification-driven agent loop that diagnoses specification violations and applies minimal edits, reaching over 95% automation in Baidu Maps production across 360+ cities.
DataEvolver: Let Your Data Build and Improve Itself via Goal-Driven Loop Agents cs.AI · 2026-05-03 · unverdicted · none · ref 7 · internal anchor
DataEvolver introduces a reusable framework with generation-time self-correction and validation-time self-expansion loops that improves visual datasets, shown to outperform baselines on an object-rotation task.

SAM 3: Segment Anything with Concepts

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer