super hub Mixed citations

SAM 3: Segment Anything with Concepts

Didac Suris, Laura Gustafson, Nicolas Carion, Ronghang Hu, Shoubhik Debnath, Yuan-Ting Hu · 2025 · cs.CV · arXiv 2511.16719

Mixed citation behavior. Most common role is method (58%).

217 Pith papers citing it

Method 58% of classified citations

open full Pith review browse 217 citing papers more from Didac Suris arXiv PDF

abstract

We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 29 background 17 baseline 4

citation-polarity summary

use method 29 background 15 baseline 4 unclear 2

claims ledger

abstract We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of

authors

Didac Suris Laura Gustafson Nicolas Carion Ronghang Hu Shoubhik Debnath Yuan-Ting Hu

co-cited works

representative citing papers

One Video, One World: Turning Monocular Video into Physical 4D Scenes

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

cs.CV · 2026-05-26 · unverdicted · novelty 8.0

SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.

iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning

cs.CV · 2026-05-16 · unverdicted · novelty 8.0

iMiGUE-3K is the largest in-the-wild micro-gesture video dataset with 3.4K clips and 37M frames from real interviews, supporting self-supervised foundation models and benchmarks that show micro-gestures improve emotion understanding.

Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models

cs.CV · 2026-05-09 · unverdicted · novelty 8.0

Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

Open-Vocabulary and Referring Segmentation for 3D Gaussians Using 2D Detectors

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

GaussDet enables open-vocabulary and referring segmentation in 3D Gaussians by learning instance features and aggregating votes from 2D detectors, improving referential grounding by 16.7% mIoU in zero-shot setting.

UnfoldArt: Zero-Shot Recovery of Full Articulated 3D Objects from Text or Image

cs.CV · 2026-06-29 · unverdicted · novelty 7.0 · 2 refs

UnfoldArt uses a two-round structured debate between high-level semantic agents and low-level parameter agents, grounded in generated video, to infer articulation and reconstruct full articulated 3D objects including occluded geometry from text or image inputs.

Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

A new dataset of 220k+ cross-view pairs and a single-stage geometry-aware model GAGeo based on the π³ 3D foundation model outperforms prior methods on object geo-localization with strong generalization and zero-shot ground-to-drone capability.

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

cs.CV · 2026-06-25 · unverdicted · novelty 7.0 · 2 refs

MemoBench is a new diagnostic benchmark with 360 synthetic and real clips plus VQA evaluation that tests memory consistency in video models under the disappear-and-reappear paradigm in dynamically changing environments.

Million-scale multimodal pollen microscopy with expert-guided foundation models

cs.CV · 2026-06-16 · accept · novelty 7.0

Releases Pollen AI Atlas, a million-scale multimodal pollen microscopy dataset with expert-guided VLM captions and baseline benchmarks for recognition and cross-regional retrieval.

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Smaller self-supervised ViTs localize objects better via attention than larger ViTs, enabling A² to decouple localization from feature extraction for competitive performance on distribution-shifted benchmarks.

AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

Introduces AVTrack dataset for audio-visual tracking in challenging human-centric scenes, demonstrating performance drops in existing methods.

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.

Remembering by Reconstructing: Domain Incremental Learning With Test-Time Training on Video Streams

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

Domain-incremental video learning that permits forgetting through per-domain LoRA adapters and recovers the matching adapter at inference via test-time training on a self-supervised MAE reconstruction head.

Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

A 3D-aware framework uses SAM3D geometry and pose estimation plus geodesic filtering to supervise a lightweight adapter on DINO and Stable Diffusion features, improving semantic correspondence with less manual supervision.

Toward Semantic-Agnostic and Shape-Aware Vision-Language Segmentation Models

cs.CV · 2026-05-27 · unverdicted · novelty 7.0

Introduces SANSA paradigm for semantic-agnostic vision-language segmentation via dictionary or example-based prompts, with finetuning delivering up to 20% mIoU gains on the new task while retaining standard performance.

EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

EchoPilot delivers state-of-the-art training-free ultrasound video segmentation from a single point prompt by introducing scale-space semantic prompting via S.E.E.D. and reliability-gated memory updates.

Mosaic: Compositional Multi-Concept Erasure via Vector Field Blending

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

Mosaic is a framework for compositional multi-concept erasure in flow-based T2I models via spatial vector field blending without extra optimization, evaluated on the new CoME-Bench benchmark covering intra- and cross-category cases.

EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

EM-Vid introduces an entity-centric latent patch memory bank with sparse token conditioning and budgeted updates for training-free consistent multi-shot video generation.

COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

COCOTree is a 21K-image benchmark with 1.8M nodes and an OTQ metric for the new task of open tree-structured visual decomposition.

VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

VISTAQA is a new benchmark for joint visual question answering correctness and pixel-level grounding, evaluated with the GROVE metric that uses per-sample geometric mean to require both dimensions to succeed.

Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes an equation-anchored tool-use method for MLLMs that writes the pinhole back-projection equation in Chain-of-Thought and substitutes retrieved camera intrinsics and depths to achieve robustness in 3D object detection and visual grounding under rescaled intrinsics.

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

cs.CV · 2026-05-15 · unverdicted · novelty 7.0

GRASP is a large-scale dataset and benchmark for social reasoning grounded in gaze and gesture events in multi-person videos, with Social Grounding Reward (SGR) proposed to improve model performance on GRASP-Bench.

ELDOR: A Dataset and Benchmark for Illegal Gold Mining in the Amazon Rainforest

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

Introduces the ELDOR UAV dataset and four benchmark tasks for semantic segmentation and classification of mining disturbances and ecological recovery in rainforest imagery.

citing papers explorer

Showing 29 of 29 citing papers after filters.

Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models cs.CV · 2026-05-09 · unverdicted · none · ref 40 · internal anchor
Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.
Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances cs.CV · 2026-05-12 · unverdicted · none · ref 4 · internal anchor
AFFORDMEM improves AP50 by 3.23-3.7 points on SceneFun3D splits by using a reusable cross-scene affordance memory bank and in-scene spatial memory to guide VLMs toward actionable 3D regions.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models cs.CV · 2026-05-11 · conditional · none · ref 5 · 2 links · internal anchor
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 7 · internal anchor
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents cs.AI · 2026-05-02 · unverdicted · none · ref 53 · internal anchor
EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
SketchVLM: Vision language models can annotate images to explain thoughts and guide users cs.CV · 2026-04-23 · unverdicted · none · ref 7 · internal anchor
SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches cs.CV · 2026-04-15 · unverdicted · none · ref 6 · internal anchor
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.
Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection cs.CV · 2026-04-13 · conditional · none · ref 8 · internal anchor
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details cs.CV · 2026-04-08 · unverdicted · none · ref 6 · internal anchor
RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning cs.CV · 2026-04-08 · unverdicted · none · ref 5 · internal anchor
A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
Focusable Monocular Depth Estimation cs.CV · 2026-05-12 · unverdicted · none · ref 4 · internal anchor
FocusDepth is a prompt-conditioned framework that fuses SAM3 features into Depth Anything models via Multi-Scale Spatial-Aligned Fusion to improve target-region depth accuracy on the new FDE-Bench.
From Expansion to Consolidation: Socio-Spatial Contagion Dynamics in Off-Grid PV Adoption econ.GN · 2026-05-10 · unverdicted · none · ref 17 · 2 links · internal anchor
Socio-spatial contagion in off-grid PV adoption is nearly ubiquitous, intensifies over time but peaks within 1-2 years, and shifts from range expansion to contraction as communities move from clustering to consolidation of installations.
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models cs.CV · 2026-05-08 · unverdicted · none · ref 11 · 2 links · internal anchor
SARA introduces semantic saliency to guide relational alignment in video diffusion models, improving text following and motion quality over prior alignment methods.
ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring cs.CV · 2026-05-08 · unverdicted · none · ref 2 · 2 links · internal anchor
ChartREG++ benchmark for multi-target chart referring expression grounding with diverse clues plus a plotting-code synthesis pipeline for accurate masks that improves model performance.
Embody4D: A Generalist Data Engine for Embodied 4D World Modeling cs.CV · 2026-05-03 · unverdicted · none · ref 8 · 2 links · internal anchor
Embody4D generates novel-view videos from monocular robot videos via a 3D-aware synthesis pipeline, confidence-aware expert modulation, and interaction-aware attention for embodied 4D world modeling.
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation cs.CV · 2026-04-21 · unverdicted · none · ref 3 · internal anchor
CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment cs.CV · 2026-04-12 · unverdicted · none · ref 7 · internal anchor
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
JARVIS: A Just-in-Time Augmented Reality VLM-Powered Instruction System for Cross-Reality Task Guidance cs.HC · 2026-04-11 · unverdicted · none · ref 5 · 2 links · internal anchor
JARVIS delivers VLM-powered contextual AR guidance with state verification for cross-reality tasks, improving usability and success rates over baselines in a 14-person study.
PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing cs.CV · 2026-04-08 · unverdicted · none · ref 8 · internal anchor
PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing cs.CV · 2026-04-06 · unverdicted · none · ref 10 · internal anchor
SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation cs.IR · 2026-04-04 · unverdicted · none · ref 6 · internal anchor
MG²-RAG proposes a multi-granularity graph RAG framework that constructs hierarchical multimodal nodes via entity-driven visual grounding and performs structured retrieval, delivering SOTA results on four multimodal tasks with 43.3× faster graph construction.
EponaV2: Driving World Model with Comprehensive Future Reasoning cs.CV · 2026-05-14 · unverdicted · none · ref 6 · internal anchor
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
iPay: Integrated Payment Action Recognition via Multimodal Networks and Adaptive Spatial Prior Learning cs.CV · 2026-05-11 · unverdicted · none · ref 15 · internal anchor
iPay fuses RGB and skeleton expert streams via dual-attention and a prior-driven Spatial Difference Discriminator to reach 83.45% accuracy on 500+ real-world payment clips from onboard transit cameras.
MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding cs.CV · 2026-04-10 · unverdicted · none · ref 5 · internal anchor
MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.
Position: Life-Logging Video Streams Make the Privacy-Utility Trade-off Inevitable cs.CV · 2026-05-11 · unverdicted · none · ref 6 · internal anchor
Life-logging video streams create an inevitable privacy-utility trade-off that is a foundational challenge for always-on AI systems.
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation cs.GR · 2026-05-05 · unverdicted · none · ref 12 · 2 links · internal anchor
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
Fringe Projection Based Vision Pipeline for Autonomous Hard Drive Disassembly cs.CV · 2026-04-19 · unverdicted · none · ref 38 · internal anchor
An integrated fringe projection and AI pipeline delivers aligned high-accuracy 3D sensing and instance segmentation for autonomous HDD disassembly at 77.7 FPS.
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds cs.CV · 2026-04-15 · unverdicted · none · ref 9 · internal anchor
HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claiming open-source SOTA performance.
The Second Challenge on Cross-Domain Few-Shot Object Detection at NTIRE 2026: Methods and Results cs.CV · 2026-04-13 · unverdicted · none · ref 9 · internal anchor
The NTIRE 2026 CD-FSOD Challenge report details innovative methods and performance results from 19 teams on cross-domain few-shot object detection in open- and closed-source tracks.

SAM 3: Segment Anything with Concepts

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer