Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.
super hub
SAM 3: Segment Anything with Concepts
103 Pith papers cite this work. Polarity classification is still indexing.
abstract
We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.
hub tools
claims ledger
- abstract We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of
authors
co-cited works
years
2026 103representative citing papers
RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
AFFORDMEM improves AP50 by 3.23-3.7 points on SceneFun3D splits by using a reusable cross-scene affordance memory bank and in-scene spatial memory to guide VLMs toward actionable 3D regions.
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
CAFE benchmark reveals that promptable segmentation models often produce correct masks for misleading prompts, showing a gap between localization accuracy and true concept understanding.
A relightable Gaussian Splatting method for virtual production decomposes scenes into fixed appearance and variable lighting by parameterizing primitives to directly sample high-resolution background textures, enabling controllable relighting without physically-based rendering or far-field maps.
ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.
Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
GA3T is a new dataset of synchronized ground-aerial robot data in unstructured outdoor environments designed to support cross-view perception, traversability estimation, and collaborative scene understanding.
EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.
HRDexDB is a multi-modal dataset of 1.4K human and robotic dexterous grasps across 100 objects, providing aligned 3D kinematics, high-resolution tactile data, and video streams.
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.
VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
Defines SML task for localizing semantic edits and proposes TRACE framework with semantic anchoring, perturbation sensing, and constrained reasoning that outperforms prior IML methods on a custom benchmark.
WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.
Introduces the first benchmark for open-ended video game glitch detection with temporal localization and proposes GliDe, an agentic framework that achieves stronger performance than vanilla multimodal models.
citing papers explorer
-
Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models
Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.
-
RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition
RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
-
Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances
AFFORDMEM improves AP50 by 3.23-3.7 points on SceneFun3D splits by using a reusable cross-scene affordance memory bank and in-scene spatial memory to guide VLMs toward actionable 3D regions.
-
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
-
From Pixels to Concepts: Do Segmentation Models Understand What They Segment?
CAFE benchmark reveals that promptable segmentation models often produce correct masks for misleading prompts, showing a gap between localization accuracy and true concept understanding.
-
Relightable Gaussian Splatting for Virtual Production Using Image-Based Illumination
A relightable Gaussian Splatting method for virtual production decomposes scenes into fixed appearance and variable lighting by parameterizing primitives to directly sample high-resolution background textures, enabling controllable relighting without physically-based rendering or far-field maps.
-
ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring
ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.
-
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding
Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
-
Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
GA3T: A Ground-Aerial Terrain Traversability Dataset for Heterogeneous Robot Teams in Unstructured Environments
GA3T is a new dataset of synchronized ground-aerial robot data in unstructured outdoor environments designed to support cross-view perception, traversability estimation, and collaborative scene understanding.
-
EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents
EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
-
SketchVLM: Vision language models can annotate images to explain thoughts and guide users
SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
-
AnimationBench: Are Video Models Good at Character-Centric Animation?
AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.
-
HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand Grasps
HRDexDB is a multi-modal dataset of 1.4K human and robotic dexterous grasps across 100 objects, providing aligned 3D kinematics, high-resolution tactile data, and video streams.
-
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.
-
VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems
VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.
-
Online Reasoning Video Object Segmentation
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
-
Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
-
Semantic Manipulation Localization
Defines SML task for localizing semantic edits and proposes TRACE framework with semantic anchoring, perturbation sensing, and constrained reasoning that outperforms prior IML methods on a custom benchmark.
-
WildDet3D: Scaling Promptable 3D Detection in the Wild
WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
-
Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation
Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.
-
Open-Ended Video Game Glitch Detection with Agentic Reasoning and Temporal Grounding
Introduces the first benchmark for open-ended video game glitch detection with temporal localization and proposes GliDe, an agentic framework that achieves stronger performance than vanilla multimodal models.
-
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details
RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.
-
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning
A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
-
Are We Recognizing the Jaguar or Its Background? A Diagnostic Framework for Jaguar Re-Identification
A new diagnostic framework using inpainted context ratios and laterality checks on a Pantanal jaguar benchmark reveals whether re-ID models depend on coat patterns or spurious background evidence.
-
Generalized Small Object Detection:A Point-Prompted Paradigm and Benchmark
TinySet-9M dataset and DEAL point-prompted framework deliver 31.4% relative AP75 gain over supervised baselines for small object detection with one click at inference and generalization to unseen categories.
-
VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence
VoxCor creates reusable volumetric features from frozen 2D ViT models by combining triplanar inference with a closed-form weighted partial least squares projection, enabling direct voxel correspondence across modalities without training or registration.
-
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
-
SID: Sliding into Distribution for Robust Few-Demonstration Manipulation
SID achieves approximately 90% success on six real-world manipulation tasks with only two demonstrations under out-of-distribution initializations, with less than 10% performance drop under distractors and disturbances.
-
Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation
Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.
-
What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models
PAIR-VLA adds invariance and sensitivity objectives over paired visual variants during PPO fine-tuning of VLA models, yielding 9-16% average gains on ManiSkill3 under distractors, textures, poses, viewpoints, and lighting shifts.
-
Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency
VLMs exhibit size, center, and saliency biases in scene understanding, relying less on people than humans do, with size bias as a key driver of divergence.
-
From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation
AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bimanual tasks.
-
Focusable Monocular Depth Estimation
FocusDepth is a prompt-conditioned framework that fuses SAM3 features into Depth Anything models via Multi-Scale Spatial-Aligned Fusion to improve target-region depth accuracy on the new FDE-Bench.
-
Pixal3D: Pixel-Aligned 3D Generation from Images
Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.
-
Geometric 4D Stitching for Grounded 4D Generation
Geometric 4D Stitching explicitly complements missing geometric regions in 4D generated scenes with grounded stitches to achieve consistent 4D representations in under 10 minutes on a single GPU.
-
From Expansion to Consolidation: Socio-Spatial Contagion Dynamics in Off-Grid PV Adoption
Socio-spatial contagion in off-grid PV adoption is nearly ubiquitous, intensifies over time but peaks within 1-2 years, and shifts from range expansion to contraction as communities move from clustering to consolidation of installations.
-
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models
SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.
-
PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation
PathPainter transfers image generation models to embodied navigation by generating traversability masks from BEV images and language instructions while using cross-view localization to reduce odometry drift.
-
4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding
4DThinker enables VLMs to perform dynamic spatial reasoning by internally simulating 4D imagery in latent space, outperforming prior text-based and modular approaches.
-
ChartZero: Synthetic Priors Enable Zero Shot Chart Data Extraction
ChartZero achieves zero-shot line chart data extraction by training only on synthetic mathematical functions, using a Global Orthogonal Instance loss to prevent curve fragmentation and a VLM-guided strategy for legend matching.
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tradeoffs in open-world affordance grounding.
-
Training-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction
TunnelMIND recalibrates language-guided defect proposals via dense visual consistency and reconstructs them into structured defect entities with attributes for severity grading and retrieval-grounded engineering reports, reporting F1 scores of 0.68, 0.78, and 0.72 on visible, GPR, and road defect任务.
-
MemOVCD: Training-Free Open-Vocabulary Change Detection via Cross-Temporal Memory Reasoning and Global-Local Adaptive Rectification
MemOVCD reformulates change detection as cross-temporal memory reasoning with weighted bidirectional propagation and adaptive rectification to improve semantic change identification without task-specific training.
-
Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation
Layer analysis of DINOv3 shows non-uniform 3D geometric knowledge concentrated in deeper layers, enabling a last-layer-centric recombination module that improves monocular depth estimation accuracy to state-of-the-art levels.
-
Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding
MCM-VG achieves state-of-the-art zero-shot 3D visual grounding on ScanRefer and Nr3D by creating consistent 2D-3D mappings across semantic, geometric, and viewpoint dimensions using LLMs and VLMs.
-
Seeking Consensus: Geometric-Semantic On-the-Fly Recalibration for Open-Vocabulary Remote Sensing Semantic Segmentation
SeeCo is a training-free on-the-fly recalibration method using multi-view geometric consistency and adaptive textual calibration to improve open-vocabulary semantic segmentation in remote sensing images.
-
BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances
BridgeACT learns robot manipulation from human videos alone by predicting task-relevant grasp regions and 3D motion affordances that map directly to robot controllers.