super hub

SAM 3: Segment Anything with Concepts

Didac Suris, Laura Gustafson, Nicolas Carion, Ronghang Hu, Shoubhik Debnath, Yuan-Ting Hu · 2025 · cs.CV · arXiv 2511.16719

103 Pith papers cite this work. Polarity classification is still indexing.

103 Pith papers citing it

open full Pith review browse 103 citing papers more from Didac Suris arXiv PDF

abstract

We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.

hub tools

JSON dossier citing papers JSON arXiv source

claims ledger

abstract We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of

authors

Didac Suris Laura Gustafson Nicolas Carion Ronghang Hu Shoubhik Debnath Yuan-Ting Hu

co-cited works

representative citing papers

Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models

cs.CV · 2026-05-09 · unverdicted · novelty 8.0

Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.

RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.

Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

AFFORDMEM improves AP50 by 3.23-3.7 points on SceneFun3D splits by using a reusable cross-scene affordance memory bank and in-scene spatial memory to guide VLMs toward actionable 3D regions.

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

cs.CV · 2026-05-11 · conditional · novelty 7.0 · 2 refs

TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.

From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

CAFE benchmark reveals that promptable segmentation models often produce correct masks for misleading prompts, showing a gap between localization accuracy and true concept understanding.

Relightable Gaussian Splatting for Virtual Production Using Image-Based Illumination

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

A relightable Gaussian Splatting method for virtual production decomposes scenes into fixed appearance and variable lighting by parameterizing primitives to directly sample high-resolution background textures, enabling controllable relighting without physically-based rendering or far-field maps.

ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.

Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.

Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

GA3T: A Ground-Aerial Terrain Traversability Dataset for Heterogeneous Robot Teams in Unstructured Environments

cs.RO · 2026-05-07 · accept · novelty 7.0

GA3T is a new dataset of synchronized ground-aerial robot data in unstructured outdoor environments designed to support cross-view perception, traversability estimation, and collaborative scene understanding.

EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents

cs.AI · 2026-05-02 · unverdicted · novelty 7.0

EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.

SketchVLM: Vision language models can annotate images to explain thoughts and guide users

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.

AnimationBench: Are Video Models Good at Character-Centric Animation?

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.

HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand Grasps

cs.RO · 2026-04-16 · unverdicted · novelty 7.0

HRDexDB is a multi-modal dataset of 1.4K human and robotic dexterous grasps across 100 objects, providing aligned 3D kinematics, high-resolution tactile data, and video streams.

Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.

VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems

cs.MA · 2026-04-13 · unverdicted · novelty 7.0

VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.

Online Reasoning Video Object Segmentation

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.

Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection

cs.CV · 2026-04-13 · conditional · novelty 7.0

Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.

Semantic Manipulation Localization

cs.CV · 2026-04-11 · unverdicted · novelty 7.0

Defines SML task for localizing semantic edits and proposes TRACE framework with semantic anchoring, perturbation sensing, and constrained reasoning that outperforms prior IML methods on a custom benchmark.

WildDet3D: Scaling Promptable 3D Detection in the Wild

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.

Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.

Open-Ended Video Game Glitch Detection with Agentic Reasoning and Temporal Grounding

cs.MA · 2026-04-09 · unverdicted · novelty 7.0

Introduces the first benchmark for open-ended video game glitch detection with temporal localization and proposes GliDe, an agentic framework that achieves stronger performance than vanilla multimodal models.

citing papers explorer

Showing 50 of 103 citing papers.

Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models cs.CV · 2026-05-09 · unverdicted · none · ref 40 · internal anchor
Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.
RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition cs.CV · 2026-05-12 · unverdicted · none · ref 37 · internal anchor
RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances cs.CV · 2026-05-12 · unverdicted · none · ref 4 · internal anchor
AFFORDMEM improves AP50 by 3.23-3.7 points on SceneFun3D splits by using a reusable cross-scene affordance memory bank and in-scene spatial memory to guide VLMs toward actionable 3D regions.
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 6 · internal anchor
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models cs.CV · 2026-05-11 · conditional · none · ref 5 · 2 links · internal anchor
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
From Pixels to Concepts: Do Segmentation Models Understand What They Segment? cs.CV · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
CAFE benchmark reveals that promptable segmentation models often produce correct masks for misleading prompts, showing a gap between localization accuracy and true concept understanding.
Relightable Gaussian Splatting for Virtual Production Using Image-Based Illumination cs.CV · 2026-05-09 · unverdicted · none · ref 90 · internal anchor
A relightable Gaussian Splatting method for virtual production decomposes scenes into fixed appearance and variable lighting by parameterizing primitives to directly sample high-resolution background textures, enabling controllable relighting without physically-based rendering or far-field maps.
ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring cs.CV · 2026-05-08 · unverdicted · none · ref 2 · internal anchor
ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding cs.CV · 2026-05-08 · unverdicted · none · ref 2 · internal anchor
Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance cs.CV · 2026-05-07 · unverdicted · none · ref 4 · internal anchor
Sparkle supplies a large-scale dataset and benchmark for instruction-driven video background replacement, enabling models that generate more natural and temporally consistent new scenes than earlier approaches.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 7 · internal anchor
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
GA3T: A Ground-Aerial Terrain Traversability Dataset for Heterogeneous Robot Teams in Unstructured Environments cs.RO · 2026-05-07 · accept · none · ref 3 · internal anchor
GA3T is a new dataset of synchronized ground-aerial robot data in unstructured outdoor environments designed to support cross-view perception, traversability estimation, and collaborative scene understanding.
EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents cs.AI · 2026-05-02 · unverdicted · none · ref 53 · internal anchor
EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
SketchVLM: Vision language models can annotate images to explain thoughts and guide users cs.CV · 2026-04-23 · unverdicted · none · ref 7 · internal anchor
SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
AnimationBench: Are Video Models Good at Character-Centric Animation? cs.CV · 2026-04-16 · unverdicted · none · ref 4 · internal anchor
AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.
HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand Grasps cs.RO · 2026-04-16 · unverdicted · none · ref 31 · internal anchor
HRDexDB is a multi-modal dataset of 1.4K human and robotic dexterous grasps across 100 objects, providing aligned 3D kinematics, high-resolution tactile data, and video streams.
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches cs.CV · 2026-04-15 · unverdicted · none · ref 6 · internal anchor
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.
VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems cs.MA · 2026-04-13 · unverdicted · none · ref 5 · internal anchor
VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.
Online Reasoning Video Object Segmentation cs.CV · 2026-04-13 · unverdicted · none · ref 8 · internal anchor
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection cs.CV · 2026-04-13 · conditional · none · ref 8 · internal anchor
Seg2Change adapts open-vocabulary segmentation models to open-vocabulary change detection via a category-agnostic change head and new dataset CA-CDD, delivering +9.52 IoU on WHU-CD and +5.50 mIoU on SECOND.
Semantic Manipulation Localization cs.CV · 2026-04-11 · unverdicted · none · ref 47 · internal anchor
Defines SML task for localizing semantic edits and proposes TRACE framework with semantic anchoring, perturbation sensing, and constrained reasoning that outperforms prior IML methods on a custom benchmark.
WildDet3D: Scaling Promptable 3D Detection in the Wild cs.CV · 2026-04-09 · unverdicted · none · ref 9 · internal anchor
WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation cs.CV · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.
Open-Ended Video Game Glitch Detection with Agentic Reasoning and Temporal Grounding cs.MA · 2026-04-09 · unverdicted · none · ref 8 · internal anchor
Introduces the first benchmark for open-ended video game glitch detection with temporal localization and proposes GliDe, an agentic framework that achieves stronger performance than vanilla multimodal models.
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details cs.CV · 2026-04-08 · unverdicted · none · ref 6 · internal anchor
RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning cs.CV · 2026-04-08 · unverdicted · none · ref 5 · internal anchor
A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
Are We Recognizing the Jaguar or Its Background? A Diagnostic Framework for Jaguar Re-Identification cs.CV · 2026-04-06 · unverdicted · none · ref 25 · internal anchor
A new diagnostic framework using inpainted context ratios and laterality checks on a Pantanal jaguar benchmark reveals whether re-ID models depend on coat patterns or spurious background evidence.
Generalized Small Object Detection:A Point-Prompted Paradigm and Benchmark cs.CV · 2026-04-03 · unverdicted · none · ref 37 · internal anchor
TinySet-9M dataset and DEAL point-prompted framework deliver 31.4% relative AP75 gain over supervised baselines for small object detection with one click at inference and generalization to unseen categories.
VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence cs.CV · 2026-05-13 · unverdicted · none · ref 18 · internal anchor
VoxCor creates reusable volumetric features from frozen 2D ViT models by combining triplanar inference with a closed-form weighted partial least squares projection, enabling direct voxel correspondence across modalities without training or registration.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-05-13 · conditional · none · ref 5 · internal anchor
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
SID: Sliding into Distribution for Robust Few-Demonstration Manipulation cs.RO · 2026-05-13 · unverdicted · none · ref 2 · internal anchor
SID achieves approximately 90% success on six real-world manipulation tasks with only two demonstrations under out-of-distribution initializations, with less than 10% performance drop under distractors and disturbances.
Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation cs.CV · 2026-05-13 · unverdicted · none · ref 4 · internal anchor
Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.
What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models cs.RO · 2026-05-13 · conditional · none · ref 42 · internal anchor
PAIR-VLA adds invariance and sensitivity objectives over paired visual variants during PPO fine-tuning of VLA models, yielding 9-16% average gains on ManiSkill3 under distractors, textures, poses, viewpoints, and lighting shifts.
Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency cs.CV · 2026-05-13 · conditional · none · ref 6 · internal anchor
VLMs exhibit size, center, and saliency biases in scene understanding, relying less on people than humans do, with size bias as a key driver of divergence.
From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 6 · internal anchor
AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bimanual tasks.
Focusable Monocular Depth Estimation cs.CV · 2026-05-12 · unverdicted · none · ref 4 · internal anchor
FocusDepth is a prompt-conditioned framework that fuses SAM3 features into Depth Anything models via Multi-Scale Spatial-Aligned Fusion to improve target-region depth accuracy on the new FDE-Bench.
Pixal3D: Pixel-Aligned 3D Generation from Images cs.CV · 2026-05-11 · unverdicted · none · ref 66 · internal anchor
Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.
Geometric 4D Stitching for Grounded 4D Generation cs.CV · 2026-05-11 · unverdicted · none · ref 25 · internal anchor
Geometric 4D Stitching explicitly complements missing geometric regions in 4D generated scenes with grounded stitches to achieve consistent 4D representations in under 10 minutes on a single GPU.
From Expansion to Consolidation: Socio-Spatial Contagion Dynamics in Off-Grid PV Adoption econ.GN · 2026-05-10 · unverdicted · none · ref 17 · 2 links · internal anchor
Socio-spatial contagion in off-grid PV adoption is nearly ubiquitous, intensifies over time but peaks within 1-2 years, and shifts from range expansion to contraction as communities move from clustering to consolidation of installations.
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models cs.CV · 2026-05-08 · unverdicted · none · ref 11 · internal anchor
SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.
PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation cs.RO · 2026-05-08 · unverdicted · none · ref 28 · internal anchor
PathPainter transfers image generation models to embodied navigation by generating traversability masks from BEV images and language instructions while using cross-view localization to reduce odometry drift.
4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding cs.CV · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
4DThinker enables VLMs to perform dynamic spatial reasoning by internally simulating 4D imagery in latent space, outperforming prior text-based and modular approaches.
ChartZero: Synthetic Priors Enable Zero Shot Chart Data Extraction cs.CV · 2026-05-07 · unverdicted · none · ref 5 · internal anchor
ChartZero achieves zero-shot line chart data extraction by training only on synthetic mathematical functions, using a Global Orthogonal Instance loss to prevent curve fragmentation and a VLM-guided strategy for legend matching.
Affordance Agent Harness: Verification-Gated Skill Orchestration cs.RO · 2026-05-01 · unverdicted · none · ref 8 · 2 links · internal anchor
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tradeoffs in open-world affordance grounding.
Training-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction cs.CV · 2026-04-30 · unverdicted · none · ref 3 · internal anchor
TunnelMIND recalibrates language-guided defect proposals via dense visual consistency and reconstructs them into structured defect entities with attributes for severity grading and retrieval-grounded engineering reports, reporting F1 scores of 0.68, 0.78, and 0.72 on visible, GPR, and road defect任务.
MemOVCD: Training-Free Open-Vocabulary Change Detection via Cross-Temporal Memory Reasoning and Global-Local Adaptive Rectification cs.CV · 2026-04-29 · unverdicted · none · ref 1 · internal anchor
MemOVCD reformulates change detection as cross-temporal memory reasoning with weighted bidirectional propagation and adaptive rectification to improve semantic change identification without task-specific training.
Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation cs.CV · 2026-04-29 · unverdicted · none · ref 5 · internal anchor
Layer analysis of DINOv3 shows non-uniform 3D geometric knowledge concentrated in deeper layers, enabling a last-layer-centric recombination module that improves monocular depth estimation accuracy to state-of-the-art levels.
Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding cs.CV · 2026-04-29 · unverdicted · none · ref 4 · internal anchor
MCM-VG achieves state-of-the-art zero-shot 3D visual grounding on ScanRefer and Nr3D by creating consistent 2D-3D mappings across semantic, geometric, and viewpoint dimensions using LLMs and VLMs.
Seeking Consensus: Geometric-Semantic On-the-Fly Recalibration for Open-Vocabulary Remote Sensing Semantic Segmentation cs.CV · 2026-04-29 · unverdicted · none · ref 5 · internal anchor
SeeCo is a training-free on-the-fly recalibration method using multi-view geometric consistency and adaptive textual calibration to improve open-vocabulary semantic segmentation in remote sensing images.
BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances cs.RO · 2026-04-25 · unverdicted · none · ref 20 · internal anchor
BridgeACT learns robot manipulation from human videos alone by predicting task-relevant grasp regions and 3D motion affordances that map directly to robot controllers.