{"total":13,"items":[{"citing_arxiv_id":"2605.23500","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation","primary_cat":"cs.CV","submitted_at":"2026-05-22T11:04:12+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19410","ref_index":76,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Vision Harnessing Agent for Open Ad-hoc Segmentation","primary_cat":"cs.CV","submitted_at":"2026-05-19T06:04:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16903","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WOW-Seg: A Word-free Open World Segmentation Model","primary_cat":"cs.CV","submitted_at":"2026-05-16T09:28:46+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WOW-Seg proposes a word-free open-world segmentation model using Mask2Token and Cascade Attention Mask modules, reporting 89.7 semantic similarity and 82.4 semantic IoU on LVIS with one-eighth the parameters of prior SOTA plus a new 7,662-class benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15951","ref_index":85,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding","primary_cat":"cs.CV","submitted_at":"2026-05-15T13:41:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15670","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation","primary_cat":"cs.CV","submitted_at":"2026-04-17T03:48:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11411","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Online Reasoning Video Object Segmentation","primary_cat":"cs.CV","submitted_at":"2026-04-13T12:55:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08626","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WildDet3D: Scaling Promptable 3D Detection in the Wild","primary_cat":"cs.CV","submitted_at":"2026-04-09T16:00:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"6, 24.8→47.2). Method Training data AP rare APcommon APfrequent AP3D Text Prompt 3D-MOOD [59] Omni3D 2.4 2.1 2.6 2.3 WildDet3D Omni3D 9.0 6.5 5.2 6.8 WildDet3D w/ depth Omni3D 23.0 21.5 16.1 20.7 WildDet3D Omni3D, Others, WildDet3D-Data 28.3 21.6 18.7 22.6 WildDet3D w/ depth Omni3D, Others, WildDet3D-Data47.4 40.7 37.2 41.6 Box Prompt OVMono3D-LIFT [60] Omni3D 7.4 8.8 5.1 7.7 DetAny3D [64] Omni3D, Others 9.9 7.4 6.3 7.8 WildDet3D Omni3D 12.0 7.9 5.3 8.4 WildDet3D w/ depth Omni3D 26.4 24.4 19.6 23.9 WildDet3D Omni3D, Others, WildDet3D-Data 30.0 24.2 20.3 24.8 WildDet3D w/ depth Omni3D, Others, WildDet3D-Data53.7 46.1 42.5 47.2 Interestingly, with GT depth the text-prompt setting (41.6 AP) is competitive with oracle (47."},{"citing_arxiv_id":"2604.07916","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation","primary_cat":"cs.CV","submitted_at":"2026-04-09T07:37:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.16024","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Speak, Segment, Track, Navigate: An Interactive System for Video-Guided Skull-Base Surgery","primary_cat":"cs.CV","submitted_at":"2026-03-17T00:15:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A video-only speech-guided system for skull-base surgery segments and tracks instruments to deliver 2.32 mm tool-tip accuracy and rapid 3D model registration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.03054","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation","primary_cat":"cs.CV","submitted_at":"2026-01-06T14:37:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.10554","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Grounding Everything in Tokens for Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2025-12-11T11:38:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GETok partitions images with grid tokens and refines locations via offset tokens to enable better native 2D spatial reasoning in MLLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.12455","ref_index":72,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mitigating Object Hallucinations via Sentence-Level Early Intervention","primary_cat":"cs.CV","submitted_at":"2025-07-16T17:55:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SENTINEL reduces MLLM object hallucinations by over 90% via sentence-level early intervention with detector-bootstrapped preference data and C-DPO loss, outperforming prior SOTA on hallucination and capability benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.06520","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement","primary_cat":"cs.CV","submitted_at":"2025-03-09T08:48:51+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}