pith. machine review for the scientific record. sign in

arxiv: 2506.21546 · v4 · submitted 2025-06-26 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: unknown

Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination

Authors on Pith no claims yet
classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords segmentationcounterfactualhallucinationshallucinationreasoningabstainentirelyevaluations
0
0 comments X
read the original abstract

Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

    cs.CV 2026-04 unverdicted novelty 8.0

    3D-VCD reduces hallucinations in 3D-LLM embodied agents by contrasting predictions from original and distorted 3D scene representations at inference time.

  2. From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

    cs.CV 2026-05 unverdicted novelty 7.0

    CAFE benchmark reveals that promptable segmentation models often produce correct masks for misleading prompts, showing a gap between localization accuracy and true concept understanding.