Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification

Haichao Jiang; Jian-Fang Hu; Quan Zhang; Tianming Liang; Yuting Yang

arxiv: 2605.17531 · v2 · pith:IWK3ZIQBnew · submitted 2026-05-17 · 💻 cs.CV

Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification

Yuting Yang , Haichao Jiang , Tianming Liang , Quan Zhang , Jian-Fang Hu This is my paper

Pith reviewed 2026-05-20 13:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords referring segmentationambiguitymulti-turn dialogueclarificationagentic frameworkvideo object segmentationhierarchical optimizationintent resolution

0 comments

The pith

A multi-turn clarification framework resolves ambiguity in referring segmentation by asking questions instead of guessing user intent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that referring segmentation systems can avoid guessing at ambiguous user queries by proactively engaging in multi-turn conversations to clarify the intended target object. A reader would care if this is true because real users often give imprecise descriptions, leading current models to produce incorrect segmentations. The authors introduce IC-Seg as an agentic system that performs this clarification and Hi-GRPO as a hierarchical optimization strategy to provide dense supervision at trajectory, turn, and step levels for efficiency. They also create the Ambi-RVOS benchmark to evaluate such ambiguous scenarios. If correct, this shifts the paradigm from one-shot guessing to interactive intent resolution in vision-language segmentation tasks.

Core claim

IC-Seg is a novel agentic framework that proactively clarifies user intent through multi-turn conversation before performing segmentation on images or videos. To train this capability, Hi-GRPO injects dense and informative supervision signals at the trajectory, turn, and step levels to encourage efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. This leads to superior performance in resolving ambiguous queries on the new Ambi-RVOS benchmark while retaining state-of-the-art results on standard reasoning segmentation benchmarks.

What carries the argument

IC-Seg agentic framework for multi-turn intent clarification in referring segmentation, driven by the Hi-GRPO hierarchical optimization strategy that provides dense supervision at trajectory, turn, and step levels.

Load-bearing premise

Users will engage with and benefit from multi-turn clarification in practice, and the Hi-GRPO strategy will provide effective dense supervision without introducing dialogue inefficiencies or new failure modes.

What would settle it

If evaluations on Ambi-RVOS show that IC-Seg does not outperform baselines by a large margin or if dialogue quality metrics indicate more inefficiencies, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.17531 by Haichao Jiang, Jian-Fang Hu, Quan Zhang, Tianming Liang, Yuting Yang.

**Figure 2.** Figure 2: Overview of the IC-Seg framework. IC-Seg resolves ambiguities via multi-turn dialogues [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparisons among our IC-Seg and two baselines. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: shows the training dynamics of IC-Seg-8B. The localization-related rewards, including RIoU, Rbox , Rpoint, and Rframe, steadily increase during training, indicating that the model gradually improves its final grounding accuracy and keyframe selection. The process reward also rises consistently, suggesting that Hi-GRPO encourages more effective clarification behavior rather than only optimizing the final s… view at source ↗

read the original abstract

Referring segmentation aims to segment the target objects in images or videos based on the textual query. Despite remarkable progress over the past years, existing works always assume that the user-provided queries are already precise and clear. However, this assumption is impractical. In real-world scenarios, it is unrealistic to expect all users to thoroughly review their visual content and carefully ensure their queries are unique and unambiguous. When encountering such cases, existing segmentation models tend to arbitrarily guess the user preferences, often resulting in undesired outcomes. To address this limitation, we propose IC-Seg, a novel agentic framework that proactively clarifies user intent through multi-turn conversation before segmentation. To effectively incentivize this capability, we further introduce Hi-GRPO, a new hierarchical optimization strategy that injects dense and informative supervision signals at the trajectory, turn, and step levels. This strategy encourages efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. For evaluation, we establish Ambi-RVOS, a referring video object segmentation benchmark with ambiguous user queries. Extensive experiments demonstrate that IC-Seg not only outperforms existing methods by a large margin in resolving ambiguous queries, but also maintains state-of-the-art performance on standard reasoning segmentation benchmarks. Code and data will be released at https://github.com/iSEE-Laboratory/IC-Seg.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IC-Seg adds multi-turn clarification to handle ambiguous queries in referring segmentation with a hierarchical optimization, but the abstract gives no metrics or ablations so the gains are hard to judge.

read the letter

The paper's central move is to stop treating user queries as always precise. Instead of guessing when a referring expression is vague, IC-Seg runs a short conversation to pin down intent before segmenting. They support this with Hi-GRPO, which supplies reward signals at the full trajectory, individual turn, and single-step levels, plus a new Ambi-RVOS benchmark built around deliberately ambiguous video queries. That combination is the actual novelty; prior referring segmentation work stayed inside the single-query setting.

Referee Report

3 major / 2 minor

Summary. The paper proposes IC-Seg, an agentic framework for referring video object segmentation that proactively resolves ambiguous user queries via multi-turn clarification dialogues instead of guessing. It introduces Hi-GRPO, a hierarchical optimization strategy that supplies dense supervision signals at the trajectory, turn, and step levels to promote efficient intent clarification and reduce redundant interactions. A new benchmark Ambi-RVOS is created to evaluate performance on ambiguous queries, with claims of large-margin outperformance on this benchmark while retaining state-of-the-art results on standard reasoning segmentation benchmarks.

Significance. If the empirical claims hold, the work addresses a practical gap in referring segmentation by moving beyond the assumption of unambiguous queries, which is common in real-world use. The hierarchical reward design and the Ambi-RVOS benchmark could serve as useful tools for developing more robust interactive vision-language models, provided the gains are shown to stem from the agentic clarification mechanism rather than optimization artifacts.

major comments (3)

§4.2 (Hi-GRPO description): The central claim that Hi-GRPO delivers dense supervision improving clarification efficiency without new failure modes or dialogue bloat is load-bearing for the large-margin gains on Ambi-RVOS, yet the manuscript provides no ablation that isolates or removes the trajectory/turn/step reward terms individually while reporting turn counts, success rates, and performance on the original non-ambiguous benchmarks.
§5.1 and Table 2 (Ambi-RVOS results): The reported large-margin outperformance is stated without accompanying quantitative metrics, variance across runs, or direct comparison to a non-hierarchical GRPO baseline, making it impossible to verify that the margin is attributable to the multi-turn clarification policy rather than the new optimization or benchmark construction.
§5.2 (standard benchmark retention): The assertion that IC-Seg maintains SOTA performance on existing reasoning segmentation benchmarks while adding clarification capability requires explicit side-by-side tables with the same backbone and training regime; without these, it remains unclear whether the hierarchical terms introduce any degradation on unambiguous queries.

minor comments (2)

The abstract and introduction repeatedly use 'large margin' without defining the metric or providing the numerical delta; this should be replaced with concrete numbers (e.g., mIoU improvement) once the tables are referenced.
Notation for the three reward levels in Hi-GRPO (trajectory, turn, step) is introduced without a compact equation summarizing their weighted combination; adding such an equation would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of experimental rigor that we have addressed through revisions to the manuscript. Below we respond point-by-point to each major comment.

read point-by-point responses

Referee: §4.2 (Hi-GRPO description): The central claim that Hi-GRPO delivers dense supervision improving clarification efficiency without new failure modes or dialogue bloat is load-bearing for the large-margin gains on Ambi-RVOS, yet the manuscript provides no ablation that isolates or removes the trajectory/turn/step reward terms individually while reporting turn counts, success rates, and performance on the original non-ambiguous benchmarks.

Authors: We agree that isolating the contribution of each hierarchical reward level is necessary to substantiate the claims. In the revised manuscript we have added a dedicated ablation subsection in §4.2 (new Table 4) that systematically removes the trajectory-level, turn-level, and step-level reward terms one at a time. For each variant we report average dialogue turns, clarification success rate on Ambi-RVOS, and segmentation performance on the original non-ambiguous benchmarks (RefCOCO, RefCOCO+, DAVIS). The results show that the full three-level hierarchy yields the highest efficiency and accuracy without increasing dialogue length or introducing new failure modes. revision: yes
Referee: §5.1 and Table 2 (Ambi-RVOS results): The reported large-margin outperformance is stated without accompanying quantitative metrics, variance across runs, or direct comparison to a non-hierarchical GRPO baseline, making it impossible to verify that the margin is attributable to the multi-turn clarification policy rather than the new optimization or benchmark construction.

Authors: We acknowledge the need for statistical reporting and a controlled baseline. The revised Table 2 now includes mean and standard deviation across three independent runs with different random seeds. We have also added a direct comparison row for a non-hierarchical GRPO baseline (trajectory reward only) trained under identical conditions. The updated results confirm that the performance margin on Ambi-RVOS is attributable to the hierarchical supervision enabling more effective multi-turn clarification rather than optimization or benchmark artifacts alone. revision: yes
Referee: §5.2 (standard benchmark retention): The assertion that IC-Seg maintains SOTA performance on existing reasoning segmentation benchmarks while adding clarification capability requires explicit side-by-side tables with the same backbone and training regime; without these, it remains unclear whether the hierarchical terms introduce any degradation on unambiguous queries.

Authors: We agree that a controlled side-by-side evaluation is required. We have inserted a new Table 3 in §5.2 that compares IC-Seg against prior state-of-the-art methods using exactly the same backbone, training data, and optimization schedule on the standard benchmarks (RefCOCO, RefCOCO+, DAVIS). The table demonstrates that IC-Seg retains or slightly exceeds prior SOTA numbers, indicating that the hierarchical reward terms do not degrade performance on unambiguous queries. revision: yes

Circularity Check

0 steps flagged

No circularity; new framework and benchmark are self-contained

full rationale

The paper introduces IC-Seg as a new agentic multi-turn clarification framework and Hi-GRPO as a hierarchical optimization with trajectory/turn/step rewards, plus the Ambi-RVOS benchmark. Claims of outperformance on ambiguous queries and maintained SOTA on standard benchmarks rest on empirical results from these novel elements rather than any self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations. No derivation reduces to its own inputs by construction; the work is independent of prior fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The central claim rests on the effectiveness of the newly introduced IC-Seg framework, Hi-GRPO optimization, and Ambi-RVOS benchmark rather than on pre-existing axioms or fitted parameters described in the abstract.

invented entities (3)

IC-Seg no independent evidence
purpose: Agentic framework for proactive multi-turn intent clarification before segmentation
Newly proposed system to address the limitation of ambiguous queries.
Hi-GRPO no independent evidence
purpose: Hierarchical optimization injecting supervision at trajectory, turn, and step levels
New strategy to incentivize efficient clarification capability.
Ambi-RVOS no independent evidence
purpose: Benchmark for referring video object segmentation with ambiguous user queries
New dataset established to evaluate performance on ambiguous cases.

pith-pipeline@v0.9.0 · 5789 in / 1259 out tokens · 59714 ms · 2026-05-20T13:33:58.906229+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Hi-GRPO ... injects dense ... supervision signals at the trajectory, turn, and step levels ... Rturn = Rent + Reff ... entropy reduction ... Reff = 1/K Σ I(Nk < Nk-1)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

IC-Seg ... multi-turn conversation before segmentation ... Ambi-RVOS benchmark

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.