Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification
Pith reviewed 2026-05-20 13:33 UTC · model grok-4.3
The pith
A multi-turn clarification framework resolves ambiguity in referring segmentation by asking questions instead of guessing user intent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
IC-Seg is a novel agentic framework that proactively clarifies user intent through multi-turn conversation before performing segmentation on images or videos. To train this capability, Hi-GRPO injects dense and informative supervision signals at the trajectory, turn, and step levels to encourage efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. This leads to superior performance in resolving ambiguous queries on the new Ambi-RVOS benchmark while retaining state-of-the-art results on standard reasoning segmentation benchmarks.
What carries the argument
IC-Seg agentic framework for multi-turn intent clarification in referring segmentation, driven by the Hi-GRPO hierarchical optimization strategy that provides dense supervision at trajectory, turn, and step levels.
Load-bearing premise
Users will engage with and benefit from multi-turn clarification in practice, and the Hi-GRPO strategy will provide effective dense supervision without introducing dialogue inefficiencies or new failure modes.
What would settle it
If evaluations on Ambi-RVOS show that IC-Seg does not outperform baselines by a large margin or if dialogue quality metrics indicate more inefficiencies, the central claim would be falsified.
Figures
read the original abstract
Referring segmentation aims to segment the target objects in images or videos based on the textual query. Despite remarkable progress over the past years, existing works always assume that the user-provided queries are already precise and clear. However, this assumption is impractical. In real-world scenarios, it is unrealistic to expect all users to thoroughly review their visual content and carefully ensure their queries are unique and unambiguous. When encountering such cases, existing segmentation models tend to arbitrarily guess the user preferences, often resulting in undesired outcomes. To address this limitation, we propose IC-Seg, a novel agentic framework that proactively clarifies user intent through multi-turn conversation before segmentation. To effectively incentivize this capability, we further introduce Hi-GRPO, a new hierarchical optimization strategy that injects dense and informative supervision signals at the trajectory, turn, and step levels. This strategy encourages efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. For evaluation, we establish Ambi-RVOS, a referring video object segmentation benchmark with ambiguous user queries. Extensive experiments demonstrate that IC-Seg not only outperforms existing methods by a large margin in resolving ambiguous queries, but also maintains state-of-the-art performance on standard reasoning segmentation benchmarks. Code and data will be released at https://github.com/iSEE-Laboratory/IC-Seg.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes IC-Seg, an agentic framework for referring video object segmentation that proactively resolves ambiguous user queries via multi-turn clarification dialogues instead of guessing. It introduces Hi-GRPO, a hierarchical optimization strategy that supplies dense supervision signals at the trajectory, turn, and step levels to promote efficient intent clarification and reduce redundant interactions. A new benchmark Ambi-RVOS is created to evaluate performance on ambiguous queries, with claims of large-margin outperformance on this benchmark while retaining state-of-the-art results on standard reasoning segmentation benchmarks.
Significance. If the empirical claims hold, the work addresses a practical gap in referring segmentation by moving beyond the assumption of unambiguous queries, which is common in real-world use. The hierarchical reward design and the Ambi-RVOS benchmark could serve as useful tools for developing more robust interactive vision-language models, provided the gains are shown to stem from the agentic clarification mechanism rather than optimization artifacts.
major comments (3)
- §4.2 (Hi-GRPO description): The central claim that Hi-GRPO delivers dense supervision improving clarification efficiency without new failure modes or dialogue bloat is load-bearing for the large-margin gains on Ambi-RVOS, yet the manuscript provides no ablation that isolates or removes the trajectory/turn/step reward terms individually while reporting turn counts, success rates, and performance on the original non-ambiguous benchmarks.
- §5.1 and Table 2 (Ambi-RVOS results): The reported large-margin outperformance is stated without accompanying quantitative metrics, variance across runs, or direct comparison to a non-hierarchical GRPO baseline, making it impossible to verify that the margin is attributable to the multi-turn clarification policy rather than the new optimization or benchmark construction.
- §5.2 (standard benchmark retention): The assertion that IC-Seg maintains SOTA performance on existing reasoning segmentation benchmarks while adding clarification capability requires explicit side-by-side tables with the same backbone and training regime; without these, it remains unclear whether the hierarchical terms introduce any degradation on unambiguous queries.
minor comments (2)
- The abstract and introduction repeatedly use 'large margin' without defining the metric or providing the numerical delta; this should be replaced with concrete numbers (e.g., mIoU improvement) once the tables are referenced.
- Notation for the three reward levels in Hi-GRPO (trajectory, turn, step) is introduced without a compact equation summarizing their weighted combination; adding such an equation would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of experimental rigor that we have addressed through revisions to the manuscript. Below we respond point-by-point to each major comment.
read point-by-point responses
-
Referee: §4.2 (Hi-GRPO description): The central claim that Hi-GRPO delivers dense supervision improving clarification efficiency without new failure modes or dialogue bloat is load-bearing for the large-margin gains on Ambi-RVOS, yet the manuscript provides no ablation that isolates or removes the trajectory/turn/step reward terms individually while reporting turn counts, success rates, and performance on the original non-ambiguous benchmarks.
Authors: We agree that isolating the contribution of each hierarchical reward level is necessary to substantiate the claims. In the revised manuscript we have added a dedicated ablation subsection in §4.2 (new Table 4) that systematically removes the trajectory-level, turn-level, and step-level reward terms one at a time. For each variant we report average dialogue turns, clarification success rate on Ambi-RVOS, and segmentation performance on the original non-ambiguous benchmarks (RefCOCO, RefCOCO+, DAVIS). The results show that the full three-level hierarchy yields the highest efficiency and accuracy without increasing dialogue length or introducing new failure modes. revision: yes
-
Referee: §5.1 and Table 2 (Ambi-RVOS results): The reported large-margin outperformance is stated without accompanying quantitative metrics, variance across runs, or direct comparison to a non-hierarchical GRPO baseline, making it impossible to verify that the margin is attributable to the multi-turn clarification policy rather than the new optimization or benchmark construction.
Authors: We acknowledge the need for statistical reporting and a controlled baseline. The revised Table 2 now includes mean and standard deviation across three independent runs with different random seeds. We have also added a direct comparison row for a non-hierarchical GRPO baseline (trajectory reward only) trained under identical conditions. The updated results confirm that the performance margin on Ambi-RVOS is attributable to the hierarchical supervision enabling more effective multi-turn clarification rather than optimization or benchmark artifacts alone. revision: yes
-
Referee: §5.2 (standard benchmark retention): The assertion that IC-Seg maintains SOTA performance on existing reasoning segmentation benchmarks while adding clarification capability requires explicit side-by-side tables with the same backbone and training regime; without these, it remains unclear whether the hierarchical terms introduce any degradation on unambiguous queries.
Authors: We agree that a controlled side-by-side evaluation is required. We have inserted a new Table 3 in §5.2 that compares IC-Seg against prior state-of-the-art methods using exactly the same backbone, training data, and optimization schedule on the standard benchmarks (RefCOCO, RefCOCO+, DAVIS). The table demonstrates that IC-Seg retains or slightly exceeds prior SOTA numbers, indicating that the hierarchical reward terms do not degrade performance on unambiguous queries. revision: yes
Circularity Check
No circularity; new framework and benchmark are self-contained
full rationale
The paper introduces IC-Seg as a new agentic multi-turn clarification framework and Hi-GRPO as a hierarchical optimization with trajectory/turn/step rewards, plus the Ambi-RVOS benchmark. Claims of outperformance on ambiguous queries and maintained SOTA on standard benchmarks rest on empirical results from these novel elements rather than any self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations. No derivation reduces to its own inputs by construction; the work is independent of prior fitted quantities.
Axiom & Free-Parameter Ledger
invented entities (3)
-
IC-Seg
no independent evidence
-
Hi-GRPO
no independent evidence
-
Ambi-RVOS
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Hi-GRPO ... injects dense ... supervision signals at the trajectory, turn, and step levels ... Rturn = Rent + Reff ... entropy reduction ... Reff = 1/K Σ I(Nk < Nk-1)
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
IC-Seg ... multi-turn conversation before segmentation ... Ambi-RVOS benchmark
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.