Localizing Input Uncertainty Quantification for Large Language Models via Shapley Values

Changhee Lee; Seongjun Lee; Suwan Yoon

arxiv: 2605.28170 · v1 · pith:77244GMFnew · submitted 2026-05-27 · 💻 cs.AI

Localizing Input Uncertainty Quantification for Large Language Models via Shapley Values

Seongjun Lee , Suwan Yoon , Changhee Lee This is my paper

Pith reviewed 2026-06-29 12:39 UTC · model grok-4.3

classification 💻 cs.AI

keywords Shapley valuesinput uncertainty quantificationlarge language modelsambiguity detectionconditional entropyspan-level attributioncooperative games

0 comments

The pith

ShaQ attributes uncertainty in LLM outputs to specific ambiguous input spans using Shapley values from conditional entropy reductions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ShaQ to quantify input-induced uncertainty at the span level rather than as a single scalar for large language models. It models ambiguous spans as players in a cooperative game whose contributions are measured by Shapley values, each defined as the weighted average of marginal drops in conditional entropy that result from clarifying coalitions of spans. This yields an exact additive decomposition of total input uncertainty that accounts for interactions among spans. The resulting attributions let users identify which parts of an input to clarify in order to improve model reliability, with evaluations showing gains on ambiguity detection tasks and utility in clinical dialogue.

Core claim

ShaQ models ambiguous spans in the input as players in a cooperative game and quantifies their contributions using Shapley values, defined via the weighted average of marginal reductions in conditional entropy obtained by clarifying each span coalition, providing a principled decomposition in which individual attributions sum exactly to the total input-induced uncertainty.

What carries the argument

Shapley values defined via the weighted average of marginal reductions in conditional entropy obtained by clarifying span coalitions

If this is right

Individual span attributions sum exactly to the total input-induced uncertainty.
The formulation captures complex interactions among spans.
ShaQ achieves state-of-the-art performance in ambiguity detection on the AmbigQA and AmbiEnt benchmarks.
ShaQ localizes under-specified clinical utterances on MediTOD and supports targeted clarification in human-AI collaboration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interactive systems could use high-attribution spans to prompt users for clarification in real time.
The same game setup could be adapted to isolate uncertainty arising from model knowledge gaps rather than input ambiguity.
Integrating ShaQ into model training loops might encourage LLMs to request clarification on ambiguous spans automatically.
Running ShaQ across many different LLM architectures would test whether the entropy-based attributions remain stable.

Load-bearing premise

Marginal reductions in conditional entropy when clarifying coalitions of spans can be used to define Shapley values that additively decompose input-induced uncertainty while capturing interactions without model-specific biases.

What would settle it

A direct check that the sum of all span attributions equals the total conditional entropy of the input, or an experiment showing whether clarifying the highest-attribution spans reduces uncertainty more than clarifying lower-attribution ones.

Figures

Figures reproduced from arXiv: 2605.28170 by Changhee Lee, Seongjun Lee, Suwan Yoon.

**Figure 1.** Figure 1: Comparison of uncertainty methods to LLM-user interaction. (Left) Output-level uncertainty methods provide confidence scores, but do not distinguish whether uncertainty arises from input ambiguity or model hallucination. (Center) Aggregate aleatoric uncertainty methods identify ambiguous inputs, but lack localization. (Right) ShaQ localizes uncertainty to individual spans, enabling targeted clarification o… view at source ↗

**Figure 2.** Figure 2: An overview of the ShaQ pipeline. Given an input query “How tall is the tallest building in the city?”, (Step 1) two ambiguous spans are localized, and then two premises are generated per span, providing four joint clarifications. For each clarification, (Step 2) the Answerer generates ℓ responses, which are then mapped into a shared semantic space by the Clusterer. (Step 3) Bottom-up marginalization is us… view at source ↗

**Figure 3.** Figure 3: Aleatoric uncertainty distributions on AmbigQA (Left) and AmbiEnt (Right). the baselines, ShaQ exhibits the clearest separation between ambiguous and unambiguous inputs, assigning consistently higher uncertainty to ambiguous questions. This separation supports the effectiveness of ShaQ as a fine-grained ambiguity quantification method. Overall, these results show that the uncertainty estimates produced by … view at source ↗

**Figure 4.** Figure 4: Uncertainty-guided clarification. (a) Unclear Symptom Appearances (b) Unclear Symptom Timing [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 6.** Figure 6: Aleatoric uncertainty distributions on AmbiEnt (GPT-5.4-mini). C. Additional Results C.1. Additional Results on Ambiguity Detection Uncertainty Distribution Analysis. Figures 6, 8, 7, 9 visualize the aleatoric uncertainty distributions for ambiguous and unambiguous samples across different Answerer and Clusterer backbone models. Across these backbones, ShaQ exhibits a similar distributional pattern: ambigu… view at source ↗

**Figure 7.** Figure 7: Aleatoric uncertainty distributions on AmbigQA (GPT-4). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Aleatoric uncertainty distributions on AmbiEnt. Upper row: Gemini-3.1-flash-preview, Lower row: Gemini-2.5-flash. ShaQ(Total) ShaQ(Max) ICE Deep Ensembles ShaQ(Total) ShaQ(Max) ICE Deep Ensembles ShaQ(Total) ShaQ(Max) ICE Deep Ensembles [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Aleatoric uncertainty distributions on AmbigQA. First row: GPT-5.4-mini, Mid row: Gemini-3.1-flash-preview, Last row: Gemini-2.5-flash. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative Result of multi span ambiguous questions on AmbigQA (GPT-4). Highlighted span is colored with pink. Qualitative Analysis on AmbigQA We further conduct a qualitative evaluation of ShaQ on three representative multi-span cases identified by the AmbigQA localizer: • (1) cases in which all detected spans must be revised to resolve the ambiguity ( [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative Result of multi span ambiguous questions on AmbiEnt (GPT-5.4-mini). Highlighted span is colored with pink. Qualitative Analysis on AmbiEnt In AmbiEnt, ambiguity may arise independently from either the premise or the hypothesis. We therefore conduct a qualitative evaluation of ShaQ on two representative multi-span cases identified by the AmbiEnt localizer: • (1) cases in which all detected span… view at source ↗

**Figure 12.** Figure 12: Extensive Analysis For MediTOD Dialogue Set. Extensive Medical Dialogue Case Study [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt used for the Localizer. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt used for the Premise Generator. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt used for the Answerer. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗

**Figure 16.** Figure 16: Prompt used for the Clusterer. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗

**Figure 17.** Figure 17: Prompt used for Ask4Conf-D. Uncertainty-guided Question Clarification Prompt for Baseline Methods You rewrite ambiguous QA questions so that answer uncertainty is reduced while the original broad intent is preserved. You are given only an original question and numeric uncertainty scores from one uncertainty method. Original question: {{ORIGINAL_QUESTION}} Numeric uncertainty scores: {{UNCERTAINTY_CONTEXT}… view at source ↗

**Figure 18.** Figure 18: Prompt used for the Uncertainty-guided Clarification. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt used for the Uncertainty-guided Clarification with ShaQ. Uncertainty context Numeric uncertainty scores: query_ambiguity: {query_ambiguity} , query_ambiguity_stats: "min": {min} , "max": {max} , "mean": {mean} , "median": {median} , "variance": {variance} , "threshold": {threshold} , [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗

**Figure 20.** Figure 20: Uncertainty Context format used for the Uncertainty-guided Clarification. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_20.png] view at source ↗

**Figure 21.** Figure 21: Fine-grained Uncertainty Context format used for the Uncertainty-guided Clarification with ShaQ. ShaQ Prompt: Medical Dialogue Localizer You are a conservative ambiguity detector for medical dialogues. Your task is to review the Current Utterance in light of the Medical Dialogue Context, and wrap only the truly ambiguous spans in the Current Utterance with XML tags of the form <ambig id="1">...</ambig>. #… view at source ↗

**Figure 22.** Figure 22: Prompt for the Localizer: identifies ambiguous spans in the current utterance given the dialogue history. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_22.png] view at source ↗

**Figure 23.** Figure 23: Prompt for the Premise Generator: produces mutually exclusive clinical interpretations for each ambiguous span. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_23.png] view at source ↗

**Figure 24.** Figure 24: Prompt for the answerer: rewrites the current utterance under a specific assumption about an ambiguous span. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_24.png] view at source ↗

**Figure 25.** Figure 25: Prompt for the Clusterer: groups rewritten utterances by shared clinical or semantic meaning. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_25.png] view at source ↗

read the original abstract

As large language models (LLMs) are increasingly integrated into high-stakes decision-making, the ability to reliably quantify uncertainty has become a critical requirement for safety and trust. However, current uncertainty quantification methods primarily operate at the output level, often failing to distinguish whether uncertainty arises from the model's lack of knowledge or from ambiguity in the user's input. While input-centric uncertainty quantification has recently emerged as a promising direction, it remains relatively underexplored and typically relies on coarse, input-level information. Consequently, users are provided with scalar uncertainty scores that offer little actionable guidance on which parts of the input should be clarified to improve reliability. To address this limitation, we propose Shapley-based input uncertainty Quantification (ShaQ), a framework for span-level attribution of input-induced uncertainty. Our approach models ambiguous spans in the input as players in a cooperative game and quantifies their contributions using Shapley values, defined via the weighted average of marginal reductions in conditional entropy obtained by clarifying each span coalition. Unlike existing input-level approaches, our formulation captures complex interactions among spans and provides a principled decomposition in which individual attributions sum exactly to the total input-induced uncertainty. We evaluate ShaQ on the AmbigQA and AmbiEnt benchmarks, where it achieves state-of-the-art performance in ambiguity detection. We further demonstrate its utility on MediTOD, showing that ShaQ can localize under-specified clinical utterances and facilitate human-AI collaboration in high-stakes settings. Overall, ShaQ improves uncertainty estimation and provides actionable insights for targeted input clarification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ShaQ applies standard Shapley values to span coalitions in LLM inputs using marginal conditional-entropy reductions, giving an additive decomposition with no visible internal contradictions.

read the letter

The core move is to treat spans as players in a cooperative game where the value of a coalition is the drop in conditional entropy once those spans are clarified. This yields per-span attributions that sum exactly to the total input uncertainty by the usual Shapley axioms. That part is clean and matches the stress-test note.

What is actually new is the span-level localization for input ambiguity rather than another output-score method. They run it on AmbigQA and AmbiEnt and report better ambiguity detection than prior input-level baselines, plus a MediTOD case study showing it can flag under-specified clinical utterances. The additive property and interaction capture are real advantages over simple per-span scores.

The soft spots are practical rather than foundational. Exact Shapley computation scales poorly, so any real implementation must approximate coalitions; the abstract gives no detail on the approximation or on how "clarifying" a span is operationalized inside the LLM without introducing new conditioning artifacts. Tokenization boundaries and prompt formatting could also affect the entropy estimates in ways that are not yet stress-tested. Those are fixable with more implementation transparency and ablation checks.

This is for people already working on uncertainty quantification or safety in deployed LLMs. A reader who wants a principled way to point at which input parts need clarification will get something usable here. It is coherent on its own terms and deserves a serious referee to check the experiments and approximations.

Referee Report

2 major / 3 minor

Summary. The paper introduces ShaQ, a framework that models ambiguous spans in LLM inputs as players in a cooperative game and attributes input-induced uncertainty via Shapley values computed from marginal reductions in conditional entropy when coalitions of spans are clarified. It claims this yields an exact additive decomposition of total input uncertainty, captures span interactions, outperforms prior methods on AmbigQA and AmbiEnt for ambiguity detection, and aids human-AI collaboration on MediTOD by localizing under-specified clinical utterances.

Significance. If the value function is shown to isolate input ambiguity without model-specific biases, ShaQ supplies a principled, interaction-aware localization tool that moves beyond scalar input uncertainty scores. The reliance on the standard Shapley construction (additivity holds by definition for any valid v) is a methodological strength, and the application to high-stakes domains like clinical dialogue is a practical contribution.

major comments (2)

[§3] §3 (method): the value function v(S) is defined via marginal entropy reduction upon span clarification, but the manuscript must explicitly address how clarification is realized in the LLM (e.g., prompt rewriting, token masking, or conditioning) to ensure the marginal contributions remain faithful to input-induced uncertainty rather than conflating it with output-generation stochasticity.
[§4] §4 (experiments): the SOTA claim on AmbigQA/AmbiEnt requires reporting of exact metrics, baseline implementations, and statistical significance; without these, it is unclear whether gains stem from the Shapley decomposition itself or from auxiliary modeling choices.

minor comments (3)

[§3] Notation for conditional entropy and the characteristic function v should be introduced with a single consistent equation early in §3 to avoid reader confusion when later referencing coalitions.
The computational complexity of exact Shapley values is exponential; the paper should state whether Monte-Carlo sampling or other approximations are used and report variance or convergence diagnostics.
Figure captions and table headers should explicitly label the uncertainty metric (e.g., “input-induced entropy”) rather than generic “uncertainty score.”

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We address each major comment below and will update the manuscript accordingly to improve clarity on the value function and experimental details.

read point-by-point responses

Referee: [§3] §3 (method): the value function v(S) is defined via marginal entropy reduction upon span clarification, but the manuscript must explicitly address how clarification is realized in the LLM (e.g., prompt rewriting, token masking, or conditioning) to ensure the marginal contributions remain faithful to input-induced uncertainty rather than conflating it with output-generation stochasticity.

Authors: We agree that the current description in §3 would benefit from greater explicitness on the clarification procedure. Clarification is performed via prompt rewriting: each coalition S is clarified by substituting the spans in S with their ground-truth disambiguations (from benchmark annotations) while leaving all other tokens unchanged; the LLM is then conditioned on this rewritten prompt to compute the conditional entropy of the output distribution. This isolates input-induced uncertainty because the only change is the input content, not the sampling procedure or model parameters. We will add a new paragraph and pseudocode in §3 detailing this exact mechanism, including how we ensure the entropy computation uses the same generation settings across coalitions. revision: yes
Referee: [§4] §4 (experiments): the SOTA claim on AmbigQA/AmbiEnt requires reporting of exact metrics, baseline implementations, and statistical significance; without these, it is unclear whether gains stem from the Shapley decomposition itself or from auxiliary modeling choices.

Authors: We acknowledge that the experimental section would be strengthened by additional reporting. The manuscript already contains comparative tables, but we will expand §4 with: (i) the precise numerical values for all metrics (accuracy, F1, AUC) rather than only relative rankings, (ii) full implementation details and hyperparameter settings for each baseline, and (iii) statistical significance results (paired t-tests over 5 random seeds) comparing ShaQ against the strongest baselines. These additions will make it possible to attribute performance differences more clearly to the Shapley formulation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines ShaQ by applying the standard Shapley value formula (weighted average of marginal contributions) to a value function v based on reductions in conditional entropy when clarifying input span coalitions. The claimed additive decomposition (attributions sum exactly to total input-induced uncertainty) follows directly from the definition of Shapley values for any v, which is the standard mathematical property rather than a derived or fitted result internal to the paper. No self-citations, uniqueness theorems, ansatzes smuggled via prior work, or fitted parameters renamed as predictions appear in the abstract or description. The central construction is an application of cooperative game theory to the entropy-based value function and remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient technical detail to enumerate free parameters, axioms, or invented entities; the cooperative-game modeling and conditional-entropy definition are invoked but not expanded.

pith-pipeline@v0.9.1-grok · 5804 in / 1092 out tokens · 35196 ms · 2026-06-29T12:39:01.685422+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references

[1]

the flash

Entity overlap — the same name refers to two or more well-known entities that yield different answers. (e.g., “the flash” → 1990 TV series vs 2014 TV series)

1990
[2]

current”, “recent

Role/title holders — role-based questions where MORE THAN ONE person is commonly associated, especially with words like “current”, “recent”, or no time qualifier. (e.g., “Who is the administrator of the SBA?” — changed multiple times)
[3]

Who won the world cup?

Missing specifier — a key dimension (sport, year, country, gender, edition) is absent and there are multiple well-known correct answers across that dimension. (e.g., “Who won the world cup?” — no sport, no year)
[4]

more famous

Temporal — the answer has changed over time AND the question does NOT fix a specific time period. Err toward marking temporal ambiguity when the role or record holder has changed within the past∼20 years, even if one answer is “more famous”. Do NOT mark as ambiguous: • Questions with a single universally dominant answer that virtually everyone would give....
[6]

Each premise must be strongly supported by the wording, not merely theoretically possible
[7]

All premises must point to real, well-known facts, media, people, or events

DO NOT invent fictional or unverifiable entities. All premises must point to real, well-known facts, media, people, or events
[9]

Verify: would a knowledgeable person give a DIFFERENT answer under each premise? If not, do not include both
[10]

Keep each premise concise, one short sentence
[11]

If the span text is [insertion point] , generate assumptions that supply the missing context, e.g., a specific year, country, sport, or edition
[12]

Prefer the MOST COMMON or WELL-KNOWN interpretations first
[13]

Do not use obscure variants, foreign editions, regional releases, minor adaptations, or historical edge cases unless the question wording directly makes that dimension salient
[14]

Do not split by spelling, alias, paraphrase, or answer granularity if experts would still converge on the same factual answer
[15]

premises

When in doubt, be conservative: return fewer premises rather than low-plausibility ones. ## Examples {Examples} ## Task Question:{clean_sentence} Span id:{span_id} Span text:{span_text} Ambiguity reason:{reason} The"premises"array must contain between 1 and{m}items. Add another item only if it is a strong, common, answer-changing interpretation. Figure 14...
[16]

Output ONLY the answer entity

Your answer MUST be≤5 words. Output ONLY the answer entity
[17]

If the assumption specifies a particular year, TV series, edition, or context, answer for THAT specific version — not the most well-known one

STRICTLY follow the assumption. If the assumption specifies a particular year, TV series, edition, or context, answer for THAT specific version — not the most well-known one
[18]

According to

DO NOT explain, hedge, or add qualifiers. No “According to...” or “I think...”
[19]

DO NOT fall back to the most famous answer

If the assumption points to an obscure but real entity, output that entity. DO NOT fall back to the most famous answer
[20]

## Examples {Examples} Question:{clean_sentence} Assumptions: {premises} Answer: Figure 15.Prompt used for the Answerer

If Assumptions say “None.”, answer the question with its most common interpretation. ## Examples {Examples} Question:{clean_sentence} Assumptions: {premises} Answer: Figure 15.Prompt used for the Answerer. 34 Prompt: Clusterer You are a semantic clustering assistant. Your task is to group a list of short answers to a question into clusters based on whethe...
[21]

Paris" and

SAME cluster — surface variations of the same entity: • "Paris" and "paris, france"→same cluster • "1992" and "nineteen ninety-two"→same cluster • "George Washington" and "Washington"→same cluster • "The 2014 CW series" and "2014 Flash"→same cluster • "22 episodes" and "22"→same cluster • "Amy Adams" and "Amy Lou Adams"→same cluster

1992
[22]

Paris" and

DIFFERENT clusters — genuinely distinct entities or facts: • "Paris" and "London"→different clusters • "1990" and "2014"→different clusters • "22" and "23"→different clusters • "France" and "Argentina"→different clusters

1990
[23]

Unknown",

Refusals: "Unknown", "I don’t know", "N/A", or any refusal to answer form their own single cluster, separate from all factual answers
[24]

New York Yankees

Partial overlap: If two answers share some words but refer to different entities, they go in DIFFERENT clusters. - "New York Yankees" and "New York Mets"→different clusters
[26]

## Examples {Examples} ## Task Question: {question} Answers: {answers_numbered_list} Figure 16.Prompt used for the Clusterer

Every answer index must be assigned to exactly one cluster. ## Examples {Examples} ## Task Question: {question} Answers: {answers_numbered_list} Figure 16.Prompt used for the Clusterer. 35 Prompt: Ask4Conf-D Read the following question: Question:{sentence} Provide the probability that this question is ambiguous due to factors such as ambiguous entities, a...
[32]

Figure 18.Prompt used for the Uncertainty-guided Clarification

If the original question is already clear, keep it nearly unchanged. Figure 18.Prompt used for the Uncertainty-guided Clarification. 36 Uncertainty-guided Question Clarification Prompt for ShaQ You rewrite ambiguous QA questions so that answer uncertainty is reduced while the original broad intent is preserved. You are given only an original question, num...
[33]

Return exactly one rewritten direct wh-question
[34]

Preserve the original topic and answer type
[35]

Prefer edits around spans with larger span uncertainty scores
[36]

Do not answer the question
[37]

Do not include a candidate answer in the question unless that text already appeared in the original question
[38]

Do not ask a yes/no question and do not ask the model to list multiple alternatives
[39]

min": {min} ,

If the original question is already clear, keep it nearly unchanged. Figure 19.Prompt used for the Uncertainty-guided Clarification withShaQ. Uncertainty context Numeric uncertainty scores: query_ambiguity: {query_ambiguity} , query_ambiguity_stats: "min": {min} , "max": {max} , "mean": {mean} , "median": {median} , "variance": {variance} , "threshold": {...
[40]

The span has at least two DISTINCT, mutually incompatible interpretations
[41]

Both interpretations are COMMON and strongly supported by the wording plus context
[42]

Choosing between them would materially change the clinical understanding, factual assumption, likely answer, or follow-up question
[43]

If a span is merely imprecise, approximate, broad, or missing detail, that alone is NOT enough

No single interpretation is clearly dominant to a knowledgeable doctor or patient. If a span is merely imprecise, approximate, broad, or missing detail, that alone is NOT enough. MARK as ambiguous only when:
[44]

Ambiguous referents — a pronoun or vague noun could realistically refer to two different symptoms, body parts, time points, or events already active in context
[45]

more detail is needed

Competing medical readings — the wording naturally supports two different concrete clinical meanings, not just “more detail is needed.”
[46]

Missing anchor with competing outcomes — omitted timing, target, or state creates two different plausible medical situations that knowledgeable readers would genuinely disagree between
[47]

cough”, “headache

Clinically meaningful missing degree or duration — severity, frequency, amount, or duration is expressed so loosely that two different COMMON clinical readings remain live, and the context does not make one reading clearly dominant. Do NOT mark as ambiguous: • Clear statements with one dominant clinical interpretation, even if somewhat informal or approxi...
[48]

Each premise describes a specific, real-world interpretation of the span
[49]

Each premise must be strongly supported by the context, not merely theoretically possible
[50]

The premises must be MUTUALLY EXCLUSIVE — they cannot be simultaneously true
[51]

Verify: would a knowledgeable person derive a DIFFERENT medical or semantic meaning under each premise? If not, do not include both
[52]

Keep each premise concise (one short sentence)
[53]

Prefer the MOST COMMON or CLINICALLY RELEV ANT interpretations first
[54]

Do not use obscure variants, highly unlikely edge cases, or unrelated diagnoses unless the context directly makes that dimension salient
[55]

When did the pain start getting worse?

When in doubt, be conservative: return fewer premises rather than low-plausibility ones. ## Example Context:Doctor: “When did the pain start getting worse?” Current Utterance:“It hurt a little every time I exercised.” Span text:“exercised” Ambiguity reason:“The type and intensity of the exercise is unspecified and could change the clinical understanding.”...
[56]

Keep it short and factual

Your answer MUST be≤20 words. Keep it short and factual
[57]

Rewrite the utterance to embed the specific meaning of the assumption into it

STRICTLY follow the assumption. Rewrite the utterance to embed the specific meaning of the assumption into it
[58]

Under the assumption

DO NOT explain, hedge, or add qualifiers (e.g., skip “Under the assumption...”, “This means...”). ONLY output the rewritten meaning
[59]

None.”, answer with the most straightforward interpretation. ## Examples Context:Doctor: “When did the pain start getting worse?

If Assumptions say “None.”, answer with the most straightforward interpretation. ## Examples Context:Doctor: “When did the pain start getting worse?” Current Utterance:“It hurt a little every time I exercised.” Assumptions:“exercised” refers to strenuous weightlifting. Answer:It hurt a little every time I did heavy weightlifting. Context:Doctor: “When did...
[60]

Assign cluster ids as consecutive integers starting from 0
[61]

Every answer index must be assigned to exactly one cluster
[62]

Unknown”, “N/A

Refusals (“Unknown”, “N/A”) go into their own independent cluster. ## TaskMedical Dialogue Context:{context} Current Utterance:{question} Answers:{answers_numbered_list} Respond strictly in the following JSON format. Do not include any explanation or extra text. { "clusters": { "0": <cluster_id>, "1": <cluster_id> } } Figure 25.Prompt for the Clusterer: g...

[1] [1]

the flash

Entity overlap — the same name refers to two or more well-known entities that yield different answers. (e.g., “the flash” → 1990 TV series vs 2014 TV series)

1990

[2] [2]

current”, “recent

Role/title holders — role-based questions where MORE THAN ONE person is commonly associated, especially with words like “current”, “recent”, or no time qualifier. (e.g., “Who is the administrator of the SBA?” — changed multiple times)

[3] [3]

Who won the world cup?

Missing specifier — a key dimension (sport, year, country, gender, edition) is absent and there are multiple well-known correct answers across that dimension. (e.g., “Who won the world cup?” — no sport, no year)

[4] [4]

more famous

Temporal — the answer has changed over time AND the question does NOT fix a specific time period. Err toward marking temporal ambiguity when the role or record holder has changed within the past∼20 years, even if one answer is “more famous”. Do NOT mark as ambiguous: • Questions with a single universally dominant answer that virtually everyone would give....

[5] [6]

Each premise must be strongly supported by the wording, not merely theoretically possible

[6] [7]

All premises must point to real, well-known facts, media, people, or events

DO NOT invent fictional or unverifiable entities. All premises must point to real, well-known facts, media, people, or events

[7] [9]

Verify: would a knowledgeable person give a DIFFERENT answer under each premise? If not, do not include both

[8] [10]

Keep each premise concise, one short sentence

[9] [11]

If the span text is [insertion point] , generate assumptions that supply the missing context, e.g., a specific year, country, sport, or edition

[10] [12]

Prefer the MOST COMMON or WELL-KNOWN interpretations first

[11] [13]

Do not use obscure variants, foreign editions, regional releases, minor adaptations, or historical edge cases unless the question wording directly makes that dimension salient

[12] [14]

Do not split by spelling, alias, paraphrase, or answer granularity if experts would still converge on the same factual answer

[13] [15]

premises

When in doubt, be conservative: return fewer premises rather than low-plausibility ones. ## Examples {Examples} ## Task Question:{clean_sentence} Span id:{span_id} Span text:{span_text} Ambiguity reason:{reason} The"premises"array must contain between 1 and{m}items. Add another item only if it is a strong, common, answer-changing interpretation. Figure 14...

[14] [16]

Output ONLY the answer entity

Your answer MUST be≤5 words. Output ONLY the answer entity

[15] [17]

If the assumption specifies a particular year, TV series, edition, or context, answer for THAT specific version — not the most well-known one

STRICTLY follow the assumption. If the assumption specifies a particular year, TV series, edition, or context, answer for THAT specific version — not the most well-known one

[16] [18]

According to

DO NOT explain, hedge, or add qualifiers. No “According to...” or “I think...”

[17] [19]

DO NOT fall back to the most famous answer

If the assumption points to an obscure but real entity, output that entity. DO NOT fall back to the most famous answer

[18] [20]

## Examples {Examples} Question:{clean_sentence} Assumptions: {premises} Answer: Figure 15.Prompt used for the Answerer

If Assumptions say “None.”, answer the question with its most common interpretation. ## Examples {Examples} Question:{clean_sentence} Assumptions: {premises} Answer: Figure 15.Prompt used for the Answerer. 34 Prompt: Clusterer You are a semantic clustering assistant. Your task is to group a list of short answers to a question into clusters based on whethe...

[19] [21]

Paris" and

SAME cluster — surface variations of the same entity: • "Paris" and "paris, france"→same cluster • "1992" and "nineteen ninety-two"→same cluster • "George Washington" and "Washington"→same cluster • "The 2014 CW series" and "2014 Flash"→same cluster • "22 episodes" and "22"→same cluster • "Amy Adams" and "Amy Lou Adams"→same cluster

1992

[20] [22]

Paris" and

DIFFERENT clusters — genuinely distinct entities or facts: • "Paris" and "London"→different clusters • "1990" and "2014"→different clusters • "22" and "23"→different clusters • "France" and "Argentina"→different clusters

1990

[21] [23]

Unknown",

Refusals: "Unknown", "I don’t know", "N/A", or any refusal to answer form their own single cluster, separate from all factual answers

[22] [24]

New York Yankees

Partial overlap: If two answers share some words but refer to different entities, they go in DIFFERENT clusters. - "New York Yankees" and "New York Mets"→different clusters

[23] [26]

## Examples {Examples} ## Task Question: {question} Answers: {answers_numbered_list} Figure 16.Prompt used for the Clusterer

Every answer index must be assigned to exactly one cluster. ## Examples {Examples} ## Task Question: {question} Answers: {answers_numbered_list} Figure 16.Prompt used for the Clusterer. 35 Prompt: Ask4Conf-D Read the following question: Question:{sentence} Provide the probability that this question is ambiguous due to factors such as ambiguous entities, a...

[24] [32]

Figure 18.Prompt used for the Uncertainty-guided Clarification

If the original question is already clear, keep it nearly unchanged. Figure 18.Prompt used for the Uncertainty-guided Clarification. 36 Uncertainty-guided Question Clarification Prompt for ShaQ You rewrite ambiguous QA questions so that answer uncertainty is reduced while the original broad intent is preserved. You are given only an original question, num...

[25] [33]

Return exactly one rewritten direct wh-question

[26] [34]

Preserve the original topic and answer type

[27] [35]

Prefer edits around spans with larger span uncertainty scores

[28] [36]

Do not answer the question

[29] [37]

Do not include a candidate answer in the question unless that text already appeared in the original question

[30] [38]

Do not ask a yes/no question and do not ask the model to list multiple alternatives

[31] [39]

min": {min} ,

If the original question is already clear, keep it nearly unchanged. Figure 19.Prompt used for the Uncertainty-guided Clarification withShaQ. Uncertainty context Numeric uncertainty scores: query_ambiguity: {query_ambiguity} , query_ambiguity_stats: "min": {min} , "max": {max} , "mean": {mean} , "median": {median} , "variance": {variance} , "threshold": {...

[32] [40]

The span has at least two DISTINCT, mutually incompatible interpretations

[33] [41]

Both interpretations are COMMON and strongly supported by the wording plus context

[34] [42]

Choosing between them would materially change the clinical understanding, factual assumption, likely answer, or follow-up question

[35] [43]

If a span is merely imprecise, approximate, broad, or missing detail, that alone is NOT enough

No single interpretation is clearly dominant to a knowledgeable doctor or patient. If a span is merely imprecise, approximate, broad, or missing detail, that alone is NOT enough. MARK as ambiguous only when:

[36] [44]

Ambiguous referents — a pronoun or vague noun could realistically refer to two different symptoms, body parts, time points, or events already active in context

[37] [45]

more detail is needed

Competing medical readings — the wording naturally supports two different concrete clinical meanings, not just “more detail is needed.”

[38] [46]

Missing anchor with competing outcomes — omitted timing, target, or state creates two different plausible medical situations that knowledgeable readers would genuinely disagree between

[39] [47]

cough”, “headache

Clinically meaningful missing degree or duration — severity, frequency, amount, or duration is expressed so loosely that two different COMMON clinical readings remain live, and the context does not make one reading clearly dominant. Do NOT mark as ambiguous: • Clear statements with one dominant clinical interpretation, even if somewhat informal or approxi...

[40] [48]

Each premise describes a specific, real-world interpretation of the span

[41] [49]

Each premise must be strongly supported by the context, not merely theoretically possible

[42] [50]

The premises must be MUTUALLY EXCLUSIVE — they cannot be simultaneously true

[43] [51]

Verify: would a knowledgeable person derive a DIFFERENT medical or semantic meaning under each premise? If not, do not include both

[44] [52]

Keep each premise concise (one short sentence)

[45] [53]

Prefer the MOST COMMON or CLINICALLY RELEV ANT interpretations first

[46] [54]

Do not use obscure variants, highly unlikely edge cases, or unrelated diagnoses unless the context directly makes that dimension salient

[47] [55]

When did the pain start getting worse?

When in doubt, be conservative: return fewer premises rather than low-plausibility ones. ## Example Context:Doctor: “When did the pain start getting worse?” Current Utterance:“It hurt a little every time I exercised.” Span text:“exercised” Ambiguity reason:“The type and intensity of the exercise is unspecified and could change the clinical understanding.”...

[48] [56]

Keep it short and factual

Your answer MUST be≤20 words. Keep it short and factual

[49] [57]

Rewrite the utterance to embed the specific meaning of the assumption into it

STRICTLY follow the assumption. Rewrite the utterance to embed the specific meaning of the assumption into it

[50] [58]

Under the assumption

DO NOT explain, hedge, or add qualifiers (e.g., skip “Under the assumption...”, “This means...”). ONLY output the rewritten meaning

[51] [59]

None.”, answer with the most straightforward interpretation. ## Examples Context:Doctor: “When did the pain start getting worse?

If Assumptions say “None.”, answer with the most straightforward interpretation. ## Examples Context:Doctor: “When did the pain start getting worse?” Current Utterance:“It hurt a little every time I exercised.” Assumptions:“exercised” refers to strenuous weightlifting. Answer:It hurt a little every time I did heavy weightlifting. Context:Doctor: “When did...

[52] [60]

Assign cluster ids as consecutive integers starting from 0

[53] [61]

Every answer index must be assigned to exactly one cluster

[54] [62]

Unknown”, “N/A

Refusals (“Unknown”, “N/A”) go into their own independent cluster. ## TaskMedical Dialogue Context:{context} Current Utterance:{question} Answers:{answers_numbered_list} Respond strictly in the following JSON format. Do not include any explanation or extra text. { "clusters": { "0": <cluster_id>, "1": <cluster_id> } } Figure 25.Prompt for the Clusterer: g...