arxiv: 2604.17274 · v1 · submitted 2026-04-19 · 💻 cs.CV · cs.AI

Recognition: unknown

Instinct vs. Reflection: Unifying Token and Verbalized Confidence in Multimodal Large Models

Anqi Chen, Wenbin Li, Yang Gao, Yifan Jiang, Yizhu Jiang, Yunkai Dang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:28 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multimodal large language modelsconfidence estimationcalibrationfailure predictioninstinct-reflection misalignmentselective prediction

0 comments

The pith

Multimodal models' token probabilities and verbal self-assessments often conflict, but a monotone fusion rule plus mean alignment produces better-calibrated confidence scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in multimodal large language models the implicit support from token probabilities frequently diverges from the model's own spoken confidence statements. This instinct-reflection misalignment is addressed by combining the two signals through a monotone fusion that uses their consistency to estimate correctness. A subsequent order-preserving mean alignment step removes global bias while keeping the ordering of scores intact. The result is improved calibration and more accurate failure prediction on both open and closed models. Readers care because better confidence estimates let systems know when to trust or reject an answer in perception and reasoning tasks.

Core claim

Multimodal LLMs exhibit instinct-reflection misalignment between implicit token-level support and verbalized self-assessment. A monotone confidence fusion framework merges these dual-channel signals by leveraging cross-channel consistency to estimate correctness. An order-preserving mean alignment step then corrects global bias, improving calibration while preserving the risk-coverage trade-off for selective prediction.

What carries the argument

Monotone confidence fusion framework that merges token probabilities and verbalized confidence using cross-channel consistency, followed by an order-preserving mean alignment to correct bias.

If this is right

The fused scores give more reliable confidence estimates on diverse open-source and closed-source MLLMs.
Calibration error decreases while the risk-coverage curve for selective prediction stays intact.
Failure prediction improves because the combined signal better tracks actual correctness.
The method avoids expensive self-consistency sampling by using only the existing token and verbal outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion idea could be tested on text-only LLMs where token and verbal confidence also diverge.
If the misalignment pattern proves consistent, a single learned fusion function might replace hand-crafted monotone rules.
Task-specific alignment steps could further reduce residual bias on domains where global mean correction falls short.

Load-bearing premise

The observed instinct-reflection misalignment stays stable enough across tasks and models to be fixed by one monotone fusion rule and a single global mean alignment step without creating new errors.

What would settle it

On a held-out collection of MLLMs and tasks, applying the fusion and alignment fails to raise calibration metrics or worsens failure prediction performance compared with using either signal alone.

Figures

Figures reproduced from arXiv: 2604.17274 by Anqi Chen, Wenbin Li, Yang Gao, Yifan Jiang, Yizhu Jiang, Yunkai Dang.

**Figure 1.** Figure 1: Comparison of confidence estimation paradigms. Top/Middle: self-consistency methods rely on aggregating multiple sampled outputs via hyperparameter or prompt variations. Bottom (Ours): We propose an MLLM-centric, closed-source compatible framework. We extract dual-channel confidence from a single-pass inference: token confidence (instinct) and verbalized confidence (reflection). These signals are integrate… view at source ↗

**Figure 2.** Figure 2: Overview of our method. We evaluate open-source and proprietary MLLMs across diverse benchmarks, employing varying prompt strategies to extract dual-channel uncertainty signals: instinct (implicit token-based probability confidence) and reflection (explicit verbal self-assessment confidence). These complementary cues are integrated via a monotone head followed by order-preserving mean alignment to correct … view at source ↗

**Figure 3.** Figure 3: Comparison of ECE and AUROC metrics using verbalized confidence. The figure contrasts the performance of closedsource (top) and open-source (bottom) models across six datasets. Results on Calibrated Confidence. Results in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: (Left) Risk-coverage curves for GPT-4o. Node radius encodes marginal sample volume per confidence bin; only nodes exceeding a minimum threshold are plotted to mitigate smallsample artifacts. (Right) ECE vs temperature. Results of risk–coverage analysis [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Risk–coverage analysis for eight comparative MLLMs. We report the risk (1 − accuracy) at varying coverage levels across six multi-modal benchmarks. The top row displays results for closed-source models (first three) and GLM-4.1V-9B, while the bottom row shows open-source models. Across both groups, ScienceQA and MMBench (bottom curves) show high reliability, whereas MMMU-Pro (top curves) remains challengin… view at source ↗

**Figure 7.** Figure 7: Extended Analysis of Temperature on Calibration. We report the AUPRC-N and ECE metrics for four additional multimodal models: GLM-4.1V-9B, InternVL3.5-8B, MiniCPM-V-2.6, and Phi-3.5-Vision. We evaluate performance across six benchmarks by sweeping temperature T ∈ {0.1, 0.5, 0.7, 0.9}. Results indicate that calibration performance remains insensitive to varying temperature. challenging MMMU-Pro benchmark, b… view at source ↗

**Figure 8.** Figure 8: Extended calibration analysis across ConBench, AI2D, MMStar, and ScienceQA. Each panel displays the confidence histograms (top) for correct versus wrong predictions and the corresponding reliability diagrams (bottom). The dashed diagonal line represents perfect calibration. Consistent with the findings on MMBench, closed-source models (GPT-4o, GPT-5, Claude-3.7-Sonnet) demonstrate superior calibration with… view at source ↗

**Figure 9.** Figure 9: Confidence histograms across different models and prompting strategies. The visualization compares the calibration performance of four MLLMs (columns) under six distinct prompting methods (rows). The x-axis represents confidence intervals, and the y-axis denotes the frequency of predictions. Blue bars indicate correct predictions, while red bars indicate incorrect ones. Top-line metrics report Accuracy (AC… view at source ↗

**Figure 10.** Figure 10: Empirical confidence distribution (First row) and reliability diagrams (Second row) for four MLLMs using Vanilla prompting on MMStar. Columns from left to right: InternVL2.5-4B, Qwen2-VL-7B, Phi-3.5-Vision, and GPT-4o. Red/blue bars in the top row denote wrong and correct predictions, respectively. In the bottom row, the dashed line indicates perfect calibration (y = x) [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 11.** Figure 11: Prompt templates for verbalized confidence strategy. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt templates for top-K and self-probing prompting strategies. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt templates for different role-play, confidence interval, and COT confidence strategies. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: An illustration of Verbal-Internal Disconnect (VID) in MLLMs during a geography-based QA task. Category: General Knowledge Data Source: MMStar Question: Which option describe the object relationship in the image correctly? Ground Truth: A Chosen Answer: B Confidence Comparison Option Verbal Conf. Token Conf. A: The suitcase is on the book. 35% 80% B: The suitcase is beneath the cat. 0% 0% C: The suitcase … view at source ↗

**Figure 15.** Figure 15: An illustration of Verbal-Internal Disconnect (VID) in MLLMs during a general knowledge QA task. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: An illustration of Under-Confident Correct(UCC) in MLLMs during a general knowledge QA task. Category: Public Health Data Source: MMMU-Pro Question: In March 2009, an outbreak of Influenza A (H1N1) occurred... The following table showed the regional distribution... The secondary attack rate of H1N1 influenza in Fuzhou was... Ground Truth: A Chosen Answer: A Confidence Comparison Option Verbal Conf. Token … view at source ↗

**Figure 17.** Figure 17: An illustration of Under-Confident Correct(UCC) in MLLMs during a public health QA task. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: An illustration of Over-Confident Wrong(OCW) in MLLMs during a geography QA task. Category: General Knowledge Data Source: MMStar Question: What is the center of focus in the image? Ground Truth: A Chosen Answer: B Confidence Comparison Option Verbal Conf. Token Conf. A: A man writing in a book. 0% 70% B: A boy with his head in his hands surrounded by books. 100% 90% C: A cluttered desk with books and a p… view at source ↗

**Figure 19.** Figure 19: An illustration of Over-Confident Wrong(OCW) in MLLMs during a general knowledge QA task. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗

**Figure 20.** Figure 20: An illustration of Well-Calibrated Correct(WCC) in MLLMs during a literary QA task. Category: Arts Data Source: ConBench Question: In which form does this artwork exist? Ground Truth: B Chosen Answer: B Confidence Comparison Option Verbal Conf. Token Conf. A: Sculpture 0% 20% B: Painting 100% 90% C: Digital Art 0% 10% D: Performance Art 0% 5% Analysis of Discrepancy: This case represents Well-Calibrated C… view at source ↗

**Figure 21.** Figure 21: An illustration of Well-Calibrated Correct(WCC) in MLLMs during a artistic QA task. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_21.png] view at source ↗

**Figure 22.** Figure 22: An illustration of Well-Calibrated Wrong(WCW) in MLLMs during a artistic QA task. Category: Science / Ecology Data Source: AI2D Question: The diagram below shows a food chain. If the smaller toothed whales ate all the leopard seals, the population of krills would most likely Ground Truth: C Chosen Answer: B Confidence Comparison Option Verbal Conf. Token Conf. A: remain the same 27% 20% B: decrease 44% 70… view at source ↗

**Figure 23.** Figure 23: An illustration of Well-Calibrated Wrong(WCW) in MLLMs during a ecological QA task. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_23.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in various perception and reasoning tasks. Despite this success, ensuring their reliability in practical deployment necessitates robust confidence estimation. Prior works have predominantly focused on text-only LLMs, often relying on computationally expensive self-consistency sampling. In this paper, we extend this to multimodal settings and conduct a comprehensive evaluation of MLLMs' response confidence estimation. Our analysis reveals a significant instinct-reflection misalignment: the model's implicit token-level support frequently diverges from its verbal self-assessment confidence. To address this misalignment, we propose a monotone confidence fusion framework to merge dual-channel signals and cross-channel consistency to estimate correctness. Subsequently, an order-preserving mean alignment step is applied to correct global bias, which improves calibration while preserving the risk-coverage trade-off for selective prediction. Experiments on diverse open-source and closed-source MLLMs show that our method consistently yields more reliable confidence estimates and improves both calibration and failure prediction. Code will be available at https://github.com/Yunkaidang/Instinct-vs.-Reflection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a real token-vs-verbal confidence mismatch in MLLMs and offers a lightweight monotone fusion fix, but the gains look fragile without stronger evidence that the pattern holds across tasks.

read the letter

The paper's main point is that MLLMs show a consistent divergence between their implicit token-level support and the confidence they express in words, and the authors combine those two signals with a monotone fusion step plus a global mean alignment to improve calibration and selective prediction. They test the approach on both open and closed models and report that it helps without heavy extra compute. That dual-channel idea is a direct extension of text-only LLM confidence work, and the evaluation across model types is a reasonable step forward. The method stays simple, which is a plus for anyone who might actually deploy it. The soft spot is the assumption that the misalignment is stable and monotone enough for one fusion rule and one offset to fix it everywhere. If the token-verbal correlation flips or varies sharply between perception and reasoning tasks, or if closed models force rough approximations for the token side, the fix could add noise rather than remove it. The abstract gives no numbers or ablation details, so it is hard to judge how large the reported improvements actually are or whether they survive distribution shift. The two free parameters noted in the review also mean some fitting is involved, which raises the usual question about how much is genuine generalization. This is aimed at people who need better reliability from existing multimodal models in practice. A reader working on deployment or safety checks would find the idea easy to try and worth testing on their own data. The core observation is useful and the approach is lightweight enough to be worth checking, so the paper deserves a serious referee. I would send it to review but would want the authors to show the actual implementation of the fusion, the size of the gains, and results on tasks where the misalignment might not be monotone.

Referee Report

3 major / 2 minor

Summary. The paper claims that MLLMs exhibit a significant instinct-reflection misalignment between implicit token-level confidence and verbalized self-assessment. It proposes a monotone confidence fusion framework to merge these dual-channel signals via cross-channel consistency for correctness estimation, followed by an order-preserving mean alignment step to correct global bias. This is asserted to improve calibration and failure prediction while preserving risk-coverage trade-offs. Experiments across diverse open- and closed-source MLLMs are reported to yield consistently more reliable confidence estimates.

Significance. If the central claims hold, the work provides a lightweight, sampling-free approach to unifying dual confidence signals in multimodal settings, which could meaningfully advance reliable deployment of MLLMs. Strengths include evaluation on both open-source and closed-source models and the planned code release, which aids reproducibility. The focus on preserving selective-prediction properties is a positive design choice.

major comments (3)

[§3] §3 (Method): The monotone confidence fusion framework and subsequent order-preserving mean alignment are described at a high level but without explicit equations, pseudocode, or implementation details for how token probabilities are mapped and fused with verbalized confidence (including determination of any fusion weights or thresholds). This is load-bearing for the central claim, as the reported improvements depend on these steps producing better-calibrated estimates without introducing new errors or violating monotonicity.
[§4] §4 (Experiments): The abstract and summary state that the method 'consistently yields more reliable confidence estimates' and improves calibration/failure prediction, yet no quantitative results (e.g., ECE, Brier score, or AUROC values), ablation studies isolating fusion versus alignment, or details on parameter selection are supplied. This prevents verification that the improvements are not artifacts of the specific benchmarks or that the approach generalizes when misalignment varies across perception versus reasoning tasks.
[§3.2] §3.2: The order-preserving mean alignment step assumes a stable, globally correctable bias between channels. If the misalignment pattern is non-monotonic or task-dependent (as could occur in vision-language settings), the single global shift risks degrading calibration on unseen data; no sensitivity analysis or task-stratified results are provided to support the assumption.

minor comments (2)

[Abstract] The abstract would benefit from including one or two key quantitative metrics to substantiate the 'consistent improvements' claim.
[§1] Define notation for 'instinct' (token-level) and 'reflection' (verbalized) signals explicitly in the introduction or §2 for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that greater methodological transparency, explicit quantitative results, and additional robustness checks will strengthen the paper. We address each major comment below and will incorporate the suggested revisions.

read point-by-point responses

Referee: [§3] §3 (Method): The monotone confidence fusion framework and subsequent order-preserving mean alignment are described at a high level but without explicit equations, pseudocode, or implementation details for how token probabilities are mapped and fused with verbalized confidence (including determination of any fusion weights or thresholds). This is load-bearing for the central claim, as the reported improvements depend on these steps producing better-calibrated estimates without introducing new errors or violating monotonicity.

Authors: We appreciate the referee's emphasis on reproducibility. The current description in §3 is indeed high-level. In the revised manuscript we will add the explicit equations for token-to-confidence mapping (normalized answer-token probability), the cross-channel consistency measure used for fusion, the weighted fusion formula, and the order-preserving mean alignment shift. We will also include pseudocode for the full pipeline and specify how fusion weights are derived from consistency and how any thresholds are selected on a validation split. revision: yes
Referee: [§4] §4 (Experiments): The abstract and summary state that the method 'consistently yields more reliable confidence estimates' and improves calibration/failure prediction, yet no quantitative results (e.g., ECE, Brier score, or AUROC values), ablation studies isolating fusion versus alignment, or details on parameter selection are supplied. This prevents verification that the improvements are not artifacts of the specific benchmarks or that the approach generalizes when misalignment varies across perception versus reasoning tasks.

Authors: We acknowledge that the experimental section would benefit from more granular reporting. While the manuscript already demonstrates consistent gains across open- and closed-source models, the revision will add explicit tables with ECE, Brier score, and AUROC numbers, plus risk-coverage curves. We will further include ablations that isolate the fusion step from the alignment step and report parameter-selection details together with separate results for perception versus reasoning tasks. revision: yes
Referee: [§3.2] §3.2: The order-preserving mean alignment step assumes a stable, globally correctable bias between channels. If the misalignment pattern is non-monotonic or task-dependent (as could occur in vision-language settings), the single global shift risks degrading calibration on unseen data; no sensitivity analysis or task-stratified results are provided to support the assumption.

Authors: The referee correctly identifies a key assumption. Our empirical observations across the evaluated models and tasks indicated that the channel bias is largely monotonic and globally correctable. To address potential task dependence, the revised version will add sensitivity analysis on the alignment shift parameter and task-stratified results (perception vs. reasoning). These additions will either corroborate the assumption or clearly delineate its limitations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical method proposed without self-referential reduction

full rationale

The paper reports an observed instinct-reflection misalignment in MLLMs, then introduces a monotone fusion framework plus order-preserving mean alignment as a corrective procedure. No equations, derivations, or self-citations are exhibited that would make the claimed calibration gains equivalent to a fitted parameter or input quantity defined from the same data by construction. The central claim rests on empirical evaluation across models rather than a closed logical loop, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the existence of a general instinct-reflection misalignment in MLLMs and on the effectiveness of the newly introduced fusion and alignment components; no external benchmarks or formal derivations are cited to support these.

free parameters (2)

fusion weights or thresholds
The monotone fusion framework likely requires parameters to balance the two channels and to enforce monotonicity.
mean alignment offset
Order-preserving mean alignment step uses a global correction whose value is determined from data.

axioms (2)

domain assumption Token probabilities constitute an implicit confidence signal distinct from verbalized self-assessment.
Invoked when the paper defines instinct versus reflection and measures their misalignment.
domain assumption Cross-channel consistency is a reliable indicator of correctness.
Used to justify merging the dual signals for correctness estimation.

invented entities (1)

monotone confidence fusion framework no independent evidence
purpose: Merges token-level and verbalized signals using cross-channel consistency to estimate correctness.
Newly proposed construct with no independent evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5501 in / 1367 out tokens · 43638 ms · 2026-05-10T06:28:14.124357+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
cs.CL 2026-05 unverdicted novelty 7.0

BICR uses blind-image contrastive ranking on frozen LVLM hidden states to train a lightweight probe that penalizes confidence on blacked-out inputs, yielding top calibration and discrimination across five models and m...

Reference graph

Works this paper leans on

9 extracted references · cited by 1 Pith paper

[1]

Letk ⋆ i = arg maxk pik andr token i = maxk pik

Token confidence channel (instinct).Compute option log-probabilities {zik}K k=1 at the answer position and form pi = softmax(zi)as in (2). Letk ⋆ i = arg maxk pik andr token i = maxk pik
[2]

A"..."D" or

Verbalized confidence channel (reflection).Prompt the MLLM to output per-option confidence (Appendix B.2); parse into si ∈[0,1] K and setr verbal i =s i,k⋆ i . 3.Cross-channel consistency.Computeκ ik by (4) and setr cons i =κ i,k⋆ i . 4.Features.Formϕ i by (9) usingℓ(·)in (10), margin (11), and entropy (12). Fitting:Standardize features using calibration-...

2025
[3]

confidence

"confidence" must be an integer in the range 0 to 100 (no extra symbols or text)
[4]

chosen_answer

"chosen_answer" must be one of "A", "B", "C", or "D"
[5]

Do not include any explanation or text outside of the JSON object
[6]

Here is your answer:

Do not wrap your JSON inside code fences or add any prefixes like "Here is your answer:"
[7]

Only the exact structure shown above

No additional keys are allowed. Only the exact structure shown above
[8]

confidence

Please ensure that the "confidence" values are output in the exact order corresponding to the options provided in the question. For example, if the question has N options (such as A, B, C, etc.), then the first "confidence" value must correspond to the first option, the second to the second option, and so on. Do not change or shuffle this order
[9]

confidence

When you assign the "confidence" value for each option, carefully consider: • The difficulty or complexity of the question, • The availability (or lack) of relevant knowledge, • Any ambiguity in the prompt or the attached image, • Other potential sources of uncertainty (e.g., incomplete reasoning). Your "confidence" should reflect how likely you believe t...

2024