Recognition: unknown
VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation
Pith reviewed 2026-05-07 16:41 UTC · model grok-4.3
The pith
VLM judges can rank responses correctly but fail to assign reliable absolute scores due to strongly task-dependent uncertainty.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLM judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Evaluation uncertainty is strongly task-dependent: intervals cover roughly 40 percent of the score range for aesthetics and natural images but expand to about 70 percent for chart and mathematical reasoning tasks. Interval width is driven primarily by task difficulty and annotation quality, with the same judge and method yielding 4.5 times narrower intervals on a clean, multi-annotator captioning benchmark.
What carries the argument
Conformal prediction intervals built directly from the judge's score-token log-probabilities, converting a point score into a calibrated interval without retraining or additional data assumptions.
Load-bearing premise
The conformal prediction intervals constructed from score-token log-probabilities are valid and meaningful for VLM outputs without extra calibration or assumptions on the underlying data distribution.
What would settle it
A finding that the constructed intervals stay narrow, cover the true scores at the claimed rate, and show little variation across all 14 task categories would falsify the task-dependent uncertainty and ranking-scoring decoupling claims.
Figures
read the original abstract
Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge's point score into a calibrated prediction interval using only score-token log-probabilities, with no retraining. We present the first systematic analysis of conformal prediction for VLM-as-a-Judge across 3 judges and 14 visual task categories. Our results show that evaluation uncertainty is strongly task-dependent: intervals cover ~40% of the score range for aesthetics and natural images but expand to ~70% for chart and mathematical reasoning, yielding a quantitative reliability map for multimodal evaluation. We further identify a failure mode not captured by standard evaluation metrics, ranking-scoring decoupling, where judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Finally, we show that interval width is driven primarily by task difficulty and annotation quality, i.e., the same judge and method yield 4.5x narrower intervals on a clean, multi-annotator captioning benchmark. Code: https://github.com/divake/VLM-Judge-Uncertainty
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper applies conformal prediction to convert point scores from VLM judges into calibrated prediction intervals using only score-token log-probabilities, without retraining. Across 3 VLMs and 14 visual task categories, it reports that interval widths are strongly task-dependent (narrower for aesthetics/natural images, wider for charts/math reasoning), identifies a ranking-scoring decoupling where high rank correlation coexists with uninformative absolute-score intervals, and shows that interval width is driven by task difficulty and annotation quality (e.g., 4.5x narrower on clean multi-annotator data). Code is released.
Significance. If the intervals are valid, the work supplies a practical, distribution-free tool for quantifying VLM-judge reliability and produces a task-specific reliability map that could inform evaluation pipelines in multimodal systems. The ranking-scoring decoupling observation and the link to annotation quality are useful distinctions not captured by standard metrics. Releasing reproducible code is a clear strength.
major comments (2)
- [§4 and §3.2] §4 (Experiments) and §3.2 (Conformal Prediction Setup): No table, figure, or text reports empirical coverage rates on held-out data to verify that the constructed intervals attain the nominal level (e.g., 90 %). The central claims—task-dependent interval widths and ranking-scoring decoupling—rest on these intervals being calibrated; without coverage diagnostics the width comparisons and reliability map are not yet interpretable.
- [§5.1] §5.1 (Ranking-scoring decoupling): The decoupling is demonstrated by juxtaposing Spearman rank correlation against interval width, but the manuscript provides no statistical test for the significance of the observed dissociation nor controls for task-specific variance in score distributions. This weakens the claim that the phenomenon is a distinct failure mode rather than a direct consequence of wide intervals.
minor comments (3)
- [Abstract and §2] Abstract and §2: The phrase “first systematic analysis” should be qualified by explicit comparison to prior conformal-prediction applications in NLP or vision-language evaluation.
- [Figure 2 and Table 1] Figure 2 and Table 1: Axis labels and legends should explicitly state the conformal significance level α used for all reported intervals.
- [§3.2] Notation: The mapping from raw score-token log-probabilities to nonconformity scores is described only in prose; a short equation or pseudocode block would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below and outline revisions that will strengthen the empirical support for our claims while preserving the core contributions of the work.
read point-by-point responses
-
Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Conformal Prediction Setup): No table, figure, or text reports empirical coverage rates on held-out data to verify that the constructed intervals attain the nominal level (e.g., 90 %). The central claims—task-dependent interval widths and ranking-scoring decoupling—rest on these intervals being calibrated; without coverage diagnostics the width comparisons and reliability map are not yet interpretable.
Authors: We agree that explicit empirical coverage diagnostics would improve interpretability. Although the conformal prediction procedure we employ carries a theoretical marginal coverage guarantee under exchangeability, reporting achieved coverage on held-out data is a standard and useful practice. In the revised manuscript we will add a table in §4 (or a dedicated subsection) that reports empirical coverage rates for the nominal 90 % level across all three VLMs and the 14 task categories. This addition will directly confirm calibration and support the subsequent width comparisons and reliability map. revision: yes
-
Referee: [§5.1] §5.1 (Ranking-scoring decoupling): The decoupling is demonstrated by juxtaposing Spearman rank correlation against interval width, but the manuscript provides no statistical test for the significance of the observed dissociation nor controls for task-specific variance in score distributions. This weakens the claim that the phenomenon is a distinct failure mode rather than a direct consequence of wide intervals.
Authors: The referee correctly notes the absence of formal statistical controls. To address this, we will augment §5.1 with (i) a test of the null hypothesis that the observed dissociation between rank correlation and interval width is explained solely by interval width (e.g., via partial correlation or a regression that includes task-level score variance as a covariate) and (ii) stratification or regression-based controls for task-specific variance. These additions will clarify that the decoupling is not an artifact of wide intervals alone and will strengthen the presentation of it as a distinct failure mode. revision: yes
Circularity Check
No significant circularity; standard conformal framework applied empirically
full rationale
The paper applies the known distribution-free conformal prediction procedure to convert VLM point scores into intervals using score-token log-probabilities as the nonconformity measure. All reported findings (task-dependent interval widths, ranking-scoring decoupling, and the 4.5x width difference on clean benchmarks) are empirical observations obtained by running the fixed method across 14 task categories and 3 judges. No equation reduces a claimed prediction to a fitted parameter by construction, no uniqueness theorem is imported from prior self-work, and no ansatz is smuggled via citation. The derivation chain remains self-contained against the external conformal prediction literature and the experimental data.
Axiom & Free-Parameter Ledger
free parameters (1)
- conformal significance level (alpha)
axioms (1)
- standard math Conformal prediction provides distribution-free valid prediction intervals under exchangeability assumption.
Reference graph
Works this paper leans on
-
[1]
Phi-4-reasoning technical report, 2025
Marah Abdin, Sahil Agarwal, Aman Agrawal, et al. Phi-4-reasoning technical report. arXiv preprint arXiv:2504.21318, 2025
-
[2]
Anastasios N Angelopoulos, Stephen Bates, Emmanuel J Cand \`e s, Michael I Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control. arXiv preprint arXiv:2110.01052, 2021
-
[3]
Navigating the unknown: Uncertainty-aware compute-in-memory autonomy of edge robotics
Nastaran Darabi, Priyesh Shukla, Dinithi Jayasuriya, Divake Kumar, Alex Christopher Stutts, and Amit Ranjan Trivedi. Navigating the unknown: Uncertainty-aware compute-in-memory autonomy of edge robotics. In 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp.\ 1--6, 2024
2024
-
[4]
MLLM -as-a-judge: Assessing multimodal LLM -as-a-judge with vision-language benchmark
Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. MLLM -as-a-judge: Assessing multimodal LLM -as-a-judge with vision-language benchmark. In International Conference on Machine Learning (ICML), 2024
2024
-
[5]
Divake Kumar, Patrick Poggi, Sina Tayebati, Devashri Naik, Nilesh Ahuja, and Amit Ranjan Trivedi. Calibrated decomposition of aleatoric and epistemic uncertainty in deep features for inference-time adaptation. arXiv preprint arXiv:2511.12389, 2025 a
-
[6]
Gemini 2.5: Our newest Gemini model with thinking
Google DeepMind . Gemini 2.5: Our newest Gemini model with thinking. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/, 2025
2025
-
[7]
arXiv preprint arXiv:2502.06884 (2025) arXiv:2502.06884 25
Sina Tayebati, Divake Kumar, Nastaran Darabi, Dinithi Jayasuriya, Ranganath Krishnan, and Amit Ranjan Trivedi. Learning conformal abstention policies for adaptive risk management in large language and vision-language models. arXiv preprint arXiv:2502.06884, 2025 a
-
[8]
Conformal prediction via regression-as-classification
Etash Guha, Shlok Natarajan, Thomas M \"o llenhoff, Mohammad Emtiyaz Khan, and Eugene Ndiaye. Conformal prediction via regression-as-classification. In International Conference on Learning Representations (ICLR), 2024
2024
-
[9]
SocREval : Large language models with the socratic method for reference-free reasoning evaluation
Hangfeng He, Hongming Zhang, and Dan Roth. SocREval : Large language models with the socratic method for reference-free reasoning evaluation. In Findings of the Association for Computational Linguistics: NAACL 2024, 2024
2024
-
[10]
Conformal inference meets evidential learning: Distribution-free uncertainty quantification with epistemic and aleatoric separability
Alex Christopher Stutts, Divake Kumar, Theja Tulabandhula, and Amit Ranjan Trivedi. Conformal inference meets evidential learning: Distribution-free uncertainty quantification with epistemic and aleatoric separability. In Proceedings of the 61st ACM/IEEE Design Automation Conference (DAC), pp.\ 1--4, 2024
2024
-
[11]
Conformal prediction with large language models for multi-choice question answering
Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404, 2023
-
[12]
VL-RewardBench : A challenging benchmark for vision-language generative reward models
Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, and Qi Liu. VL-RewardBench : A challenging benchmark for vision-language generative reward models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[13]
Nastaran Darabi, Divake Kumar, Sina Tayebati, and Amit Ranjan Trivedi. INTACT : Inducing noise tolerance through adversarial curriculum training for LiDAR -based safety-critical perception and autonomy. arXiv preprint arXiv:2502.01896, 2025
-
[14]
Locally valid and discriminative prediction intervals for deep learning models
Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Locally valid and discriminative prediction intervals for deep learning models. In Advances in Neural Information Processing Systems, 2021
2021
-
[15]
Uncertainty-aware LiDAR -camera autonomy via conformal prediction and principled abstention
Divake Kumar, Sina Tayebati, Nastaran Darabi, Vita Pi-Ho Hu, and Amit Ranjan Trivedi. Uncertainty-aware LiDAR -camera autonomy via conformal prediction and principled abstention. In 2025 IEEE International Conference on Omni-layer Intelligent Systems (COINS), pp.\ 1--6. IEEE, 2025 b . doi:10.1109/COINS65080.2025.11125785
-
[16]
G-Eval : NLG evaluation using GPT-4 with better human alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval : NLG evaluation using GPT-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 2511--2522, 2023
2023
-
[17]
Fair conformal predictors for applications in medical imaging
Charles Lu, Andr \'e anne Lemay, Ken Chang, Katharina H \"o bel, and Jayashree Kalpathy-Cramer. Fair conformal predictors for applications in medical imaging. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.\ 12008--12016, 2022
2022
-
[18]
Belief Dynamics for Detecting Behavioral Shifts in Safe Collaborative Manipulation
Devashri Naik, Divake Kumar, Nastaran Darabi, and Amit Ranjan Trivedi. Belief dynamics for detecting behavioral shifts in safe collaborative manipulation. arXiv preprint arXiv:2604.04967, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Language models with conformal factuality guarantees
Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees. In International Conference on Machine Learning (ICML), 2024
2024
-
[20]
Uncertainty-guided inference-time depth adaptation for transformer-based visual tracking
Patrick Poggi, Divake Kumar, Theja Tulabandhula, and Amit Ranjan Trivedi. Uncertainty-guided inference-time depth adaptation for transformer-based visual tracking. arXiv preprint arXiv:2602.16160, 2026
-
[21]
Conformal language modeling
Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S Jaakkola, and Regina Barzilay. Conformal language modeling. In International Conference on Learning Representations (ICLR), 2024
2024
-
[22]
CAP : Conformalized abstention policies for context-adaptive risk management for LLMs and VLMs
Sina Tayebati, Divake Kumar, Nastaran Darabi, Dinithi Jayasuriya, Theja Tulabandhula, Ranganath Krishnan, and Amit Ranjan Trivedi. CAP : Conformalized abstention policies for context-adaptive risk management for LLMs and VLMs . In Proceedings of the 17th Asian Conference on Machine Learning (ACML), Conference Track, 2025 b
2025
-
[23]
Conformalized quantile regression
Yaniv Romano, Evan Patterson, and Emmanuel Cand \`e s. Conformalized quantile regression. In Advances in Neural Information Processing Systems, 2019
2019
-
[24]
Divake Kumar, Sina Tayebati, Francesco Migliarba, Ranganath Krishnan, and Amit Ranjan Trivedi. Learnable conformal prediction with context-aware nonconformity functions for robotic planning and perception. arXiv preprint arXiv:2509.21955, 2025 c
-
[25]
A comparison of some conformal quantile regression methods
Matteo Sesia and Emmanuel J Cand \`e s. A comparison of some conformal quantile regression methods. Stat, 9 0 (1): 0 e261, 2020
2020
-
[26]
Conformal prediction using conditional histograms
Matteo Sesia and Yaniv Romano. Conformal prediction using conditional histograms. In Advances in Neural Information Processing Systems, 2021
2021
-
[27]
Intelligent sensing-to-action for robust autonomy at the edge: Opportunities and challenges
Amit Ranjan Trivedi, Sina Tayebati, Hemant Kumawat, Nastaran Darabi, Divake Kumar, Adarsh Kumar Kosta, Yeshwanth Venkatesha, Dinithi Jayasuriya, Nethmi Jayasinghe, Priyadarshini Panda, Saibal Mukhopadhyay, and Kaushik Roy. Intelligent sensing-to-action for robust autonomy at the edge: Opportunities and challenges. In 2025 Design, Automation & Test in Euro...
2025
-
[28]
Analyzing uncertainty of LLM -as-a-judge: Interval evaluations with conformal prediction
Huanxin Sheng, Xinyi Liu, Hangfeng He, Jieyu Zhao, and Jian Kang. Analyzing uncertainty of LLM -as-a-judge: Interval evaluations with conformal prediction. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 11286--11328, 2025
2025
-
[29]
API is enough: Conformal prediction for large language models without logit-access
Jiayuan Su, Jing Luo, Hongwei Wang, and Lu Cheng. API is enough: Conformal prediction for large language models without logit-access. In Findings of the Association for Computational Linguistics: EMNLP 2024, 2024
2024
-
[30]
TRACER : Trajectory risk aggregation for critical episodes in agentic reasoning
Sina Tayebati, Divake Kumar, Nastaran Darabi, Davide Ettori, Ranganath Krishnan, and Amit Ranjan Trivedi. TRACER : Trajectory risk aggregation for critical episodes in agentic reasoning. arXiv preprint arXiv:2602.11409, 2026
-
[31]
Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback
Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
2023
-
[32]
Algorithmic Learning in a Random World
Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005
2005
-
[33]
Divake Kumar, Sina Tayebati, Devashri Naik, Patrick Poggi, Amanda Sofie Rios, Nilesh Ahuja, and Amit Ranjan Trivedi. TRIAGE : Type-routed interventions via aleatoric-epistemic gated estimation in robotic manipulation and adaptive perception---don't treat all uncertainty the same. arXiv preprint arXiv:2603.08128, 2026
-
[34]
Polos: Multimodal metric learning from human feedback for image captioning
Yuiga Wada, Kanta Kaneda, Daichi Saito, and Komei Sugiura. Polos: Multimodal metric learning from human feedback for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
2024
-
[35]
Black-box uncertainty quantification method for LLM -as-a-judge
Nico Wagner, Michael Desmond, Rahul Nair, Zahra Ashktorab, Elizabeth M Daly, Qian Pan, Martin Santillan Cooper, James M Johnson, and Werner Geyer. Black-box uncertainty quantification method for LLM -as-a-judge. arXiv preprint arXiv:2410.11594, 2024
-
[36]
Boosted conformal prediction intervals
Ran Xie, Rina Foygel Barber, and Emmanuel J Cand \`e s. Boosted conformal prediction intervals. In Advances in Neural Information Processing Systems, 2024
2024
-
[37]
Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs
Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs . In International Conference on Learning Representations (ICLR), 2024
2024
-
[38]
LLaVA-Critic : Learning to evaluate multimodal models
Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. LLaVA-Critic : Learning to evaluate multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[39]
Judging LLM -as-a-judge with MT-Bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing, et al. Judging LLM -as-a-judge with MT-Bench and chatbot arena. In Advances in Neural Information Processing Systems, 2023
2023
-
[40]
@esa (Ref
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[41]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[42]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.