arxiv: 2604.25235 · v2 · submitted 2026-04-28 · 💻 cs.LG · cs.CL· cs.CV· stat.ML

Recognition: unknown

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

Amit Ranjan Trivedi, Devashri Naik, Divake Kumar, Ranganath Krishnan, Sina Tayebati

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:41 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CVstat.ML

keywords VLM judgesconformal predictionmultimodal evaluationprediction intervalsranking-scoring decouplingtask-dependent uncertaintyvision-language modelsscore reliability

0 comments

The pith

VLM judges can rank responses correctly but fail to assign reliable absolute scores due to strongly task-dependent uncertainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines vision-language models acting as judges for multimodal tasks and finds that their point scores lack any built-in reliability signal. Using conformal prediction on score-token log-probabilities, it constructs prediction intervals that reveal wide variation in uncertainty across 14 visual task categories. Judges maintain high ranking correlation even when intervals span most of the score range, exposing a ranking-scoring decoupling that standard metrics miss. The work produces a quantitative reliability map showing much tighter intervals on clean captioning benchmarks than on chart or math reasoning tasks.

Core claim

VLM judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Evaluation uncertainty is strongly task-dependent: intervals cover roughly 40 percent of the score range for aesthetics and natural images but expand to about 70 percent for chart and mathematical reasoning tasks. Interval width is driven primarily by task difficulty and annotation quality, with the same judge and method yielding 4.5 times narrower intervals on a clean, multi-annotator captioning benchmark.

What carries the argument

Conformal prediction intervals built directly from the judge's score-token log-probabilities, converting a point score into a calibrated interval without retraining or additional data assumptions.

Load-bearing premise

The conformal prediction intervals constructed from score-token log-probabilities are valid and meaningful for VLM outputs without extra calibration or assumptions on the underlying data distribution.

What would settle it

A finding that the constructed intervals stay narrow, cover the true scores at the claimed rate, and show little variation across all 14 task categories would falsify the task-dependent uncertainty and ranking-scoring decoupling claims.

Figures

Figures reproduced from arXiv: 2604.25235 by Amit Ranjan Trivedi, Devashri Naik, Divake Kumar, Ranganath Krishnan, Sina Tayebati.

**Figure 1.** Figure 1: R2CCP interval width across 14 task categories for all three judges. Task-dependent view at source ↗

**Figure 2.** Figure 2: MLLM-Judge vs. Polaris: same judge (LLaVA-Critic-7B), same CP method (R2CCP, view at source ↗

**Figure 3.** Figure 3: Ranking-scoring decoupling across datasets. The horizontal axis is Pearson corre view at source ↗

read the original abstract

Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge's point score into a calibrated prediction interval using only score-token log-probabilities, with no retraining. We present the first systematic analysis of conformal prediction for VLM-as-a-Judge across 3 judges and 14 visual task categories. Our results show that evaluation uncertainty is strongly task-dependent: intervals cover ~40% of the score range for aesthetics and natural images but expand to ~70% for chart and mathematical reasoning, yielding a quantitative reliability map for multimodal evaluation. We further identify a failure mode not captured by standard evaluation metrics, ranking-scoring decoupling, where judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Finally, we show that interval width is driven primarily by task difficulty and annotation quality, i.e., the same judge and method yield 4.5x narrower intervals on a clean, multi-annotator captioning benchmark. Code: https://github.com/divake/VLM-Judge-Uncertainty

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLM judges rank responses reliably but their absolute scores carry wide, task-dependent uncertainty that standard metrics miss.

read the letter

The core finding is that three VLMs used as judges produce high rank correlations on multimodal outputs yet generate prediction intervals that often span 40-70% of the score range, with the width tracking task type rather than the judge itself. The same model yields much tighter intervals on clean captioning data than on charts or math reasoning. This decoupling between ranking and scoring is the practical takeaway for anyone using these judges in benchmarks. They apply conformal prediction directly to the log-probabilities of the score tokens, no retraining required, and run the procedure across 14 task categories to produce a reliability map. That systematic coverage is new and gives a concrete way to decide when automated scores can be trusted for absolute comparisons. The observation that interval width is driven more by task difficulty and annotation quality than by the underlying VLM is also useful for benchmark design. The main limitation is the lack of explicit calibration diagnostics in the presented results. The method treats the token log-probs as nonconformity scores under a distribution-free guarantee, but without shown empirical coverage on held-out sets or checks for multi-token score outputs, it is hard to know how much the reported widths can be trusted. Minor sensitivity to the choice of alpha would also help. Readers working on automated evaluation pipelines or large-scale multimodal benchmarks will get the most from this. It is a solid empirical note that deserves referee time because the scope is broad enough and the failure mode it flags is real, even if the conformal validation needs tightening before the map can be used as a reference.

Referee Report

2 major / 3 minor

Summary. The paper applies conformal prediction to convert point scores from VLM judges into calibrated prediction intervals using only score-token log-probabilities, without retraining. Across 3 VLMs and 14 visual task categories, it reports that interval widths are strongly task-dependent (narrower for aesthetics/natural images, wider for charts/math reasoning), identifies a ranking-scoring decoupling where high rank correlation coexists with uninformative absolute-score intervals, and shows that interval width is driven by task difficulty and annotation quality (e.g., 4.5x narrower on clean multi-annotator data). Code is released.

Significance. If the intervals are valid, the work supplies a practical, distribution-free tool for quantifying VLM-judge reliability and produces a task-specific reliability map that could inform evaluation pipelines in multimodal systems. The ranking-scoring decoupling observation and the link to annotation quality are useful distinctions not captured by standard metrics. Releasing reproducible code is a clear strength.

major comments (2)

[§4 and §3.2] §4 (Experiments) and §3.2 (Conformal Prediction Setup): No table, figure, or text reports empirical coverage rates on held-out data to verify that the constructed intervals attain the nominal level (e.g., 90 %). The central claims—task-dependent interval widths and ranking-scoring decoupling—rest on these intervals being calibrated; without coverage diagnostics the width comparisons and reliability map are not yet interpretable.
[§5.1] §5.1 (Ranking-scoring decoupling): The decoupling is demonstrated by juxtaposing Spearman rank correlation against interval width, but the manuscript provides no statistical test for the significance of the observed dissociation nor controls for task-specific variance in score distributions. This weakens the claim that the phenomenon is a distinct failure mode rather than a direct consequence of wide intervals.

minor comments (3)

[Abstract and §2] Abstract and §2: The phrase “first systematic analysis” should be qualified by explicit comparison to prior conformal-prediction applications in NLP or vision-language evaluation.
[Figure 2 and Table 1] Figure 2 and Table 1: Axis labels and legends should explicitly state the conformal significance level α used for all reported intervals.
[§3.2] Notation: The mapping from raw score-token log-probabilities to nonconformity scores is described only in prose; a short equation or pseudocode block would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and outline revisions that will strengthen the empirical support for our claims while preserving the core contributions of the work.

read point-by-point responses

Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Conformal Prediction Setup): No table, figure, or text reports empirical coverage rates on held-out data to verify that the constructed intervals attain the nominal level (e.g., 90 %). The central claims—task-dependent interval widths and ranking-scoring decoupling—rest on these intervals being calibrated; without coverage diagnostics the width comparisons and reliability map are not yet interpretable.

Authors: We agree that explicit empirical coverage diagnostics would improve interpretability. Although the conformal prediction procedure we employ carries a theoretical marginal coverage guarantee under exchangeability, reporting achieved coverage on held-out data is a standard and useful practice. In the revised manuscript we will add a table in §4 (or a dedicated subsection) that reports empirical coverage rates for the nominal 90 % level across all three VLMs and the 14 task categories. This addition will directly confirm calibration and support the subsequent width comparisons and reliability map. revision: yes
Referee: [§5.1] §5.1 (Ranking-scoring decoupling): The decoupling is demonstrated by juxtaposing Spearman rank correlation against interval width, but the manuscript provides no statistical test for the significance of the observed dissociation nor controls for task-specific variance in score distributions. This weakens the claim that the phenomenon is a distinct failure mode rather than a direct consequence of wide intervals.

Authors: The referee correctly notes the absence of formal statistical controls. To address this, we will augment §5.1 with (i) a test of the null hypothesis that the observed dissociation between rank correlation and interval width is explained solely by interval width (e.g., via partial correlation or a regression that includes task-level score variance as a covariate) and (ii) stratification or regression-based controls for task-specific variance. These additions will clarify that the decoupling is not an artifact of wide intervals alone and will strengthen the presentation of it as a distinct failure mode. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard conformal framework applied empirically

full rationale

The paper applies the known distribution-free conformal prediction procedure to convert VLM point scores into intervals using score-token log-probabilities as the nonconformity measure. All reported findings (task-dependent interval widths, ranking-scoring decoupling, and the 4.5x width difference on clean benchmarks) are empirical observations obtained by running the fixed method across 14 task categories and 3 judges. No equation reduces a claimed prediction to a fitted parameter by construction, no uniqueness theorem is imported from prior self-work, and no ansatz is smuggled via citation. The derivation chain remains self-contained against the external conformal prediction literature and the experimental data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work relies on the standard conformal prediction axioms and one typical hyperparameter; no new entities postulated.

free parameters (1)

conformal significance level (alpha)
Standard parameter in conformal prediction to control coverage probability; implied but not specified in abstract.

axioms (1)

standard math Conformal prediction provides distribution-free valid prediction intervals under exchangeability assumption.
Core assumption of the conformal prediction framework used.

pith-pipeline@v0.9.0 · 5542 in / 1253 out tokens · 48358 ms · 2026-05-07T16:41:55.981858+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Phi-4-reasoning technical report, 2025

Marah Abdin, Sahil Agarwal, Aman Agrawal, et al. Phi-4-reasoning technical report. arXiv preprint arXiv:2504.21318, 2025

work page arXiv 2025
[2]

and Bates, Stephen and Cand

Anastasios N Angelopoulos, Stephen Bates, Emmanuel J Cand \`e s, Michael I Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control. arXiv preprint arXiv:2110.01052, 2021

work page arXiv 2021
[3]

Navigating the unknown: Uncertainty-aware compute-in-memory autonomy of edge robotics

Nastaran Darabi, Priyesh Shukla, Dinithi Jayasuriya, Divake Kumar, Alex Christopher Stutts, and Amit Ranjan Trivedi. Navigating the unknown: Uncertainty-aware compute-in-memory autonomy of edge robotics. In 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp.\ 1--6, 2024

2024
[4]

MLLM -as-a-judge: Assessing multimodal LLM -as-a-judge with vision-language benchmark

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. MLLM -as-a-judge: Assessing multimodal LLM -as-a-judge with vision-language benchmark. In International Conference on Machine Learning (ICML), 2024

2024
[5]

Calibrated decomposition of aleatoric and epistemic uncertainty in deep features for inference-time adaptation

Divake Kumar, Patrick Poggi, Sina Tayebati, Devashri Naik, Nilesh Ahuja, and Amit Ranjan Trivedi. Calibrated decomposition of aleatoric and epistemic uncertainty in deep features for inference-time adaptation. arXiv preprint arXiv:2511.12389, 2025 a

work page arXiv 2025
[6]

Gemini 2.5: Our newest Gemini model with thinking

Google DeepMind . Gemini 2.5: Our newest Gemini model with thinking. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/, 2025

2025
[7]

arXiv preprint arXiv:2502.06884 (2025) arXiv:2502.06884 25

Sina Tayebati, Divake Kumar, Nastaran Darabi, Dinithi Jayasuriya, Ranganath Krishnan, and Amit Ranjan Trivedi. Learning conformal abstention policies for adaptive risk management in large language and vision-language models. arXiv preprint arXiv:2502.06884, 2025 a

work page arXiv 2025
[8]

Conformal prediction via regression-as-classification

Etash Guha, Shlok Natarajan, Thomas M \"o llenhoff, Mohammad Emtiyaz Khan, and Eugene Ndiaye. Conformal prediction via regression-as-classification. In International Conference on Learning Representations (ICLR), 2024

2024
[9]

SocREval : Large language models with the socratic method for reference-free reasoning evaluation

Hangfeng He, Hongming Zhang, and Dan Roth. SocREval : Large language models with the socratic method for reference-free reasoning evaluation. In Findings of the Association for Computational Linguistics: NAACL 2024, 2024

2024
[10]

Conformal inference meets evidential learning: Distribution-free uncertainty quantification with epistemic and aleatoric separability

Alex Christopher Stutts, Divake Kumar, Theja Tulabandhula, and Amit Ranjan Trivedi. Conformal inference meets evidential learning: Distribution-free uncertainty quantification with epistemic and aleatoric separability. In Proceedings of the 61st ACM/IEEE Design Automation Conference (DAC), pp.\ 1--4, 2024

2024
[11]

Conformal prediction with large language models for multi-choice question answering

Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404, 2023

work page arXiv 2023
[12]

VL-RewardBench : A challenging benchmark for vision-language generative reward models

Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, and Qi Liu. VL-RewardBench : A challenging benchmark for vision-language generative reward models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[13]

INTACT : Inducing noise tolerance through adversarial curriculum training for LiDAR -based safety-critical perception and autonomy

Nastaran Darabi, Divake Kumar, Sina Tayebati, and Amit Ranjan Trivedi. INTACT : Inducing noise tolerance through adversarial curriculum training for LiDAR -based safety-critical perception and autonomy. arXiv preprint arXiv:2502.01896, 2025

work page arXiv 2025
[14]

Locally valid and discriminative prediction intervals for deep learning models

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Locally valid and discriminative prediction intervals for deep learning models. In Advances in Neural Information Processing Systems, 2021

2021
[15]

Uncertainty-aware LiDAR -camera autonomy via conformal prediction and principled abstention

Divake Kumar, Sina Tayebati, Nastaran Darabi, Vita Pi-Ho Hu, and Amit Ranjan Trivedi. Uncertainty-aware LiDAR -camera autonomy via conformal prediction and principled abstention. In 2025 IEEE International Conference on Omni-layer Intelligent Systems (COINS), pp.\ 1--6. IEEE, 2025 b . doi:10.1109/COINS65080.2025.11125785

work page doi:10.1109/coins65080.2025.11125785 2025
[16]

G-Eval : NLG evaluation using GPT-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval : NLG evaluation using GPT-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 2511--2522, 2023

2023
[17]

Fair conformal predictors for applications in medical imaging

Charles Lu, Andr \'e anne Lemay, Ken Chang, Katharina H \"o bel, and Jayashree Kalpathy-Cramer. Fair conformal predictors for applications in medical imaging. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.\ 12008--12016, 2022

2022
[18]

Belief Dynamics for Detecting Behavioral Shifts in Safe Collaborative Manipulation

Devashri Naik, Divake Kumar, Nastaran Darabi, and Amit Ranjan Trivedi. Belief dynamics for detecting behavioral shifts in safe collaborative manipulation. arXiv preprint arXiv:2604.04967, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Language models with conformal factuality guarantees

Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees. In International Conference on Machine Learning (ICML), 2024

2024
[20]

Uncertainty-guided inference-time depth adaptation for transformer-based visual tracking

Patrick Poggi, Divake Kumar, Theja Tulabandhula, and Amit Ranjan Trivedi. Uncertainty-guided inference-time depth adaptation for transformer-based visual tracking. arXiv preprint arXiv:2602.16160, 2026

work page arXiv 2026
[21]

Conformal language modeling

Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S Jaakkola, and Regina Barzilay. Conformal language modeling. In International Conference on Learning Representations (ICLR), 2024

2024
[22]

CAP : Conformalized abstention policies for context-adaptive risk management for LLMs and VLMs

Sina Tayebati, Divake Kumar, Nastaran Darabi, Dinithi Jayasuriya, Theja Tulabandhula, Ranganath Krishnan, and Amit Ranjan Trivedi. CAP : Conformalized abstention policies for context-adaptive risk management for LLMs and VLMs . In Proceedings of the 17th Asian Conference on Machine Learning (ACML), Conference Track, 2025 b

2025
[23]

Conformalized quantile regression

Yaniv Romano, Evan Patterson, and Emmanuel Cand \`e s. Conformalized quantile regression. In Advances in Neural Information Processing Systems, 2019

2019
[24]

Learnable conformal prediction with context-aware nonconformity functions for robotic planning and perception

Divake Kumar, Sina Tayebati, Francesco Migliarba, Ranganath Krishnan, and Amit Ranjan Trivedi. Learnable conformal prediction with context-aware nonconformity functions for robotic planning and perception. arXiv preprint arXiv:2509.21955, 2025 c

work page arXiv 2025
[25]

A comparison of some conformal quantile regression methods

Matteo Sesia and Emmanuel J Cand \`e s. A comparison of some conformal quantile regression methods. Stat, 9 0 (1): 0 e261, 2020

2020
[26]

Conformal prediction using conditional histograms

Matteo Sesia and Yaniv Romano. Conformal prediction using conditional histograms. In Advances in Neural Information Processing Systems, 2021

2021
[27]

Intelligent sensing-to-action for robust autonomy at the edge: Opportunities and challenges

Amit Ranjan Trivedi, Sina Tayebati, Hemant Kumawat, Nastaran Darabi, Divake Kumar, Adarsh Kumar Kosta, Yeshwanth Venkatesha, Dinithi Jayasuriya, Nethmi Jayasinghe, Priyadarshini Panda, Saibal Mukhopadhyay, and Kaushik Roy. Intelligent sensing-to-action for robust autonomy at the edge: Opportunities and challenges. In 2025 Design, Automation & Test in Euro...

2025
[28]

Analyzing uncertainty of LLM -as-a-judge: Interval evaluations with conformal prediction

Huanxin Sheng, Xinyi Liu, Hangfeng He, Jieyu Zhao, and Jian Kang. Analyzing uncertainty of LLM -as-a-judge: Interval evaluations with conformal prediction. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 11286--11328, 2025

2025
[29]

API is enough: Conformal prediction for large language models without logit-access

Jiayuan Su, Jing Luo, Hongwei Wang, and Lu Cheng. API is enough: Conformal prediction for large language models without logit-access. In Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

2024
[30]

TRACER : Trajectory risk aggregation for critical episodes in agentic reasoning

Sina Tayebati, Divake Kumar, Nastaran Darabi, Davide Ettori, Ranganath Krishnan, and Amit Ranjan Trivedi. TRACER : Trajectory risk aggregation for critical episodes in agentic reasoning. arXiv preprint arXiv:2602.11409, 2026

work page arXiv 2026
[31]

Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

2023
[32]

Algorithmic Learning in a Random World

Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005

2005
[33]

TRIAGE : Type-routed interventions via aleatoric-epistemic gated estimation in robotic manipulation and adaptive perception---don't treat all uncertainty the same

Divake Kumar, Sina Tayebati, Devashri Naik, Patrick Poggi, Amanda Sofie Rios, Nilesh Ahuja, and Amit Ranjan Trivedi. TRIAGE : Type-routed interventions via aleatoric-epistemic gated estimation in robotic manipulation and adaptive perception---don't treat all uncertainty the same. arXiv preprint arXiv:2603.08128, 2026

work page arXiv 2026
[34]

Polos: Multimodal metric learning from human feedback for image captioning

Yuiga Wada, Kanta Kaneda, Daichi Saito, and Komei Sugiura. Polos: Multimodal metric learning from human feedback for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[35]

Black-box uncertainty quantification method for LLM -as-a-judge

Nico Wagner, Michael Desmond, Rahul Nair, Zahra Ashktorab, Elizabeth M Daly, Qian Pan, Martin Santillan Cooper, James M Johnson, and Werner Geyer. Black-box uncertainty quantification method for LLM -as-a-judge. arXiv preprint arXiv:2410.11594, 2024

work page arXiv 2024
[36]

Boosted conformal prediction intervals

Ran Xie, Rina Foygel Barber, and Emmanuel J Cand \`e s. Boosted conformal prediction intervals. In Advances in Neural Information Processing Systems, 2024

2024
[37]

Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs . In International Conference on Learning Representations (ICLR), 2024

2024
[38]

LLaVA-Critic : Learning to evaluate multimodal models

Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. LLaVA-Critic : Learning to evaluate multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[39]

Judging LLM -as-a-judge with MT-Bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing, et al. Judging LLM -as-a-judge with MT-Bench and chatbot arena. In Advances in Neural Information Processing Systems, 2023

2023
[40]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[41]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[42]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...