arxiv: 2604.03742 · v1 · submitted 2026-04-04 · 💻 cs.AI

Recognition: no theorem link

Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge

Yulong He , Ivan Smirnov , Dmitry Fedrushkov , Sergey Kovalchuk , Ilya Revin

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:08 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM evaluationAnalytic Hierarchy ProcessFuzzy AHPDualJudgemulti-criteria decision makingepistemic uncertaintyJudgeBenchhybrid evaluation

0 comments

The pith

Fuzzy AHP with confidence-modulated uncertainty and a DualJudge hybrid produces more consistent LLM evaluations than direct scoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts the Analytic Hierarchy Process to break LLM judgments into explicit criteria such as coherence and relevance, then extends it to a fuzzy version that encodes epistemic uncertainty through triangular fuzzy numbers weighted by the model's own confidence scores. On JudgeBench this structured approach, both crisp and fuzzy, yields better calibration and stability than simple direct scoring across model sizes and data splits. The authors then introduce DualJudge, which adaptively blends the structured AHP output with holistic direct scores using consistency-aware weighting. The hybrid reaches state-of-the-art agreement with human preferences, showing that intuitive and deliberative evaluation routes are complementary rather than competing.

Core claim

By representing pairwise comparisons as triangular fuzzy numbers whose spread is scaled by LLM-generated confidence, the Fuzzy AHP decomposes evaluation into explicit criteria, checks consistency, and aggregates via uncertainty-aware operators; when these outputs are fused with direct holistic scores inside DualJudge through a consistency-weighted mechanism, the resulting judgments outperform both pure direct scoring and standalone AHP on JudgeBench.

What carries the argument

Confidence-aware Fuzzy Analytic Hierarchy Process (FAHP) that modulates triangular fuzzy numbers with LLM-generated confidence scores to aggregate multi-criteria judgments while tracking epistemic uncertainty.

If this is right

Evaluations remain more stable when comparison pairs contain high uncertainty.
Explicit criteria make the sources of disagreement between judges traceable.
Hybrid fusion improves performance on both easy and hard items in JudgeBench splits.
The same decomposition can be applied to new criteria without retraining the underlying LLM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be ported to non-LLM decision tasks where raters have varying confidence.
If criteria weights were learned from data rather than set by experts, calibration might improve further.
Extending the fuzzy aggregation to larger hierarchies might reveal which criteria drive most of the uncertainty.

Load-bearing premise

LLM confidence scores reliably reflect epistemic uncertainty and that breaking evaluations into AHP criteria adds accuracy without injecting new systematic biases.

What would settle it

A human-labeled test set in which direct scoring matches human preferences more closely than either AHP or DualJudge, or in which FAHP judgments become less stable precisely when confidence scores are low.

Figures

Figures reproduced from arXiv: 2604.03742 by Dmitry Fedrushkov, Ilya Revin, Ivan Smirnov, Sergey Kovalchuk, Yulong He.

**Figure 1.** Figure 1: Overview of the proposed hybrid AHP-based evaluation framework. The pipeline integrates direct scoring, crisp AHP, and fuzzy AHP through consistency checking, weight derivation, and result aggregation. direct scoring with both crisp and fuzzy AHP formulations, and adaptively fuses their outputs based on consistency-aware reliability estimation. The resulting framework enables context-aware criterion constr… view at source ↗

**Figure 2.** Figure 2: Despite a tie in direct scoring, the structured AHP-based evaluations (crisp and fuzzy) consistently favor Response A. – Claude Split: Response pairs generated by Claude-3-class models (270 samples total). These splits ensure that observed improvements are not artifacts of a specific response style or generator bias [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Effective evaluation of large language models (LLMs) remains a critical bottleneck, as conventional direct scoring often yields inconsistent and opaque judgments. In this work, we adapt the Analytic Hierarchy Process (AHP) to LLM-based evaluation and, more importantly, propose a confidence-aware Fuzzy AHP (FAHP) extension that models epistemic uncertainty via triangular fuzzy numbers modulated by LLM-generated confidence scores. Systematically validated on JudgeBench, our structured approach decomposes assessments into explicit criteria and incorporates uncertainty-aware aggregation, producing more calibrated judgments. Extensive experiments demonstrate that both crisp and fuzzy AHP consistently outperform direct scoring across model scales and dataset splits, with FAHP showing superior stability in uncertain comparison scenarios. Building on these insights, we propose \textbf{DualJudge}, a hybrid framework inspired by Dual-Process Theory that adaptively fuses holistic direct scores with structured AHP outputs via consistency-aware weighting. DualJudge achieves state-of-the-art performance, underscoring the complementary strengths of intuitive and deliberative evaluation paradigms. These results establish uncertainty-aware structured reasoning as a principled pathway toward more reliable LLM assessment. Code is available at https://github.com/hreyulog/AHP_llm_judge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts AHP to LLM judging with a confidence-modulated fuzzy extension and a DualJudge hybrid that beats direct scoring on JudgeBench, but the key assumption that LLM confidence tracks epistemic uncertainty lacks any calibration test.

read the letter

The main takeaway is that this work turns LLM evaluation into a multi-criteria decision process using AHP, adds a fuzzy version where triangular numbers are scaled by the model's own scores, and then fuses the structured output with plain direct scores through consistency weighting in DualJudge. Both the crisp and fuzzy AHP versions beat simple scoring across model sizes and data splits, and DualJudge comes out on top overall.

Referee Report

2 major / 1 minor

Summary. The manuscript adapts the Analytic Hierarchy Process (AHP) and its fuzzy extension (FAHP) to LLM evaluation by decomposing judgments into explicit criteria and modeling uncertainty via triangular fuzzy numbers modulated by LLM-generated confidence scores. It introduces DualJudge, a hybrid framework that fuses direct holistic scores with structured AHP outputs through consistency-aware weighting, and reports that both crisp and fuzzy variants outperform direct scoring on JudgeBench, with DualJudge achieving state-of-the-art results.

Significance. If the results hold, the work supplies a structured, uncertainty-aware alternative to direct scoring that could improve consistency and interpretability in LLM assessment. The open-source code at the provided GitHub link is a clear strength that aids reproducibility.

major comments (2)

Methods section on FAHP: the central claim that modulating triangular fuzzy numbers with LLM confidence scores improves calibration and stability rests on the untested assumption that these scores track epistemic uncertainty rather than overconfidence or bias; no Brier score, expected calibration error, or correlation with judgment accuracy on JudgeBench is reported to support this modulation step.
Experiments section: the abstract states consistent outperformance across model scales and dataset splits with FAHP showing superior stability, yet provides no details on statistical significance tests, exact train/test splits, or controls for post-hoc hyperparameter choices, undermining verification of the SOTA claim for DualJudge.

minor comments (1)

Abstract: the description of 'consistency-aware fusion weights' should explicitly state whether these are derived solely from the AHP consistency ratio or involve additional fitting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address the major comments point by point below, and we plan to incorporate revisions to strengthen the paper.

read point-by-point responses

Referee: Methods section on FAHP: the central claim that modulating triangular fuzzy numbers with LLM confidence scores improves calibration and stability rests on the untested assumption that these scores track epistemic uncertainty rather than overconfidence or bias; no Brier score, expected calibration error, or correlation with judgment accuracy on JudgeBench is reported to support this modulation step.

Authors: We agree that direct validation of the confidence scores as measures of epistemic uncertainty would strengthen the methodological justification. Although the original manuscript did not include Brier scores or expected calibration error, the superior performance of FAHP on JudgeBench provides indirect support for the approach. In the revision, we will add a correlation analysis between the LLM confidence scores and judgment accuracy on the benchmark to better substantiate the modulation step. revision: yes
Referee: Experiments section: the abstract states consistent outperformance across model scales and dataset splits with FAHP showing superior stability, yet provides no details on statistical significance tests, exact train/test splits, or controls for post-hoc hyperparameter choices, undermining verification of the SOTA claim for DualJudge.

Authors: We acknowledge the need for more rigorous experimental details to support the claims. The experiments utilized fixed dataset splits from JudgeBench without post-hoc hyperparameter optimization. We will include statistical significance testing (such as Wilcoxon signed-rank tests) and precise descriptions of the train/test splits and hyperparameter settings in the revised manuscript to allow for better verification of the results and the SOTA performance of DualJudge. revision: yes

Circularity Check

0 steps flagged

Minor risk from AHP weight derivation but no equations reduce to fitted parameters by construction

full rationale

The paper's central claims rest on empirical validation against the external JudgeBench dataset rather than any self-referential derivation. Standard AHP pairwise comparison matrices and triangular fuzzy number modulation are applied without the modulation step being fitted to the target judgments or renamed as a prediction. No self-citation chain is load-bearing for the uniqueness of the DualJudge fusion rule, and the consistency-aware weighting follows directly from the stated Dual-Process inspiration without reducing to prior author results by definition. The only minor elevation above zero arises from the conventional derivation of AHP criterion weights, which is not shown to be circular in the provided text.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Based on abstract only; main additions are the fuzzy modulation and hybrid weighting. No new physical entities. AHP pairwise consistency is treated as a domain assumption.

free parameters (2)

AHP criteria weights
Derived from pairwise comparisons, potentially fitted to LLM responses.
consistency-aware fusion weights
Adaptive weights in DualJudge that depend on comparison consistency.

axioms (1)

domain assumption Standard AHP consistency ratio assumptions hold when applied to LLM judgments
Paper relies on AHP properties without re-deriving them for LLM context.

pith-pipeline@v0.9.0 · 5520 in / 1144 out tokens · 37673 ms · 2026-05-13T17:08:47.622507+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 3 internal anchors

[1]

Fuzzy Sets and Systems17(3), 233–247 (1985).https://doi.org/https://doi.org/10.1016/0165-0114(85)90090-9, https://www.sciencedirect.com/science/article/pii/0165011485900909

Buckley, J.: Fuzzy hierarchical analysis. Fuzzy Sets and Systems17(3), 233–247 (1985).https://doi.org/https://doi.org/10.1016/0165-0114(85)90090-9, https://www.sciencedirect.com/science/article/pii/0165011485900909

work page doi:10.1016/0165-0114(85)90090-9 1985
[2]

In: Bouamor, H., Pino, J., Bali, K

Chiang, C.H., Lee, H.y.: A closer look into using large language models for automatic evaluation. In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 8928–8942. Association for Computational Linguistics, Singapore (Dec 2023).https:// doi.org/10.18653/v1/2023.findings-emnlp.599,https://aclan...

work page doi:10.18653/v1/2023.findings-emnlp.599 2023
[3]

Journal of Mathematical Psychology29(4), 387–405 (1985)

Crawford, G., Williams, C.: A note on the analysis of subjective judg- ment matrices. Journal of Mathematical Psychology29(4), 387–405 (1985). https://doi.org/https://doi.org/10.1016/0022-2496(85)90002-1,https:// www.sciencedirect.com/science/article/pii/0022249685900021

work page doi:10.1016/0022-2496(85)90002-1 1985
[4]

Mathematics in Sci- ence and Engineering, Academic Press (1980),https://books.google.ru/books? id=JmjfHUUtMkMC

Dubois, D.: Fuzzy Sets and Systems: Theory and Applications. Mathematics in Sci- ence and Engineering, Academic Press (1980),https://books.google.ru/books? id=JmjfHUUtMkMC

work page 1980
[5]

Applied Sciences15(10) (2025).https://doi.org/10.3390/app15105683, https://www.mdpi.com/2076-3417/15/10/5683

Emirtekin, E.: Large language model-powered automated assessment: A systematic review. Applied Sciences15(10) (2025).https://doi.org/10.3390/app15105683, https://www.mdpi.com/2076-3417/15/10/5683

work page doi:10.3390/app15105683 2025
[6]

José,A.S.,Alonso,Mª,T.,Lamata:Consistencyintheanalytichierarchyprocess:a new approach. Int. J. Uncertain. Fuzziness Knowl. Based Syst.14, 445–459 (2006), https://api.semanticscholar.org/CorpusID:18104088 Multi-Criteria LLM Evaluation with DualJudge and FAHP 15

work page 2006
[7]

macmillan (2011)

Kahneman, D.: Thinking, fast and slow. macmillan (2011)

work page 2011
[8]

org/abs/2512.10487

Kampourakis, V., Kavallieratos, G., Spathoulas, G., Gkioulos, V., Katsikas, S.: Llm-assisted ahp for explainable cyber range evaluation (2025),https://arxiv. org/abs/2512.10487

work page arXiv 2025
[9]

Li, W., Zhao, M., Dong, W., Cai, J., Wei, Y., Pocress, M., Li, Y., Yuan, W., Wang, X., Hou, R., Lou, K., Zeng, W., Yang, Y., Du, Y., Wang, M.: Grading scale impact on llm-as-a-judge: Human-llm alignment is highest on 0-5 grading scale (2026), https://arxiv.org/abs/2601.03444

work page arXiv 2026
[10]

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-eval: Nlg evaluation using gpt-4 with better human alignment (2023),https://arxiv.org/abs/2303.16634

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

In Ku, L.-W., Martins, A

Lu, X., Li, J., Takeuchi, K., Kashima, H.: AHP-powered LLM reasoning for multi-criteria evaluation of open-ended responses. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Findings of the Association for Computational Lin- guistics: EMNLP 2024. pp. 1847–1856. Association for Computational Linguis- tics, Miami, Florida, USA (Nov 2024).https://doi.org/10....

work page doi:10.18653/v1/2024 2024
[12]

OpenAI: gpt-oss-120b & gpt-oss-20b model card (2025),https://arxiv.org/abs/ 2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Qwen Team: Qwen3.5: Towards native multimodal agents (February 2026),https: //qwen.ai/blog?id=qwen3.5

work page 2026
[14]

Mathematical Modelling9(3), 161–176 (1987).https://doi.org/https: //doi.org/10.1016/0270-0255(87)90473-8,https://www.sciencedirect.com/ science/article/pii/0270025587904738

Saaty, R.: The analytic hierarchy process—what it is and how it is used. Mathematical Modelling9(3), 161–176 (1987).https://doi.org/https: //doi.org/10.1016/0270-0255(87)90473-8,https://www.sciencedirect.com/ science/article/pii/0270025587904738

work page doi:10.1016/0270-0255(87)90473-8 1987
[15]

Advanced book program, McGraw-Hill International Book Company (1980),https://books.google.ru/books?id=Xxi7AAAAIAAJ

Saaty, T.: The Analytic Hierarchy Process: Planning, Priority Setting, Resource Allocation. Advanced book program, McGraw-Hill International Book Company (1980),https://books.google.ru/books?id=Xxi7AAAAIAAJ

work page 1980
[16]

International Series in Operations Research & Management Science, Springer (2012),https://books.google.ru/books?id=6J9XI8I1qjwC

Saaty, T., Vargas, L.: Models, Methods, Concepts & Applications of the Analytic Hierarchy Process. International Series in Operations Research & Management Science, Springer (2012),https://books.google.ru/books?id=6J9XI8I1qjwC

work page 2012
[17]

In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=G0dksFayVq

Tan, S., Zhuang, S., Montgomery, K., Tang, W.Y., Cuadron, A., Wang, C., Popa, R., Stoica, I.: Judgebench: A benchmark for evaluating LLM-based judges. In: The Thirteenth International Conference on Learning Representations (2025),https: //openreview.net/forum?id=G0dksFayVq

work page 2025
[18]

Tan, S., Zhuang, S., Montgomery, K., Tang, W.Y., Cuadron, A., Wang, C., Popa, R.A., Stoica, I.: Judgebench: A benchmark for evaluating llm-based judges (2024), https://arxiv.org/abs/2410.12784

work page arXiv 2024
[19]

Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., Chen, W.: Mmlu-pro: A more robust and challenging multi-task language understanding benchmark (2024),https://arxiv.org/abs/2406.01574

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

White, C., Dooley, S., Roberts, M., Pal, A., Feuer, B., Jain, S., Shwartz-Ziv, R., Jain, N., Saifullah, K., Dey, S., Shubh-Agrawal, Sandha, S.S., Naidu, S., Hegde, C., LeCun, Y., Goldstein, T., Neiswanger, W., Goldblum, M.: Livebench: A challeng- ing, contamination-limited llm benchmark (2025),https://arxiv.org/abs/2406. 19314

work page 2025
[21]

Wu, H., Zhou, S., Zhang, H., Chen, W.: Doc2ahp: Inferring structured multi- criteria decision models via semantic trees with llms (2026),https://arxiv.org/ abs/2601.16479 16 He et al

work page arXiv 2026
[22]

Xie, J., Li, Y., Yin, X., Wan, X.: Dsgram: dynamic weighting sub-metrics for grammatical error correction in the era of large language models. In: Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intel- ligence and Fifteenth Symposium on Educational Advances i...

work page doi:10.1609/aaai.v39i24.34746 2025
[23]

1965 , issn =

Zadeh, L.: Fuzzy sets. Information and Control8(3), 338–353 (1965). https://doi.org/https://doi.org/10.1016/S0019-9958(65)90241-X, https://www.sciencedirect.com/science/article/pii/S001999586590241X

work page doi:10.1016/s0019-9958(65)90241-x 1965
[24]

European Journal of Operational Research116(2), 443–449 (1999)

Zeshui, X., Cuiping, W.: A consistency improving method in the an- alytic hierarchy process1research supported by nsf of china and shan- dong.1. European Journal of Operational Research116(2), 443–449 (1999). https://doi.org/https://doi.org/10.1016/S0377-2217(98)00109-X, https://www.sciencedirect.com/science/article/pii/S037722179800109X Multi-Criteria LL...

work page doi:10.1016/s0377-2217(98)00109-x 1999