Trust- judge: Inconsistencies of LLM-as-a-judge and how to alleviate them.arXiv preprint arXiv:2509.21117,

[WSZ+25] Yidong Wang, Yunze Song, Tingyuan Zhu, Xuanwang Zhang, Zhuohao Yu, Hao Chen, Chiyu Song, Qiufeng Wang, Cunxiang Wang, Zhen Wu, et al · 2025 · arXiv 2509.21117

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs

cs.CL · 2026-04-24 · unverdicted · novelty 6.0

AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.

Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.

Can LLMs Rank? A Tale of Triads and Triage

cs.CY · 2026-06-29 · unverdicted · novelty 5.0

LLM ranking reliability for prioritization tasks can be assessed via coefficient of consistency ζ (intra-run circular triads) and Kendall's τ (inter-run distance), with three leading models showing distinct consistency profiles on homelessness allocation and ED triage.

Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs

cs.CL · 2026-06-16 · unverdicted · novelty 5.0

A prompt perturbation approach builds comparison graphs from LLM judgments, filters inconsistent cycles or ties, and aggregates more reliable rankings.

MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs

cs.CL · 2026-05-05 · unverdicted · novelty 5.0

MedFabric dataset and EtHER detector achieve over 15% better word-level fabrication detection in medical LLMs than prior methods by generating stylistically faithful errors and using decomposition-based checking.

A Survey on LLM-as-a-Judge

cs.CL · 2024-11-23 · unverdicted · novelty 4.0

A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.

citing papers explorer

Showing 1 of 1 citing paper after filters.

A Survey on LLM-as-a-Judge cs.CL · 2024-11-23 · unverdicted · none · ref 166
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.

Trust- judge: Inconsistencies of LLM-as-a-judge and how to alleviate them.arXiv preprint arXiv:2509.21117,

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer