Automated LLM-based evaluation of code review bot comments achieves only moderate agreement (0.44-0.62) with developer labels in an industrial dataset because developer decisions reflect contextual constraints beyond comment quality.
Yu, Qiang Yang, and Xing Xie
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
citing papers explorer
-
Understanding the Limits of Automated Evaluation for Code Review Bots in Practice
Automated LLM-based evaluation of code review bot comments achieves only moderate agreement (0.44-0.62) with developer labels in an industrial dataset because developer decisions reflect contextual constraints beyond comment quality.
-
TrustLLM: Trustworthiness in Large Language Models
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.