Hallujudge: A reference-free hallu- cination detection for context misalignment in code review automation

Kla Tantithamthavorn, Hong Yi Lin, Patanamon Thongtanunam, Wachiraphan Charoenwet, Minwoo Jeong, Ming Wu · 2026 · cs.SE · arXiv 2601.19072

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Large Language models (LLMs) have shown strong capabilities in code review automation, such as review comment generation, yet they suffer from hallucinations -- where the generated review comments are ungrounded in the actual code -- poses a significant challenge to the adoption of LLMs in code review workflows. To address this, we explore effective and scalable methods for a hallucination detection in LLM-generated code review comments without the reference. In this work, we design HalluJudge that aims to assess the grounding of generated review comments based on the context alignment. HalluJudge includes four key strategies ranging from direct assessment to structured multi-branch reasoning (e.g., Tree-of-Thoughts). We conduct a comprehensive evaluation of these assessment strategies across Atlassian's enterprise-scale software projects to examine the effectiveness and cost-efficiency of HalluJudge. Furthermore, we analyze the alignment between HalluJudge's judgment and developer preference of the actual LLM-generated code review comments in the real-world production. Our results show that the hallucination assessment in HalluJudge is cost-effective with an F1 score of 0.85 and an average cost of $0.009. On average, 67% of the HalluJudge assessments are aligned with the developer preference of the actual LLM-generated review comments in the online production. Our results suggest that HalluJudge can serve as a practical safeguard to reduce developers' exposure to hallucinated comments, fostering trust in AI-assisted code reviews.

representative citing papers

Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

Introduces functional equivalence methods and functional entropy to predict functional correctness of LLM-generated code via uncertainty quantification, outperforming NLI-based baselines in most tested settings.

Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries

cs.SE · 2026-05-22 · unverdicted · novelty 5.0

Develops a section-aware hallucination detection method for LLM bug report summaries using synthetic injection on the BugsRepo dataset from Mozilla projects, reporting up to 0.89 Macro-F1 at report level.

Understanding the Limits of Automated Evaluation for Code Review Bots in Practice

cs.SE · 2026-04-27 · unverdicted · novelty 5.0

Automated LLM-based evaluation of code review bot comments achieves only moderate agreement (0.44-0.62) with developer labels in an industrial dataset because developer decisions reflect contextual constraints beyond comment quality.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification cs.CL · 2026-05-27 · unverdicted · none · ref 24 · internal anchor
Introduces functional equivalence methods and functional entropy to predict functional correctness of LLM-generated code via uncertainty quantification, outperforming NLI-based baselines in most tested settings.

Hallujudge: A reference-free hallu- cination detection for context misalignment in code review automation

fields

years

verdicts

representative citing papers

citing papers explorer