RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration

Fabian Ridder , Laurin Lessel , Malte Schilling

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:36 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords detectionhallucinationhallucinationsinternallanguageragognizerclosed-domaincurrent

0 comments

The pith

RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often receive extra facts from a search system before answering, a setup called retrieval-augmented generation. Even with those facts, the models sometimes invent details that the retrieved text does not support. Existing ways to catch these inventions usually look at the finished answer after it is written or examine the model without changing it. The authors instead treat the model's internal signals as a training signal. They create a new dataset of real examples where the model hallucinated despite having the right context, labeled at the level of individual words. They then attach a small extra network head that learns to read the model's hidden states and predict which tokens will be hallucinations. Training happens on two tasks at once: producing fluent answers and correctly labeling the hallucinated tokens. The joint training is meant to push the model's internal representations to become more clearly separated between safe and unsafe tokens. Experiments across several test sets show the method detects hallucinations better than prior approaches and also reduces how often hallucinations appear when the model generates new text, all without lowering the quality or relevance of the answers.

Core claim

Integrating a lightweight detection head into an LLM for joint optimization of language modeling and hallucination detection forces the model to improve the separability of its internal states regarding hallucinations while simultaneously learning to generate well-formed and meaningful responses, achieving state-of-the-art token-level detection and substantially reduced hallucination rates without degrading quality or relevance.

Load-bearing premise

That the internal hidden states of the base LLM can be made meaningfully more separable for hallucination versus non-hallucination tokens through the addition of the detection head and the joint loss, and that the newly introduced RAGognize dataset accurately represents naturally occurring closed-domain hallucinations.

Figures

Figures reproduced from arXiv: 2604.15945 by Fabian Ridder, Laurin Lessel, Malte Schilling.

**Figure 1.** Figure 1: Distinction of Contextual and Parametric Knowledge: The Venn diagram illustrates possible knowledge scenar [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Automatic Data Generation and Annotation Pipeline for the RAGognize dataset: Wikipedia facts post-dating [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: RAGognizer Architecture: An MLP detection head is integrated at an intermediate layer (e.g., Block 18 for [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) is widely used to augment the input to Large Language Models (LLMs) with external information, such as recent or domain-specific knowledge. Nonetheless, current models still produce closed-domain hallucinations and generate content that is unsupported by the retrieved context. Current detection approaches typically treat hallucination as a post-hoc problem, relying on black-box consistency checks or probes over frozen internal representations. In this work, we demonstrate that hallucination detection based on internal state representation can also serve as a direct training signal. We introduce RAGognize, a dataset of naturally occurring closed-domain hallucinations with token-level annotations, and RAGognizer, a hallucination-aware fine-tuning approach that integrates a lightweight detection head into an LLM, allowing for the joint optimization of language modeling and hallucination detection. This joint objective forces the model to improve the separability of its internal states regarding hallucinations while simultaneously learning to generate well-formed and meaningful responses. Across multiple benchmarks, RAGognizer achieves state-of-the-art token-level hallucination detection while substantially reducing hallucination rates during generation, without degrading language quality or relevance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's joint training with a hallucination detection head on a new natural RAG dataset is a clean idea in principle, but the abstract supplies no numbers, benchmarks, or ablations to show the mechanism actually works.

read the letter

The core move here is to add a lightweight detection head to an LLM and train it jointly on language modeling plus hallucination detection, using a fresh token-level dataset of real closed-domain RAG failures. The claim is that this forces better separability in the internal states while keeping generation quality intact. That combination of a naturally occurring dataset and joint optimization is not in the prior post-hoc or frozen-probe work referenced in the abstract, so the setup itself is new enough to note.

Referee Report

2 major / 0 minor

Summary. The paper introduces RAGognize, a new dataset of naturally occurring closed-domain hallucinations with token-level annotations, and RAGognizer, an approach that augments an LLM with a lightweight detection head for joint optimization of the language modeling objective and hallucination detection. It claims this joint training improves the separability of internal hidden states with respect to hallucination vs. non-hallucination tokens, yielding state-of-the-art token-level detection performance and substantially lower hallucination rates during generation across multiple benchmarks, without degrading output quality or relevance.

Significance. If the central mechanism holds, the work would be significant for shifting hallucination detection from post-hoc probing of frozen models to an integrated training signal that directly shapes representations. This could improve reliability in RAG systems. However, the manuscript provides no representation-level diagnostics or ablations to isolate the effect of the joint loss and detection head from the new dataset or added capacity, leaving the core claim unsupported.

major comments (2)

[Abstract / Experiments] Abstract and Experiments: the claim that joint LM+detection optimization 'forces the model to improve the separability of its internal states regarding hallucinations' is load-bearing for the contribution, yet the manuscript supplies no direct evidence such as linear probe accuracy, cosine distances, or visualization of last-layer activations comparing the jointly trained model against (a) the base LLM fine-tuned only on LM loss or (b) a frozen base with a separately trained detection head. Downstream SOTA detection and reduced hallucination rates could arise from the RAGognize data distribution alone.
[Experiments] Experiments section: quantitative results, benchmark names, exact metrics (e.g., token-level F1 or AUROC for detection, hallucination rate reductions), ablation studies on loss weighting and head architecture, and implementation details are not reported in sufficient detail to allow reproduction or verification that the joint objective is responsible for the gains rather than the added parameters or data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that internal LLM representations can be made separable for hallucination detection via joint training and that the new dataset faithfully captures real hallucinations. No free parameters or invented physical entities are introduced beyond standard neural-network weights.

axioms (2)

domain assumption Internal hidden states of LLMs contain information that can be linearly or lightly transformed into a reliable hallucination detector.
Invoked when the authors state that the detection head improves separability of states regarding hallucinations.
domain assumption Joint optimization of language modeling loss and detection loss will not trade off against generation quality or relevance.
Stated as an empirical outcome in the abstract but required for the overall claim to hold.

pith-pipeline@v0.9.0 · 5502 in / 1560 out tokens · 36165 ms · 2026-05-10T08:36:18.373975+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 39 canonical work pages · 13 internal anchors

[1]

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder, and et al. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165

work page internal anchor Pith review arXiv 2020
[2]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , volume=

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43 0 (2): 0 1--55, January 2025. ISSN 1558-2868. doi:10.1145/3...

work page doi:10.1145/3703155 2025
[3]

arXiv preprint arXiv:1909.01066 , year=

Fabio Petroni, Tim Rockt \"a schel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. Language models as knowledge bases?, 2019. URL https://arxiv.org/abs/1909.01066

work page arXiv 2019
[4]

Knowledge conflicts for llms: A survey.arXiv:2403.08319, 2024

Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. Knowledge conflicts for LLMs : A survey, 2024. URL https://arxiv.org/abs/2403.08319

work page arXiv 2024
[5]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen tau Yih, Tim Rockt \"a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks, 2021. URL https://arxiv.org/abs/2005.11401

work page internal anchor Pith review arXiv 2021
[6]

Agrawal, M

Ayush Agrawal, Mirac Suzgun, Lester Mackey, and Adam Tauman Kalai. Do language models know when they're hallucinating references?, 2024. URL https://arxiv.org/abs/2305.18248

work page arXiv 2024
[7]

doi:10.48550/arXiv.2401.00396 , abstract =

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. Ragtruth: A hallucination corpus for developing trustworthy retrieval-augmented language models, 2024. URL https://arxiv.org/abs/2401.00396

work page arXiv 2024
[8]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, and et al. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Mercer, Lalit R

Frederick Jelinek, Robert L. Mercer, Lalit R. Bahl, and Janet M. Baker. Perplexity---a measure of the difficulty of speech recognition tasks. Journal of the Acoustical Society of America, 62, 1977. URL https://api.semanticscholar.org/CorpusID:121680873

1977
[10]

doi: 10.1038/ s41586-024-07421-0

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, others, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630: 0 625--630, 2024. doi:10.1038/s41586-024-07421-0. URL https://www.nature.com/articles/s41586-024-07421-0

work page doi:10.1038/s41586-024-07421-0 2024
[11]

INSIDE: LLMs’ internal states retain the power of hallucination detection.arXiv preprint arXiv:2402.03744,

Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. INSIDE : LLMs ' internal states retain the power of hallucination detection, 2024. URL https://arxiv.org/abs/2402.03744

work page arXiv 2024
[12]

arXiv , url =:2407.07071 , primaryclass =

Yung-Sung Chuang, Linlu Qiu, Cheng-Yu Hsieh, Ranjay Krishna, Yoon Kim, and James Glass. Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps, 2024. URL https://arxiv.org/abs/2407.07071

work page arXiv 2024
[13]

The Internal State of an LLM Knows When It's Lying

Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it's lying, 2023. URL https://arxiv.org/abs/2304.13734

work page internal anchor Pith review arXiv 2023
[14]

Real-time detection of hallucinated entities in long-form generation.arXiv preprint arXiv:2509.03531,

Oscar Obeso, Andy Arditi, Javier Ferrando, Joshua Freeman, Cameron Holmes, and Neel Nanda. Real-time detection of hallucinated entities in long-form generation, 2025. URL https://arxiv.org/abs/2509.03531

work page arXiv 2025
[15]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Unsupervised

Weihang Su, Changyue Wang, Qingyao Ai, Yiran HU, Zhijing Wu, Yujia Zhou, and Yiqun Liu. Unsupervised real-time hallucination detection based on the internal states of large language models, 2024. URL https://arxiv.org/abs/2403.06448

work page arXiv 2024
[17]

LRP4RAG : Detecting hallucinations in retrieval-augmented generation via layer-wise relevance propagation

Haichuan Hu, Congqing He, Xiaochen Xie, and Quanjun Zhang. LRP4RAG : Detecting hallucinations in retrieval-augmented generation via layer-wise relevance propagation. arXiv preprint 2408.15533, 2025

work page arXiv 2025
[18]

ICR Probe : Tracking hidden state dynamics for reliable hallucination detection in llms: Tracking hidden state dynamics for reliable hallucination detection in LLMs , 2025

Zhenliang Zhang, Xinyu Hu, Huixuan Zhang, Junzhe Zhang, and Xiaojun Wan. ICR Probe : Tracking hidden state dynamics for reliable hallucination detection in llms: Tracking hidden state dynamics for reliable hallucination detection in LLMs , 2025. URL https://arxiv.org/abs/2507.16488

work page arXiv 2025
[19]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT : Zero-resource black-box hallucination detection for generative large language models: Zero-resource black-box hallucination detection for generative large language models, 2023. URL https://arxiv.org/abs/2303.08896

work page internal anchor Pith review arXiv 2023
[20]

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

Pengcheng He, Jianfeng Gao, and Weizhu Chen. DeBERTaV3 : Improving DeBERTa using ELECTRA-Style pre-training with gradient-disentangled embedding sharing, 2023. URL https://arxiv.org/abs/2111.09543

work page internal anchor Pith review arXiv 2023
[21]

MiniCheck : Efficient fact-checking of LLMs on grounding documents, 2024

Liyan Tang, Philippe Laban, and Greg Durrett. MiniCheck : Efficient fact-checking of LLMs on grounding documents, 2024. URL https://arxiv.org/abs/2404.10774

work page arXiv 2024
[22]

32 Vipula Rawte, Prachi Priya, SM Tonmoy, SM Za- man, Amit Sheth, and Amitava Das

Selvan Sunitha Ravi, Bartosz Mielczarek, Anand Kannappan, Douwe Kiela, and Rebecca Qian. Lynx: An open source hallucination evaluation model, 2024. URL https://arxiv.org/abs/2407.08488

work page arXiv 2024
[23]

Qwen Team

Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, and et al. Granite guardian, 2024. URL https://arxiv.org/abs/2412.07724

work page arXiv 2024
[24]

HHEM 2.1 : A better hallucination detection model and a new leaderboard

Ofer Mendelevitch, Forrest Bao, Miaoran Li, and Rogger Luo. HHEM 2.1 : A better hallucination detection model and a new leaderboard. Vectara blog, Aug 2024. URL https://www.vectara.com/blog/hhem-2-1-a-better-hallucination-detection-model

2024
[25]

RAGAS: Automated evaluation of retrieval augmented generation

Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation, 2025. URL https://arxiv.org/abs/2309.15217

work page arXiv 2025
[26]

HalluciNot : Hallucination detection through context and common knowledge verification, 2025

Bibek Paudel, Alexander Lyzhov, Preetam Joshi, and Puneet Anand. HalluciNot : Hallucination detection through context and common knowledge verification, 2025. URL https://arxiv.org/abs/2504.07069

work page arXiv 2025
[27]

LUMINA : Detecting hallucinations in RAG system with context-knowledge signals, 2025

Samuel Yeh, Sharon Li, and Tanwi Mallick. LUMINA : Detecting hallucinations in RAG system with context-knowledge signals, 2025. URL https://arxiv.org/abs/2509.21875

work page arXiv 2025
[28]

Halueval: A large-scale hallucination evaluation benchmark for large language models.arXiv preprint arXiv:2305.11747, 2023

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. HaluEval : A large-scale hallucination evaluation benchmark for large language models, 2023. URL https://arxiv.org/abs/2305.11747

work page arXiv 2023
[29]

The HalluRAG dataset: Detecting closed-domain hallucinations in RAG applications using an LLM 's internal states, 2025

Fabian Ridder and Malte Schilling. The HalluRAG dataset: Detecting closed-domain hallucinations in RAG applications using an LLM 's internal states, 2025. URL https://arxiv.org/abs/2412.17056

work page arXiv 2025
[30]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, and et al. LLaMA : Open and efficient foundation language models, 2023. URL https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, and et al. The Llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L \'e lio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth \'e e Lacroix, and William El Sayed. Mistral 7b, 2023. URL http...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, and et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

BGE M3-Embedding : Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2023

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. BGE M3-Embedding : Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2023

2023
[35]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903

work page internal anchor Pith review arXiv 2023
[36]

Do androids know they're only dreaming of electric sheep?, 2024

Sky CH-Wang, Benjamin Van Durme, Jason Eisner, and Chris Kedzie. Do androids know they're only dreaming of electric sheep?, 2024. URL https://arxiv.org/abs/2312.17249

work page arXiv 2024
[37]

Do LLMs know about hallucination? an empirical investigation of LLM 's hidden states, 2024

Hanyu Duan, Yi Yang, and Kar Yan Tam. Do LLMs know about hallucination? an empirical investigation of LLM 's hidden states, 2024. URL https://arxiv.org/abs/2402.09733

work page arXiv 2024
[38]

Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowl- edge conflicts.arXiv preprint arXiv:2305.13300,

Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts, 2024. URL https://arxiv.org/abs/2305.13300

work page arXiv 2024
[39]

Kovács, G

\'A d \'a m Kov \'a cs and G \'a bor Recski. Lettucedetect: A hallucination detection framework for RAG applications. arXiv preprint arXiv:2502.17125, 2025

work page arXiv 2025
[40]

HaloScope : Harnessing unlabeled LLM generations for hallucination detection, 2024

Xuefeng Du, Chaowei Xiao, and Yixuan Li. HaloScope : Harnessing unlabeled LLM generations for hallucination detection, 2024. URL https://arxiv.org/abs/2409.17504

work page arXiv 2024
[41]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, and Johan Ferret et al. Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

LFM2 technical report.arXiv:2511.23404, 2025

Alexander Amini, Anna Banaszak, Harold Benoit, and et al. LFM2 technical report, 2025. URL https://arxiv.org/abs/2511.23404

work page arXiv 2025
[43]

LLaMA 3.2 1B language model

Meta AI . LLaMA 3.2 1B language model. Hugging Face model card, https://huggingface.co/meta-llama/Llama-3.2-1B, 2024. [Online; accessed 5-Nov-2025]

2024
[44]

Granite 4.0 language models

IBM Research . Granite 4.0 language models. GitHub, https://github.com/ibm-granite/granite-4.0-language-models, 2025. [Online; accessed 5-Nov-2025]

2025