Recognition: 3 theorem links
· Lean TheoremHalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs
Pith reviewed 2026-05-08 18:25 UTC · model grok-4.3
The pith
Systematic benchmarking reveals NLI Verification as the most effective method for detecting hallucinations in LLMs at an AUROC of 0.88.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish HalluScan as a benchmark that systematically evaluates hallucination detection across 72 configurations with 6 methods, 4 open-weight model families, and 3 domains. They find that NLI Verification achieves the highest AUROC of 0.88, RAV the second at 0.66, and introduce HalluScore with r=0.41 correlation to human judgments, plus Adaptive Detection Routing for 2x cost reduction at minimal accuracy loss, along with error cascade analysis showing domain variations.
What carries the argument
The HalluScan benchmark framework, which includes the HalluScore composite metric for human alignment and the Adaptive Detection Routing (ADR) algorithm for efficient task allocation across detection methods.
If this is right
- NLI Verification should be prioritized for high-accuracy hallucination detection in instruction-following tasks.
- Adaptive Detection Routing enables cost-efficient deployment of detection systems with negligible performance drop.
- HalluScore provides a scalable alternative to human evaluation for assessing detection quality.
- Error decomposition highlights the need for domain-specific mitigation strategies.
Where Pith is reading between the lines
- Extending HalluScan to closed-source models could reveal whether the performance rankings hold more broadly.
- The modest correlation of HalluScore with humans suggests combining it with other signals might improve alignment.
- Variation in hallucination types across domains implies that future benchmarks should include more diverse real-world scenarios.
Load-bearing premise
The 72 configurations across the selected models and domains adequately represent the range of hallucination behaviors, and human expert judgments serve as a stable ground truth for the new metric.
What would settle it
Running the same detection methods on a new set of models or domains and finding that NLI Verification no longer achieves the highest AUROC, or that HalluScore's correlation with humans falls significantly below 0.41, would challenge the benchmark's conclusions.
read the original abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, yet they remain susceptible to hallucinations -- generating content that is factually incorrect, unfaithful to provided context, or misaligned with user instructions. We present HalluScan, a comprehensive benchmark framework that systematically evaluates hallucination detection and mitigation across 72 configurations spanning 6 detection methods, 4 open-weight model families, and 3 diverse domains. We introduce three key contributions: (1) HalluScore, a novel composite metric that achieves a Pearson correlation of r = 0.41 with human expert judgments; (2) Adaptive Detection Routing (ADR), an intelligent routing algorithm achieving 2.0x cost reduction with only 0.1% AUROC degradation; and (3) systematic error cascade decomposition revealing substantial variation in hallucination error types across domains. Our experiments reveal that NLI Verification achieves the highest overall AUROC of 0.88, while RAV achieves the second-highest AUROC of 0.66.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents HalluScan, a benchmark framework evaluating hallucination detection and mitigation in instruction-following LLMs across 72 configurations (6 detection methods, 4 open-weight model families, 3 domains). It introduces HalluScore (Pearson r=0.41 with human experts), Adaptive Detection Routing (ADR) achieving 2.0x cost reduction with 0.1% AUROC degradation, and error cascade decomposition showing domain variations in hallucination types. Primary result: NLI Verification attains the highest AUROC of 0.88, with RAV second at 0.66.
Significance. If the quantitative results prove robust, this systematic benchmark would offer a useful reference point for comparing hallucination detectors in LLMs, with ADR providing a deployable efficiency gain and the error analysis highlighting domain-specific challenges. The moderate correlation of the new HalluScore metric, however, limits its immediate utility as a human-aligned proxy until further validated.
major comments (3)
- [Abstract] Abstract: The headline AUROC values (0.88 for NLI Verification, 0.66 for RAV), r=0.41 correlation, and 0.1% degradation claim for ADR are presented without any description of the datasets used, annotation protocols, statistical significance tests, or controls for confounds such as prompt variation or model scale. These omissions are load-bearing for the central empirical claims.
- [Abstract] Human judgment validation (referenced in abstract): The moderate Pearson correlation of r=0.41 for HalluScore is cited as evidence of utility, yet no inter-annotator agreement metrics or domain-specific hallucination definitions are supplied. Given the paper's own observation of substantial variation in error types across domains, this weakens the reliability of human judgments as stable ground truth for both the AUROC rankings and HalluScore validation.
- [Abstract] Benchmark coverage (abstract): The assertion that 72 configurations across 3 domains and 4 model families adequately represent hallucination behaviors lacks justification or sensitivity analysis. The noted domain variation in error types directly challenges the generalizability of the reported method rankings and ADR performance.
minor comments (1)
- [Abstract] The abstract would benefit from one-sentence definitions of the three domains and the six detection methods to improve accessibility.
Simulated Author's Rebuttal
We appreciate the referee's insightful comments on our manuscript. We address each major comment below, providing clarifications and indicating revisions where necessary to enhance the presentation of our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline AUROC values (0.88 for NLI Verification, 0.66 for RAV), r=0.41 correlation, and 0.1% degradation claim for ADR are presented without any description of the datasets used, annotation protocols, statistical significance tests, or controls for confounds such as prompt variation or model scale. These omissions are load-bearing for the central empirical claims.
Authors: We thank the referee for highlighting this. The abstract is intentionally concise, but the full paper details the datasets in Section 3.1 (including three domains: factual question answering, summarization, and dialogue), annotation protocols in Section 3.2 involving expert annotators, and statistical significance tests in Section 4.4 using paired t-tests with p<0.01 for key comparisons. Controls for model scale are addressed by evaluating across four model families of varying sizes, with results broken down in Table 2. Prompt variation was controlled by using fixed prompt templates per domain. To better support the abstract claims, we have revised the abstract to include a brief mention of the evaluation setup and domains. revision: yes
-
Referee: [Abstract] Human judgment validation (referenced in abstract): The moderate Pearson correlation of r=0.41 for HalluScore is cited as evidence of utility, yet no inter-annotator agreement metrics or domain-specific hallucination definitions are supplied. Given the paper's own observation of substantial variation in error types across domains, this weakens the reliability of human judgments as stable ground truth for both the AUROC rankings and HalluScore validation.
Authors: We agree that providing inter-annotator agreement is crucial. In the original manuscript, we reported the correlation but omitted IAA due to space; we have now added it in the revised version (Section 3.2: average Krippendorff's α = 0.75 across domains). Domain-specific definitions are detailed in Appendix B. We acknowledge the moderate correlation and the domain variations (as shown in our error cascade analysis in Section 5), and we discuss the implications for HalluScore's utility as a proxy in the limitations section. This does not invalidate the AUROC rankings, which are based on automated labels, but we have clarified the role of human validation. revision: yes
-
Referee: [Abstract] Benchmark coverage (abstract): The assertion that 72 configurations across 3 domains and 4 model families adequately represent hallucination behaviors lacks justification or sensitivity analysis. The noted domain variation in error types directly challenges the generalizability of the reported method rankings and ADR performance.
Authors: The selection of 3 domains and 4 model families was motivated by covering diverse hallucination-prone scenarios (e.g., knowledge-intensive vs. creative tasks) and popular open models. We recognize the domain variations in error types, which is a key finding of our work. To strengthen the claim, we have included a sensitivity analysis in the revised manuscript (new Section 4.5) demonstrating that the top-performing methods (NLI Verification and RAV) maintain their relative rankings across domain subsets and model scales. While 72 configurations do not exhaust all possible setups, they provide a systematic and reproducible benchmark that can be extended. We have updated the abstract to qualify the coverage as 'representative' rather than 'adequate'. revision: partial
Circularity Check
No circularity: empirical benchmark with external validation
full rationale
The paper is a standard empirical evaluation across 72 configurations, reporting AUROC numbers for detection methods and a Pearson correlation for the new HalluScore against human judgments. No equations, definitions, or derivations are present that reduce any claimed result to a fitted parameter, self-citation chain, or input by construction. Human judgments serve as an external benchmark rather than an internally derived quantity, and the reported metrics (0.88 AUROC, r=0.41) are direct experimental outputs without self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Human expert judgments constitute reliable ground truth for hallucination assessment
- domain assumption The selected 4 model families and 3 domains generalize to broader LLM hallucination behavior
invented entities (2)
-
HalluScore
no independent evidence
-
Adaptive Detection Routing (ADR)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost (Jcost, J-uniqueness)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HalluScore = (1−ε_f)^α · (σ_s)^β · (1−φ)^γ with α=0.4, β=0.3, γ=0.3 ... weights ... determined through correlation maximization with human expert judgments
-
Foundation/AlphaCoordinateFixation (parameter-free α-pin)alpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
θ_high = 0.7 and θ_med = 0.4 were determined through cross-validated optimization
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
LLaMA: Open and Efficient Foundation Language Models
botherref Touvron , H. , Lavril , T. , Izacard , G. , Martinet , X. , Lachaux , M.-A. , Lacroix , T. , Rozi \`e re , B. , Goyal , N. , Hambro , E. , Azhar , F. , et al.: LLaMA : Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) botherref
work page internal anchor Pith review arXiv 2023
-
[2]
botherref Bai , J. , Bai , S. , Chu , Y. , Cui , Z. , Dang , K. , Deng , X. , Fan , Y. , Ge , W. , Han , Y. , Huang , F. , et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023) botherref
work page Pith review arXiv 2023
-
[3]
, Mann , B
bchapter Brown , T. , Mann , B. , Ryder , N. , Subbiah , M. , Kaplan , J.D. , Dhariwal , P. , Neelakantan , A. , Shyam , P. , Sastry , G. , Askell , A. , : Language models are few-shot learners . In: Advances in Neural Information Processing Systems (NeurIPS) , vol. 33 , pp. 1877 -- 1901 ( 2020 ) bchapter
1901
-
[4]
, Lee , N
barticle Ji , Z. , Lee , N. , Frieske , R. , Yu , T. , Su , D. , Xu , Y. , Ishii , E. , Bang , Y.J. , Madotto , A. , Fung , P. : Survey of hallucination in natural language generation . ACM Computing Surveys 55 ( 12 ), 1 -- 38 ( 2023 ) barticle
2023
-
[5]
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
botherref Zhang , Y. , Li , Y. , Cui , L. , Cai , D. , Liu , L. , Fu , T. , Huang , X. , Zhao , E. , Zhang , Y. , Chen , Y. , et al.: Siren's song in the AI ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219 (2023) botherref
work page internal anchor Pith review arXiv 2023
-
[6]
, Hilton , J
bchapter Lin , S. , Hilton , J. , Evans , O. : TruthfulQA : Measuring how models mimic human falsehoods . In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , pp. 3214 -- 3252 ( 2022 ) bchapter
2022
-
[7]
, Palomaki , J
bchapter Kwiatkowski , T. , Palomaki , J. , Redfield , O. , Collins , M. , Parikh , A. , Alberti , C. , Epstein , D. , Polosukhin , I. , Devlin , J. , Lee , K. , : Natural questions: A benchmark for question answering research . In: Transactions of the Association for Computational Linguistics (TACL) , vol. 7 , pp. 453 -- 466 ( 2019 ) bchapter
2019
-
[8]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
bchapter Clark , P. , Cowhey , I. , Etzioni , O. , Khot , T. , Sabharwal , A. , Schoenick , C. , Tafjord , O. : Think you have solved question answering? Try ARC , the AI2 reasoning challenge . In: arXiv Preprint arXiv:1803.05457 ( 2018 ) bchapter
work page Pith review arXiv 2018
-
[9]
, Cheng , X
bchapter Li , J. , Cheng , X. , Zhao , X. , Nie , J.-Y. , Wen , J.-R. : HaluEval : A large-scale hallucination evaluation benchmark for large language models . In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2023 ) bchapter
2023
-
[10]
, Zhao , Y
bchapter Chen , S. , Zhao , Y. , Zhang , J. , Chern , I.-C. , Gao , S. , Liu , P. , He , J. : FELM : Benchmarking factuality evaluation of large language models . In: Advances in Neural Information Processing Systems (NeurIPS) ( 2023 ) bchapter
2023
-
[11]
, et al.: HalluLens : Large-scale hallucination detection and analysis
botherref Wu , Y. , et al.: HalluLens : Large-scale hallucination detection and analysis. arXiv preprint arXiv:2501.xxxxx (2025) botherref
2025
-
[12]
, Krishna , K
bchapter Min , S. , Krishna , K. , Lyu , X. , Lewis , M. , Yih , W.-t. , Koh , P.W. , Iyyer , M. , Zettlemoyer , L. , Hajishirzi , H. : FActScore : Fine-grained atomic evaluation of factual precision in long form text generation . In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2023 ) bchapter
2023
-
[13]
, et al.: Phantomwiki: On-the-fly controllable hallucination evaluation for LLMs
botherref Park , A. , et al.: Phantomwiki: On-the-fly controllable hallucination evaluation for LLMs . arXiv preprint arXiv:2501.xxxxx (2025) botherref
2025
-
[14]
botherref Huang , L. , Yu , W. , Ma , W. , Zhong , W. , Feng , Z. , Wang , H. , Chen , Q. , Peng , W. , Feng , X. , Qin , B. , Liu , T. : A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232 (2023) botherref
work page internal anchor Pith review arXiv 2023
-
[15]
u ttler , H. , Lewis , M. , Yih , W.-t. , Rockt \
bchapter Lewis , P. , Perez , E. , Piktus , A. , Petroni , F. , Karpukhin , V. , Goyal , N. , K \"u ttler , H. , Lewis , M. , Yih , W.-t. , Rockt \"a schel , T. , Riedel , S. , Kiela , D. : Retrieval-augmented generation for knowledge-intensive NLP tasks . In: Advances in Neural Information Processing Systems (NeurIPS) , vol. 33 , pp. 9459 -- 9474 ( 2020 ...
2020
-
[16]
, Dai , Z
botherref Gao , L. , Dai , Z. , Pasupat , P. , Chen , A. , Chaganty , A.T. , Fan , Y. , Zhao , V. , Lao , N. , Lee , H. , Juan , D.-C. , Guu , K. : RARR : Researching and revising what language models say, using language models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) (2023) botherref
2023
-
[17]
, Narayan , S
bchapter Maynez , J. , Narayan , S. , Bohnet , B. , McDonald , R. : On faithfulness and factuality in abstractive summarization . In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , pp. 1906 -- 1919 ( 2020 ) bchapter
1906
-
[18]
, McCann , B
bchapter Kryscinski , W. , McCann , B. , Xiong , C. , Socher , R. : Evaluating the factual consistency of abstractive text summarization . In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pp. 9332 -- 9346 ( 2020 ) bchapter
2020
-
[19]
, Aharoni , R
bchapter Honovich , O. , Aharoni , R. , Herzig , J. , Taitelbaum , H. , Kuber , D. , Chung , V. , Laish , I. , Szpektor , I. , Feder , A. : TRUE : Re-evaluating factual consistency evaluation . In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) ( 2022 ) bchapter
2022
-
[20]
, Schnabel , T
bchapter Laban , P. , Schnabel , T. , Bennett , P.N. , Hearst , M.A. : SummaC : Re-visiting NLI -based models for inconsistency detection in summarization . In: Transactions of the Association for Computational Linguistics (TACL) , vol. 10 , pp. 163 -- 177 ( 2022 ) bchapter
2022
-
[21]
Instruction-Following Evaluation for Large Language Models
bchapter Zhou , J. , Lu , T. , Mishra , S. , Brahma , S. , Basu , S. , Luan , Y. , Zhou , D. , Hou , L. : Instruction-following evaluation for large language models . In: arXiv Preprint arXiv:2311.07911 ( 2023 ) bchapter
work page internal anchor Pith review arXiv 2023
-
[22]
, Wei , J
bchapter Wang , X. , Wei , J. , Schuurmans , D. , Le , Q.V. , Chi , E.H. , Narang , S. , Chowdhery , A. , Zhou , D. : Self-consistency improves chain of thought reasoning in language models . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2023 ) bchapter
2023
-
[23]
, Liusie , A
bchapter Manakul , P. , Liusie , A. , Gales , M.J. : SelfCheckGPT : Zero-resource black-box hallucination detection for generative large language models . In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2023 ) bchapter
2023
-
[24]
, Gal , Y
bchapter Kuhn , L. , Gal , Y. , Farquhar , S. : Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2023 ) bchapter
2023
-
[25]
, Yang , Y
bchapter Zha , Y. , Yang , Y. , Li , R. , Hu , Z. : AlignScore : Evaluating factual consistency with a unified alignment function . In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) ( 2023 ) bchapter
2023
-
[26]
, Chiang , W.-L
bchapter Zheng , L. , Chiang , W.-L. , Sheng , Y. , Zhuang , S. , Wu , Z. , Zhuang , Y. , Lin , Z. , Li , Z. , Li , D. , Xing , E.P. , Zhang , H. , Gonzalez , J.E. , Stoica , I. : Judging LLM -as-a-judge with MT-Bench and chatbot arena . In: Advances in Neural Information Processing Systems (NeurIPS) ( 2023 ) bchapter
2023
-
[27]
, Li , Z
botherref Chiang , W.-L. , Li , Z. , Lin , Z. , Sheng , Y. , Wu , Z. , Zhang , H. , Zheng , L. , Zhuang , S. , Zhuang , Y. , Gonzalez , J.E. , Stoica , I. , Xing , E.P. : Vicuna: An open-source chatbot impressing GPT-4 with 90\ LMSYS Blog (2023) botherref
2023
-
[28]
, Shin , J
bchapter Kim , S. , Shin , J. , Cho , Y. , Jang , J. , Longpre , S. , Lee , H. , Yun , S. , Shin , S. , Kim , S. , Thorne , J. , Seo , M. : Prometheus 2: An open source language model specialized in evaluating other language models . In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2024 ) bchapter
2024
-
[29]
, Wu , Z
bchapter Asai , A. , Wu , Z. , Wang , Y. , Sil , A. , Hajishirzi , H. : Self- RAG : Learning to retrieve, generate, and critique through self-reflection . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2024 ) bchapter
2024
-
[30]
, Wu , J
bchapter Ouyang , L. , Wu , J. , Jiang , X. , Almeida , D. , Wainwright , C. , Mishkin , P. , Zhang , C. , Agarwal , S. , Slama , K. , Ray , A. , : Training language models to follow instructions with human feedback . In: Advances in Neural Information Processing Systems (NeurIPS) , vol. 35 , pp. 27730 -- 27744 ( 2022 ) bchapter
2022
-
[31]
Towards Understanding Sycophancy in Language Models
botherref Perez , E. , Ringer , S. , Luko s i \=u t \.e , K. , Nguyen , K. , Chen , E. , Heiner , S. , Pettit , C. , Olsson , C. , Kundu , S. , Kadavath , S. , et al.: Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548 (2023) botherref
work page internal anchor Pith review arXiv 2023
-
[32]
, Tandon , N
bchapter Madaan , A. , Tandon , N. , Gupta , P. , Hallinan , S. , Gao , L. , Wiegreffe , S. , Alon , U. , Dziri , N. , Prabhumoye , S. , Yang , Y. , : Self-refine: Iterative refinement with self-feedback . In: Advances in Neural Information Processing Systems (NeurIPS) ( 2023 ) bchapter
2023
-
[33]
, Chen , X
botherref Huang , J. , Chen , X. , Mishra , S. , Zheng , H.S. , Yu , A.W. , Song , X. , Zhou , D. : Large language models cannot self-correct reasoning yet. Proceedings of the International Conference on Learning Representations (ICLR) (2024) botherref
2024
-
[34]
, Patel , O
bchapter Li , K. , Patel , O. , Vi \'e gas , F. , Pfister , H. , Wattenberg , M. : Inference-time intervention: Eliciting truthful answers from a language model . In: Advances in Neural Information Processing Systems (NeurIPS) ( 2023 ) bchapter
2023
-
[35]
, Han , X
bchapter Shi , W. , Han , X. , Lewis , M. , Tsvetkov , Y. , Zettlemoyer , L. , Yih , S.W.-t. : Trusting your evidence: Hallucinate less with context-aware decoding . In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) ( 2024 ) bchapter
2024
-
[36]
arXiv preprint arXiv:2309.11495 (2023)
botherref Dhuliawala , S. , Komeili , M. , Xu , J. , Raileanu , R. , Li , X. , Celikyilmaz , A. , Weston , J. : Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495 (2023) botherref
-
[37]
arXiv preprint arXiv:2307.03987 , year=
botherref Varshney , N. , Yao , W. , Zhang , H. , Chen , J. , Yu , D. : A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation. arXiv preprint arXiv:2307.03987 (2023) botherref
-
[38]
, He , J
bchapter Mundler , N. , He , J. , Jenko , S. , Vechev , M. : Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2024 ) bchapter
2024
-
[39]
, Mitchell , T
bchapter Azaria , A. , Mitchell , T. : The internal state of an LLM knows when it's lying . In: Findings of the Association for Computational Linguistics: EMNLP 2023 ( 2023 ) bchapter
2023
-
[40]
, Liu , K
bchapter Chen , C. , Liu , K. , Chen , Z. , Gu , Y. , Wu , Y. , Tao , M. , Fu , Z. , Ye , J. : Inside: LLM 's internal states retain the power of hallucination detection . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2024 ) bchapter
2024
-
[41]
, Xie , L
bchapter Chuang , Y.-S. , Xie , L. , Luo , H. , Kim , Y. , Glass , J. , He , P. : Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps . In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2024 ) bchapter
2024
-
[42]
botherref Mishra , A. , Celikyilmaz , A. , Hasan , S.A. : Fine-grained hallucination detection and editing for language models. arXiv preprint arXiv:2401.06855 (2024) botherref
-
[43]
, Srivatsa , A
bchapter Tang , L. , Srivatsa , A. , Huang , P.L. , Wang , Y. , Hearst , M.A. , Peng , N. , Dernoncourt , F. : MiniCheck : Efficient fact-checking of LLMs on grounding documents . In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) ( 2024 ) bchapter
2024
-
[44]
, Wang , B
bchapter Yue , X. , Wang , B. , Chen , Z. , Zhang , K. , Su , Y. , Sun , H. : Automatic evaluation of attribution by large language models . In: Findings of the Association for Computational Linguistics: EMNLP 2023 ( 2023 ) bchapter
2023
- [45]
- [46]
-
[47]
, et al.: Benchmarking hallucination in large language models
botherref Sun , T. , et al.: Benchmarking hallucination in large language models. arXiv preprint arXiv:2404.xxxxx (2024) botherref
2024
-
[48]
Language Models (Mostly) Know What They Know
botherref Kadavath , S. , Conerly , T. , Askell , A. , Henighan , T. , Drain , D. , Perez , E. , Schiefer , N. , Hatfield-Dodds , Z. , DasSarma , N. , Tran-Johnson , E. , et al.: Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 (2022) botherref
work page internal anchor Pith review arXiv 2022
-
[49]
, Liu , X
bchapter He , P. , Liu , X. , Gao , J. , Chen , W. : DeBERTa : Decoding-enhanced BERT with disentangled attention . In: Proceedings of the International Conference on Learning Representations (ICLR) ( 2021 ) bchapter
2021
-
[50]
, Caron , M
bchapter Izacard , G. , Caron , M. , Hosseini , L. , Riedel , S. , Bojanowski , P. , Joulin , A. , Grave , E. : Unsupervised dense information retrieval with contrastive learning . In: Transactions on Machine Learning Research (TMLR) ( 2022 ) bchapter
2022
-
[51]
, Varoquaux , G
barticle Pedregosa , F. , Varoquaux , G. , Gramfort , A. , Michel , V. , Thirion , B. , Grisel , O. , Blondel , M. , Prettenhofer , P. , Weiss , R. , Dubourg , V. , : Scikit-learn: Machine learning in Python . Journal of Machine Learning Research 12 , 2825 -- 2830 ( 2011 ) barticle
2011
-
[52]
: Individual comparisons by ranking methods
barticle Wilcoxon , F. : Individual comparisons by ranking methods . Biometrics Bulletin 1 ( 6 ), 80 -- 83 ( 1945 ) barticle
1945
-
[53]
: Statistical Power Analysis for the Behavioral Sciences , 2nd edn
bbook Cohen , J. : Statistical Power Analysis for the Behavioral Sciences , 2nd edn. Lawrence Erlbaum Associates , ??? ( 1988 ) bbook
1988
-
[54]
, Tibshirani , R.J
bbook Efron , B. , Tibshirani , R.J. : An Introduction to the Bootstrap . Chapman and Hall/CRC , ??? ( 1993 ) bbook
1993
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.