Chain-of-Verification Reduces Hallucination in Large Language Models
Pith reviewed 2026-05-18 01:03 UTC · model grok-4.3
The pith
A four-step self-check process lets language models correct factual errors in their own answers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that language models can reduce hallucinations by following a four-step Chain-of-Verification procedure: drafting an initial response, planning verification questions to fact-check the draft, answering those questions independently so the answers are not biased by the draft, and then generating a final verified response that draws on the independent answers.
What carries the argument
The Chain-of-Verification procedure, a four-step sequence of drafting, planning verification questions, independent answering, and final response generation that isolates verification from the initial draft.
If this is right
- The method lowers the number of incorrect facts produced when answering list-based questions from Wikidata.
- It improves accuracy on closed-book MultiSpanQA by catching unsupported spans through separate verification.
- It decreases the amount of invented content in long-form text generation outputs.
- The four-step process works across these different task types using the same prompting structure.
Where Pith is reading between the lines
- The independent-answer step could be tested on tasks involving logical consistency rather than pure facts to see if the same separation helps.
- Integrating the verification questions into a single forward pass might reduce the number of model calls needed while preserving the benefit.
- The approach might combine with retrieval methods by using the verification answers to decide when to fetch external evidence.
Load-bearing premise
The model can generate verification questions and answer them in a way that remains unaffected by its own initial draft response.
What would settle it
A direct comparison experiment in which verification questions are answered once with the draft hidden and once with the draft visible, showing no reduction in final hallucinations when the draft is hidden.
read the original abstract
Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response. In experiments, we show CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Chain-of-Verification (CoVe), a four-step prompting procedure in which an LLM first drafts a response, plans verification questions to fact-check the draft, answers those questions independently, and finally produces a verified response. Experiments are reported to show reduced hallucinations on list-based Wikidata questions, closed-book MultiSpanQA, and longform text generation.
Significance. If the empirical gains hold under rigorous controls, CoVe would constitute a simple, training-free prompting technique that improves factual accuracy across multiple task types without external retrieval or model changes. The multi-task scope and focus on a practical mitigation strategy for a core LLM limitation would make the result noteworthy for the field.
major comments (1)
- [CoVe procedure description] The central claim that CoVe reduces hallucinations rests on step (iii) producing verification answers that are unbiased by the initial draft. The manuscript states that questions are answered 'independently so the answers are not biased by other responses,' yet provides no explicit mechanism (separate sessions, context clearing, or prompt isolation) to enforce this independence. If verification prompts retain the original query or any latent state from the draft, the observed gains could arise from re-generation rather than external fact-checking.
minor comments (2)
- Add the exact prompt templates and any context-management details used in the experiments to support reproducibility.
- Report statistical significance tests and confidence intervals for the hallucination reductions rather than raw percentages alone.
Simulated Author's Rebuttal
We thank the referee for their constructive review and recommendation for minor revision. We address the major comment below regarding the independence of verification answers in the CoVe procedure.
read point-by-point responses
-
Referee: The central claim that CoVe reduces hallucinations rests on step (iii) producing verification answers that are unbiased by the initial draft. The manuscript states that questions are answered 'independently so the answers are not biased by other responses,' yet provides no explicit mechanism (separate sessions, context clearing, or prompt isolation) to enforce this independence. If verification prompts retain the original query or any latent state from the draft, the observed gains could arise from re-generation rather than external fact-checking.
Authors: We appreciate the referee's observation on this point. In the experimental implementation described in Section 3, each verification question is issued via an independent model call (or separate prompt session) that includes solely the verification question itself, with no inclusion of the original user query, the draft response, or any prior context from the conversation history. This design ensures the verification answers are generated without influence from the initial draft. We have updated the manuscript to explicitly detail this isolation mechanism, including example prompts, to address the concern. revision: yes
Circularity Check
CoVe presented as procedural recipe with separate empirical tests; no derivation reduces to inputs by construction
full rationale
The paper defines Chain-of-Verification explicitly as a four-step procedure (draft, plan verifications, answer independently, generate final response) and supports its effectiveness via experiments on Wikidata lists, MultiSpanQA, and longform generation. No equations, fitted parameters, or self-referential definitions appear in the method; the independence claim in step (iii) is stated as a design choice rather than derived from prior steps. No self-citation chains or uniqueness theorems are invoked to justify the core procedure. The work is therefore self-contained as an empirical method proposal without circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can generate relevant verification questions and provide answers to them that are not biased by an initial draft response.
Forward citations
Cited by 18 Pith papers
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents
MobiBench is the first modular multi-path offline benchmark for mobile GUI agents, achieving 94.72% agreement with human evaluators while allowing component-level analysis.
-
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
-
Credo: Declarative Control of LLM Pipelines via Beliefs and Policies
Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
-
Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
-
Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models
Chain-of-illocution prompting improves source adherence in RAG explanations for programming education by up to 63% over baselines.
-
Hallucination is Inevitable: An Innate Limitation of Large Language Models
Hallucinations are inevitable in LLMs because they cannot learn all computable functions according to learning theory.
-
Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems
Compositional selective specificity (CSS) improves overcommitment-aware utility from 0.846 to 0.913 on LongFact while retaining 0.938 specificity by calibrating claim-level backoffs in agentic AI responses.
-
Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents
CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
SEPs approximate semantic entropy from single-generation hidden states to enable cheap and robust hallucination detection in LLMs.
-
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
-
HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs
HalluScan benchmark tests hallucination detectors on LLMs, identifies NLI Verification as top performer with 0.88 AUROC, and introduces HalluScore (r=0.41 with humans) plus a routing method for 2x cost savings.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)
HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperforman...
-
Hallucination Detection and Evaluation of Large Language Model
HHEM delivers fast hallucination detection in LLMs via classification, cutting evaluation time from 8 hours to 10 minutes with up to 82.2% accuracy while adding segment retrieval for summarization.
-
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
A literature survey that taxonomizes hallucination phenomena in LLMs, reviews evaluation benchmarks, and analyzes approaches for their detection, explanation, and mitigation.
-
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a t...
Reference graph
Works this paper leans on
-
[6]
Truth-o-meter: Collaborating with llm in fighting its hallucinations
Boris A Galitsky. Truth-o-meter: Collaborating with llm in fighting its hallucinations. 2023
work page 2023
-
[7]
Rarr: Researching and revising what language models say, using language models
Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 16477--16...
work page 2023
-
[9]
Survey of hallucination in natural language generation
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55 0 (12): 0 1--38, 2023
work page 2023
-
[14]
Multispanqa: A dataset for multi-span question answering
Haonan Li, Martin Tomko, Maria Vasardani, and Timothy Baldwin. Multispanqa: A dataset for multi-span question answering. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 1250--1260, 2022
work page 2022
-
[23]
Reducing conversational agents’ overconfidence through linguistic calibration
Sabrina J Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10: 0 857--872, 2022
work page 2022
-
[25]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022
work page 2022
-
[29]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019
work page 2019
-
[31]
Measuring attribution in natural language generation models
Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. Measuring attribution in natural language generation models. Computational Linguistics, pp.\ 1--66, 2023
work page 2023
-
[36]
Contrastive learning reduces hallucination in conversations
Weiwei Sun, Zhengliang Shi, Shen Gao, Pengjie Ren, Maarten de Rijke, and Zhaochun Ren. Contrastive learning reduces hallucination in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.\ 13618--13626, 2023 b
work page 2023
-
[38]
Llama 2: Open foundation and fine-tuned chat models, 2023 b
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page 2023
-
[42]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 0 24824--24837, 2022
work page 2022
-
[48]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter , title =. CoRR , volume =. 2017 , url =. 1711.05101 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[49]
PaLM: Scaling Language Modeling with Pathways , author=. 2022 , eprint=
work page 2022
-
[50]
Conference on Empirical Methods in Natural Language Processing , year=
Re3: Generating Longer Stories With Recursive Reprompting and Revision , author=. Conference on Empirical Methods in Natural Language Processing , year=
-
[51]
OPT: Open Pre-trained Transformer Language Models , author=. 2022 , eprint=
work page 2022
- [52]
-
[53]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. 2023 , eprint=
work page 2023
- [54]
-
[55]
TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2021 , eprint=
work page 2021
-
[56]
Training Verifiers to Solve Math Word Problems
Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
Thirty-Fourth AAAI Conference on Artificial Intelligence , year =
Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =
-
[58]
Conference on Empirical Methods in Natural Language Processing , year=
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. Conference on Empirical Methods in Natural Language Processing , year=
-
[59]
Naftali Tishby and Noga Zaslavsky
Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/...
-
[60]
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback , author=. 2023 , eprint=
work page 2023
-
[61]
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment , author=. 2023 , eprint=
work page 2023
- [62]
- [63]
-
[64]
Proceedings of the international AAAI conference on web and social media , volume=
The pushshift reddit dataset , author=. Proceedings of the international AAAI conference on web and social media , volume=
-
[65]
Self-Refine: Iterative Refinement with Self-Feedback , author=. 2023 , eprint=
work page 2023
-
[66]
Ye, Seonghyeon and Jo, Yongrae and Kim, Doyoung and Kim, Sungdong and Hwang, Hyeonbin and Seo, Minjoon , title =. 2023 , howpublished =
work page 2023
- [67]
-
[68]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing , author=. 2023 , eprint=
work page 2023
-
[69]
The Eleventh International Conference on Learning Representations , year=
Generating Sequences by Learning to Self-Correct , author=. The Eleventh International Conference on Learning Representations , year=
-
[70]
Self-critiquing models for assisting human evaluators , author=. 2022 , eprint=
work page 2022
-
[71]
Large Language Models are not Fair Evaluators , author=. 2023 , eprint=
work page 2023
-
[72]
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources , author=. 2023 , eprint=
work page 2023
-
[73]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[74]
arXiv preprint arXiv:2212.09968 , year=
On Improving Summarization Factual Consistency from Natural Language Feedback , author=. arXiv preprint arXiv:2212.09968 , year=
- [75]
-
[76]
Proofwriter: Generating implications, proofs, and abductive statements over natural language
Proofwriter: Generating implications, proofs, and abductive statements over natural language , author=. arXiv preprint arXiv:2012.13048 , year=
-
[77]
Proceedings of the AAAI conference on artificial intelligence , volume=
Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[78]
arXiv preprint arXiv:1909.00277 , year=
Cosmos QA: Machine reading comprehension with contextual commonsense reasoning , author=. arXiv preprint arXiv:1909.00277 , year=
-
[79]
Camburu, Oana-Maria and Rockt. e-. Advances in Neural Information Processing Systems , volume=
-
[80]
Nie, Yixin and Williams, Adina and Dinan, Emily and Bansal, Mohit and Weston, Jason and Kiela, Douwe , journal=. Adversarial
-
[81]
News summarization and evaluation in the era of
Goyal, Tanya and Li, Junyi Jessy and Durrett, Greg , journal=. News summarization and evaluation in the era of
-
[82]
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021
work page 2021
-
[83]
SocialIQA: Commonsense Reasoning about Social Interactions
Socialiqa: Commonsense reasoning about social interactions , author=. arXiv preprint arXiv:1904.09728 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[84]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
BoolQ: Exploring the surprising difficulty of natural yes/no questions , author=. arXiv preprint arXiv:1905.10044 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[85]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[86]
RACE: Large-scale ReAding Comprehension Dataset From Examinations
RACE: Large-scale ReAding Comprehension Dataset From Examinations , author=. arXiv preprint arXiv:1704.04683 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[87]
HellaSwag: Can a Machine Really Finish Your Sentence?
HellaSwag: Can a machine really finish your sentence? , author=. arXiv preprint arXiv:1905.07830 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[88]
Communications of the ACM , volume=
Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=
work page 2021
-
[89]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension , author=. arXiv preprint arXiv:1705.03551 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[90]
Transactions of the Association for Computational Linguistics , volume=
Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=
work page 2019
-
[91]
Zhou, Chunting and Liu, Pengfei and Xu, Puxin and Iyer, Srini and Sun, Jiao and Mao, Yuning and Ma, Xuezhe and Efrat, Avia and Yu, Ping and Yu, Lili and others , journal=
-
[92]
arXiv preprint arXiv:2303.12767 , year=
Can we trust the evaluation on ChatGPT? , author=. arXiv preprint arXiv:2303.12767 , year=
-
[93]
and Stoica, Ion and Xing, Eric P
Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =
- [94]
-
[95]
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , author=. 2023 , eprint=
work page 2023
-
[96]
GitHub repository , howpublished =
Wang, Yidong and Yu, Zhuohao and Zeng, Zhengran and Yang, Linyi and Heng, Qiang and Wang, Cunxiang and Chen, Hao and Jiang, Chaoya and Xie, Rui and Wang, Jindong and Xie, Xing and Ye, Wei and Zhang, Shikun and Zhang, Yue , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[97]
Xiong, Miao and Hu, Zhiyuan and Lu, Xinyang and Li, Yifei and Fu, Jie and He, Junxian and Hooi, Bryan , year=. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs , abstractNote=
-
[98]
ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning , author=. 2022 , eprint=
work page 2022
-
[99]
Generating Sequences by Learning to Self-Correct , author=. 2022 , eprint=
work page 2022
-
[100]
Progressive-Hint Prompting Improves Reasoning in Large Language Models , author=. 2023 , eprint=
work page 2023
-
[101]
Large Language Models are Zero-Shot Reasoners , author=. 2023 , eprint=
work page 2023
-
[102]
Distilling Reasoning Capabilities into Smaller Language Models
Shridhar, Kumar and Stolfo, Alessandro and Sachan, Mrinmaya. Distilling Reasoning Capabilities into Smaller Language Models. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.441
-
[103]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=
work page 2023
-
[104]
Automatic Generation of Socratic Subquestions for Teaching Math Word Problems
Shridhar, Kumar and Macina, Jakub and El-Assady, Mennatallah and Sinha, Tanmay and Kapur, Manu and Sachan, Mrinmaya. Automatic Generation of Socratic Subquestions for Teaching Math Word Problems. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.277
-
[105]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
work page 2023
-
[106]
FacTool: Factuality Detection in Generative AI--A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios , author=. arXiv preprint arXiv:2307.13528 , year=
-
[107]
arXiv preprint arXiv:2305.18248 , year=
Do Language Models Know When They're Hallucinating References? , author=. arXiv preprint arXiv:2305.18248 , year=
-
[108]
Check your facts and try again: Improving large language models with external knowledge and automated feedback , author=. arXiv preprint arXiv:2302.12813 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[109]
Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig
Active retrieval augmented generation , author=. arXiv preprint arXiv:2305.06983 , year=
-
[110]
Transactions of the Association for Computational Linguistics , volume=
Reducing conversational agents’ overconfidence through linguistic calibration , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=
work page 2022
-
[111]
arXiv preprint arXiv:2306.00186 , year=
Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback , author=. arXiv preprint arXiv:2306.00186 , year=
-
[112]
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models , author=. arXiv preprint arXiv:2303.08896 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[113]
Truth-O-Meter: Collaborating with LLM in Fighting its Hallucinations , author=. 2023 , publisher=
work page 2023
-
[114]
Computational Linguistics , pages=
Measuring attribution in natural language generation models , author=. Computational Linguistics , pages=
-
[115]
arXiv preprint arXiv:2307.03987 , year=
A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation , author=. arXiv preprint arXiv:2307.03987 , year=
-
[116]
arXiv preprint arXiv:2306.01693 , year=
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training , author=. arXiv preprint arXiv:2306.01693 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.