pith. machine review for the scientific record. sign in

arxiv: 2309.11495 · v2 · pith:46MPX235new · submitted 2023-09-20 · 💻 cs.CL · cs.AI

Chain-of-Verification Reduces Hallucination in Large Language Models

Pith reviewed 2026-05-18 01:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords hallucination reductionlarge language modelsself-verificationchain of verificationfact checkingresponse generationfactual accuracy
0
0 comments X

The pith

A four-step self-check process lets language models correct factual errors in their own answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Chain-of-Verification method in which a language model first produces a draft response, then generates a set of verification questions aimed at checking that draft, answers those questions without reference to the draft, and finally produces a revised response that incorporates the verification results. This sequence is meant to catch and remove incorrect factual claims that would otherwise appear in the output. Experiments apply the method to list questions drawn from Wikidata, closed-book multi-span question answering, and long-form text generation, showing lower rates of hallucinated information in each case. If the independent verification step reliably prevents the original draft from influencing the checks, the approach offers an internal way to improve factual accuracy without external retrieval or model retraining.

Core claim

The central claim is that language models can reduce hallucinations by following a four-step Chain-of-Verification procedure: drafting an initial response, planning verification questions to fact-check the draft, answering those questions independently so the answers are not biased by the draft, and then generating a final verified response that draws on the independent answers.

What carries the argument

The Chain-of-Verification procedure, a four-step sequence of drafting, planning verification questions, independent answering, and final response generation that isolates verification from the initial draft.

If this is right

  • The method lowers the number of incorrect facts produced when answering list-based questions from Wikidata.
  • It improves accuracy on closed-book MultiSpanQA by catching unsupported spans through separate verification.
  • It decreases the amount of invented content in long-form text generation outputs.
  • The four-step process works across these different task types using the same prompting structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The independent-answer step could be tested on tasks involving logical consistency rather than pure facts to see if the same separation helps.
  • Integrating the verification questions into a single forward pass might reduce the number of model calls needed while preserving the benefit.
  • The approach might combine with retrieval methods by using the verification answers to decide when to fetch external evidence.

Load-bearing premise

The model can generate verification questions and answer them in a way that remains unaffected by its own initial draft response.

What would settle it

A direct comparison experiment in which verification questions are answered once with the draft hidden and once with the draft visible, showing no reduction in final hallucinations when the draft is hidden.

read the original abstract

Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response. In experiments, we show CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes Chain-of-Verification (CoVe), a four-step prompting procedure in which an LLM first drafts a response, plans verification questions to fact-check the draft, answers those questions independently, and finally produces a verified response. Experiments are reported to show reduced hallucinations on list-based Wikidata questions, closed-book MultiSpanQA, and longform text generation.

Significance. If the empirical gains hold under rigorous controls, CoVe would constitute a simple, training-free prompting technique that improves factual accuracy across multiple task types without external retrieval or model changes. The multi-task scope and focus on a practical mitigation strategy for a core LLM limitation would make the result noteworthy for the field.

major comments (1)
  1. [CoVe procedure description] The central claim that CoVe reduces hallucinations rests on step (iii) producing verification answers that are unbiased by the initial draft. The manuscript states that questions are answered 'independently so the answers are not biased by other responses,' yet provides no explicit mechanism (separate sessions, context clearing, or prompt isolation) to enforce this independence. If verification prompts retain the original query or any latent state from the draft, the observed gains could arise from re-generation rather than external fact-checking.
minor comments (2)
  1. Add the exact prompt templates and any context-management details used in the experiments to support reproducibility.
  2. Report statistical significance tests and confidence intervals for the hallucination reductions rather than raw percentages alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and recommendation for minor revision. We address the major comment below regarding the independence of verification answers in the CoVe procedure.

read point-by-point responses
  1. Referee: The central claim that CoVe reduces hallucinations rests on step (iii) producing verification answers that are unbiased by the initial draft. The manuscript states that questions are answered 'independently so the answers are not biased by other responses,' yet provides no explicit mechanism (separate sessions, context clearing, or prompt isolation) to enforce this independence. If verification prompts retain the original query or any latent state from the draft, the observed gains could arise from re-generation rather than external fact-checking.

    Authors: We appreciate the referee's observation on this point. In the experimental implementation described in Section 3, each verification question is issued via an independent model call (or separate prompt session) that includes solely the verification question itself, with no inclusion of the original user query, the draft response, or any prior context from the conversation history. This design ensures the verification answers are generated without influence from the initial draft. We have updated the manuscript to explicitly detail this isolation mechanism, including example prompts, to address the concern. revision: yes

Circularity Check

0 steps flagged

CoVe presented as procedural recipe with separate empirical tests; no derivation reduces to inputs by construction

full rationale

The paper defines Chain-of-Verification explicitly as a four-step procedure (draft, plan verifications, answer independently, generate final response) and supports its effectiveness via experiments on Wikidata lists, MultiSpanQA, and longform generation. No equations, fitted parameters, or self-referential definitions appear in the method; the independence claim in step (iii) is stated as a design choice rather than derived from prior steps. No self-citation chains or uniqueness theorems are invoked to justify the core procedure. The work is therefore self-contained as an empirical method proposal without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of a prompting strategy rather than new mathematical constructs or fitted parameters.

axioms (1)
  • domain assumption Large language models can generate relevant verification questions and provide answers to them that are not biased by an initial draft response.
    This assumption enables the independent answering step in the CoVe method.

pith-pipeline@v0.9.0 · 5666 in / 1225 out tokens · 37360 ms · 2026-05-18T01:03:08.207655+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  2. MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents

    cs.AI 2025-12 accept novelty 8.0

    MobiBench is the first modular multi-path offline benchmark for mobile GUI agents, achieving 94.72% agreement with human evaluators while allowing component-level analysis.

  3. Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...

  4. Credo: Declarative Control of LLM Pipelines via Beliefs and Policies

    cs.AI 2026-04 unverdicted novelty 7.0

    Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.

  5. Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.

  6. Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models

    cs.CL 2026-03 conditional novelty 7.0

    Chain-of-illocution prompting improves source adherence in RAG explanations for programming education by up to 63% over baselines.

  7. Hallucination is Inevitable: An Innate Limitation of Large Language Models

    cs.CL 2024-01 conditional novelty 7.0

    Hallucinations are inevitable in LLMs because they cannot learn all computable functions according to learning theory.

  8. Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems

    cs.CL 2026-04 unverdicted novelty 6.0

    Compositional selective specificity (CSS) improves overcommitment-aware utility from 0.846 to 0.913 on LongFact while retaining 0.938 specificity by calibrating claim-level backoffs in agentic AI responses.

  9. Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents

    cs.CV 2025-09 unverdicted novelty 6.0

    CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.

  10. Towards an AI co-scientist

    cs.AI 2025-02 unverdicted novelty 6.0

    A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

  11. Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

    cs.CL 2024-06 unverdicted novelty 6.0

    SEPs approximate semantic entropy from single-generation hidden states to enable cheap and robust hallucination detection in LLMs.

  12. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

    cs.CL 2023-10 unverdicted novelty 6.0

    Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.

  13. HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs

    cs.CL 2026-05 unverdicted novelty 5.0

    HalluScan benchmark tests hallucination detectors on LLMs, identifies NLI Verification as top performer with 0.88 AUROC, and introduces HalluScore (r=0.41 with humans) plus a routing method for 2x cost savings.

  14. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    cs.CL 2023-11 unverdicted novelty 5.0

    The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

  15. Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)

    cs.LG 2026-04 unverdicted novelty 4.0

    HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperforman...

  16. Hallucination Detection and Evaluation of Large Language Model

    cs.CL 2025-12 unverdicted novelty 4.0

    HHEM delivers fast hallucination detection in LLMs via classification, cutting evaluation time from 8 hours to 10 minutes with up to 82.2% accuracy while adding segment retrieval for summarization.

  17. Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    cs.CL 2023-09 unverdicted novelty 4.0

    A literature survey that taxonomizes hallucination phenomena in LLMs, reviews evaluation benchmarks, and analyzes approaches for their detection, explanation, and mitigation.

  18. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

    cs.AI 2024-02 unverdicted novelty 3.0

    A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a t...

Reference graph

Works this paper leans on

264 extracted references · 264 canonical work pages · cited by 18 Pith papers · 24 internal anchors

  1. [6]

    Truth-o-meter: Collaborating with llm in fighting its hallucinations

    Boris A Galitsky. Truth-o-meter: Collaborating with llm in fighting its hallucinations. 2023

  2. [7]

    Rarr: Researching and revising what language models say, using language models

    Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 16477--16...

  3. [9]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55 0 (12): 0 1--38, 2023

  4. [14]

    Multispanqa: A dataset for multi-span question answering

    Haonan Li, Martin Tomko, Maria Vasardani, and Timothy Baldwin. Multispanqa: A dataset for multi-span question answering. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 1250--1260, 2022

  5. [23]

    Reducing conversational agents’ overconfidence through linguistic calibration

    Sabrina J Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10: 0 857--872, 2022

  6. [25]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

  7. [29]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  8. [31]

    Measuring attribution in natural language generation models

    Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. Measuring attribution in natural language generation models. Computational Linguistics, pp.\ 1--66, 2023

  9. [36]

    Contrastive learning reduces hallucination in conversations

    Weiwei Sun, Zhengliang Shi, Shen Gao, Pengjie Ren, Maarten de Rijke, and Zhaochun Ren. Contrastive learning reduces hallucination in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.\ 13618--13626, 2023 b

  10. [38]

    Llama 2: Open foundation and fine-tuned chat models, 2023 b

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  11. [42]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 0 24824--24837, 2022

  12. [48]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter , title =. CoRR , volume =. 2017 , url =. 1711.05101 , timestamp =

  13. [49]

    2022 , eprint=

    PaLM: Scaling Language Modeling with Pathways , author=. 2022 , eprint=

  14. [50]

    Conference on Empirical Methods in Natural Language Processing , year=

    Re3: Generating Longer Stories With Recursive Reprompting and Revision , author=. Conference on Empirical Methods in Natural Language Processing , year=

  15. [51]

    2022 , eprint=

    OPT: Open Pre-trained Transformer Language Models , author=. 2022 , eprint=

  16. [52]

    2023 , eprint=

    GPTScore: Evaluate as You Desire , author=. 2023 , eprint=

  17. [53]

    2023 , eprint=

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. 2023 , eprint=

  18. [54]

    2020 , eprint=

    Language Models are Few-Shot Learners , author=. 2020 , eprint=

  19. [55]

    2021 , eprint=

    TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2021 , eprint=

  20. [56]

    Training Verifiers to Solve Math Word Problems

    Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

  21. [57]

    Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

    Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

  22. [58]

    Conference on Empirical Methods in Natural Language Processing , year=

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. Conference on Empirical Methods in Natural Language Processing , year=

  23. [59]

    Naftali Tishby and Noga Zaslavsky

    Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/...

  24. [60]

    2023 , eprint=

    AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback , author=. 2023 , eprint=

  25. [61]

    2023 , eprint=

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment , author=. 2023 , eprint=

  26. [62]

    2022 , url=

    ChatGPT: Optimizing Language Models for Dialogue , author=. 2022 , url=

  27. [63]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  28. [64]

    Proceedings of the international AAAI conference on web and social media , volume=

    The pushshift reddit dataset , author=. Proceedings of the international AAAI conference on web and social media , volume=

  29. [65]

    2023 , eprint=

    Self-Refine: Iterative Refinement with Self-Feedback , author=. 2023 , eprint=

  30. [66]

    2023 , howpublished =

    Ye, Seonghyeon and Jo, Yongrae and Kim, Doyoung and Kim, Sungdong and Hwang, Hyeonbin and Seo, Minjoon , title =. 2023 , howpublished =

  31. [67]

    2022 , eprint=

    Large Language Models Can Self-Improve , author=. 2022 , eprint=

  32. [68]

    2023 , eprint=

    CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing , author=. 2023 , eprint=

  33. [69]

    The Eleventh International Conference on Learning Representations , year=

    Generating Sequences by Learning to Self-Correct , author=. The Eleventh International Conference on Learning Representations , year=

  34. [70]

    2022 , eprint=

    Self-critiquing models for assisting human evaluators , author=. 2022 , eprint=

  35. [71]

    2023 , eprint=

    Large Language Models are not Fair Evaluators , author=. 2023 , eprint=

  36. [72]

    2023 , eprint=

    How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources , author=. 2023 , eprint=

  37. [73]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  38. [74]

    arXiv preprint arXiv:2212.09968 , year=

    On Improving Summarization Factual Consistency from Natural Language Feedback , author=. arXiv preprint arXiv:2212.09968 , year=

  39. [75]

    EMNLP , year=

    Explaining Answers with Entailment Trees , author=. EMNLP , year=

  40. [76]

    Proofwriter: Generating implications, proofs, and abductive statements over natural language

    Proofwriter: Generating implications, proofs, and abductive statements over natural language , author=. arXiv preprint arXiv:2012.13048 , year=

  41. [77]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  42. [78]

    arXiv preprint arXiv:1909.00277 , year=

    Cosmos QA: Machine reading comprehension with contextual commonsense reasoning , author=. arXiv preprint arXiv:1909.00277 , year=

  43. [79]

    Camburu, Oana-Maria and Rockt. e-. Advances in Neural Information Processing Systems , volume=

  44. [80]

    Adversarial

    Nie, Yixin and Williams, Adina and Dinan, Emily and Bansal, Mohit and Weston, Jason and Kiela, Douwe , journal=. Adversarial

  45. [81]

    News summarization and evaluation in the era of

    Goyal, Tanya and Li, Junyi Jessy and Durrett, Greg , journal=. News summarization and evaluation in the era of

  46. [82]

    Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021

  47. [83]

    SocialIQA: Commonsense Reasoning about Social Interactions

    Socialiqa: Commonsense reasoning about social interactions , author=. arXiv preprint arXiv:1904.09728 , year=

  48. [84]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    BoolQ: Exploring the surprising difficulty of natural yes/no questions , author=. arXiv preprint arXiv:1905.10044 , year=

  49. [85]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

  50. [86]

    RACE: Large-scale ReAding Comprehension Dataset From Examinations

    RACE: Large-scale ReAding Comprehension Dataset From Examinations , author=. arXiv preprint arXiv:1704.04683 , year=

  51. [87]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    HellaSwag: Can a machine really finish your sentence? , author=. arXiv preprint arXiv:1905.07830 , year=

  52. [88]

    Communications of the ACM , volume=

    Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

  53. [89]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension , author=. arXiv preprint arXiv:1705.03551 , year=

  54. [90]

    Transactions of the Association for Computational Linguistics , volume=

    Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=

  55. [91]

    Zhou, Chunting and Liu, Pengfei and Xu, Puxin and Iyer, Srini and Sun, Jiao and Mao, Yuning and Ma, Xuezhe and Efrat, Avia and Yu, Ping and Yu, Lili and others , journal=

  56. [92]

    arXiv preprint arXiv:2303.12767 , year=

    Can we trust the evaluation on ChatGPT? , author=. arXiv preprint arXiv:2303.12767 , year=

  57. [93]

    and Stoica, Ion and Xing, Eric P

    Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

  58. [94]

    2023 , eprint=

    Instruction Tuning with GPT-4 , author=. 2023 , eprint=

  59. [95]

    2023 , eprint=

    Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , author=. 2023 , eprint=

  60. [96]

    GitHub repository , howpublished =

    Wang, Yidong and Yu, Zhuohao and Zeng, Zhengran and Yang, Linyi and Heng, Qiang and Wang, Cunxiang and Chen, Hao and Jiang, Chaoya and Xie, Rui and Wang, Jindong and Xie, Xing and Ye, Wei and Zhang, Shikun and Zhang, Yue , title =. GitHub repository , howpublished =. 2023 , publisher =

  61. [97]

    Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs , abstractNote=

    Xiong, Miao and Hu, Zhiyuan and Lu, Xinyang and Li, Yifei and Fu, Jie and He, Junxian and Hooi, Bryan , year=. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs , abstractNote=

  62. [98]

    2022 , eprint=

    ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning , author=. 2022 , eprint=

  63. [99]

    2022 , eprint=

    Generating Sequences by Learning to Self-Correct , author=. 2022 , eprint=

  64. [100]

    2023 , eprint=

    Progressive-Hint Prompting Improves Reasoning in Large Language Models , author=. 2023 , eprint=

  65. [101]

    2023 , eprint=

    Large Language Models are Zero-Shot Reasoners , author=. 2023 , eprint=

  66. [102]

    Distilling Reasoning Capabilities into Smaller Language Models

    Shridhar, Kumar and Stolfo, Alessandro and Sachan, Mrinmaya. Distilling Reasoning Capabilities into Smaller Language Models. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.441

  67. [103]

    2023 , eprint=

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

  68. [104]

    Automatic Generation of Socratic Subquestions for Teaching Math Word Problems

    Shridhar, Kumar and Macina, Jakub and El-Assady, Mennatallah and Sinha, Tanmay and Kapur, Manu and Sachan, Mrinmaya. Automatic Generation of Socratic Subquestions for Teaching Math Word Problems. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.277

  69. [105]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  70. [106]

    I-Chun Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, and Pengfei Liu

    FacTool: Factuality Detection in Generative AI--A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios , author=. arXiv preprint arXiv:2307.13528 , year=

  71. [107]

    arXiv preprint arXiv:2305.18248 , year=

    Do Language Models Know When They're Hallucinating References? , author=. arXiv preprint arXiv:2305.18248 , year=

  72. [108]

    Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback

    Check your facts and try again: Improving large language models with external knowledge and automated feedback , author=. arXiv preprint arXiv:2302.12813 , year=

  73. [109]

    Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig

    Active retrieval augmented generation , author=. arXiv preprint arXiv:2305.06983 , year=

  74. [110]

    Transactions of the Association for Computational Linguistics , volume=

    Reducing conversational agents’ overconfidence through linguistic calibration , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

  75. [111]

    arXiv preprint arXiv:2306.00186 , year=

    Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback , author=. arXiv preprint arXiv:2306.00186 , year=

  76. [112]

    SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

    Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models , author=. arXiv preprint arXiv:2303.08896 , year=

  77. [113]

    2023 , publisher=

    Truth-O-Meter: Collaborating with LLM in Fighting its Hallucinations , author=. 2023 , publisher=

  78. [114]

    Computational Linguistics , pages=

    Measuring attribution in natural language generation models , author=. Computational Linguistics , pages=

  79. [115]

    arXiv preprint arXiv:2307.03987 , year=

    A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation , author=. arXiv preprint arXiv:2307.03987 , year=

  80. [116]

    arXiv preprint arXiv:2306.01693 , year=

    Fine-Grained Human Feedback Gives Better Rewards for Language Model Training , author=. arXiv preprint arXiv:2306.01693 , year=

Showing first 80 references.