arxiv: 2309.11495 · v2 · pith:46MPX235new · submitted 2023-09-20 · 💻 cs.CL · cs.AI

Chain-of-Verification Reduces Hallucination in Large Language Models

Shehzaad Dhuliawala , Mojtaba Komeili , Jing Xu , Roberta Raileanu , Xian Li , Asli Celikyilmaz , Jason Weston This is my paper

Pith reviewed 2026-05-18 01:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords hallucination reductionlarge language modelsself-verificationchain of verificationfact checkingresponse generationfactual accuracy

0 comments

The pith

A four-step self-check process lets language models correct factual errors in their own answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Chain-of-Verification method in which a language model first produces a draft response, then generates a set of verification questions aimed at checking that draft, answers those questions without reference to the draft, and finally produces a revised response that incorporates the verification results. This sequence is meant to catch and remove incorrect factual claims that would otherwise appear in the output. Experiments apply the method to list questions drawn from Wikidata, closed-book multi-span question answering, and long-form text generation, showing lower rates of hallucinated information in each case. If the independent verification step reliably prevents the original draft from influencing the checks, the approach offers an internal way to improve factual accuracy without external retrieval or model retraining.

Core claim

The central claim is that language models can reduce hallucinations by following a four-step Chain-of-Verification procedure: drafting an initial response, planning verification questions to fact-check the draft, answering those questions independently so the answers are not biased by the draft, and then generating a final verified response that draws on the independent answers.

What carries the argument

The Chain-of-Verification procedure, a four-step sequence of drafting, planning verification questions, independent answering, and final response generation that isolates verification from the initial draft.

If this is right

The method lowers the number of incorrect facts produced when answering list-based questions from Wikidata.
It improves accuracy on closed-book MultiSpanQA by catching unsupported spans through separate verification.
It decreases the amount of invented content in long-form text generation outputs.
The four-step process works across these different task types using the same prompting structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The independent-answer step could be tested on tasks involving logical consistency rather than pure facts to see if the same separation helps.
Integrating the verification questions into a single forward pass might reduce the number of model calls needed while preserving the benefit.
The approach might combine with retrieval methods by using the verification answers to decide when to fetch external evidence.

Load-bearing premise

The model can generate verification questions and answer them in a way that remains unaffected by its own initial draft response.

What would settle it

A direct comparison experiment in which verification questions are answered once with the draft hidden and once with the draft visible, showing no reduction in final hallucinations when the draft is hidden.

read the original abstract

Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response. In experiments, we show CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoVe is a clean four-step prompting recipe that shows measurable drops in hallucinations on the tested tasks, but the independence of the verification answers is the part that still feels under-specified.

read the letter

The paper introduces Chain-of-Verification as a simple workflow: draft an answer, generate verification questions, answer those questions without the draft in context, then synthesize a final response. That independent-answer step is the main addition over plain chain-of-thought or self-critique methods already in the literature. On the tasks they report—Wikidata list questions, closed-book MultiSpanQA, and long-form generation—the numbers show lower hallucination rates than the baselines they compare against. The method is easy to implement and does not require extra training or external tools, which is a practical plus for anyone already using prompting to improve factual output. The experiments are run across a few model sizes and the gains look consistent enough to be worth trying in similar settings. The soft spot is exactly the one the stress-test flagged. The paper claims the verification answers are produced independently so they are not biased by the draft, but the description does not spell out whether this is done in separate sessions, with context cleared, or just by writing a new prompt that still contains the original query. If the model state carries over at all, the verification step risks becoming a rephrasing of the same internal knowledge rather than an external check. That assumption is load-bearing for the central claim, and the current write-up leaves it implicit. The paper is aimed at practitioners who want a lightweight way to reduce factual errors in LLM outputs without changing the model itself. Readers working on reliability for QA or generation tasks will find it directly usable. It is solid enough on the prompting side and has enough empirical backing that a serious editor should send it out for review rather than desk-reject it; the main revisions would be around clarifying the independence mechanism and adding more statistical detail on the results.

Referee Report

1 major / 2 minor

Summary. The paper proposes Chain-of-Verification (CoVe), a four-step prompting procedure in which an LLM first drafts a response, plans verification questions to fact-check the draft, answers those questions independently, and finally produces a verified response. Experiments are reported to show reduced hallucinations on list-based Wikidata questions, closed-book MultiSpanQA, and longform text generation.

Significance. If the empirical gains hold under rigorous controls, CoVe would constitute a simple, training-free prompting technique that improves factual accuracy across multiple task types without external retrieval or model changes. The multi-task scope and focus on a practical mitigation strategy for a core LLM limitation would make the result noteworthy for the field.

major comments (1)

[CoVe procedure description] The central claim that CoVe reduces hallucinations rests on step (iii) producing verification answers that are unbiased by the initial draft. The manuscript states that questions are answered 'independently so the answers are not biased by other responses,' yet provides no explicit mechanism (separate sessions, context clearing, or prompt isolation) to enforce this independence. If verification prompts retain the original query or any latent state from the draft, the observed gains could arise from re-generation rather than external fact-checking.

minor comments (2)

Add the exact prompt templates and any context-management details used in the experiments to support reproducibility.
Report statistical significance tests and confidence intervals for the hallucination reductions rather than raw percentages alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and recommendation for minor revision. We address the major comment below regarding the independence of verification answers in the CoVe procedure.

read point-by-point responses

Referee: The central claim that CoVe reduces hallucinations rests on step (iii) producing verification answers that are unbiased by the initial draft. The manuscript states that questions are answered 'independently so the answers are not biased by other responses,' yet provides no explicit mechanism (separate sessions, context clearing, or prompt isolation) to enforce this independence. If verification prompts retain the original query or any latent state from the draft, the observed gains could arise from re-generation rather than external fact-checking.

Authors: We appreciate the referee's observation on this point. In the experimental implementation described in Section 3, each verification question is issued via an independent model call (or separate prompt session) that includes solely the verification question itself, with no inclusion of the original user query, the draft response, or any prior context from the conversation history. This design ensures the verification answers are generated without influence from the initial draft. We have updated the manuscript to explicitly detail this isolation mechanism, including example prompts, to address the concern. revision: yes

Circularity Check

0 steps flagged

CoVe presented as procedural recipe with separate empirical tests; no derivation reduces to inputs by construction

full rationale

The paper defines Chain-of-Verification explicitly as a four-step procedure (draft, plan verifications, answer independently, generate final response) and supports its effectiveness via experiments on Wikidata lists, MultiSpanQA, and longform generation. No equations, fitted parameters, or self-referential definitions appear in the method; the independence claim in step (iii) is stated as a design choice rather than derived from prior steps. No self-citation chains or uniqueness theorems are invoked to justify the core procedure. The work is therefore self-contained as an empirical method proposal without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of a prompting strategy rather than new mathematical constructs or fitted parameters.

axioms (1)

domain assumption Large language models can generate relevant verification questions and provide answers to them that are not biased by an initial draft response.
This assumption enables the independent answering step in the CoVe method.

pith-pipeline@v0.9.0 · 5666 in / 1225 out tokens · 37360 ms · 2026-05-18T01:03:08.207655+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents
cs.AI 2025-12 accept novelty 8.0

MobiBench is the first modular multi-path offline benchmark for mobile GUI agents, achieving 94.72% agreement with human evaluators while allowing component-level analysis.
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 7.0

Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
Credo: Declarative Control of LLM Pipelines via Beliefs and Policies
cs.AI 2026-04 unverdicted novelty 7.0

Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents
cs.AI 2026-04 unverdicted novelty 7.0

PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models
cs.CL 2026-03 conditional novelty 7.0

Chain-of-illocution prompting improves source adherence in RAG explanations for programming education by up to 63% over baselines.
Hallucination is Inevitable: An Innate Limitation of Large Language Models
cs.CL 2024-01 conditional novelty 7.0

Hallucinations are inevitable in LLMs because they cannot learn all computable functions according to learning theory.
Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems
cs.CL 2026-04 unverdicted novelty 6.0

Compositional selective specificity (CSS) improves overcommitment-aware utility from 0.846 to 0.913 on LongFact while retaining 0.938 specificity by calibrating claim-level backoffs in agentic AI responses.
Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents
cs.CV 2025-09 unverdicted novelty 6.0

CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.
Towards an AI co-scientist
cs.AI 2025-02 unverdicted novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
cs.CL 2024-06 unverdicted novelty 6.0

SEPs approximate semantic entropy from single-generation hidden states to enable cheap and robust hallucination detection in LLMs.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
cs.CL 2023-10 unverdicted novelty 6.0

Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs
cs.CL 2026-05 unverdicted novelty 5.0

HalluScan benchmark tests hallucination detectors on LLMs, identifies NLI Verification as top performer with 0.88 AUROC, and introduces HalluScore (r=0.41 with humans) plus a routing method for 2x cost savings.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)
cs.LG 2026-04 unverdicted novelty 4.0

HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperforman...
Hallucination Detection and Evaluation of Large Language Model
cs.CL 2025-12 unverdicted novelty 4.0

HHEM delivers fast hallucination detection in LLMs via classification, cutting evaluation time from 8 hours to 10 minutes with up to 82.2% accuracy while adding segment retrieval for summarization.
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
cs.CL 2023-09 unverdicted novelty 4.0

A literature survey that taxonomizes hallucination phenomena in LLMs, reviews evaluation benchmarks, and analyzes approaches for their detection, explanation, and mitigation.
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
cs.AI 2024-02 unverdicted novelty 3.0

A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a t...

Reference graph

Works this paper leans on

264 extracted references · 264 canonical work pages · cited by 18 Pith papers · 24 internal anchors

[6]

Truth-o-meter: Collaborating with llm in fighting its hallucinations

Boris A Galitsky. Truth-o-meter: Collaborating with llm in fighting its hallucinations. 2023

work page 2023
[7]

Rarr: Researching and revising what language models say, using language models

Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 16477--16...

work page 2023
[9]

Survey of hallucination in natural language generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55 0 (12): 0 1--38, 2023

work page 2023
[14]

Multispanqa: A dataset for multi-span question answering

Haonan Li, Martin Tomko, Maria Vasardani, and Timothy Baldwin. Multispanqa: A dataset for multi-span question answering. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 1250--1260, 2022

work page 2022
[23]

Reducing conversational agents’ overconfidence through linguistic calibration

Sabrina J Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10: 0 857--872, 2022

work page 2022
[25]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

work page 2022
[29]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[31]

Measuring attribution in natural language generation models

Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. Measuring attribution in natural language generation models. Computational Linguistics, pp.\ 1--66, 2023

work page 2023
[36]

Contrastive learning reduces hallucination in conversations

Weiwei Sun, Zhengliang Shi, Shen Gao, Pengjie Ren, Maarten de Rijke, and Zhaochun Ren. Contrastive learning reduces hallucination in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.\ 13618--13626, 2023 b

work page 2023
[38]

Llama 2: Open foundation and fine-tuned chat models, 2023 b

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page 2023
[42]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 0 24824--24837, 2022

work page 2022
[48]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter , title =. CoRR , volume =. 2017 , url =. 1711.05101 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2017
[49]

2022 , eprint=

PaLM: Scaling Language Modeling with Pathways , author=. 2022 , eprint=

work page 2022
[50]

Conference on Empirical Methods in Natural Language Processing , year=

Re3: Generating Longer Stories With Recursive Reprompting and Revision , author=. Conference on Empirical Methods in Natural Language Processing , year=

work page
[51]

2022 , eprint=

OPT: Open Pre-trained Transformer Language Models , author=. 2022 , eprint=

work page 2022
[52]

2023 , eprint=

GPTScore: Evaluate as You Desire , author=. 2023 , eprint=

work page 2023
[53]

2023 , eprint=

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023
[54]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020
[55]

2021 , eprint=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2021 , eprint=

work page 2021
[56]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

work page
[58]

Conference on Empirical Methods in Natural Language Processing , year=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. Conference on Empirical Methods in Natural Language Processing , year=

work page
[59]

Naftali Tishby and Noga Zaslavsky

Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/...

work page doi:10.18653/v1/n19-1421 2019
[60]

2023 , eprint=

AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback , author=. 2023 , eprint=

work page 2023
[61]

2023 , eprint=

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment , author=. 2023 , eprint=

work page 2023
[62]

2022 , url=

ChatGPT: Optimizing Language Models for Dialogue , author=. 2022 , url=

work page 2022
[63]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[64]

Proceedings of the international AAAI conference on web and social media , volume=

The pushshift reddit dataset , author=. Proceedings of the international AAAI conference on web and social media , volume=

work page
[65]

2023 , eprint=

Self-Refine: Iterative Refinement with Self-Feedback , author=. 2023 , eprint=

work page 2023
[66]

2023 , howpublished =

Ye, Seonghyeon and Jo, Yongrae and Kim, Doyoung and Kim, Sungdong and Hwang, Hyeonbin and Seo, Minjoon , title =. 2023 , howpublished =

work page 2023
[67]

2022 , eprint=

Large Language Models Can Self-Improve , author=. 2022 , eprint=

work page 2022
[68]

2023 , eprint=

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing , author=. 2023 , eprint=

work page 2023
[69]

The Eleventh International Conference on Learning Representations , year=

Generating Sequences by Learning to Self-Correct , author=. The Eleventh International Conference on Learning Representations , year=

work page
[70]

2022 , eprint=

Self-critiquing models for assisting human evaluators , author=. 2022 , eprint=

work page 2022
[71]

2023 , eprint=

Large Language Models are not Fair Evaluators , author=. 2023 , eprint=

work page 2023
[72]

2023 , eprint=

How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources , author=. 2023 , eprint=

work page 2023
[73]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[74]

arXiv preprint arXiv:2212.09968 , year=

On Improving Summarization Factual Consistency from Natural Language Feedback , author=. arXiv preprint arXiv:2212.09968 , year=

work page arXiv
[75]

EMNLP , year=

Explaining Answers with Entailment Trees , author=. EMNLP , year=

work page
[76]

Proofwriter: Generating implications, proofs, and abductive statements over natural language

Proofwriter: Generating implications, proofs, and abductive statements over natural language , author=. arXiv preprint arXiv:2012.13048 , year=

work page arXiv 2012
[77]

Proceedings of the AAAI conference on artificial intelligence , volume=

Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[78]

arXiv preprint arXiv:1909.00277 , year=

Cosmos QA: Machine reading comprehension with contextual commonsense reasoning , author=. arXiv preprint arXiv:1909.00277 , year=

work page arXiv 1909
[79]

Camburu, Oana-Maria and Rockt. e-. Advances in Neural Information Processing Systems , volume=

work page
[80]

Adversarial

Nie, Yixin and Williams, Adina and Dinan, Emily and Bansal, Mohit and Weston, Jason and Kiela, Douwe , journal=. Adversarial

work page
[81]

News summarization and evaluation in the era of

Goyal, Tanya and Li, Junyi Jessy and Durrett, Greg , journal=. News summarization and evaluation in the era of

work page
[82]

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021

work page 2021
[83]

SocialIQA: Commonsense Reasoning about Social Interactions

Socialiqa: Commonsense reasoning about social interactions , author=. arXiv preprint arXiv:1904.09728 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1904
[84]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

BoolQ: Exploring the surprising difficulty of natural yes/no questions , author=. arXiv preprint arXiv:1905.10044 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1905
[85]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[86]

RACE: Large-scale ReAding Comprehension Dataset From Examinations

RACE: Large-scale ReAding Comprehension Dataset From Examinations , author=. arXiv preprint arXiv:1704.04683 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[87]

HellaSwag: Can a Machine Really Finish Your Sentence?

HellaSwag: Can a machine really finish your sentence? , author=. arXiv preprint arXiv:1905.07830 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1905
[88]

Communications of the ACM , volume=

Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

work page 2021
[89]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension , author=. arXiv preprint arXiv:1705.03551 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[90]

Transactions of the Association for Computational Linguistics , volume=

Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=

work page 2019
[91]

Zhou, Chunting and Liu, Pengfei and Xu, Puxin and Iyer, Srini and Sun, Jiao and Mao, Yuning and Ma, Xuezhe and Efrat, Avia and Yu, Ping and Yu, Lili and others , journal=

work page
[92]

arXiv preprint arXiv:2303.12767 , year=

Can we trust the evaluation on ChatGPT? , author=. arXiv preprint arXiv:2303.12767 , year=

work page arXiv
[93]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

work page
[94]

2023 , eprint=

Instruction Tuning with GPT-4 , author=. 2023 , eprint=

work page 2023
[95]

2023 , eprint=

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , author=. 2023 , eprint=

work page 2023
[96]

GitHub repository , howpublished =

Wang, Yidong and Yu, Zhuohao and Zeng, Zhengran and Yang, Linyi and Heng, Qiang and Wang, Cunxiang and Chen, Hao and Jiang, Chaoya and Xie, Rui and Wang, Jindong and Xie, Xing and Ye, Wei and Zhang, Shikun and Zhang, Yue , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[97]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs , abstractNote=

Xiong, Miao and Hu, Zhiyuan and Lu, Xinyang and Li, Yifei and Fu, Jie and He, Junxian and Hooi, Bryan , year=. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs , abstractNote=

work page
[98]

2022 , eprint=

ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning , author=. 2022 , eprint=

work page 2022
[99]

2022 , eprint=

Generating Sequences by Learning to Self-Correct , author=. 2022 , eprint=

work page 2022
[100]

2023 , eprint=

Progressive-Hint Prompting Improves Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023
[101]

2023 , eprint=

Large Language Models are Zero-Shot Reasoners , author=. 2023 , eprint=

work page 2023
[102]

Distilling Reasoning Capabilities into Smaller Language Models

Shridhar, Kumar and Stolfo, Alessandro and Sachan, Mrinmaya. Distilling Reasoning Capabilities into Smaller Language Models. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.441

work page doi:10.18653/v1/2023.findings-acl.441 2023
[103]

2023 , eprint=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023
[104]

Automatic Generation of Socratic Subquestions for Teaching Math Word Problems

Shridhar, Kumar and Macina, Jakub and El-Assady, Mennatallah and Sinha, Tanmay and Kapur, Manu and Sachan, Mrinmaya. Automatic Generation of Socratic Subquestions for Teaching Math Word Problems. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.277

work page doi:10.18653/v1/2022.emnlp-main.277 2022
[105]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[106]

I-Chun Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, and Pengfei Liu

FacTool: Factuality Detection in Generative AI--A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios , author=. arXiv preprint arXiv:2307.13528 , year=

work page arXiv
[107]

arXiv preprint arXiv:2305.18248 , year=

Do Language Models Know When They're Hallucinating References? , author=. arXiv preprint arXiv:2305.18248 , year=

work page arXiv
[108]

Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback

Check your facts and try again: Improving large language models with external knowledge and automated feedback , author=. arXiv preprint arXiv:2302.12813 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[109]

Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig

Active retrieval augmented generation , author=. arXiv preprint arXiv:2305.06983 , year=

work page arXiv
[110]

Transactions of the Association for Computational Linguistics , volume=

Reducing conversational agents’ overconfidence through linguistic calibration , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

work page 2022
[111]

arXiv preprint arXiv:2306.00186 , year=

Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback , author=. arXiv preprint arXiv:2306.00186 , year=

work page arXiv
[112]

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models , author=. arXiv preprint arXiv:2303.08896 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[113]

2023 , publisher=

Truth-O-Meter: Collaborating with LLM in Fighting its Hallucinations , author=. 2023 , publisher=

work page 2023
[114]

Computational Linguistics , pages=

Measuring attribution in natural language generation models , author=. Computational Linguistics , pages=

work page
[115]

arXiv preprint arXiv:2307.03987 , year=

A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation , author=. arXiv preprint arXiv:2307.03987 , year=

work page arXiv
[116]

arXiv preprint arXiv:2306.01693 , year=

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training , author=. arXiv preprint arXiv:2306.01693 , year=

work page arXiv

Showing first 80 references.