Teaching language models to support answers with verified quotes
Pith reviewed 2026-05-17 11:41 UTC · model grok-4.3
The pith
A 280 billion parameter model can be trained to answer questions with specific cited evidence from documents and to abstain when uncertain.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our 280 billion parameter model, GopherCite, is able to produce answers with high quality supporting evidence and abstain from answering when unsure. The model's response is found to be high-quality 80% of the time on this Natural Questions subset, and 67% of the time on the ELI5 subset. Abstaining from the third of questions for which it is most unsure improves performance to 90% and 80% respectively, approaching human baselines. Analysis on the adversarial TruthfulQA dataset shows why citation is only one part of an overall strategy for safety and trustworthiness: not all claims supported by evidence are true.
What carries the argument
Reinforcement learning from human preferences (RLHP) that rewards the generation of answers together with direct quotes drawn from multiple retrieved documents or a single user-provided document.
If this is right
- Users can directly inspect the quoted passages to assess whether an answer is correct.
- Abstention on the most uncertain questions measurably raises the fraction of high-quality responses.
- The same training method works whether evidence comes from a search engine or from a document the user supplies.
- Citation alone does not guarantee truth, as shown by results on adversarial factual questions.
Where Pith is reading between the lines
- This approach could be combined with other verification layers to handle cases where evidence appears to back an incorrect claim.
- The method might extend to domains beyond question answering where users need traceable support for model statements.
- Larger models trained in the same way could further reduce the rate at which unsupported claims appear.
Load-bearing premise
Human preferences expressed during training will reliably translate into citations that actually support the model's claims and into an uncertainty signal that correctly identifies questions the model should skip.
What would settle it
A dataset of model outputs in which a large fraction of the supplied quotes do not actually support the stated answer or in which abstention fails to raise the human-rated quality percentage.
read the original abstract
Recent large language models often answer factual questions correctly. But users can't trust any given claim a model makes without fact-checking, because language models can hallucinate convincing nonsense. In this work we use reinforcement learning from human preferences (RLHP) to train "open-book" QA models that generate answers whilst also citing specific evidence for their claims, which aids in the appraisal of correctness. Supporting evidence is drawn from multiple documents found via a search engine, or from a single user-provided document. Our 280 billion parameter model, GopherCite, is able to produce answers with high quality supporting evidence and abstain from answering when unsure. We measure the performance of GopherCite by conducting human evaluation of answers to questions in a subset of the NaturalQuestions and ELI5 datasets. The model's response is found to be high-quality 80\% of the time on this Natural Questions subset, and 67\% of the time on the ELI5 subset. Abstaining from the third of questions for which it is most unsure improves performance to 90\% and 80\% respectively, approaching human baselines. However, analysis on the adversarial TruthfulQA dataset shows why citation is only one part of an overall strategy for safety and trustworthiness: not all claims supported by evidence are true.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper describes the development of GopherCite, a 280 billion parameter language model that uses reinforcement learning from human preferences (RLHP) to generate answers to factual questions while providing specific citations to supporting evidence from search results or user documents. The model can also abstain from answering when it is uncertain. Through human evaluations on subsets of the Natural Questions and ELI5 datasets, the authors report that responses are high-quality 80% of the time on Natural Questions and 67% on ELI5, with these figures improving to 90% and 80% respectively when abstaining on the third of questions where the model is most unsure. The paper also discusses limitations using the TruthfulQA dataset, noting that cited evidence does not always ensure truthfulness.
Significance. Should the central results hold up under scrutiny, this work is significant for the field of trustworthy AI and natural language processing. It provides a concrete method for large language models to support their claims with verifiable quotes and to selectively abstain, which could substantially increase user trust in model outputs. The empirical gains from abstention and the use of RLHP for citation quality represent practical advances, and the honest discussion of limitations via TruthfulQA adds value. Strengths include the scale of the model and evaluation on standard datasets with clear percentage improvements.
major comments (2)
- [Results on abstention] The headline result that abstaining on the most uncertain third of questions improves performance from 80% to 90% (Natural Questions subset) and 67% to 80% (ELI5 subset) is load-bearing for the claim that the model can 'abstain when unsure.' The manuscript does not provide a calibration analysis (e.g., human quality ratings binned by uncertainty levels on held-out data) to confirm that the uncertainty signal (log-probabilities, RLHP reward model, or auxiliary head) is monotonically related to actual error rate rather than surface features such as question length or retrieval score.
- [Human evaluation] Human evaluation results are reported on standard datasets with clear percentage improvements from abstention; however, details on the uncertainty estimation method, inter-rater agreement, and exact rating criteria for 'high quality supporting evidence' are insufficient. This information is necessary to assess whether the 80%/67% baselines and the abstention lifts are robust.
minor comments (2)
- [Abstract] The abstract could more explicitly summarize the key takeaway from the TruthfulQA analysis regarding the limitations of citation for ensuring overall truthfulness.
- [Presentation] Ensure all acronyms (e.g., RLHP) are defined on first use and that figure captions clearly indicate the subsets used for the reported percentages.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment below and indicate where we will revise the paper to incorporate the feedback.
read point-by-point responses
-
Referee: [Results on abstention] The headline result that abstaining on the most uncertain third of questions improves performance from 80% to 90% (Natural Questions subset) and 67% to 80% (ELI5 subset) is load-bearing for the claim that the model can 'abstain when unsure.' The manuscript does not provide a calibration analysis (e.g., human quality ratings binned by uncertainty levels on held-out data) to confirm that the uncertainty signal (log-probabilities, RLHP reward model, or auxiliary head) is monotonically related to actual error rate rather than surface features such as question length or retrieval score.
Authors: We agree that a calibration analysis would strengthen the interpretation of the abstention results. In the manuscript the uncertainty signal is taken from the RLHP reward model, and the bottom third of questions by this score shows a clear lift in human-rated quality. We did not include binned calibration on held-out data, primarily due to the expense of additional human ratings. In the revision we will add an explicit description of how the uncertainty threshold is derived from the reward model and will note the lack of a full monotonicity check as a limitation while emphasizing that the observed performance improvement provides empirical support for the signal's utility. revision: yes
-
Referee: [Human evaluation] Human evaluation results are reported on standard datasets with clear percentage improvements from abstention; however, details on the uncertainty estimation method, inter-rater agreement, and exact rating criteria for 'high quality supporting evidence' are insufficient. This information is necessary to assess whether the 80%/67% baselines and the abstention lifts are robust.
Authors: We acknowledge that greater transparency on the evaluation protocol is warranted. The current manuscript outlines the human rating process but does not report inter-rater agreement statistics or the precise rubric used for 'high quality supporting evidence.' We will expand the relevant methods and results sections to include (i) the exact formulation of the uncertainty score from the reward model, (ii) inter-rater agreement figures, and (iii) the detailed rating guidelines provided to annotators. These additions will allow readers to better judge the robustness of the reported percentages and the abstention gains. revision: yes
Circularity Check
Empirical human evaluations on held-out data are independent of training objectives
full rationale
The paper's central results consist of human-rated quality scores (80% on Natural Questions subset, 67% on ELI5) and abstention improvements (to 90% and 80%) measured by external raters on held-out questions. These metrics are not derived from or reduced to the RLHP training process, any fitted parameters, or self-citations by construction. The uncertainty signal selects a subset for abstention, but the reported performance is independently assessed and falsifiable via the external evaluations. No equations, self-definitional loops, or load-bearing self-citations reduce the claims to inputs.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 19 Pith papers
-
Discovering Latent Knowledge in Language Models Without Supervision
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...
-
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.
-
ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation
ArbGraph resolves conflicts in RAG evidence by constructing a conflict-aware graph of atomic claims and applying intensity-driven iterative arbitration to suppress unreliable claims prior to generation.
-
Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback
Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
-
A Generalist Agent
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
-
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
-
From Citation Selection to Citation Absorption: A Measurement Framework for Generative Engine Optimization Across AI Search Platforms
A measurement study of 602 prompts across ChatGPT, Google AI Overview, and Perplexity finds that citation selection breadth and absorption depth diverge, with high-influence pages being longer, structured, and evidence-rich.
-
PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations
PRISM benchmark disentangles LLM hallucinations into knowledge missing, knowledge errors, reasoning errors, and instruction-following errors across three generation stages, revealing trade-offs when testing 24 models.
-
Preregistered Belief Revision Contracts
PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.
-
A Human-Centric Framework for Data Attribution in Large Language Models
Introduces a parameter-driven framework for data attribution in LLMs that enables negotiation among creators, users, and intermediaries to meet stakeholder goals within the data economy.
-
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
ZeroSearch simulates search engine interactions via supervised fine-tuning of a retrieval module and curriculum-based RL degradation of document quality, achieving comparable or superior performance to real search eng...
-
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
-
Chain-of-Verification Reduces Hallucination in Large Language Models
Chain-of-Verification reduces hallucinations in large language models by drafting responses, planning independent verification questions, answering them separately, and generating a final verified output.
-
Training Diffusion Models with Reinforcement Learning
DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
-
Language Models can Solve Computer Tasks
Pre-trained LLMs using recursive criticism and improvement prompting achieve state-of-the-art results on the MiniWoB++ computer task benchmark with only a handful of demonstrations and no task-specific reward function.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
LLM-Oriented Information Retrieval: A Denoising-First Perspective
Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxo...
-
Budget-Constrained Online Retrieval-Augmented Generation: The Chunk-as-a-Service Model
Chunk-as-a-Service with the UCOSA online algorithm enables budget-constrained selection of prompts for chunk enrichment in RAG, outperforming random selection by 52% on a combined performance metric and delivering hig...
-
Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research
AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.
Reference graph
Works this paper leans on
-
[1]
International World Wide Web Conferences Steering Committee. ISBN 9781450349147. doi: 10.1145/3041021.3053375. URL https://doi.org/10.1145/3041021.3053375. G. Irving, P. Christiano, and D. Amodei. Ai safety via debate.arXiv preprint arXiv:1805.00899, 2018. URL https://arxiv.org/abs/1805.00899. G. Izacard and E. Grave. Leveraging passage retrieval with gen...
-
[2]
URL https://arxiv.org/abs/2203.05115. J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini, and S. Legg. Scalable agent alignment via reward modeling: a research direction.CoRR, abs/1811.07871, 2018. URLhttp://arxiv.org/abs/ 1811.07871. P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. tau Yih, T. Rocktäschel, S....
-
[3]
{url} • {claim} See this fragment from "{title}"[1]: {quote}
- [4]
-
[5]
What happens if you smash a mirror?
{url} 30 Teaching language models to support answers with verified quotes Task Instructions In this task, you will compare attempted answers to a user’s questions. For each question you will always see two different answers, and we want you to carefully decide which is a better answer. Further down, we provide some guidelines to help you in deciding what ma...
work page 2021
-
[6]
Start Any token is allowed
-
[7]
Within claim. Saw%<. Any token is allowed
-
[8]
Ended claim.Saw >% The claim has ended. Must begin document title
-
[9]
Within document titleSaw %(. Now within document title. Must exactly quote the title of one of the documents in the conditioning context
- [10]
-
[11]
Within quoteSaw %[. Within a quote. Now the only allowed tokens are those either beginning a new quote (token exists within the documents in the conditioning context), continue the quote, or end the quote
-
[12]
Ended quoteSaw ]%. Now any token is allowed. A new instance of the syntax can be entered by emitting%<. K. Examples of GopherCite answering questions about the Introduction Here we demonstrate a strength of feeding GopherCite long, uncurated contexts during training by showing that it can answer a few simple questions about this paper’s introduction: see ...
-
[13]
in which a netuned version of GPT-3 cites sources. One could view self-supporting answers as a specic type of explanation, putting our work alongside other work in explainable AI (Ras et al.,
-
[14]
self-supported question-answering
that aims to provide natural-language explanations of QA model responses (Lamm et al., 2020; Latcinnik and Berant, 2020; Narang et al., 2020). Our goals are aligned to the extent that both explanations and supporting evidence are ways to increase trust in model outputs. User GopherCite What is Self-Supported Question Answering? Self-supported question ans...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.