arxiv: 2203.11147 · v1 · pith:W5AW73QHnew · submitted 2022-03-21 · 💻 cs.CL · cs.LG

Teaching language models to support answers with verified quotes

Jacob Menick , Maja Trebacz , Vladimir Mikulik , John Aslanides , Francis Song , Martin Chadwick , Mia Glaese , Susannah Young

show 3 more authors

Lucy Campbell-Gillingham Geoffrey Irving Nat McAleese

This is my paper

Pith reviewed 2026-05-17 11:41 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords language modelsquestion answeringcitationsreinforcement learninghuman preferencesfactualityabstentionevidence retrieval

0 comments

The pith

A 280 billion parameter model can be trained to answer questions with specific cited evidence from documents and to abstain when uncertain.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reinforcement learning from human preferences can teach language models to generate answers while also retrieving and quoting supporting passages from search results or user documents. This training produces responses judged high quality by human raters 80 percent of the time on a Natural Questions subset and 67 percent on an ELI5 subset. When the model abstains on the third of questions where its uncertainty signal is strongest, those rates rise to 90 and 80 percent. The work addresses the core problem that language models can produce plausible-sounding claims without any attached evidence that users can check.

Core claim

Our 280 billion parameter model, GopherCite, is able to produce answers with high quality supporting evidence and abstain from answering when unsure. The model's response is found to be high-quality 80% of the time on this Natural Questions subset, and 67% of the time on the ELI5 subset. Abstaining from the third of questions for which it is most unsure improves performance to 90% and 80% respectively, approaching human baselines. Analysis on the adversarial TruthfulQA dataset shows why citation is only one part of an overall strategy for safety and trustworthiness: not all claims supported by evidence are true.

What carries the argument

Reinforcement learning from human preferences (RLHP) that rewards the generation of answers together with direct quotes drawn from multiple retrieved documents or a single user-provided document.

If this is right

Users can directly inspect the quoted passages to assess whether an answer is correct.
Abstention on the most uncertain questions measurably raises the fraction of high-quality responses.
The same training method works whether evidence comes from a search engine or from a document the user supplies.
Citation alone does not guarantee truth, as shown by results on adversarial factual questions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could be combined with other verification layers to handle cases where evidence appears to back an incorrect claim.
The method might extend to domains beyond question answering where users need traceable support for model statements.
Larger models trained in the same way could further reduce the rate at which unsupported claims appear.

Load-bearing premise

Human preferences expressed during training will reliably translate into citations that actually support the model's claims and into an uncertainty signal that correctly identifies questions the model should skip.

What would settle it

A dataset of model outputs in which a large fraction of the supplied quotes do not actually support the stated answer or in which abstention fails to raise the human-rated quality percentage.

read the original abstract

Recent large language models often answer factual questions correctly. But users can't trust any given claim a model makes without fact-checking, because language models can hallucinate convincing nonsense. In this work we use reinforcement learning from human preferences (RLHP) to train "open-book" QA models that generate answers whilst also citing specific evidence for their claims, which aids in the appraisal of correctness. Supporting evidence is drawn from multiple documents found via a search engine, or from a single user-provided document. Our 280 billion parameter model, GopherCite, is able to produce answers with high quality supporting evidence and abstain from answering when unsure. We measure the performance of GopherCite by conducting human evaluation of answers to questions in a subset of the NaturalQuestions and ELI5 datasets. The model's response is found to be high-quality 80\% of the time on this Natural Questions subset, and 67\% of the time on the ELI5 subset. Abstaining from the third of questions for which it is most unsure improves performance to 90\% and 80\% respectively, approaching human baselines. However, analysis on the adversarial TruthfulQA dataset shows why citation is only one part of an overall strategy for safety and trustworthiness: not all claims supported by evidence are true.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GopherCite uses RL from human prefs to add citations and abstention to open-book QA, with human ratings showing clear lifts from skipping the bottom third, though the uncertainty signal lacks direct calibration evidence.

read the letter

This paper's main result is that RL from human preferences can train a 280B model to answer factual questions while citing specific evidence from retrieved documents or a user-provided one, and to abstain when it detects low confidence. On subsets of Natural Questions and ELI5 the human raters scored the full responses high-quality 80% and 67% of the time; dropping the most uncertain third raised those figures to 90% and 80%. They also run a useful check on TruthfulQA that shows citations alone do not guarantee truth.

Referee Report

2 major / 2 minor

Summary. This paper describes the development of GopherCite, a 280 billion parameter language model that uses reinforcement learning from human preferences (RLHP) to generate answers to factual questions while providing specific citations to supporting evidence from search results or user documents. The model can also abstain from answering when it is uncertain. Through human evaluations on subsets of the Natural Questions and ELI5 datasets, the authors report that responses are high-quality 80% of the time on Natural Questions and 67% on ELI5, with these figures improving to 90% and 80% respectively when abstaining on the third of questions where the model is most unsure. The paper also discusses limitations using the TruthfulQA dataset, noting that cited evidence does not always ensure truthfulness.

Significance. Should the central results hold up under scrutiny, this work is significant for the field of trustworthy AI and natural language processing. It provides a concrete method for large language models to support their claims with verifiable quotes and to selectively abstain, which could substantially increase user trust in model outputs. The empirical gains from abstention and the use of RLHP for citation quality represent practical advances, and the honest discussion of limitations via TruthfulQA adds value. Strengths include the scale of the model and evaluation on standard datasets with clear percentage improvements.

major comments (2)

[Results on abstention] The headline result that abstaining on the most uncertain third of questions improves performance from 80% to 90% (Natural Questions subset) and 67% to 80% (ELI5 subset) is load-bearing for the claim that the model can 'abstain when unsure.' The manuscript does not provide a calibration analysis (e.g., human quality ratings binned by uncertainty levels on held-out data) to confirm that the uncertainty signal (log-probabilities, RLHP reward model, or auxiliary head) is monotonically related to actual error rate rather than surface features such as question length or retrieval score.
[Human evaluation] Human evaluation results are reported on standard datasets with clear percentage improvements from abstention; however, details on the uncertainty estimation method, inter-rater agreement, and exact rating criteria for 'high quality supporting evidence' are insufficient. This information is necessary to assess whether the 80%/67% baselines and the abstention lifts are robust.

minor comments (2)

[Abstract] The abstract could more explicitly summarize the key takeaway from the TruthfulQA analysis regarding the limitations of citation for ensuring overall truthfulness.
[Presentation] Ensure all acronyms (e.g., RLHP) are defined on first use and that figure captions clearly indicate the subsets used for the reported percentages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment below and indicate where we will revise the paper to incorporate the feedback.

read point-by-point responses

Referee: [Results on abstention] The headline result that abstaining on the most uncertain third of questions improves performance from 80% to 90% (Natural Questions subset) and 67% to 80% (ELI5 subset) is load-bearing for the claim that the model can 'abstain when unsure.' The manuscript does not provide a calibration analysis (e.g., human quality ratings binned by uncertainty levels on held-out data) to confirm that the uncertainty signal (log-probabilities, RLHP reward model, or auxiliary head) is monotonically related to actual error rate rather than surface features such as question length or retrieval score.

Authors: We agree that a calibration analysis would strengthen the interpretation of the abstention results. In the manuscript the uncertainty signal is taken from the RLHP reward model, and the bottom third of questions by this score shows a clear lift in human-rated quality. We did not include binned calibration on held-out data, primarily due to the expense of additional human ratings. In the revision we will add an explicit description of how the uncertainty threshold is derived from the reward model and will note the lack of a full monotonicity check as a limitation while emphasizing that the observed performance improvement provides empirical support for the signal's utility. revision: yes
Referee: [Human evaluation] Human evaluation results are reported on standard datasets with clear percentage improvements from abstention; however, details on the uncertainty estimation method, inter-rater agreement, and exact rating criteria for 'high quality supporting evidence' are insufficient. This information is necessary to assess whether the 80%/67% baselines and the abstention lifts are robust.

Authors: We acknowledge that greater transparency on the evaluation protocol is warranted. The current manuscript outlines the human rating process but does not report inter-rater agreement statistics or the precise rubric used for 'high quality supporting evidence.' We will expand the relevant methods and results sections to include (i) the exact formulation of the uncertainty score from the reward model, (ii) inter-rater agreement figures, and (iii) the detailed rating guidelines provided to annotators. These additions will allow readers to better judge the robustness of the reported percentages and the abstention gains. revision: yes

Circularity Check

0 steps flagged

Empirical human evaluations on held-out data are independent of training objectives

full rationale

The paper's central results consist of human-rated quality scores (80% on Natural Questions subset, 67% on ELI5) and abstention improvements (to 90% and 80%) measured by external raters on held-out questions. These metrics are not derived from or reduced to the RLHP training process, any fitted parameters, or self-citations by construction. The uncertainty signal selects a subset for abstention, but the reported performance is independently assessed and falsifiable via the external evaluations. No equations, self-definitional loops, or load-bearing self-citations reduce the claims to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are introduced; the work relies on standard RL from human preferences and retrieval from an external search engine or user document.

pith-pipeline@v0.9.0 · 5556 in / 1133 out tokens · 49605 ms · 2026-05-17T11:41:27.964902+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Discovering Latent Knowledge in Language Models Without Supervision
cs.CL 2022-12 conditional novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
cs.CL 2026-05 unverdicted novelty 7.0

RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.
ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 7.0

ArbGraph resolves conflicts in RAG evidence by constructing a conflict-aware graph of atomic claims and applying intensity-driven iterative arbitration to suppress unreliable claims prior to generation.
Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback
cs.LG 2026-03 unverdicted novelty 7.0

Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
Flamingo: a Visual Language Model for Few-Shot Learning
cs.CV 2022-04 unverdicted novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
From Citation Selection to Citation Absorption: A Measurement Framework for Generative Engine Optimization Across AI Search Platforms
cs.IR 2026-04 unverdicted novelty 6.0

A measurement study of 602 prompts across ChatGPT, Google AI Overview, and Perplexity finds that citation selection breadth and absorption depth diverge, with high-influence pages being longer, structured, and evidence-rich.
PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations
cs.CL 2026-04 unverdicted novelty 6.0

PRISM benchmark disentangles LLM hallucinations into knowledge missing, knowledge errors, reasoning errors, and instruction-following errors across three generation stages, revealing trade-offs when testing 24 models.
Preregistered Belief Revision Contracts
cs.AI 2026-04 unverdicted novelty 6.0

PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.
A Human-Centric Framework for Data Attribution in Large Language Models
cs.CY 2026-02 unverdicted novelty 6.0

Introduces a parameter-driven framework for data attribution in LLMs that enables negotiation among creators, users, and intermediaries to meet stakeholder goals within the data economy.
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
cs.CL 2025-05 conditional novelty 6.0

ZeroSearch simulates search engine interactions via supervised fine-tuning of a retrieval module and curriculum-based RL degradation of document quality, achieving comparable or superior performance to real search eng...
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
cs.CL 2023-10 unverdicted novelty 6.0

Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
Chain-of-Verification Reduces Hallucination in Large Language Models
cs.CL 2023-09 unverdicted novelty 6.0

Chain-of-Verification reduces hallucinations in large language models by drafting responses, planning independent verification questions, answering them separately, and generating a final verified output.
Training Diffusion Models with Reinforcement Learning
cs.LG 2023-05 unverdicted novelty 6.0

DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
Language Models can Solve Computer Tasks
cs.CL 2023-03 accept novelty 6.0

Pre-trained LLMs using recursive criticism and improvement prompting achieve state-of-the-art results on the MiniWoB++ computer task benchmark with only a handful of demonstrations and no task-specific reward function.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
LLM-Oriented Information Retrieval: A Denoising-First Perspective
cs.IR 2026-05 unverdicted novelty 5.0

Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxo...
Budget-Constrained Online Retrieval-Augmented Generation: The Chunk-as-a-Service Model
cs.IR 2026-04 unverdicted novelty 5.0

Chunk-as-a-Service with the UCOSA online algorithm enables budget-constrained selection of prompts for chunk enrichment in RAG, outperforming random selection by 52% on a combined performance metric and delivering hig...
Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research
cs.HC 2026-04 unverdicted novelty 5.0

AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 19 Pith papers

[1]

ISBN 9781450349147

International World Wide Web Conferences Steering Committee. ISBN 9781450349147. doi: 10.1145/3041021.3053375. URL https://doi.org/10.1145/3041021.3053375. G. Irving, P. Christiano, and D. Amodei. Ai safety via debate.arXiv preprint arXiv:1805.00899, 2018. URL https://arxiv.org/abs/1805.00899. G. Izacard and E. Grave. Leveraging passage retrieval with gen...

work page doi:10.1145/3041021.3053375 2018
[2]

road draft tube

URL https://arxiv.org/abs/2203.05115. J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini, and S. Legg. Scalable agent alignment via reward modeling: a research direction.CoRR, abs/1811.07871, 2018. URLhttp://arxiv.org/abs/ 1811.07871. P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. tau Yih, T. Rocktäschel, S....

work page doi:10.1177/0894439317715434 2018
[3]

{url} • {claim} See this fragment from "{title}"[1]: {quote}

work page
[4]

{quote}" Source:

{url} • {claim} "{quote}" Source: "{title}" [1]

work page
[5]

What happens if you smash a mirror?

{url} 30 Teaching language models to support answers with veriﬁed quotes Task Instructions In this task, you will compare attempted answers to a user’s questions. For each question you will always see two diﬀerent answers, and we want you to carefully decide which is a better answer. Further down, we provide some guidelines to help you in deciding what ma...

work page 2021
[6]

Start Any token is allowed

work page
[7]

Within claim. Saw%<. Any token is allowed

work page
[8]

Must begin document title

Ended claim.Saw >% The claim has ended. Must begin document title

work page
[9]

Now within document title

Within document titleSaw %(. Now within document title. Must exactly quote the title of one of the documents in the conditioning context

work page
[10]

Must begin a quote

Ended document titleSaw )%. Must begin a quote

work page
[11]

Within a quote

Within quoteSaw %[. Within a quote. Now the only allowed tokens are those either beginning a new quote (token exists within the documents in the conditioning context), continue the quote, or end the quote

work page
[12]

Now any token is allowed

Ended quoteSaw ]%. Now any token is allowed. A new instance of the syntax can be entered by emitting%<. K. Examples of GopherCite answering questions about the Introduction Here we demonstrate a strength of feeding GopherCite long, uncurated contexts during training by showing that it can answer a few simple questions about this paper’s introduction: see ...

work page
[13]

One could view self-supporting answers as a specic type of explanation, putting our work alongside other work in explainable AI (Ras et al.,

in which a netuned version of GPT-3 cites sources. One could view self-supporting answers as a specic type of explanation, putting our work alongside other work in explainable AI (Ras et al.,

work page
[14]

self-supported question-answering

that aims to provide natural-language explanations of QA model responses (Lamm et al., 2020; Latcinnik and Berant, 2020; Narang et al., 2020). Our goals are aligned to the extent that both explanations and supporting evidence are ways to increase trust in model outputs. User GopherCite What is Self-Supported Question Answering? Self-supported question ans...

work page 2020