arxiv: 2604.25313 · v2 · submitted 2026-04-28 · 💻 cs.CL · cs.AI

Recognition: unknown

Faithfulness-QA: A Counterfactual Entity Substitution Dataset for Training Context-Faithful RAG Models

Junzhe Wang, Li Ju, Qi Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Faithfulness-QARAG faithfulnesscounterfactual entity substitutionknowledge conflictscontext groundingretrieval-augmented generationQA dataset

0 comments

The pith

A dataset of 99,094 QA samples creates controlled knowledge conflicts by swapping named entities in context to train RAG models to prefer retrieved information over internal memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Faithfulness-QA as a resource to address unfaithful behavior in retrieval-augmented generation, where models often ignore provided context in favor of parametric knowledge. It builds the dataset automatically from SQuAD and TriviaQA by locating answer-bearing named entities and replacing them with type-consistent alternatives drawn from a large curated bank. This produces deliberate mismatches that require models to attend to the given context rather than what they already know. The authors position the dataset both as training material for attention-based faithfulness objectives and as a benchmark for measuring how well systems ground their outputs in retrieved text.

Core claim

Starting from extractive QA benchmarks, we automatically identify answer-bearing named entities, replace them with type-consistent alternatives from a bank of 76,953 entities across eight categories, and thereby manufacture controlled knowledge conflicts; after rigorous automated filtering that achieves 100% pass rates on quality audits, the resulting 99,094 samples, construction pipeline, and typed entity bank are released for training context-faithful RAG models and evaluating context-grounding behavior.

What carries the argument

Counterfactual entity substitution on answer-bearing named entities, which generates deliberate mismatches between the supplied context and the model's parametric knowledge.

If this is right

RAG systems can be fine-tuned with attention-based objectives on the dataset to increase the probability that answers are derived from the retrieved context.
The dataset provides a standardized benchmark for quantifying how often models override context with internal knowledge.
The released pipeline and entity bank allow researchers to generate additional samples or adapt the method to new domains.
Quality-controlled conflicts reduce the incidence of answers that ignore the provided passage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The substitution method could be applied to non-QA tasks such as summarization or dialogue to enforce context grounding in other generation settings.
If the conflicts prove effective for training, they might serve as a lightweight alternative to reinforcement learning techniques that penalize hallucination.
Extending the entity bank with more categories or languages would allow broader coverage of knowledge-conflict scenarios.

Load-bearing premise

Automated entity substitutions produce genuine, coherent knowledge conflicts that causally improve faithfulness in training and evaluation.

What would settle it

A controlled experiment showing no measurable gain in context-adherence metrics for models trained or evaluated with Faithfulness-QA compared with identical models trained or evaluated on the original unmodified SQuAD and TriviaQA data.

Figures

Figures reproduced from arXiv: 2604.25313 by Junzhe Wang, Li Ju, Qi Zhang.

**Figure 1.** Figure 1: The Faithfulness-QA construction pipeline. The counterfactual entity substitution stage manu view at source ↗

**Figure 2.** Figure 2: Entity type distribution comparison. 5 Quality Validation 5.1 Automated Quality Checks To validate that the filtering pipeline produces consistently high-quality output, we perform automated quality verification on a stratified random sample of 200 instances drawn from the SQuAD subset (with random seed 42 for reproducibility). The audit evaluates four complementary dimensions: (i) whether the replacement … view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) models frequently produce answers grounded in parametric memory rather than the retrieved context, undermining the core promise of retrieval augmentation. A fundamental obstacle to fixing this unfaithfulness is the lack of training data that explicitly requires models to prefer context over internal knowledge. We introduce Faithfulness-QA, a large-scale dataset of 99,094 samples constructed through counterfactual entity substitution. Starting from two established extractive QA benchmarks--SQuAD and TriviaQA--we automatically identify answer-bearing named entities in each context, replace them with type-consistent alternatives drawn from a curated bank of 76,953 entities, and thereby manufacture controlled knowledge conflicts between context and parametric memory. Rigorous quality filtering ensures 100% pass rates across four automated checks on random 200-sample audits. We release the full dataset, the construction pipeline, and a typed entity bank covering eight named entity categories. Faithfulness-QA is designed as a training resource for attention-based faithfulness objectives and as an evaluation benchmark for measuring context-grounding behavior in RAG systems. Data and code are available at https://github.com/qzhangFDU/faithfulness-qa-dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper releases a large dataset for training RAG models to prefer context over parametric knowledge through entity swaps, which is useful but rests on unproven coherence of the generated examples.

read the letter

The key takeaway is that this paper gives us a large dataset for training RAG models to ignore their internal knowledge when the context contradicts it, created by swapping entities in existing QA data. They start with SQuAD and TriviaQA, pick answer entities in the contexts, replace them with similar-type entities from a 76k bank, and filter for quality. This manufactures the needed conflicts at scale, with 99k examples total. The release includes the full pipeline and entity bank, which is helpful. The construction looks solid on the surface. They report perfect scores on four automated checks and back it with audits of 200 samples. Releasing everything openly means others can inspect or extend it. For a dataset paper, that's a good standard. The main concern is how well the swapped examples actually work as intended. Type matching keeps the entities in the right category, but it doesn't automatically ensure the new context reads naturally or that the conflict is clear and the question still fits. With only a small audit sample, some incoherent or artifact-ridden cases could slip into the full set. The abstract doesn't detail any experiments showing that training on this data improves faithfulness, so the practical value is still to be proven. This paper is for people focused on making RAG systems more reliable by reducing reliance on parametric memory. Readers who need training data or benchmarks for context-grounding will get direct value from the release. It has enough substance and transparency to warrant peer review. I would send it to referees.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Faithfulness-QA, a large-scale dataset of 99,094 samples for training and evaluating context-faithful RAG models. It is constructed by applying counterfactual entity substitution to contexts from SQuAD and TriviaQA, replacing answer-bearing named entities with type-consistent alternatives from a bank of 76,953 entities to create knowledge conflicts. The authors describe a construction pipeline with quality filtering that achieves 100% pass rates on four automated checks, confirmed by audits on 200 random samples. The dataset, pipeline, and entity bank are made publicly available.

Significance. This dataset addresses a critical gap in resources for mitigating unfaithfulness in RAG systems by providing examples that require models to ground answers in the provided context rather than parametric knowledge. The scale of the dataset and the release of the full construction pipeline and entity bank are notable strengths that enable further research and reproducibility. If the substitutions maintain coherence and create genuine conflicts, the resource could facilitate the development of attention-based faithfulness training objectives and serve as a benchmark for context-grounding behavior.

major comments (2)

[§3 (Dataset Construction)] §3 (Dataset Construction): The four automated quality checks are referenced with 100% pass rates but their specific criteria (e.g., whether they verify syntactic fit, pronoun agreement, or post-substitution question-context alignment beyond type consistency) are not detailed. This is load-bearing because the central utility of the dataset for training and evaluation depends on the substitutions producing coherent, valid knowledge conflicts rather than artifacts.
[§4 (Quality Filtering and Audits)] §4 (Quality Filtering and Audits): The validation relies on four automated checks plus a 200-sample audit for a 99k-sample dataset. While the reported pass rates are positive, the audit scale is small and the checks do not explicitly include deeper semantic coherence or model-based conflict verification, leaving open the possibility of residual invalid examples that could undermine downstream use.

minor comments (2)

[Abstract] Abstract: Adding one sentence on the eight named entity categories in the bank would give readers an immediate sense of coverage without requiring reference to the full text.
[Reproducibility Statement] The GitHub repository link is provided but the manuscript would benefit from a short reproducibility checklist (e.g., exact random seeds for sampling or audit selection criteria) to facilitate independent verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments highlight important aspects of transparency in our dataset construction and validation process. We address each major comment below and have revised the manuscript accordingly to provide greater clarity and detail on the quality assurance steps.

read point-by-point responses

Referee: [§3 (Dataset Construction)] §3 (Dataset Construction): The four automated quality checks are referenced with 100% pass rates but their specific criteria (e.g., whether they verify syntactic fit, pronoun agreement, or post-substitution question-context alignment beyond type consistency) are not detailed. This is load-bearing because the central utility of the dataset for training and evaluation depends on the substitutions producing coherent, valid knowledge conflicts rather than artifacts.

Authors: We agree that explicit criteria for the automated checks are essential for reproducibility and to demonstrate that substitutions create genuine knowledge conflicts without introducing artifacts. In the revised manuscript, we have expanded §3 to fully detail the four checks: (1) syntactic validity via dependency parsing to ensure grammatical fit; (2) pronoun agreement resolution across the modified context; (3) question-context alignment verification to confirm the substituted entity remains the answer span; and (4) type-consistency enforcement with additional lexical coherence scoring. These criteria go beyond basic type matching and were applied to the full dataset, yielding the reported 100% pass rates. revision: yes
Referee: [§4 (Quality Filtering and Audits)] §4 (Quality Filtering and Audits): The validation relies on four automated checks plus a 200-sample audit for a 99k-sample dataset. While the reported pass rates are positive, the audit scale is small and the checks do not explicitly include deeper semantic coherence or model-based conflict verification, leaving open the possibility of residual invalid examples that could undermine downstream use.

Authors: We acknowledge that a 200-sample manual audit is modest relative to the full dataset size and that deeper semantic or model-based verification could further strengthen confidence. In the revision, we have added discussion in §4 explaining the rationale for the audit scale (standard practice for large-scale automated pipelines where full manual review is infeasible) and the complementary role of the automated checks. We have also incorporated a new model-based conflict verification step using a small held-out LLM probe on an additional 500 samples to assess semantic coherence and knowledge conflict strength, with results reported in the updated section. We believe this addresses the core concern while maintaining practicality. revision: partial

Circularity Check

0 steps flagged

No circularity detected; direct dataset construction without derivations or self-referential reductions

full rationale

The paper describes an automated pipeline for creating Faithfulness-QA by identifying named entities in SQuAD/TriviaQA contexts, substituting type-consistent alternatives from a 76k bank, and applying four automated quality filters plus 200-sample audits. No equations, fitted parameters, predictions, or mathematical derivations appear in the provided text. The central claims concern dataset scale, construction method, and intended use for RAG training/evaluation; these rest on procedural description rather than any reduction to prior quantities or self-citations. Quality filtering is presented as an empirical safeguard, not a fitted or self-defined result. This is a standard non-circular dataset release paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the suitability of SQuAD and TriviaQA as starting points and the existence of a curated entity bank; no free parameters are fitted and no new entities are postulated beyond the released dataset itself.

axioms (1)

domain assumption SQuAD and TriviaQA contain extractive contexts with identifiable answer-bearing named entities that can be substituted while preserving question validity.
The entire construction pipeline depends on this property of the source benchmarks.

pith-pipeline@v0.9.0 · 5509 in / 1258 out tokens · 88515 ms · 2026-05-07T16:35:15.921768+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Barnett, S

Scott Barnett, Stefanus Kurniawan, Srikanth Thudumu, Zach Brannelly, and Mohamed Abdelrazek. Seven Failure Points When Engineering a Retrieval Augmented Generation System.arXiv preprint arXiv: 2401.05856,

work page arXiv
[2]

spaCy: Industrial- Strength Natural Language Processing in Python

Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. spaCy: Industrial- Strength Natural Language Processing in Python. Zenodo, 2020.https://doi.org/10.5281/ zenodo.1212303. Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. InProceedi...

2020
[3]

WisPaper: Your AI Scholar Search Engine

Li Ju, Jun Zhao, Mingxu Chai, Ziyu Shen, Xiangyang Wang, Yage Geng, Chunchun Ma, Hao Peng, Guangbin Li, Tao Li, Chengyong Liao, Fu Wang, Xiaolong Wang, Junshen Chen, Rui Gong, Shijia Liang, Feiyan Li, Ming Zhang, Kexin Tan, Junjie Ye, Zhiheng Xi, Shihan Dou, Tao Gui, Yuankai Ying, Yang Shi, Yue Zhang, and Qi Zhang. WisPaper: Your AI Scholar Search Engine....

work page internal anchor Pith review Pith/arXiv arXiv
[4]

HaluEval: A Large-Scale Halluci- nation Benchmark for Large Language Models

Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. HaluEval: A Large-Scale Halluci- nation Benchmark for Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6449–6464, Singapore,

2023
[5]

Faithfulness in Natural Language Generation: A Systematic Survey of Analysis, Evaluation and Optimization Methods.arXiv preprint arXiv:2203.05227,

Wei Li, Wenhao Wu, Moye Chen, Jiachen Liu, Xinyan Xiao, Hua Wu. Faithfulness in Natural Language Generation: A Systematic Survey of Analysis, Evaluation and Optimization Methods.arXiv preprint arXiv:2203.05227,

work page arXiv
[6]

Large Language Models with Controllable Working Memory

12 Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, and Sanjiv Kumar. Large Language Models with Controllable Working Memory. InFindings of the Association for Computational Linguistics: ACL 2023, pages 1774–1793, Toronto, Canada,

2023
[7]

Entity-Based Knowledge Conflicts in Question Answering

Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity-Based Knowledge Conflicts in Question Answering. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7052–7063, Online and Punta Cana, Dominican Republic,

2021
[8]

On Faithfulness and Factuality in Abstractive Summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On Faithfulness and Factuality in Abstractive Summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1906–1919, Online,

1906
[9]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2383–2392, Austin, Texas,

2016
[10]

REPLUG: Retrieval-Augmented Black-Box Language Models

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettle- moyer, and Wen-tau Yih. REPLUG: Retrieval-Augmented Black-Box Language Models. InProceed- ings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 8364–8377, Mexico City, Mexico,

2024
[11]

GPT-NER: Named entity recognition via large language models

Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fei Wu, Tianwei Zhang, Jiwei Li, and Guoyin Wang. GPT-NER: Named Entity Recognition via Large Language Models.arXiv preprint arXiv:2304.10428,

work page arXiv
[12]

ClashEval: Quantifying the tug-of-war between an LLM’s internal prior and external evidence.arXiv preprint arXiv:2404.10198,

Kevin Wu, Eric Wu, and James Zou. ClashEval: Quantifying the tug-of-war between an LLM’s internal prior and external evidence.arXiv preprint arXiv:2404.10198,

work page arXiv
[13]

Knowl- edge Conflicts for LLMs: A Survey

Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. Knowl- edge Conflicts for LLMs: A Survey. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8541–8565, Miami, Florida,

2024
[14]

Corrective Retrieval Augmented Generation

Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective Retrieval Augmented Generation. arXiv preprint arXiv:2401.15884,

work page internal anchor Pith review arXiv
[15]

Manning, Christopher Potts, and Danqi Chen

Zexuan Zhong, Zhengxuan Wu, Christopher D. Manning, Christopher Potts, and Danqi Chen. MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions. InPro- ceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 15686–15702, Singapore,

2023