Recognition: unknown
Faithfulness-QA: A Counterfactual Entity Substitution Dataset for Training Context-Faithful RAG Models
Pith reviewed 2026-05-07 16:35 UTC · model grok-4.3
The pith
A dataset of 99,094 QA samples creates controlled knowledge conflicts by swapping named entities in context to train RAG models to prefer retrieved information over internal memory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Starting from extractive QA benchmarks, we automatically identify answer-bearing named entities, replace them with type-consistent alternatives from a bank of 76,953 entities across eight categories, and thereby manufacture controlled knowledge conflicts; after rigorous automated filtering that achieves 100% pass rates on quality audits, the resulting 99,094 samples, construction pipeline, and typed entity bank are released for training context-faithful RAG models and evaluating context-grounding behavior.
What carries the argument
Counterfactual entity substitution on answer-bearing named entities, which generates deliberate mismatches between the supplied context and the model's parametric knowledge.
If this is right
- RAG systems can be fine-tuned with attention-based objectives on the dataset to increase the probability that answers are derived from the retrieved context.
- The dataset provides a standardized benchmark for quantifying how often models override context with internal knowledge.
- The released pipeline and entity bank allow researchers to generate additional samples or adapt the method to new domains.
- Quality-controlled conflicts reduce the incidence of answers that ignore the provided passage.
Where Pith is reading between the lines
- The substitution method could be applied to non-QA tasks such as summarization or dialogue to enforce context grounding in other generation settings.
- If the conflicts prove effective for training, they might serve as a lightweight alternative to reinforcement learning techniques that penalize hallucination.
- Extending the entity bank with more categories or languages would allow broader coverage of knowledge-conflict scenarios.
Load-bearing premise
Automated entity substitutions produce genuine, coherent knowledge conflicts that causally improve faithfulness in training and evaluation.
What would settle it
A controlled experiment showing no measurable gain in context-adherence metrics for models trained or evaluated with Faithfulness-QA compared with identical models trained or evaluated on the original unmodified SQuAD and TriviaQA data.
Figures
read the original abstract
Retrieval-Augmented Generation (RAG) models frequently produce answers grounded in parametric memory rather than the retrieved context, undermining the core promise of retrieval augmentation. A fundamental obstacle to fixing this unfaithfulness is the lack of training data that explicitly requires models to prefer context over internal knowledge. We introduce Faithfulness-QA, a large-scale dataset of 99,094 samples constructed through counterfactual entity substitution. Starting from two established extractive QA benchmarks--SQuAD and TriviaQA--we automatically identify answer-bearing named entities in each context, replace them with type-consistent alternatives drawn from a curated bank of 76,953 entities, and thereby manufacture controlled knowledge conflicts between context and parametric memory. Rigorous quality filtering ensures 100% pass rates across four automated checks on random 200-sample audits. We release the full dataset, the construction pipeline, and a typed entity bank covering eight named entity categories. Faithfulness-QA is designed as a training resource for attention-based faithfulness objectives and as an evaluation benchmark for measuring context-grounding behavior in RAG systems. Data and code are available at https://github.com/qzhangFDU/faithfulness-qa-dataset.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Faithfulness-QA, a large-scale dataset of 99,094 samples for training and evaluating context-faithful RAG models. It is constructed by applying counterfactual entity substitution to contexts from SQuAD and TriviaQA, replacing answer-bearing named entities with type-consistent alternatives from a bank of 76,953 entities to create knowledge conflicts. The authors describe a construction pipeline with quality filtering that achieves 100% pass rates on four automated checks, confirmed by audits on 200 random samples. The dataset, pipeline, and entity bank are made publicly available.
Significance. This dataset addresses a critical gap in resources for mitigating unfaithfulness in RAG systems by providing examples that require models to ground answers in the provided context rather than parametric knowledge. The scale of the dataset and the release of the full construction pipeline and entity bank are notable strengths that enable further research and reproducibility. If the substitutions maintain coherence and create genuine conflicts, the resource could facilitate the development of attention-based faithfulness training objectives and serve as a benchmark for context-grounding behavior.
major comments (2)
- [§3 (Dataset Construction)] §3 (Dataset Construction): The four automated quality checks are referenced with 100% pass rates but their specific criteria (e.g., whether they verify syntactic fit, pronoun agreement, or post-substitution question-context alignment beyond type consistency) are not detailed. This is load-bearing because the central utility of the dataset for training and evaluation depends on the substitutions producing coherent, valid knowledge conflicts rather than artifacts.
- [§4 (Quality Filtering and Audits)] §4 (Quality Filtering and Audits): The validation relies on four automated checks plus a 200-sample audit for a 99k-sample dataset. While the reported pass rates are positive, the audit scale is small and the checks do not explicitly include deeper semantic coherence or model-based conflict verification, leaving open the possibility of residual invalid examples that could undermine downstream use.
minor comments (2)
- [Abstract] Abstract: Adding one sentence on the eight named entity categories in the bank would give readers an immediate sense of coverage without requiring reference to the full text.
- [Reproducibility Statement] The GitHub repository link is provided but the manuscript would benefit from a short reproducibility checklist (e.g., exact random seeds for sampling or audit selection criteria) to facilitate independent verification.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. The comments highlight important aspects of transparency in our dataset construction and validation process. We address each major comment below and have revised the manuscript accordingly to provide greater clarity and detail on the quality assurance steps.
read point-by-point responses
-
Referee: [§3 (Dataset Construction)] §3 (Dataset Construction): The four automated quality checks are referenced with 100% pass rates but their specific criteria (e.g., whether they verify syntactic fit, pronoun agreement, or post-substitution question-context alignment beyond type consistency) are not detailed. This is load-bearing because the central utility of the dataset for training and evaluation depends on the substitutions producing coherent, valid knowledge conflicts rather than artifacts.
Authors: We agree that explicit criteria for the automated checks are essential for reproducibility and to demonstrate that substitutions create genuine knowledge conflicts without introducing artifacts. In the revised manuscript, we have expanded §3 to fully detail the four checks: (1) syntactic validity via dependency parsing to ensure grammatical fit; (2) pronoun agreement resolution across the modified context; (3) question-context alignment verification to confirm the substituted entity remains the answer span; and (4) type-consistency enforcement with additional lexical coherence scoring. These criteria go beyond basic type matching and were applied to the full dataset, yielding the reported 100% pass rates. revision: yes
-
Referee: [§4 (Quality Filtering and Audits)] §4 (Quality Filtering and Audits): The validation relies on four automated checks plus a 200-sample audit for a 99k-sample dataset. While the reported pass rates are positive, the audit scale is small and the checks do not explicitly include deeper semantic coherence or model-based conflict verification, leaving open the possibility of residual invalid examples that could undermine downstream use.
Authors: We acknowledge that a 200-sample manual audit is modest relative to the full dataset size and that deeper semantic or model-based verification could further strengthen confidence. In the revision, we have added discussion in §4 explaining the rationale for the audit scale (standard practice for large-scale automated pipelines where full manual review is infeasible) and the complementary role of the automated checks. We have also incorporated a new model-based conflict verification step using a small held-out LLM probe on an additional 500 samples to assess semantic coherence and knowledge conflict strength, with results reported in the updated section. We believe this addresses the core concern while maintaining practicality. revision: partial
Circularity Check
No circularity detected; direct dataset construction without derivations or self-referential reductions
full rationale
The paper describes an automated pipeline for creating Faithfulness-QA by identifying named entities in SQuAD/TriviaQA contexts, substituting type-consistent alternatives from a 76k bank, and applying four automated quality filters plus 200-sample audits. No equations, fitted parameters, predictions, or mathematical derivations appear in the provided text. The central claims concern dataset scale, construction method, and intended use for RAG training/evaluation; these rest on procedural description rather than any reduction to prior quantities or self-citations. Quality filtering is presented as an empirical safeguard, not a fitted or self-defined result. This is a standard non-circular dataset release paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption SQuAD and TriviaQA contain extractive contexts with identifiable answer-bearing named entities that can be substituted while preserving question validity.
Reference graph
Works this paper leans on
-
[1]
Scott Barnett, Stefanus Kurniawan, Srikanth Thudumu, Zach Brannelly, and Mohamed Abdelrazek. Seven Failure Points When Engineering a Retrieval Augmented Generation System.arXiv preprint arXiv: 2401.05856,
-
[2]
spaCy: Industrial- Strength Natural Language Processing in Python
Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. spaCy: Industrial- Strength Natural Language Processing in Python. Zenodo, 2020.https://doi.org/10.5281/ zenodo.1212303. Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. InProceedi...
2020
-
[3]
WisPaper: Your AI Scholar Search Engine
Li Ju, Jun Zhao, Mingxu Chai, Ziyu Shen, Xiangyang Wang, Yage Geng, Chunchun Ma, Hao Peng, Guangbin Li, Tao Li, Chengyong Liao, Fu Wang, Xiaolong Wang, Junshen Chen, Rui Gong, Shijia Liang, Feiyan Li, Ming Zhang, Kexin Tan, Junjie Ye, Zhiheng Xi, Shihan Dou, Tao Gui, Yuankai Ying, Yang Shi, Yue Zhang, and Qi Zhang. WisPaper: Your AI Scholar Search Engine....
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
HaluEval: A Large-Scale Halluci- nation Benchmark for Large Language Models
Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. HaluEval: A Large-Scale Halluci- nation Benchmark for Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6449–6464, Singapore,
2023
-
[5]
Wei Li, Wenhao Wu, Moye Chen, Jiachen Liu, Xinyan Xiao, Hua Wu. Faithfulness in Natural Language Generation: A Systematic Survey of Analysis, Evaluation and Optimization Methods.arXiv preprint arXiv:2203.05227,
-
[6]
Large Language Models with Controllable Working Memory
12 Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, and Sanjiv Kumar. Large Language Models with Controllable Working Memory. InFindings of the Association for Computational Linguistics: ACL 2023, pages 1774–1793, Toronto, Canada,
2023
-
[7]
Entity-Based Knowledge Conflicts in Question Answering
Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity-Based Knowledge Conflicts in Question Answering. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7052–7063, Online and Punta Cana, Dominican Republic,
2021
-
[8]
On Faithfulness and Factuality in Abstractive Summarization
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On Faithfulness and Factuality in Abstractive Summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1906–1919, Online,
1906
-
[9]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2383–2392, Austin, Texas,
2016
-
[10]
REPLUG: Retrieval-Augmented Black-Box Language Models
Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettle- moyer, and Wen-tau Yih. REPLUG: Retrieval-Augmented Black-Box Language Models. InProceed- ings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 8364–8377, Mexico City, Mexico,
2024
-
[11]
GPT-NER: Named entity recognition via large language models
Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fei Wu, Tianwei Zhang, Jiwei Li, and Guoyin Wang. GPT-NER: Named Entity Recognition via Large Language Models.arXiv preprint arXiv:2304.10428,
-
[12]
Kevin Wu, Eric Wu, and James Zou. ClashEval: Quantifying the tug-of-war between an LLM’s internal prior and external evidence.arXiv preprint arXiv:2404.10198,
-
[13]
Knowl- edge Conflicts for LLMs: A Survey
Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. Knowl- edge Conflicts for LLMs: A Survey. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8541–8565, Miami, Florida,
2024
-
[14]
Corrective Retrieval Augmented Generation
Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective Retrieval Augmented Generation. arXiv preprint arXiv:2401.15884,
work page internal anchor Pith review arXiv
-
[15]
Manning, Christopher Potts, and Danqi Chen
Zexuan Zhong, Zhengxuan Wu, Christopher D. Manning, Christopher Potts, and Danqi Chen. MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions. InPro- ceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 15686–15702, Singapore,
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.