arxiv: 1803.05355 · v3 · submitted 2018-03-14 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

FEVER: a large-scale dataset for Fact Extraction and VERification

Andreas Vlachos, Arpit Mittal, Christos Christodoulopoulos, James Thorne

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:46 UTC · model grok-4.3

classification 💻 cs.CL

keywords fact verificationFEVERclaim verificationWikipediatextual evidencedatasetnatural language inference

0 comments

The pith

The FEVER dataset introduces 185,445 claims to benchmark fact verification against textual sources.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the FEVER dataset for fact extraction and verification. It consists of 185,445 claims generated by altering sentences from Wikipedia and labeled by annotators as Supported, Refuted or NotEnoughInfo without knowledge of the original sentences. Annotators also provide the evidence sentences for supported and refuted claims, achieving 0.6841 Fleiss kappa agreement. The authors test a pipeline approach that achieves 31.87% accuracy when given the correct evidence and 50.91% when evidence is ignored, indicating the dataset presents a challenging task for automated systems.

Core claim

FEVER consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as Supported, Refuted or NotEnoughInfo by annotators achieving 0.6841 in Fleiss κ. For the first two classes, the annotators also recorded the sentence(s) forming the necessary evidence for their judgment. To characterize the challenge, a pipeline approach achieves 31.87% accuracy on labeling with correct evidence and 50.91% without.

What carries the argument

The FEVER dataset of Wikipedia-altered claims with blind verification labels and evidence annotations.

If this is right

Automated claim verification systems must handle both evidence retrieval and classification to succeed on this benchmark.
The gap between pipeline performance and potential oracles shows that current methods have substantial room for improvement.
The dataset can drive research on verifying claims in large text corpora like Wikipedia.
High annotator agreement establishes a reliable human baseline for the task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

End-to-end models might close the performance gap by learning retrieval and verification jointly.
The approach could be extended to claims requiring multiple evidence sentences or cross-document verification.
This benchmark may help evaluate systems for detecting misinformation in real-world textual sources.

Load-bearing premise

Annotators without knowledge of the original sentence can reliably determine the correct label and evidence for claims created by altering Wikipedia sentences.

What would settle it

If new annotators given the original sentences produce substantially different labels or evidence sets, the blind annotation process would not be a valid test of verification.

read the original abstract

In this paper we introduce a new publicly available dataset for verification against textual sources, FEVER: Fact Extraction and VERification. It consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as Supported, Refuted or NotEnoughInfo by annotators achieving 0.6841 in Fleiss $\kappa$. For the first two classes, the annotators also recorded the sentence(s) forming the necessary evidence for their judgment. To characterize the challenge of the dataset presented, we develop a pipeline approach and compare it to suitably designed oracles. The best accuracy we achieve on labeling a claim accompanied by the correct evidence is 31.87%, while if we ignore the evidence we achieve 50.91%. Thus we believe that FEVER is a challenging testbed that will help stimulate progress on claim verification against textual sources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FEVER releases a sizable new dataset for claim verification with evidence links, but the moderate annotator agreement leaves the benchmark's reliability as an open question.

read the letter

Hi, the main point on this paper is that it puts out FEVER, a dataset of 185k claims built by altering Wikipedia sentences and then labeled Supported, Refuted, or NotEnoughInfo by annotators who did not see the source sentence. For the first two labels they also mark the supporting sentences. The authors report 0.6841 Fleiss kappa on the labels and run a simple pipeline that reaches 31.87% accuracy when given the right evidence and 50.91% when it has to find evidence itself. That is the concrete new thing: a public, large-scale collection aimed at fact verification against text, which did not exist at this size before. They do a reasonable job showing the task is non-trivial with those baseline numbers and by releasing the data so others can build on it. The stress-test note is fair. A kappa of 0.6841 is only moderate, and the abstract gives no numbers on agreement for the evidence sentences, no per-class breakdown, and no description of how many annotators saw each claim or how ties were broken. Because the claims are artificially generated, even modest label noise can change what counts as correct evidence and make the low baseline scores harder to interpret. We only have the abstract, so those details are missing. This paper is for NLP groups working on automated claim checking or misinformation detection. Anyone who needs a ready-made testbed with evidence annotations will get immediate use from the release. It deserves a serious referee because dataset papers of this scale tend to become reference points, and reviewers can usefully press on the annotation protocol and ask for more diagnostics before the community adopts it. I would send it to peer review; the contribution is real but the reliability questions need to be addressed in the full version.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces the FEVER dataset of 185,445 claims generated by altering Wikipedia sentences and verified by annotators without access to the source sentences. Claims are labeled Supported, Refuted, or NotEnoughInfo with an overall Fleiss' κ of 0.6841; evidence sentences are recorded for the first two classes. A pipeline baseline achieves 31.87% accuracy when supplied with the correct evidence and 50.91% when evidence is ignored, leading the authors to conclude that FEVER is a challenging testbed for claim verification against textual sources.

Significance. If the annotation quality is adequately demonstrated, the public release of this large-scale dataset with associated evidence would constitute a substantial contribution to NLP research on fact extraction and verification. The concrete reporting of inter-annotator agreement and two baseline accuracies provides a clear starting point for future work and could stimulate measurable progress on the task.

major comments (1)

[Abstract] Abstract: The claim that FEVER forms a valid benchmark rests on the reliability of the three-way labels and evidence annotations. Only an aggregate Fleiss' κ of 0.6841 is reported; no class-wise agreement, no evidence-sentence selection agreement, and no information on the number of annotators per claim or disagreement resolution are provided. Because claims are artificially generated by sentence alteration, this moderate agreement level directly affects the interpretation of the 31.87% and 50.91% baseline figures and requires additional quantitative support.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for acknowledging the potential significance of the FEVER dataset. We address the single major comment below and will make the requested revisions to strengthen the reporting of annotation quality.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that FEVER forms a valid benchmark rests on the reliability of the three-way labels and evidence annotations. Only an aggregate Fleiss' κ of 0.6841 is reported; no class-wise agreement, no evidence-sentence selection agreement, and no information on the number of annotators per claim or disagreement resolution are provided. Because claims are artificially generated by sentence alteration, this moderate agreement level directly affects the interpretation of the 31.87% and 50.91% baseline figures and requires additional quantitative support.

Authors: We agree that the abstract would benefit from expanded reporting on annotation reliability. The full manuscript (Section 3) details that each claim was annotated by three crowd workers, with final labels determined by majority vote and evidence sentences selected by the same workers; disagreements were resolved via discussion among annotators. To directly address the concern, we will revise the abstract to report class-wise Fleiss' κ values, inter-annotator agreement on evidence sentence selection, and the per-claim annotator count. We will also add a brief clarification that the moderate aggregate κ is consistent with the inherent subjectivity of claim verification and does not invalidate the baselines: the 31.87% accuracy with gold evidence already demonstrates the difficulty of the verification step, while the 50.91% figure without evidence underscores the need for retrieval. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction with independent baselines

full rationale

The paper constructs a dataset by altering Wikipedia sentences and collecting human annotations (Fleiss κ=0.6841), then reports direct empirical baseline accuracies from a pipeline (31.87% with evidence, 50.91% without). No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. The central claim that FEVER is a challenging testbed follows from these reported numbers without any reduction by construction to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that sentence alteration produces representative claims and that crowd annotators can produce consistent labels and evidence spans without access to the source sentence.

axioms (1)

domain assumption Wikipedia sentences can be altered to create claims that annotators can classify as Supported, Refuted or NotEnoughInfo without seeing the original sentence
Described in the abstract as the core data generation process.

pith-pipeline@v0.9.0 · 5435 in / 1152 out tokens · 43824 ms · 2026-05-13T19:46:02.703302+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction unclear
It consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as Supported, Refuted or NotEnoughInfo by annotators achieving 0.6841 in Fleiss κ.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Needle-in-RAG: Prompt-Conditioned Character-Level Traceback of Poisoned Spans in Retrieved Evidence
cs.CR 2026-05 unverdicted novelty 7.0

RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
cs.LG 2026-05 unverdicted novelty 7.0

InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.
HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads
cs.IR 2026-04 unverdicted novelty 7.0

HeadRank improves decoding-free passage reranking by preference-aligning attention heads to increase discriminability in middle-context documents, outperforming baselines on 14 benchmarks with only 211 training queries.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
cs.CL 2020-05 accept novelty 7.0

RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
From Articles to Premises: Building PrimeFacts, an Extraction Methodology and Resource for Fact-Checking Evidence
cs.CL 2026-05 unverdicted novelty 6.0

PrimeFacts extracts decontextualized premises from fact-check articles, raising evidence retrieval MRR by up to 30% and verdict prediction Macro-F1 by 10-20 points over baselines.
CAR: Query-Guided Confidence-Aware Reranking for Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 6.0

CAR reranks documents in RAG by promoting those that increase generator confidence (via answer consistency sampling) and demoting those that decrease it, yielding NDCG@5 gains on BEIR datasets that correlate with F1 i...
When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems
cs.CR 2026-05 unverdicted novelty 6.0

Embedding-based defenses fail against attacks that align malicious message embeddings with benign ones in LLM multi-agent systems, but token-level confidence scores improve robustness by enabling better pruning of sus...
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
cs.LG 2026-05 unverdicted novelty 6.0

InvEvolve uses LLMs and RL to generate certified inventory policies that outperform classical and deep learning methods on synthetic and real data while providing multi-period performance guarantees.
Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models
cs.IR 2026-04 conditional novelty 6.0

RouteHead trains a lightweight router to dynamically select optimal LLM attention heads per query for improved attention-based document re-ranking.
Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 6.0

GuarantRAG improves RAG accuracy up to 12.1% and cuts hallucinations 16.3% by decoupling parametric reasoning from evidence integration via contrastive DPO and joint decoding.
Data, Not Model: Explaining Bias toward LLM Texts in Neural Retrievers
cs.IR 2026-04 unverdicted novelty 6.0

Bias toward LLM texts in neural retrievers arises from artifact imbalances between positive and negative documents in training data that are absorbed during contrastive learning.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
cs.CL 2024-05 accept novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
Unsupervised Dense Information Retrieval with Contrastive Learning
cs.IR 2021-12 unverdicted novelty 6.0

Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress
cs.AI 2026-04 unverdicted novelty 5.0

A thermodynamic-inspired information-geometric framework defines a composite LLM stability score that outperforms a utility-entropy baseline by 0.0299 on average across 80 observations, with gains increasing at higher...
Learning Uncertainty from Sequential Internal Dispersion in Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.
Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing
cs.AI 2026-04 unverdicted novelty 5.0

SAVeR adds self-auditing of internal beliefs in LLM agents via persona-based candidates and constraint-guided repairs, improving faithfulness on six benchmarks without hurting task performance.
Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations
cs.CL 2026-04 unverdicted novelty 5.0

LLM hallucinations arise from task-dependent basins in latent space, with separability varying by task and geometry-aware steering reducing their probability.
Multilingual E5 Text Embeddings: A Technical Report
cs.CL 2024-02 unverdicted novelty 5.0

Open-source multilingual E5 embedding models are trained via contrastive pre-training on 1 billion text pairs and fine-tuning, with an instruction-tuned model matching English SOTA performance.
Text Embeddings by Weakly-Supervised Contrastive Pre-training
cs.CL 2022-12 unverdicted novelty 5.0

E5 text embeddings trained with weakly-supervised contrastive pre-training on CCPairs outperform BM25 on BEIR zero-shot and achieve top results on MTEB, beating much larger models.
Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models
cs.AI 2026-04 unverdicted novelty 4.0

DAVinCI combines claim attribution to model internals and external sources with entailment-based verification to improve LLM factual reliability by 5-20% on fact-checking datasets.
Understanding the planning of LLM agents: A survey
cs.AI 2024-02 accept novelty 4.0

A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
Hypencoder Revisited: Reproducibility and Analysis of Non-Linear Scoring for First-Stage Retrieval
cs.IR 2026-04 conditional novelty 3.0

Reproducibility study confirms Hypencoder's non-linear query-specific scoring improves retrieval over bi-encoders on standard benchmarks but standard methods remain faster and hard-task results are mixed due to implem...