arxiv: 2604.03141 · v1 · submitted 2026-04-03 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation

James Allan, Mohit Iyyer, Nazanin Jafari

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords factuality evaluationlong-form generationprecision and recallLLM outputsimportance weightingfactual completenessreference facts

0 comments

The pith

LLMs achieve high factual precision but substantially lower recall in long-form generation, leaving out many relevant facts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that factuality evaluation for long-form LLM outputs must measure both precision, the accuracy of stated claims, and recall, the coverage of relevant facts that should appear. It introduces a framework that extracts reference facts from external sources like Wikipedia and checks which ones the generated response includes, with weights based on relevance and salience. Analysis across models shows they perform much better on precision than recall, pointing to incompleteness as a core limitation rather than just hallucination. Models also cover highly important facts more reliably than the full set of relevant ones. This joint metric reveals gaps that precision-only methods overlook in open-ended responses.

Core claim

By building reference facts from external knowledge sources and applying importance-aware weighting for relevance and salience, the evaluation shows current LLMs substantially outperform on precision compared to recall, with factual incompleteness remaining a major limitation of long-form generation and models covering highly important facts better than the full set of relevant facts.

What carries the argument

The importance-aware recall framework, which constructs a set of reference facts from external sources and weights them by relevance and salience to measure how completely generated text covers them.

Load-bearing premise

External knowledge sources such as Wikipedia provide a sufficiently complete and unbiased set of reference facts against which generated responses can be evaluated for recall.

What would settle it

Running the same evaluation on human-written reference long-form articles for the same topics and checking whether the measured recall gap between LLMs and humans shrinks or disappears.

Figures

Figures reproduced from arXiv: 2604.03141 by James Allan, Mohit Iyyer, Nazanin Jafari.

**Figure 1.** Figure 1: Percentage of claims labeled as supported, not supported, or contradicted across models and datasets. Contradiction rates. Beyond assessing the extent to which generated claims were supported by the retrieved evidence, we also analyzed whether any claims directly contradicted the evidence. Specifically, for each dataset, we calculated the rates of supported, not supported, and contradicted claims, defined… view at source ↗

**Figure 2.** Figure 2: Recall comparison for fact reference sets formed using combined importance scoring (α = β = 1) versus relevance-only(α = 1, β = 0) and salience-only scoring (α = 0, β = 1). The first column in each panel reports recall for the combined score (Co), while the next two columns show differences relative to relevance-only (∆(Co-Sal)) and salience-only (∆(Co-Rel)) rankings. 4.2.2 Factuality in LLMs (Varying Impo… view at source ↗

read the original abstract

Evaluating the factuality of long-form output generated by large language models (LLMs) remains challenging, particularly when responses are open-ended and contain many fine-grained factual statements. Existing evaluation methods primarily focus on precision: they decompose a response into atomic claims and verify each claim against external knowledge sources such as Wikipedia. However, this overlooks an equally important dimension of factuality: recall, whether the generated response covers the relevant facts that should be included. We propose a comprehensive factuality evaluation framework that jointly measures precision and recall. Our method leverages external knowledge sources to construct reference facts and determine whether they are captured in generated text. We further introduce an importance-aware weighting scheme based on relevance and salience. Our analysis reveals that current LLMs perform substantially better on precision than on recall, suggesting that factual incompleteness remains a major limitation of long-form generation and that models are generally better at covering highly important facts than the full set of relevant facts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds a recall dimension to factuality evaluation for long-form LLM outputs and shows models miss more facts than they hallucinate, but the approach depends on how complete the Wikipedia reference sets really are.

read the letter

The key point is that this work highlights how existing factuality checks for long-form LLM responses focus too much on whether the claims made are correct, while ignoring whether the response includes all the relevant facts it should. They propose adding a recall measure, weighted by importance, using external knowledge like Wikipedia to build a set of reference facts. This is new in the sense that prior methods stopped at precision. The importance-aware scheme tries to prioritize salient facts, which makes sense for open-ended generation where not every detail carries the same weight. The finding that models cover important facts better than the full set is interesting and could guide future work on what to prompt for or fine-tune on. That said, the method's reliability rests on those reference facts being a good proxy for what should be included. Wikipedia often falls short on specialized, recent, or nuanced topics, which could artificially lower the recall scores and exaggerate the precision-recall gap. Without details on how they extract and verify the references or handle cases where the source is incomplete, it's difficult to trust the quantitative claims fully. The abstract mentions a performance gap but doesn't show the actual numbers or any ablation on the weighting. Overall, this is aimed at researchers in LLM evaluation and factuality. A reader working on benchmarks or generation quality would find the framework worth trying, though they'd need to adapt it for domains where Wikipedia isn't sufficient. I think it deserves peer review. The core idea is sound and fills a gap, even if the current validation is limited and the reference source choice needs more justification. Referees could help strengthen the empirical side and suggest better ways to construct the ground truth.

Referee Report

1 major / 0 minor

Summary. The paper proposes a factuality evaluation framework for long-form LLM outputs that jointly measures precision (verifying atomic claims against external sources) and recall (coverage of reference facts extracted from sources such as Wikipedia). It introduces an importance-aware weighting scheme based on relevance and salience to prioritize key facts. Analysis of current LLMs shows substantially higher precision than recall, indicating that factual incompleteness is a major limitation and that models cover important facts better than the full set of relevant ones.

Significance. If the metrics prove robust, the work could meaningfully advance factuality evaluation by moving beyond precision-only approaches, providing a more balanced assessment that highlights coverage gaps in long-form generation. The importance weighting offers a practical way to focus on salient facts, which may guide future model improvements if the reference construction is shown to be reliable.

major comments (1)

[Evaluation Framework and Reference Fact Construction] The recall computation relies on reference facts extracted from Wikipedia (as described in the evaluation framework). For open-ended or specialized prompts, Wikipedia frequently omits fine-grained, emerging, or domain-specific facts that a complete answer should include; this incompleteness inflates the denominator while leaving the numerator unchanged for facts the model does produce, systematically depressing measured recall and undermining the central claim that 'factual incompleteness remains a major limitation.'

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: The recall computation relies on reference facts extracted from Wikipedia (as described in the evaluation framework). For open-ended or specialized prompts, Wikipedia frequently omits fine-grained, emerging, or domain-specific facts that a complete answer should include; this incompleteness inflates the denominator while leaving the numerator unchanged for facts the model does produce, systematically depressing measured recall and undermining the central claim that 'factual incompleteness remains a major limitation.'

Authors: We appreciate the referee raising this point about reference construction. We agree that Wikipedia is not exhaustive and may omit fine-grained or emerging facts for certain specialized prompts; this is a genuine limitation of relying on any single external source. However, the described effect on the metric is inverted: omitting facts from the reference reduces the denominator, which increases (rather than depresses) measured recall. Thus our reported recall values would be optimistic upper bounds, which would only strengthen the central claim that factual incompleteness is a major limitation. Our experiments focus on prompts and topics for which Wikipedia provides reliable and reasonably complete coverage, consistent with standard practice in factuality benchmarks. In the revised manuscript we will add an explicit limitations subsection discussing the choice of knowledge sources, the conditions under which our reference construction is appropriate, and directions for extending the framework to multiple or domain-specific sources. revision: partial

Circularity Check

0 steps flagged

No circularity detected in factuality evaluation framework

full rationale

The paper's core contribution is an evaluation framework that constructs reference facts from independent external sources (Wikipedia) and computes precision and recall against LLM-generated text, with an added importance-weighting scheme. No parameters are fitted to the evaluation outputs in a way that renders the reported precision-recall gap tautological, and the central claim (precision substantially exceeds recall) follows directly from applying this externally grounded metric rather than from any self-referential definition or self-citation chain. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the domain assumption that reference facts extracted from external sources accurately represent what should be covered in a response.

axioms (1)

domain assumption External knowledge sources such as Wikipedia contain the relevant facts needed to evaluate recall of generated responses
Used to construct reference facts and determine coverage in generated text.

pith-pipeline@v0.9.0 · 5463 in / 1065 out tokens · 36481 ms · 2026-05-13T20:16:31.287308+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We propose a comprehensive factuality evaluation framework that jointly measures precision and recall... importance-aware weighting scheme based on relevance and salience... Recw_F = ∑ imp(fk;q) 1[ycov_k = COVERED] / ∑ imp(fk;q)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear
reference fact set F(q) extracted from retrieved evidence Eq via VeriScore-style atomic fact extraction

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 3 internal anchors

[1]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation

Adam Dejl, James Barry, Alessandra Pascale, and Javier Carnerero Cano. Comprehensive- ness metrics for automatic evaluation of factual recall in text generation.arXiv preprint arXiv:2510.07926,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Clatter: Comprehensive entailment reasoning for hallucination detection

Ron Eliav, Arie Cattan, Eran Hirsch, Shahaf Bassan, Elias Stengel-Eskin, Mohit Bansal, and Ido Dagan. Clatter: Comprehensive entailment reasoning for hallucination detection. arXiv preprint arXiv:2506.05243,

work page arXiv
[5]

Medscore: Generalizable factuality evaluation of free-form medical answers by domain-adapted claim decomposition and verification.arXiv preprint arXiv:2505.18452, 2025a

Heyuan Huang, Alexandra DeLucia, Vijay Murari Tiyyala, and Mark Dredze. Medscore: Generalizable factuality evaluation of free-form medical answers by domain-adapted claim decomposition and verification.arXiv preprint arXiv:2505.18452, 2025a. 9 Preprint. Under review. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong ...

work page arXiv
[6]

Verifact: Enhancing long- form factuality evaluation with refined fact extraction and reference facts

Xin Liu, Lechen Zhang, Sheza Munir, Yiyang Gu, and Lu Wang. Verifact: Enhancing long- form factuality evaluation with refined fact extraction and reference facts. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17919–17936,

work page 2025
[7]

FActScore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Proces...

work page 2023
[8]

doi: 10.18653/v1/2023.emnlp-main.741

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.741. URL https: //aclanthology.org/2023.emnlp-main.741. OpenAI. Gpt-4o-mini model. https://platform.openai.com/docs/models#gpt-4o-mini,

work page doi:10.18653/v1/2023.emnlp-main.741 2023
[9]

long 2rag: Evaluating long-context & long-form retrieval-augmented generation with key point recall

Zehan Qi, Rongwu Xu, Zhijiang Guo, Cunxiang Wang, Hao Zhang, and Wei Xu. long 2rag: Evaluating long-context & long-form retrieval-augmented generation with key point recall. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 4852–4872, Miami, Florida, USA, November

work page 2024
[10]

Rethinking tabular data understanding with large language models

Association for Computational Linguistics. doi: 10.18653/v1/2024. findings-emnlp.279. URLhttps://aclanthology.org/2024.findings-emnlp.279/. Chris Samarinas, Alexander Krubner, Alireza Salemi, Youngwoo Kim, and Hamed Zamani. Beyond factual accuracy: Evaluating coverage of diverse factual information in long-form text generation. InFindings of the Associati...

work page doi:10.18653/v1/2024 2024
[11]

VeriScore: Evaluating the factuality of verifiable claims in long-form text generation

Yixiao Song, Yekyung Kim, and Mohit Iyyer. VeriScore: Evaluating the factuality of verifiable claims in long-form text generation. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 9447–9474, Miami, Florida, USA, November

work page 2024
[12]

doi: 10.18653/v1/2024.findings-emnlp.552

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.552. URL https: //aclanthology.org/2024.findings-emnlp.552/. Qwen Team. Qwen2.5: A party of foundation models, September

work page doi:10.18653/v1/2024.findings-emnlp.552 2024
[13]

SM Tonmoy, SM Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das

URL https: //qwenlm.github.io/blog/qwen2.5/. SM Tonmoy, SM Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. A comprehensive survey of hallucination mitigation techniques in large language models.arXiv preprint arXiv:2401.01313,

work page arXiv
[14]

A closer look at claim decomposition

Miriam Wanner, Seth Ebner, Zhengping Jiang, Mark Dredze, and Benjamin Van Durme. A closer look at claim decomposition. In Danushka Bollegala and Vered Shwartz (eds.), Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024), pp. 153–175, Mexico City, Mexico, June

work page 2024
[15]

10 Preprint

Association for Computational Linguistics. 10 Preprint. Under review. doi: 10.18653/v1/2024.starsem-1.13. URL https://aclanthology.org/2024.starsem-1. 13/. Miriam Wanner, Leif Azzopardi, Paul Thomas, Soham Dan, Benjamin Van Durme, and Nick Craswell. All claims are equal, but some claims are more equal than others: Importance- sensitive factuality evaluati...

work page doi:10.18653/v1/2024.starsem-1.13 2024
[16]

Long-form factuality in large language models

Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, et al. Long-form factuality in large language models. arXiv preprint arXiv:2403.18802,

work page arXiv
[17]

His breakout role was as Seth Cohen on the Fox television series The O.C

is an American actor. His breakout role was as Seth Cohen on the Fox television series The O.C. (2003–2007). For his performance as Noah in the Netflix romantic comedy series Nobody Wants This (2024), he earned a nomination for the Golden Globe Award for Best Actor in a Television Series (Musical/Comedy) and won the Critics’ Choice Television Award for Be...

work page 2003
[18]

label":

• Adam Brody’s breakout role was as Seth Cohen on the Fox television series The O.C. (2003–2007). • He earned a nomination for the Golden Globe Award for Best Actor in a Television Series (Musical/Comedy) for his performance as Noah in the Netflix romantic comedy series Nobody Wants This (2024). • Brody won the Critics’ Choice Television Award for Best Ac...

work page 2003