pith. machine review for the scientific record. sign in

arxiv: 2604.03141 · v1 · submitted 2026-04-03 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation

James Allan, Mohit Iyyer, Nazanin Jafari

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:16 UTC · model grok-4.3

classification 💻 cs.CL
keywords factuality evaluationlong-form generationprecision and recallLLM outputsimportance weightingfactual completenessreference facts
0
0 comments X

The pith

LLMs achieve high factual precision but substantially lower recall in long-form generation, leaving out many relevant facts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that factuality evaluation for long-form LLM outputs must measure both precision, the accuracy of stated claims, and recall, the coverage of relevant facts that should appear. It introduces a framework that extracts reference facts from external sources like Wikipedia and checks which ones the generated response includes, with weights based on relevance and salience. Analysis across models shows they perform much better on precision than recall, pointing to incompleteness as a core limitation rather than just hallucination. Models also cover highly important facts more reliably than the full set of relevant ones. This joint metric reveals gaps that precision-only methods overlook in open-ended responses.

Core claim

By building reference facts from external knowledge sources and applying importance-aware weighting for relevance and salience, the evaluation shows current LLMs substantially outperform on precision compared to recall, with factual incompleteness remaining a major limitation of long-form generation and models covering highly important facts better than the full set of relevant facts.

What carries the argument

The importance-aware recall framework, which constructs a set of reference facts from external sources and weights them by relevance and salience to measure how completely generated text covers them.

Load-bearing premise

External knowledge sources such as Wikipedia provide a sufficiently complete and unbiased set of reference facts against which generated responses can be evaluated for recall.

What would settle it

Running the same evaluation on human-written reference long-form articles for the same topics and checking whether the measured recall gap between LLMs and humans shrinks or disappears.

Figures

Figures reproduced from arXiv: 2604.03141 by James Allan, Mohit Iyyer, Nazanin Jafari.

Figure 1
Figure 1. Figure 1: Percentage of claims labeled as supported, not supported, or contradicted across models and datasets. Contradiction rates. Beyond assessing the extent to which generated claims were sup￾ported by the retrieved evidence, we also analyzed whether any claims directly contradicted the evidence. Specifically, for each dataset, we calculated the rates of supported, not supported, and contradicted claims, defined… view at source ↗
Figure 2
Figure 2. Figure 2: Recall comparison for fact reference sets formed using combined importance scoring (α = β = 1) versus relevance-only(α = 1, β = 0) and salience-only scoring (α = 0, β = 1). The first column in each panel reports recall for the combined score (Co), while the next two columns show differences relative to relevance-only (∆(Co-Sal)) and salience-only (∆(Co-Rel)) rankings. 4.2.2 Factuality in LLMs (Varying Impo… view at source ↗
read the original abstract

Evaluating the factuality of long-form output generated by large language models (LLMs) remains challenging, particularly when responses are open-ended and contain many fine-grained factual statements. Existing evaluation methods primarily focus on precision: they decompose a response into atomic claims and verify each claim against external knowledge sources such as Wikipedia. However, this overlooks an equally important dimension of factuality: recall, whether the generated response covers the relevant facts that should be included. We propose a comprehensive factuality evaluation framework that jointly measures precision and recall. Our method leverages external knowledge sources to construct reference facts and determine whether they are captured in generated text. We further introduce an importance-aware weighting scheme based on relevance and salience. Our analysis reveals that current LLMs perform substantially better on precision than on recall, suggesting that factual incompleteness remains a major limitation of long-form generation and that models are generally better at covering highly important facts than the full set of relevant facts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes a factuality evaluation framework for long-form LLM outputs that jointly measures precision (verifying atomic claims against external sources) and recall (coverage of reference facts extracted from sources such as Wikipedia). It introduces an importance-aware weighting scheme based on relevance and salience to prioritize key facts. Analysis of current LLMs shows substantially higher precision than recall, indicating that factual incompleteness is a major limitation and that models cover important facts better than the full set of relevant ones.

Significance. If the metrics prove robust, the work could meaningfully advance factuality evaluation by moving beyond precision-only approaches, providing a more balanced assessment that highlights coverage gaps in long-form generation. The importance weighting offers a practical way to focus on salient facts, which may guide future model improvements if the reference construction is shown to be reliable.

major comments (1)
  1. [Evaluation Framework and Reference Fact Construction] The recall computation relies on reference facts extracted from Wikipedia (as described in the evaluation framework). For open-ended or specialized prompts, Wikipedia frequently omits fine-grained, emerging, or domain-specific facts that a complete answer should include; this incompleteness inflates the denominator while leaving the numerator unchanged for facts the model does produce, systematically depressing measured recall and undermining the central claim that 'factual incompleteness remains a major limitation.'

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: The recall computation relies on reference facts extracted from Wikipedia (as described in the evaluation framework). For open-ended or specialized prompts, Wikipedia frequently omits fine-grained, emerging, or domain-specific facts that a complete answer should include; this incompleteness inflates the denominator while leaving the numerator unchanged for facts the model does produce, systematically depressing measured recall and undermining the central claim that 'factual incompleteness remains a major limitation.'

    Authors: We appreciate the referee raising this point about reference construction. We agree that Wikipedia is not exhaustive and may omit fine-grained or emerging facts for certain specialized prompts; this is a genuine limitation of relying on any single external source. However, the described effect on the metric is inverted: omitting facts from the reference reduces the denominator, which increases (rather than depresses) measured recall. Thus our reported recall values would be optimistic upper bounds, which would only strengthen the central claim that factual incompleteness is a major limitation. Our experiments focus on prompts and topics for which Wikipedia provides reliable and reasonably complete coverage, consistent with standard practice in factuality benchmarks. In the revised manuscript we will add an explicit limitations subsection discussing the choice of knowledge sources, the conditions under which our reference construction is appropriate, and directions for extending the framework to multiple or domain-specific sources. revision: partial

Circularity Check

0 steps flagged

No circularity detected in factuality evaluation framework

full rationale

The paper's core contribution is an evaluation framework that constructs reference facts from independent external sources (Wikipedia) and computes precision and recall against LLM-generated text, with an added importance-weighting scheme. No parameters are fitted to the evaluation outputs in a way that renders the reported precision-recall gap tautological, and the central claim (precision substantially exceeds recall) follows directly from applying this externally grounded metric rather than from any self-referential definition or self-citation chain. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the domain assumption that reference facts extracted from external sources accurately represent what should be covered in a response.

axioms (1)
  • domain assumption External knowledge sources such as Wikipedia contain the relevant facts needed to evaluate recall of generated responses
    Used to construct reference facts and determine coverage in generated text.

pith-pipeline@v0.9.0 · 5463 in / 1065 out tokens · 36481 ms · 2026-05-13T20:16:31.287308+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 3 internal anchors

  1. [1]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

  2. [2]

    Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation

    Adam Dejl, James Barry, Alessandra Pascale, and Javier Carnerero Cano. Comprehensive- ness metrics for automatic evaluation of factual recall in text generation.arXiv preprint arXiv:2510.07926,

  3. [3]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  4. [4]

    Clatter: Comprehensive entailment reasoning for hallucination detection

    Ron Eliav, Arie Cattan, Eran Hirsch, Shahaf Bassan, Elias Stengel-Eskin, Mohit Bansal, and Ido Dagan. Clatter: Comprehensive entailment reasoning for hallucination detection. arXiv preprint arXiv:2506.05243,

  5. [5]

    Medscore: Generalizable factuality evaluation of free-form medical answers by domain-adapted claim decomposition and verification.arXiv preprint arXiv:2505.18452, 2025a

    Heyuan Huang, Alexandra DeLucia, Vijay Murari Tiyyala, and Mark Dredze. Medscore: Generalizable factuality evaluation of free-form medical answers by domain-adapted claim decomposition and verification.arXiv preprint arXiv:2505.18452, 2025a. 9 Preprint. Under review. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong ...

  6. [6]

    Verifact: Enhancing long- form factuality evaluation with refined fact extraction and reference facts

    Xin Liu, Lechen Zhang, Sheza Munir, Yiyang Gu, and Lu Wang. Verifact: Enhancing long- form factuality evaluation with refined fact extraction and reference facts. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17919–17936,

  7. [7]

    FActScore: Fine-grained atomic evaluation of factual precision in long form text generation

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Proces...

  8. [8]

    doi: 10.18653/v1/2023.emnlp-main.741

    Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.741. URL https: //aclanthology.org/2023.emnlp-main.741. OpenAI. Gpt-4o-mini model. https://platform.openai.com/docs/models#gpt-4o-mini,

  9. [9]

    long 2rag: Evaluating long-context & long-form retrieval-augmented generation with key point recall

    Zehan Qi, Rongwu Xu, Zhijiang Guo, Cunxiang Wang, Hao Zhang, and Wei Xu. long 2rag: Evaluating long-context & long-form retrieval-augmented generation with key point recall. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 4852–4872, Miami, Florida, USA, November

  10. [10]

    Rethinking tabular data understanding with large language models

    Association for Computational Linguistics. doi: 10.18653/v1/2024. findings-emnlp.279. URLhttps://aclanthology.org/2024.findings-emnlp.279/. Chris Samarinas, Alexander Krubner, Alireza Salemi, Youngwoo Kim, and Hamed Zamani. Beyond factual accuracy: Evaluating coverage of diverse factual information in long-form text generation. InFindings of the Associati...

  11. [11]

    VeriScore: Evaluating the factuality of verifiable claims in long-form text generation

    Yixiao Song, Yekyung Kim, and Mohit Iyyer. VeriScore: Evaluating the factuality of verifiable claims in long-form text generation. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 9447–9474, Miami, Florida, USA, November

  12. [12]

    doi: 10.18653/v1/2024.findings-emnlp.552

    Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.552. URL https: //aclanthology.org/2024.findings-emnlp.552/. Qwen Team. Qwen2.5: A party of foundation models, September

  13. [13]

    SM Tonmoy, SM Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das

    URL https: //qwenlm.github.io/blog/qwen2.5/. SM Tonmoy, SM Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. A comprehensive survey of hallucination mitigation techniques in large language models.arXiv preprint arXiv:2401.01313,

  14. [14]

    A closer look at claim decomposition

    Miriam Wanner, Seth Ebner, Zhengping Jiang, Mark Dredze, and Benjamin Van Durme. A closer look at claim decomposition. In Danushka Bollegala and Vered Shwartz (eds.), Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024), pp. 153–175, Mexico City, Mexico, June

  15. [15]

    10 Preprint

    Association for Computational Linguistics. 10 Preprint. Under review. doi: 10.18653/v1/2024.starsem-1.13. URL https://aclanthology.org/2024.starsem-1. 13/. Miriam Wanner, Leif Azzopardi, Paul Thomas, Soham Dan, Benjamin Van Durme, and Nick Craswell. All claims are equal, but some claims are more equal than others: Importance- sensitive factuality evaluati...

  16. [16]

    Long-form factuality in large language models

    Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, et al. Long-form factuality in large language models. arXiv preprint arXiv:2403.18802,

  17. [17]

    His breakout role was as Seth Cohen on the Fox television series The O.C

    is an American actor. His breakout role was as Seth Cohen on the Fox television series The O.C. (2003–2007). For his performance as Noah in the Netflix romantic comedy series Nobody Wants This (2024), he earned a nomination for the Golden Globe Award for Best Actor in a Television Series (Musical/Comedy) and won the Critics’ Choice Television Award for Be...

  18. [18]

    label":

    • Adam Brody’s breakout role was as Seth Cohen on the Fox television series The O.C. (2003–2007). • He earned a nomination for the Golden Globe Award for Best Actor in a Television Series (Musical/Comedy) for his performance as Noah in the Netflix romantic comedy series Nobody Wants This (2024). • Brody won the Critics’ Choice Television Award for Best Ac...