Recognition: 2 theorem links
· Lean TheoremBeyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation
Pith reviewed 2026-05-13 20:16 UTC · model grok-4.3
The pith
LLMs achieve high factual precision but substantially lower recall in long-form generation, leaving out many relevant facts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By building reference facts from external knowledge sources and applying importance-aware weighting for relevance and salience, the evaluation shows current LLMs substantially outperform on precision compared to recall, with factual incompleteness remaining a major limitation of long-form generation and models covering highly important facts better than the full set of relevant facts.
What carries the argument
The importance-aware recall framework, which constructs a set of reference facts from external sources and weights them by relevance and salience to measure how completely generated text covers them.
Load-bearing premise
External knowledge sources such as Wikipedia provide a sufficiently complete and unbiased set of reference facts against which generated responses can be evaluated for recall.
What would settle it
Running the same evaluation on human-written reference long-form articles for the same topics and checking whether the measured recall gap between LLMs and humans shrinks or disappears.
Figures
read the original abstract
Evaluating the factuality of long-form output generated by large language models (LLMs) remains challenging, particularly when responses are open-ended and contain many fine-grained factual statements. Existing evaluation methods primarily focus on precision: they decompose a response into atomic claims and verify each claim against external knowledge sources such as Wikipedia. However, this overlooks an equally important dimension of factuality: recall, whether the generated response covers the relevant facts that should be included. We propose a comprehensive factuality evaluation framework that jointly measures precision and recall. Our method leverages external knowledge sources to construct reference facts and determine whether they are captured in generated text. We further introduce an importance-aware weighting scheme based on relevance and salience. Our analysis reveals that current LLMs perform substantially better on precision than on recall, suggesting that factual incompleteness remains a major limitation of long-form generation and that models are generally better at covering highly important facts than the full set of relevant facts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a factuality evaluation framework for long-form LLM outputs that jointly measures precision (verifying atomic claims against external sources) and recall (coverage of reference facts extracted from sources such as Wikipedia). It introduces an importance-aware weighting scheme based on relevance and salience to prioritize key facts. Analysis of current LLMs shows substantially higher precision than recall, indicating that factual incompleteness is a major limitation and that models cover important facts better than the full set of relevant ones.
Significance. If the metrics prove robust, the work could meaningfully advance factuality evaluation by moving beyond precision-only approaches, providing a more balanced assessment that highlights coverage gaps in long-form generation. The importance weighting offers a practical way to focus on salient facts, which may guide future model improvements if the reference construction is shown to be reliable.
major comments (1)
- [Evaluation Framework and Reference Fact Construction] The recall computation relies on reference facts extracted from Wikipedia (as described in the evaluation framework). For open-ended or specialized prompts, Wikipedia frequently omits fine-grained, emerging, or domain-specific facts that a complete answer should include; this incompleteness inflates the denominator while leaving the numerator unchanged for facts the model does produce, systematically depressing measured recall and undermining the central claim that 'factual incompleteness remains a major limitation.'
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address the major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: The recall computation relies on reference facts extracted from Wikipedia (as described in the evaluation framework). For open-ended or specialized prompts, Wikipedia frequently omits fine-grained, emerging, or domain-specific facts that a complete answer should include; this incompleteness inflates the denominator while leaving the numerator unchanged for facts the model does produce, systematically depressing measured recall and undermining the central claim that 'factual incompleteness remains a major limitation.'
Authors: We appreciate the referee raising this point about reference construction. We agree that Wikipedia is not exhaustive and may omit fine-grained or emerging facts for certain specialized prompts; this is a genuine limitation of relying on any single external source. However, the described effect on the metric is inverted: omitting facts from the reference reduces the denominator, which increases (rather than depresses) measured recall. Thus our reported recall values would be optimistic upper bounds, which would only strengthen the central claim that factual incompleteness is a major limitation. Our experiments focus on prompts and topics for which Wikipedia provides reliable and reasonably complete coverage, consistent with standard practice in factuality benchmarks. In the revised manuscript we will add an explicit limitations subsection discussing the choice of knowledge sources, the conditions under which our reference construction is appropriate, and directions for extending the framework to multiple or domain-specific sources. revision: partial
Circularity Check
No circularity detected in factuality evaluation framework
full rationale
The paper's core contribution is an evaluation framework that constructs reference facts from independent external sources (Wikipedia) and computes precision and recall against LLM-generated text, with an added importance-weighting scheme. No parameters are fitted to the evaluation outputs in a way that renders the reported precision-recall gap tautological, and the central claim (precision substantially exceeds recall) follows directly from applying this externally grounded metric rather than from any self-referential definition or self-citation chain. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption External knowledge sources such as Wikipedia contain the relevant facts needed to evaluate recall of generated responses
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearWe propose a comprehensive factuality evaluation framework that jointly measures precision and recall... importance-aware weighting scheme based on relevance and salience... Recw_F = ∑ imp(fk;q) 1[ycov_k = COVERED] / ∑ imp(fk;q)
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclearreference fact set F(q) extracted from retrieved evidence Eq via VeriScore-style atomic fact extraction
Reference graph
Works this paper leans on
-
[1]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation
Adam Dejl, James Barry, Alessandra Pascale, and Javier Carnerero Cano. Comprehensive- ness metrics for automatic evaluation of factual recall in text generation.arXiv preprint arXiv:2510.07926,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Clatter: Comprehensive entailment reasoning for hallucination detection
Ron Eliav, Arie Cattan, Eran Hirsch, Shahaf Bassan, Elias Stengel-Eskin, Mohit Bansal, and Ido Dagan. Clatter: Comprehensive entailment reasoning for hallucination detection. arXiv preprint arXiv:2506.05243,
-
[5]
Heyuan Huang, Alexandra DeLucia, Vijay Murari Tiyyala, and Mark Dredze. Medscore: Generalizable factuality evaluation of free-form medical answers by domain-adapted claim decomposition and verification.arXiv preprint arXiv:2505.18452, 2025a. 9 Preprint. Under review. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong ...
-
[6]
Xin Liu, Lechen Zhang, Sheza Munir, Yiyang Gu, and Lu Wang. Verifact: Enhancing long- form factuality evaluation with refined fact extraction and reference facts. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17919–17936,
work page 2025
-
[7]
FActScore: Fine-grained atomic evaluation of factual precision in long form text generation
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Proces...
work page 2023
-
[8]
doi: 10.18653/v1/2023.emnlp-main.741
Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.741. URL https: //aclanthology.org/2023.emnlp-main.741. OpenAI. Gpt-4o-mini model. https://platform.openai.com/docs/models#gpt-4o-mini,
-
[9]
long 2rag: Evaluating long-context & long-form retrieval-augmented generation with key point recall
Zehan Qi, Rongwu Xu, Zhijiang Guo, Cunxiang Wang, Hao Zhang, and Wei Xu. long 2rag: Evaluating long-context & long-form retrieval-augmented generation with key point recall. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 4852–4872, Miami, Florida, USA, November
work page 2024
-
[10]
Rethinking tabular data understanding with large language models
Association for Computational Linguistics. doi: 10.18653/v1/2024. findings-emnlp.279. URLhttps://aclanthology.org/2024.findings-emnlp.279/. Chris Samarinas, Alexander Krubner, Alireza Salemi, Youngwoo Kim, and Hamed Zamani. Beyond factual accuracy: Evaluating coverage of diverse factual information in long-form text generation. InFindings of the Associati...
-
[11]
VeriScore: Evaluating the factuality of verifiable claims in long-form text generation
Yixiao Song, Yekyung Kim, and Mohit Iyyer. VeriScore: Evaluating the factuality of verifiable claims in long-form text generation. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 9447–9474, Miami, Florida, USA, November
work page 2024
-
[12]
doi: 10.18653/v1/2024.findings-emnlp.552
Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.552. URL https: //aclanthology.org/2024.findings-emnlp.552/. Qwen Team. Qwen2.5: A party of foundation models, September
-
[13]
SM Tonmoy, SM Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das
URL https: //qwenlm.github.io/blog/qwen2.5/. SM Tonmoy, SM Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. A comprehensive survey of hallucination mitigation techniques in large language models.arXiv preprint arXiv:2401.01313,
-
[14]
A closer look at claim decomposition
Miriam Wanner, Seth Ebner, Zhengping Jiang, Mark Dredze, and Benjamin Van Durme. A closer look at claim decomposition. In Danushka Bollegala and Vered Shwartz (eds.), Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024), pp. 153–175, Mexico City, Mexico, June
work page 2024
-
[15]
Association for Computational Linguistics. 10 Preprint. Under review. doi: 10.18653/v1/2024.starsem-1.13. URL https://aclanthology.org/2024.starsem-1. 13/. Miriam Wanner, Leif Azzopardi, Paul Thomas, Soham Dan, Benjamin Van Durme, and Nick Craswell. All claims are equal, but some claims are more equal than others: Importance- sensitive factuality evaluati...
-
[16]
Long-form factuality in large language models
Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, et al. Long-form factuality in large language models. arXiv preprint arXiv:2403.18802,
-
[17]
His breakout role was as Seth Cohen on the Fox television series The O.C
is an American actor. His breakout role was as Seth Cohen on the Fox television series The O.C. (2003–2007). For his performance as Noah in the Netflix romantic comedy series Nobody Wants This (2024), he earned a nomination for the Golden Globe Award for Best Actor in a Television Series (Musical/Comedy) and won the Critics’ Choice Television Award for Be...
work page 2003
-
[18]
• Adam Brody’s breakout role was as Seth Cohen on the Fox television series The O.C. (2003–2007). • He earned a nomination for the Golden Globe Award for Best Actor in a Television Series (Musical/Comedy) for his performance as Noah in the Netflix romantic comedy series Nobody Wants This (2024). • Brody won the Critics’ Choice Television Award for Best Ac...
work page 2003
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.