pith. machine review for the scientific record. sign in

arxiv: 2510.07926 · v2 · submitted 2025-10-09 · 💻 cs.CL

Recognition: unknown

Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation

Authors on Pith no claims yet
classification 💻 cs.CL
keywords comprehensivenessllmsmetricsmissingacrossend-to-endevaluationfactual
0
0 comments X
read the original abstract

Despite demonstrating remarkable performance across a wide range of tasks, large language models (LLMs) have also been found to frequently produce outputs that are incomplete or selectively omit key information. In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies, including hallucinations. In this study, we address the challenge of evaluating the comprehensiveness of LLM-generated texts, focusing on the detection of missing information or underrepresented viewpoints. We investigate three automated evaluation metrics: (1) an NLI-based method that decomposes texts into atomic statements and uses natural language inference (NLI) to identify missing facts, (2) a Q&A-based metric that extracts question-answer pairs and compares responses across sources, and (3) an end-to-end approach that directly identifies missing content using LLMs. Our experiments demonstrate the surprising effectiveness of the simple end-to-end metric compared to more complex metrics, though at the cost of reduced robustness, interpretability and result granularity. We further assess the comprehensiveness of responses from several popular open-weight LLMs when answering user queries based on multiple sources.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation

    cs.CL 2026-04 unverdicted novelty 6.0

    An importance-aware recall metric for LLM factuality evaluation reveals models are better at avoiding false claims than covering all relevant facts.