Unifying Human and Statistical Evaluation for Natural Language Generation

Tatsunori B. Hashimoto , Hugh Zhang , Percy Liang

Authors on Pith no claims yet

classification 💻 cs.CL cs.AIstat.ML

keywords evaluationqualitydiversityhumanhusestatisticalcaptureserror

read the original abstract

How can we measure whether a natural language generation system produces both high quality and diverse outputs? Human evaluation captures quality but not diversity, as it does not catch models that simply plagiarize from the training set. On the other hand, statistical evaluation (i.e., perplexity) captures diversity but not quality, as models that occasionally emit low quality samples would be insufficiently penalized. In this paper, we propose a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated. We demonstrate that this error rate can be efficiently estimated by combining human and statistical evaluation, using an evaluation metric which we call HUSE. On summarization and chit-chat dialogue, we show that (i) HUSE detects diversity defects which fool pure human evaluation and that (ii) techniques such as annealing for improving quality actually decrease HUSE due to decreased diversity.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation
cs.CL 2026-05 unverdicted novelty 4.0

LiSCP detects LLM-generated text via stylistic consistency profiling across paraphrased variants and reports up to 11.79% better cross-domain accuracy plus robustness to adversarial attacks.