Why We Need New Evaluation Metrics for NLG

Amanda Cercas Curry; Jekaterina Novikova; Ond\v{r}ej Du\v{s}ek; Verena Rieser

arxiv: 1707.06875 · v1 · pith:LMEEBE27new · submitted 2017-07-21 · 💻 cs.CL

Why We Need New Evaluation Metrics for NLG

Jekaterina Novikova , Ond\v{r}ej Du\v{s}ek , Amanda Cercas Curry , Verena Rieser This is my paper

classification 💻 cs.CL

keywords metricsautomaticevaluationsystemneednovelbleucases

0 comments

read the original abstract

The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction
cs.CL 2026-06 unverdicted novelty 6.0

MedGuards introduces a multi-agent in-context learning framework for medical error detection and correction plus the KPCS metric, reporting improvements on four multilingual clinical note datasets.
MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction
cs.CL 2026-06 unverdicted novelty 5.0

MedGuards proposes a multi-agent system for medical error detection and correction plus the KPCS metric, reporting gains on four multilingual clinical-note datasets.
A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models
cs.MM 2026-05 unverdicted novelty 5.0

Pilot evaluation of language-specific versus multilingual LoRA adapters on Qwen2.5-VL-3B for curator-guided BLV art descriptions in three languages.
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
cs.CL 2026-05 unverdicted novelty 5.0

Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.