Why We Need New Evaluation Metrics for NLG
read the original abstract
The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.
This paper has not been read by Pith yet.
Forward citations
Cited by 4 Pith papers
-
MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction
MedGuards introduces a multi-agent in-context learning framework for medical error detection and correction plus the KPCS metric, reporting improvements on four multilingual clinical note datasets.
-
MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction
MedGuards proposes a multi-agent system for medical error detection and correction plus the KPCS metric, reporting gains on four multilingual clinical-note datasets.
-
A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models
Pilot evaluation of language-specific versus multilingual LoRA adapters on Qwen2.5-VL-3B for curator-guided BLV art descriptions in three languages.
-
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.