pith. machine review for the scientific record. sign in

arxiv: 1603.08023 · v2 · submitted 2016-03-25 · 💻 cs.CL · cs.AI· cs.LG· cs.NE

Recognition: unknown

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Authors on Pith no claims yet
classification 💻 cs.CL cs.AIcs.LGcs.NE
keywords metricsresponsedialogueevaluationgenerationdomainsystemsadopted
0
0 comments X
read the original abstract

We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts

    cs.CL 2026-04 unverdicted novelty 5.0

    Mainstream conversational models show escalating affective misalignments and ethical guidance failures during staged emotional trajectories, organized into a taxonomy of interactional breakdowns.