Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing

· 2026 · cs.CL · arXiv 2604.25923

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Recent advances in large language models (LLMs) have prompted a growing body of work that questions the methodology of prevailing evaluation practices. However, many such critiques have already been extensively debated in natural language processing (NLP): a field with a long history of methodological reflection on evaluation. We conduct a scoping review of research on evaluation concerns in NLP and develop a taxonomy, synthesizing recurring positions and trade-offs within each area. We also discuss practical implications of the taxonomy, including a structured checklist to support more deliberate evaluation design and interpretation. By situating contemporary debates within their historical context, this work provides a consolidated reference for reasoning about evaluation practices.

representative citing papers

The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.

citing papers explorer

Showing 1 of 1 citing paper.

The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness cs.CL · 2026-05-27 · unverdicted · none · ref 11 · internal anchor
HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.

Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing

fields

years

verdicts

representative citing papers

citing papers explorer