A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
Introduces LLM-FACETS, a privacy-preserving open-source framework for LLM evaluation using deterministic metrics locally, LLM-judge metrics with user-controlled APIs, and mechanisms for uncertainty visualization and hallucination detection.
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
citing papers explorer
-
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
-
Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild
Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
-
LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability
Introduces LLM-FACETS, a privacy-preserving open-source framework for LLM evaluation using deterministic metrics locally, LLM-judge metrics with user-controlled APIs, and mechanisms for uncertainty visualization and hallucination detection.
-
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.