Human readers prefer human literary translations over AI-generated ones for immersion and clarity despite finding MT adequate and struggling to identify the source.
Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
EnactToM is an evolving benchmark of embodied multi-agent tasks that tests functional Theory of Mind by requiring agents to act optimally on implicit beliefs in partially observable 3D environments.
ATLAS is a length-dependent benchmarking framework that evaluates 26 models on 8 capability dimensions and shows substantial rank changes when moving from 128K to 1M token ranges.
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.
citing papers explorer
-
Benchmark Data Contamination of Large Language Models: A Survey
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.