Human readers prefer human literary translations over AI-generated ones for immersion and clarity despite finding MT adequate and struggling to identify the source.
Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
EnactToM is an evolving benchmark of embodied multi-agent tasks that tests functional Theory of Mind by requiring agents to act optimally on implicit beliefs in partially observable 3D environments.
ATLAS is a length-dependent benchmarking framework that evaluates 26 models on 8 capability dimensions and shows substantial rank changes when moving from 128K to 1M token ranges.
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.
citing papers explorer
-
AI translation of literary texts is "fine", but readers still prefer human translations
Human readers prefer human literary translations over AI-generated ones for immersion and clarity despite finding MT adequate and struggling to identify the source.
-
EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents
EnactToM is an evolving benchmark of embodied multi-agent tasks that tests functional Theory of Mind by requiring agents to act optimally on implicit beliefs in partially observable 3D environments.
-
ATLAS: All-round Testing of Long-context Abilities across Scales
ATLAS is a length-dependent benchmarking framework that evaluates 26 models on 8 capability dimensions and shows substantial rank changes when moving from 128K to 1M token ranges.
-
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.