Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks

Alon Jacovi, Avi Caciularu, Omer Goldman, Yoav Goldberg · 2023 · DOI 10.18653/v1/2023.emnlp-main.308

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open at publisher browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

AI translation of literary texts is "fine", but readers still prefer human translations

cs.CL · 2026-06-24 · unverdicted · novelty 6.0

Human readers prefer human literary translations over AI-generated ones for immersion and clarity despite finding MT adequate and struggling to identify the source.

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

cs.AI · 2026-05-11 · conditional · novelty 6.0 · 2 refs

EnactToM is an evolving benchmark of embodied multi-agent tasks that tests functional Theory of Mind by requiring agents to act optimally on implicit beliefs in partially observable 3D environments.

ATLAS: All-round Testing of Long-context Abilities across Scales

cs.CL · 2026-05-27 · unverdicted · novelty 5.0

ATLAS is a length-dependent benchmarking framework that evaluates 26 models on 8 capability dimensions and shows substantial rank changes when moving from 128K to 1M token ranges.

Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility

cs.LG · 2026-05-07 · unverdicted · novelty 4.0 · 2 refs

Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.

Benchmark Data Contamination of Large Language Models: A Survey

cs.CL · 2024-06-06 · unverdicted · novelty 3.0

A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.

citing papers explorer

Showing 4 of 4 citing papers after filters.

AI translation of literary texts is "fine", but readers still prefer human translations cs.CL · 2026-06-24 · unverdicted · none · ref 103
Human readers prefer human literary translations over AI-generated ones for immersion and clarity despite finding MT adequate and struggling to identify the source.
EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents cs.AI · 2026-05-11 · conditional · none · ref 23 · 2 links
EnactToM is an evolving benchmark of embodied multi-agent tasks that tests functional Theory of Mind by requiring agents to act optimally on implicit beliefs in partially observable 3D environments.
ATLAS: All-round Testing of Long-context Abilities across Scales cs.CL · 2026-05-27 · unverdicted · none · ref 14
ATLAS is a length-dependent benchmarking framework that evaluates 26 models on 8 capability dimensions and shows substantial rank changes when moving from 128K to 1M token ranges.
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility cs.LG · 2026-05-07 · unverdicted · none · ref 69 · 2 links
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.

Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer