TSFMAudit detects pretraining contamination in time series foundation models via probe adaptation dynamics (faster loss drop, smaller backbone shift), tested on 6 models and 187 datasets against 10 LLM-derived baselines.
hub
Benchmark Data Contamination of Large Language Models: A Survey
28 Pith papers cite this work. Polarity classification is still indexing.
abstract
The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed the field of natural language processing. However, it has also resulted in a significant issue known as Benchmark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase of the process. This paper reviews the complex challenge of BDC in LLM evaluation and explores alternative assessment methods to mitigate the risks associated with traditional benchmarks. The paper also examines challenges and future directions in mitigating BDC risks, highlighting the complexity of the issue and the need for innovative solutions to ensure the reliability of LLM evaluation in real-world applications.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.
LLM generative error correction improves low-resource Frisian ASR performance, with comparable gains on a contamination-controlled offline dataset confirming true correction ability.
A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tuning tokens while preserving utility.
A graphlet-anchored framework generates 119,856 factually grounded biomedical QA pairs that improve accuracy on PubMedQA and MedQA benchmarks.
A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and tokenization.
A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
LiveFact is a new time-aware benchmark that evaluates LLMs on reasoning with dynamic and incomplete information for fake news detection, identifying a significant reasoning gap in model behavior.
The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.
Models benchmarking as principal-agent game, derives welfare loss from welfare alignment, improvability and variance, and applies an audit framework to OLMES items.
ReverseMath uses answer inversion to generate paired original and reversed math problems with known answers for detecting memorization and improving LLM reasoning via data augmentation.
An empirical study of 57 ML evaluation harnesses shows 41.4% of operational issues occur in the specification stage, driven mainly by unimplemented features, documentation gaps, and missing input validation.
TRACER presents a semantic-aware framework and the first benchmark for fine-grained code contamination detection across three levels of overlap, reporting F1 scores of 0.91-0.92 and large gains over prior methods.
First unified survey formalizing Pretraining Data Exposure across exposure levels and reviewing attack, defense, and contamination methods for LLMs.
NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.
ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing rankings between MCQ and LLM-judge scoring.
Ultra-compact 8-30M parameter models start contextually grounded responses on-device while cloud models seamlessly continue them, enabling responsive AI on power-constrained hardware.
SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.
League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.
ZCP detects direct and evasive data contamination in LLMs by truncating CoT reasoning and contrasting zero-CoT accuracy on original versus perturbed isomorphic datasets, plus a Contamination Confidence metric.
Authors call for contamination-resistant LLM benchmarks that exploit Transformer training-inference asymmetry and require new mathematical methods for cross-architecture interoperability.
Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.