pith. sign in

arxiv: 2606.03305 · v1 · pith:ZM3OD6UNnew · submitted 2026-06-02 · 💻 cs.AI

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

Pith reviewed 2026-06-28 09:46 UTC · model grok-4.3

classification 💻 cs.AI
keywords benchmark contaminationLLM auditingdistribution shiftdataset inferencedata provenancecontamination detectionmodel evaluation
0
0 comments X

The pith

Statistical methods for detecting LLM benchmark contamination fail in realistic auditing due to distribution shift and scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests three contamination detection approaches on 27 models spanning multiple families and sizes up to 27B parameters, including frontier industry models. It runs 335 evaluations and finds only 199 produce correct results. LLM Dataset Inference yields false positives when suspect and validation sets are not identically distributed. Post-Hoc Dataset Inference lacks statistical power because benchmarks are tiny relative to pre-training corpora. CoDeC supplies only coarse signals that cannot confirm whether a specific benchmark split was seen during training. These patterns show that current tools cannot reliably replace direct knowledge of training data provenance in practical settings.

Core claim

Across 335 evaluations, only 199 yield correct outcomes. LLM Dataset Inference results in false positives under distribution shift, Post-Hoc Dataset Inference is underpowered at benchmark scale, and CoDeC provides only coarse provenance signals that are insufficient to verify individual benchmark splits.

What carries the argument

The three detection paradigms (LLM Dataset Inference, Post-Hoc Dataset Inference, and CoDeC) evaluated for robustness against distribution shift between suspect and validation sets and against the small size of benchmarks relative to pre-training data.

If this is right

  • Statistical detection cannot yet replace transparent data provenance for confirming benchmark validity.
  • Methods must handle cases where suspect and validation sets violate the IID assumption.
  • Benchmark size limits the power of post-hoc membership inference at realistic scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Auditing pipelines could combine statistical signals with provenance logs to compensate for individual method weaknesses.
  • Specialized models in medicine or culture may require domain-specific calibration of detection thresholds.
  • Releasing the evaluation benchmark enables direct comparison of new detection algorithms against the identified failure cases.

Load-bearing premise

The assumed ground truth labels for whether contamination occurred in each of the 27 models are accurate enough to label the 335 detection outcomes as correct or incorrect.

What would settle it

Independent verification of actual training-data membership for the frontier models that contradicts the ground-truth contamination labels used to score the 335 outcomes.

Figures

Figures reproduced from arXiv: 2606.03305 by Jan Dubi\'nski, Sebastian Cygert, Wojciech Zarzecki.

Figure 1
Figure 1. Figure 1: Summary of our evaluation of methods for detecting whether a model was trained on a benchmark, across three tasks: Task 1 evaluates vulnerability to lim￾ited reference data ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CoDeC scores for Task 2. CoDeC contamination scores for benchmarks. Hatched bars indicate, that this split was used as training split. We can observe trend between model sizes, however training splits do not have higher scores. provide clear evidence of split-level membership. CoDeC scores show descrease consistently with model size, but they are indifferent to fact whether given split of benchmark was use… view at source ↗
Figure 3
Figure 3. Figure 3: Application of CoDeC and LLM Dataset Inference to industry mod [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CoDeC contamination scores on Pythia across multiple datasets. Evaluation-only bench￾marks consistently receive lower scores than pre-training corpora, reproducing the separation reported in prior work. However, train and test splits within the same corpus yield nearly identical scores, indicating that CoDeC cannot distinguish split-level membership. CoDeC. CoDeC exhibits a different lim￾itation than the D… view at source ↗
Figure 5
Figure 5. Figure 5: CoDeC scores for OLMo 2 (instruction-tuned) grouped by data [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical tools for detecting training-data membership exist, but have been validated almost exclusively in controlled academic regimes: large, homogeneous pre-training corpora and transparent, single-stage training pipelines. Whether these methods remain reliable in realistic auditing scenarios remains unclear. We identify two under-studied failure modes: distribution shift, which arises when suspect and validation sets violate the IID assumption, and scale constraints, which arise because benchmarks are orders of magnitude smaller than pre-training corpora. We systematically evaluate three leading paradigms: LLM Dataset Inference, Post-Hoc Dataset Inference, and CoDeC across 27 models from multiple families (including Pythia, OLMo~2, and specialised cultural and medical LLMs) and scales (up to 27B). We then further extend our analysis to frontier industry models. Across 335 evaluations, only 199 yield correct outcomes. LLM Dataset Inference results in false positives under distribution shift, Post-Hoc Dataset Inference is underpowered at benchmark scale, and CoDeC provides only coarse provenance signals that are insufficient to verify individual benchmark splits. Our results reveal a systematic reliability gap between controlled validation and practical benchmark auditing, and show that statistical detection cannot yet replace transparent data provenance. We open-source our benchmark for further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates three statistical paradigms for detecting LLM benchmark contamination (LLM Dataset Inference, Post-Hoc Dataset Inference, and CoDeC) across 27 models spanning multiple families and scales up to 27B, with extension to frontier industry models. It reports that only 199 of 335 evaluations produce correct outcomes, attributing specific failure modes—false positives under distribution shift for LLM Dataset Inference, underpowered detection at benchmark scale for Post-Hoc, and only coarse provenance signals for CoDeC—and concludes that statistical detection cannot yet replace transparent data provenance.

Significance. If the empirical results hold, the work is significant for documenting a practical reliability gap between controlled validation settings and realistic auditing scenarios. Strengths include the systematic multi-method, multi-family, multi-scale evaluation (including industry models) and the open-sourcing of the benchmark, which supports reproducibility and follow-on work. The concrete counts (199/335) provide falsifiable, quantitative evidence rather than purely theoretical claims.

major comments (2)
  1. [Methodology / Experimental Setup (ground-truth labeling subsection)] The central claim—that only 199/335 outcomes are correct and that specific failure modes exist—rests entirely on per-model ground-truth contamination labels for the 27 models (including frontier industry models). No section details the exact procedure, proxies, or heuristics used to establish these labels for models where training data membership cannot be directly observed; this is load-bearing because any systematic error in the labels would artifactually produce the reported false-positive rates and underpoweredness conclusions.
  2. [Results (aggregate outcomes and per-method breakdowns)] Table or results section reporting the 199/335 aggregate: the manuscript provides no sensitivity analysis or alternative labelings to show how the headline count and per-method failure-mode attributions change under plausible variations in ground-truth assignment for the industry models.
minor comments (2)
  1. [Abstract] Abstract states the 335 evaluations and 199 correct outcomes but does not define 'correct outcome' or reference the ground-truth procedure, reducing standalone clarity.
  2. [Introduction / Background] Notation for the three methods is introduced without a consolidated table of acronyms and key assumptions, which would aid comparison across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and for highlighting these important methodological points. We agree that the ground-truth labeling procedures and robustness of the 199/335 aggregate require clearer documentation. We address each comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Methodology / Experimental Setup (ground-truth labeling subsection)] The central claim—that only 199/335 outcomes are correct and that specific failure modes exist—rests entirely on per-model ground-truth contamination labels for the 27 models (including frontier industry models). No section details the exact procedure, proxies, or heuristics used to establish these labels for models where training data membership cannot be directly observed; this is load-bearing because any systematic error in the labels would artifactually produce the reported false-positive rates and underpoweredness conclusions.

    Authors: We agree that explicit documentation of the labeling procedure is essential. For the open-weight models (Pythia, OLMo~2, and the specialised cultural/medical models), labels are derived directly from their publicly released training data documentation and pre-training corpus descriptions. For the frontier industry models, labels combine official model cards, stated training data cutoffs, and cross-references to independent contamination reports in the literature. In the revised manuscript we will add a dedicated subsection under Methodology that enumerates these sources, the decision rules applied to ambiguous cases, and any limitations of the proxies. This addition will make the load-bearing assumptions transparent to readers. revision: yes

  2. Referee: [Results (aggregate outcomes and per-method breakdowns)] Table or results section reporting the 199/335 aggregate: the manuscript provides no sensitivity analysis or alternative labelings to show how the headline count and per-method failure-mode attributions change under plausible variations in ground-truth assignment for the industry models.

    Authors: We concur that a sensitivity analysis would increase confidence in the headline count. Because alternative labelings for industry models would rest on additional untestable assumptions, a full quantitative sensitivity table is not feasible without introducing speculative scenarios. We will add a concise discussion in the Results section that (a) states the sources of label uncertainty for the industry models and (b) qualitatively assesses how plausible mislabelings would affect the reported failure-mode attributions. If the referee can suggest concrete alternative label sets, we will incorporate the corresponding quantitative checks. revision: partial

Circularity Check

0 steps flagged

Empirical evaluation study with external ground truth; no derivations or self-referential reductions

full rationale

The paper performs an empirical audit of three contamination detection methods across 335 evaluations on 27 models, classifying outcomes as correct/incorrect by direct comparison to per-model ground truth labels on contamination status. No equations, fitted parameters, or derivations are present that could reduce predictions to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The central claims rest on external labels rather than self-referential quantities, satisfying the self-contained-against-external-benchmarks criterion for a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical benchmarking study with no mathematical derivations; relies on standard assumptions about contamination ground truth and IID violations.

axioms (1)
  • domain assumption Ground truth contamination status can be reliably determined for the evaluated models including frontier industry models.
    The classification of 335 outcomes as correct or incorrect depends on this.

pith-pipeline@v0.9.1-grok · 5777 in / 1084 out tokens · 22212 ms · 2026-06-28T09:46:32.648779+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    ArXivabs/2404.05590(2024)

    Alonso, I., Oronoz, M., Agerri, R.: Medexpqa: Multilingual benchmarking of large language models for medical question answering. ArXivabs/2404.05590(2024)

  2. [2]

    ArXiv abs/2402.03927(2024)

    Balloccu, S., Schmidtová, P., Lango, M., Dušek, O.: Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms. ArXiv abs/2402.03927(2024)

  3. [3]

    ArXiv abs/2304.01373(2023)

    Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O’Brien, K., et al.: Pythia: A suite for analyzing large language models across training and scaling. ArXiv abs/2304.01373(2023)

  4. [4]

    ArXivabs/2005.14165(2020)

    Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., et al.: Language models are few-shot learners. ArXivabs/2005.14165(2020)

  5. [5]

    ArXivabs/2110.14168(2021)

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., et al.: Training verifiers to solve math word problems. ArXivabs/2110.14168(2021)

  6. [6]

    ArXivabs/2310.01377(2023)

    Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., et al.: Ultrafeedback: Boosting language models with high-quality feedback. ArXivabs/2310.01377(2023)

  7. [7]

    ArXivabs/2405.16281(2024)

    Dekoninck, J., Müller, M.N., Vechev, M.: Constat: Performance-based contamina- tion detection in large language models. ArXivabs/2405.16281(2024)

  8. [8]

    ArXivabs/2311.09783(2024)

    Deng, C., Zhao, Y., Tang, X., et al.: Investigating data contamination in modern benchmarks for large language models. ArXivabs/2311.09783(2024)

  9. [9]

    ArXiv abs/1903.00161(2019)

    Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., et al.: DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. ArXiv abs/1903.00161(2019)

  10. [10]

    ArXivabs/2101.00027(2020)

    Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., et al.: The pile: An 800gb dataset of diverse text for language modeling. ArXivabs/2101.00027(2020)

  11. [11]

    ArXivabs/2308.08493(2024)

    Golchin, S., Surdeanu, M.: Time travel in llms: Tracing data contamination in large language models. ArXivabs/2308.08493(2024)

  12. [12]

    ArXivabs/2009.03300(2021)

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., et al.: Measuring massive multitask language understanding. ArXivabs/2009.03300(2021)

  13. [13]

    ArXivabs/2103.03874(2021)

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., et al.: Measuring mathematical problem solving with the MATH dataset. ArXivabs/2103.03874(2021)

  14. [14]

    ArXivabs/2009.13081(2020)

    Jin, D., Pan, E., Oufattole, N., Weng, W.H., Fang, H., et al.: What disease does this patient have? a large-scale open domain question answering dataset from medical exams. ArXivabs/2009.13081(2020)

  15. [15]

    Jin,Q.,Dhingra,B.,Liu,Z.,Cohen,W.,Lu,X.:PubMedQA:Adatasetforbiomed- ical research question answering. In: Proceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (Nov 2019)

  16. [16]

    ArXivabs/2404.00376(2024)

    Kim, H., Hwang, H., Lee, J., Park, S., Kim, D., et al.: Small language models learn enhanced reasoning skills from medical textbooks. ArXivabs/2404.00376(2024)

  17. [17]

    ArXivabs/2511.03823(2025) Reliability Gap in Benchmark Auditing 17

    Kocoń, J., Piasecki, M., Janz, A., Ferdinan, T., Łukasz Radliński, et al.: Pllum: A family of polish large language models. ArXivabs/2511.03823(2025) Reliability Gap in Benchmark Auditing 17

  18. [18]

    ArXivabs/2411.15124(2024)

    Lambert, N., Morrison, J.D., Pyatkin, V., Huang, S., et al.: Tülu 3: Pushing fron- tiers in open language model post-training. ArXivabs/2411.15124(2024)

  19. [19]

    ArXivabs/2107.06499(2022)

    Lee,K.,Ippolito,D.,Nystrom,A.,Zhang,C.,Eck,D.,etal.:Deduplicatingtraining data makes language models better. ArXivabs/2107.06499(2022)

  20. [20]

    Maini, P., Jia, H., Papernot, N., Dziedzic, A.: Llm dataset inference: Did you train on my dataset? ArXivabs/2406.06443(2024)

  21. [21]

    ArXivabs/2501.00656(2025)

    OLMo, T., Walsh, P., Soldaini, L., Groeneveld, D., Lo, K., et al.: Olmo 2: Furious. ArXivabs/2501.00656(2025)

  22. [22]

    ArXiv abs/2203.14371(2022)

    Pal, A., Umapathi, L.K., Sankarasubbu, M.: Medmcqa : A large-scale multi- subject multi-choice dataset for medical domain question answering. ArXiv abs/2203.14371(2022)

  23. [23]

    In: Workshop on Large Language Models and Generative AI for Health at AAAI 2025 (2025)

    Sallinen, A., Solergibert, A.J., Zhang, M., Boyé, G.B., et al.: Llama-3-meditron: An open-weight suite of medical LLMs based on llama-3.1. In: Workshop on Large Language Models and Generative AI for Health at AAAI 2025 (2025)

  24. [24]

    ArXivabs/2507.05201(2025)

    Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., et al.: Medgemma technical report. ArXivabs/2507.05201(2025)

  25. [25]

    Membership Inference Attacks against Machine Learning Models

    Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. ArXivabs/1610.05820(2017).https://doi. org/10.1109/SP.2017.41

  26. [26]

    Verma, S.: Neeto: A specialized medical llm for neet-pg/ukmle/usmle preparation (2025),https://huggingface.co/S4nfs/Neeto-1.0-8b

  27. [27]

    ArXivabs/2510.27055(2025)

    Zawalski, M., Boubdir, M., Bałazy, K., Nushi, B., Ribalta, P.: Detecting data contamination in llms via in-context learning. ArXivabs/2510.27055(2025)

  28. [28]

    Zhao, B., Maini, P., Boenisch, F., Dziedzic, A.: Unlocking post-hoc dataset infer- ence with synthetic data. ArXivabs/2506.15271(2025) 18 Wojciech Zarzecki, Jan Dubiński, and Sebastian Cygert () A Appendix A.1 Detailed Method Overviews This section provides step-by-step procedural summaries for the three detection paradigms evaluated in the main text. LLM...