arXiv preprint arXiv:2404.18824 , year=

Benchmarking benchmark leakage in large language models , author= · 2024 · arXiv 2404.18824

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

representative citing papers

Meta-Benchmarks for Financial-Services LLM Evaluation

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.

OpenFinGym: A Verifiable Multi-Task Gym Environment for Evaluating Quant Agents

cs.AI · 2026-06-24 · unverdicted · novelty 7.0

OpenFinGym is a multi-task verifiable gym environment for quant-finance agents with automated task construction from publications, containerised runtime, paper trading engine, and support for SFT/RL training.

Bridging Functional Correctness and Runtime Efficiency Gaps in LLM-Based Code Translation

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

SwiftTrans improves both functional correctness and runtime efficiency of LLM code translations via multi-perspective exploration with hierarchical guidance and difference-aware selection with ordinal guidance on extended benchmarks including new SwiftBench.

Can AI Agents Synthesize Scientific Conclusions?

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

cs.AI · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

LGMT is a logic-grounded metamorphic testing framework that detects hidden reasoning defects in LLMs by checking consistency on semantically invariant inputs derived from FOL equivalences.

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.

AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models

cs.CR · 2026-04-23 · unverdicted · novelty 7.0

AutoRISE evolves red-teaming attack strategies as editable executable programs via an agent, yielding 17-point higher average attack success rates than baselines across 11 models.

When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors

cs.CL · 2026-04-23 · unverdicted · novelty 7.0

New RPS and AGS metrics show within-family distilled LLM agents have 5.9 pp higher tool-use graph similarity than cross-family pairs, with some models exceeding their teachers.

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

cs.CL · 2026-04-16 · unverdicted · novelty 7.0

MADE creates a contamination-resistant living benchmark for multi-label classification of medical device adverse events, with evaluations revealing model-specific trade-offs in accuracy and uncertainty quantification.

Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners

cs.CL · 2026-01-06 · unverdicted · novelty 7.0

Large reasoning models exhibit multilingual latent reasoning that is uneven across languages but internally consistent and English-centered.

LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation

cs.CL · 2025-10-17 · unverdicted · novelty 7.0

A mutual evaluation system for LLMs that uses game-theoretic aggregation of peer reviews and validates alignment with human voting on subjective outputs.

GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines

cs.AI · 2025-08-08 · accept · novelty 7.0

GeoLaux is a new benchmark of 2186 long-step geometry problems requiring auxiliary lines, used to evaluate 23 MLLMs and reveal major drops in performance on complex tasks.

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

cs.CL · 2024-10-10 · conditional · novelty 7.0

Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.

Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

The paper characterizes deductive stereotyping in LLMs and introduces Fair-GCG to discover injection phrases that improve fairness across benchmarks, reasoning, and real-world tasks.

SrDetection: A Self-Referential Framework for Data Leakage Detection in Code Large Language Models

cs.CL · 2026-06-29 · unverdicted · novelty 6.0

SrDetection detects data leakage in Code LLMs via contrast between original benchmark samples and their semantic variants, reporting F1 gains of 21.52 (gray-box) and 14.46 (black-box) over baselines in a controlled testbed.

Uncertainty-based Debiasing and Unlearning for Decontamination

cs.CY · 2026-06-22 · unverdicted · novelty 6.0

UBD leverages ensemble uncertainty to estimate per-sample memorization and construct debiased targets for post-hoc correction or unlearning, yielding output distributions closer to uncontaminated models on MMLU-Pro and MATH-MCQA than baselines.

VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

VeriScale adversarially scales test suites for the Verina benchmark into VerinaPlus (83x larger) and VerinaLite (14x variant) that expose hidden LLM weaknesses on SpecGen and CodeGen tasks.

SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.

PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction

cs.SE · 2026-05-21 · conditional · novelty 5.0

PITMuS automates source-level bug dataset generation by mapping PIT bytecode mutants back to Java source using debug information, producing structured pairs and metadata evaluated on eight open-source systems.

citing papers explorer

Showing 10 of 10 citing papers after filters.

Bridging Functional Correctness and Runtime Efficiency Gaps in LLM-Based Code Translation cs.CL · 2026-06-16 · unverdicted · none · ref 11
SwiftTrans improves both functional correctness and runtime efficiency of LLM code translations via multi-perspective exploration with hierarchical guidance and difference-aware selection with ordinal guidance on extended benchmarks including new SwiftBench.
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity cs.CL · 2026-05-07 · unverdicted · none · ref 17
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors cs.CL · 2026-04-23 · unverdicted · none · ref 7
New RPS and AGS metrics show within-family distilled LLM agents have 5.9 pp higher tool-use graph similarity than cross-family pairs, with some models exceeding their teachers.
MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events cs.CL · 2026-04-16 · unverdicted · none · ref 8
MADE creates a contamination-resistant living benchmark for multi-label classification of medical device adverse events, with evaluations revealing model-specific trade-offs in accuracy and uncertainty quantification.
Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners cs.CL · 2026-01-06 · unverdicted · none · ref 10
Large reasoning models exhibit multilingual latent reasoning that is uneven across languages but internally consistent and English-centered.
LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation cs.CL · 2025-10-17 · unverdicted · none · ref 4
A mutual evaluation system for LLMs that uses game-theoretic aggregation of peer reviews and validates alignment with human voting on subjective outputs.
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models cs.CL · 2024-10-10 · conditional · none · ref 72
Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG cs.CL · 2026-06-30 · unverdicted · none · ref 97
The paper characterizes deductive stereotyping in LLMs and introduces Fair-GCG to discover injection phrases that improve fairness across benchmarks, reasoning, and real-world tasks.
SrDetection: A Self-Referential Framework for Data Leakage Detection in Code Large Language Models cs.CL · 2026-06-29 · unverdicted · none · ref 7
SrDetection detects data leakage in Code LLMs via contrast between original benchmark samples and their semantic variants, reporting F1 gains of 21.52 (gray-box) and 14.46 (black-box) over baselines in a controlled testbed.
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks cs.CL · 2026-04-20 · unverdicted · none · ref 15
SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.

arXiv preprint arXiv:2404.18824 , year=

fields

years

verdicts

representative citing papers

citing papers explorer