A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.
hub
input": {
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
SwiftTrans improves both functional correctness and runtime efficiency of LLM code translations via multi-perspective exploration with hierarchical guidance and difference-aware selection with ordinal guidance on extended benchmarks including new SwiftBench.
A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
AutoRISE evolves red-teaming attack strategies as editable executable programs via an agent, yielding 17-point higher average attack success rates than baselines across 11 models.
New RPS and AGS metrics show within-family distilled LLM agents have 5.9 pp higher tool-use graph similarity than cross-family pairs, with some models exceeding their teachers.
MADE creates a contamination-resistant living benchmark for multi-label classification of medical device adverse events, with evaluations revealing model-specific trade-offs in accuracy and uncertainty quantification.
Large reasoning models exhibit multilingual latent reasoning that is uneven across languages but internally consistent and English-centered.
A mutual evaluation system for LLMs that uses game-theoretic aggregation of peer reviews and validates alignment with human voting on subjective outputs.
GeoLaux is a new benchmark of 2186 long-step geometry problems requiring auxiliary lines, used to evaluate 23 MLLMs and reveal major drops in performance on complex tasks.
Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
The paper characterizes deductive stereotyping in LLMs and introduces Fair-GCG to discover injection phrases that improve fairness across benchmarks, reasoning, and real-world tasks.
SrDetection detects data leakage in Code LLMs via contrast between original benchmark samples and their semantic variants, reporting F1 gains of 21.52 (gray-box) and 14.46 (black-box) over baselines in a controlled testbed.
VeriScale adversarially scales test suites for the Verina benchmark into VerinaPlus (83x larger) and VerinaLite (14x variant) that expose hidden LLM weaknesses on SpecGen and CodeGen tasks.
SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.
PITMuS automates source-level bug dataset generation by mapping PIT bytecode mutants back to Java source using debug information, producing structured pairs and metadata evaluated on eight open-source systems.
citing papers explorer
-
Bridging Functional Correctness and Runtime Efficiency Gaps in LLM-Based Code Translation
SwiftTrans improves both functional correctness and runtime efficiency of LLM code translations via multi-perspective exploration with hierarchical guidance and difference-aware selection with ordinal guidance on extended benchmarks including new SwiftBench.
-
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
-
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors
New RPS and AGS metrics show within-family distilled LLM agents have 5.9 pp higher tool-use graph similarity than cross-family pairs, with some models exceeding their teachers.
-
MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events
MADE creates a contamination-resistant living benchmark for multi-label classification of medical device adverse events, with evaluations revealing model-specific trade-offs in accuracy and uncertainty quantification.
-
Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners
Large reasoning models exhibit multilingual latent reasoning that is uneven across languages but internally consistent and English-centered.
-
LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation
A mutual evaluation system for LLMs that uses game-theoretic aggregation of peer reviews and validates alignment with human voting on subjective outputs.
-
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
-
Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG
The paper characterizes deductive stereotyping in LLMs and introduces Fair-GCG to discover injection phrases that improve fairness across benchmarks, reasoning, and real-world tasks.
-
SrDetection: A Self-Referential Framework for Data Leakage Detection in Code Large Language Models
SrDetection detects data leakage in Code LLMs via contrast between original benchmark samples and their semantic variants, reporting F1 gains of 21.52 (gray-box) and 14.46 (black-box) over baselines in a controlled testbed.
-
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.