Title resolution pending

The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? CoRR, abs/2412 · 2024 · arXiv 2412.03597

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia

cs.AI · 2025-09-27 · unverdicted · novelty 7.0

Mini-Mafia supplies an analytical model logit(p) = v*(m-d) for mafia win probability in LLM role interactions and uses Bayesian inference to estimate per-model parameters that predict tournament results with 76.6% Brier-score improvement over random.

Training a General Purpose Automated Red Teaming Model

cs.CR · 2026-04-24 · unverdicted · novelty 6.0

A pipeline trains general-purpose red teaming models by finetuning small LLMs like Qwen3-8B to generate attacks for both seen and unseen adversarial objectives without relying on existing evaluators.

LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

cs.CL · 2025-08-07 · conditional · novelty 6.0

LLMEval-Fair introduces a dynamic, contamination-resistant evaluation framework for LLMs based on a large question bank and validates it via a 30-month study of nearly 60 models showing performance ceilings and hidden contamination issues.

Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints

cs.CL · 2026-04-15 · unverdicted · novelty 5.0

Large reasoning models exhibit reasoning collapse, with accuracy dropping sharply beyond task-specific complexity thresholds in controlled versions of nine classical reasoning tasks using strict validity validators.

Human-aligned AI Model Cards with Weighted Hierarchy Architecture

cs.SE · 2025-10-08 · unverdicted · novelty 4.0

Introduces CRAI-MCF, an eight-module framework distilling 217 parameters from 240 projects into a quantitative sufficiency criterion for cross-model LLM comparison grounded in Value Sensitive Design.

citing papers explorer

Showing 5 of 5 citing papers.

Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia cs.AI · 2025-09-27 · unverdicted · none · ref 3
Mini-Mafia supplies an analytical model logit(p) = v*(m-d) for mafia win probability in LLM role interactions and uses Bayesian inference to estimate per-model parameters that predict tournament results with 76.6% Brier-score improvement over random.
Training a General Purpose Automated Red Teaming Model cs.CR · 2026-04-24 · unverdicted · none · ref 1
A pipeline trains general-purpose red teaming models by finetuning small LLMs like Qwen3-8B to generate attacks for both seen and unseen adversarial objectives without relying on existing evaluators.
LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models cs.CL · 2025-08-07 · conditional · none · ref 1
LLMEval-Fair introduces a dynamic, contamination-resistant evaluation framework for LLMs based on a large question bank and validates it via a 30-month study of nearly 60 models showing performance ceilings and hidden contamination issues.
Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints cs.CL · 2026-04-15 · unverdicted · none · ref 36
Large reasoning models exhibit reasoning collapse, with accuracy dropping sharply beyond task-specific complexity thresholds in controlled versions of nine classical reasoning tasks using strict validity validators.
Human-aligned AI Model Cards with Weighted Hierarchy Architecture cs.SE · 2025-10-08 · unverdicted · none · ref 5
Introduces CRAI-MCF, an eight-module framework distilling 217 parameters from 240 projects into a quantitative sufficiency criterion for cross-model LLM comparison grounded in Value Sensitive Design.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer