Can we trust AI benchmarks? An interdisciplinary review of current issues in AI evaluation

Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, David Fernandez-Llorca · 2025 · arXiv 2502.06559

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

support 1

representative citing papers

Dataset Watermarking for Closed LLMs with Provable Detection

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tuning tokens while preserving utility.

To Build or Not to Build? Factors that Lead to Non-Development or Abandonment of AI Systems

cs.CY · 2026-04-30 · unverdicted · novelty 6.0

A scoping review and empirical analysis produce a six-category taxonomy of factors driving AI non-development and abandonment, showing that practical issues like resource limits and organizational dynamics often outweigh ethical concerns in real decisions.

Simulating the Evolution of Alignment and Values in Machine Intelligence

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.

Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics

cs.CY · 2026-04-02 · unverdicted · novelty 6.0

Community members from the UK blind community, Kerala, and Tamil Nadu helped define what counts as culturally appropriate depictions of artifacts, and the authors tested whether those definitions can be turned into repeatable LLM-as-a-judge measurements.

RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics

cs.AI · 2026-04-01 · conditional · novelty 6.0

RIFT taxonomy identifies eight failure modes in rubric design for LLMs and provides automated metrics matching human judgments with up to 0.925 F1 score.

Computational Hermeneutics: Evaluating generative AI as a cultural technology

cs.AI · 2026-03-31 · unverdicted · novelty 5.0

Generative AI should be evaluated through computational hermeneutics using iterative, human-inclusive benchmarks that measure cultural context rather than isolated model outputs.

citing papers explorer

Showing 6 of 6 citing papers.

Dataset Watermarking for Closed LLMs with Provable Detection cs.LG · 2026-05-07 · unverdicted · none · ref 3
A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tuning tokens while preserving utility.
To Build or Not to Build? Factors that Lead to Non-Development or Abandonment of AI Systems cs.CY · 2026-04-30 · unverdicted · none · ref 44
A scoping review and empirical analysis produce a six-category taxonomy of factors driving AI non-development and abandonment, showing that practical issues like resource limits and organizational dynamics often outweigh ethical concerns in real decisions.
Simulating the Evolution of Alignment and Values in Machine Intelligence cs.AI · 2026-04-07 · unverdicted · none · ref 8
Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics cs.CY · 2026-04-02 · unverdicted · none · ref 36
Community members from the UK blind community, Kerala, and Tamil Nadu helped define what counts as culturally appropriate depictions of artifacts, and the authors tested whether those definitions can be turned into repeatable LLM-as-a-judge measurements.
RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics cs.AI · 2026-04-01 · conditional · none · ref 1
RIFT taxonomy identifies eight failure modes in rubric design for LLMs and provides automated metrics matching human judgments with up to 0.925 F1 score.
Computational Hermeneutics: Evaluating generative AI as a cultural technology cs.AI · 2026-03-31 · unverdicted · none · ref 32
Generative AI should be evaluated through computational hermeneutics using iterative, human-inclusive benchmarks that measure cultural context rather than isolated model outputs.

Can we trust AI benchmarks? An interdisciplinary review of current issues in AI evaluation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer