A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tuning tokens while preserving utility.
Can we trust AI benchmarks? An interdisciplinary review of current issues in AI evaluation
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6roles
background 1polarities
support 1representative citing papers
A scoping review and empirical analysis produce a six-category taxonomy of factors driving AI non-development and abandonment, showing that practical issues like resource limits and organizational dynamics often outweigh ethical concerns in real decisions.
Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
Community members from the UK blind community, Kerala, and Tamil Nadu helped define what counts as culturally appropriate depictions of artifacts, and the authors tested whether those definitions can be turned into repeatable LLM-as-a-judge measurements.
RIFT taxonomy identifies eight failure modes in rubric design for LLMs and provides automated metrics matching human judgments with up to 0.925 F1 score.
Generative AI should be evaluated through computational hermeneutics using iterative, human-inclusive benchmarks that measure cultural context rather than isolated model outputs.
citing papers explorer
-
Dataset Watermarking for Closed LLMs with Provable Detection
A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tuning tokens while preserving utility.
-
To Build or Not to Build? Factors that Lead to Non-Development or Abandonment of AI Systems
A scoping review and empirical analysis produce a six-category taxonomy of factors driving AI non-development and abandonment, showing that practical issues like resource limits and organizational dynamics often outweigh ethical concerns in real decisions.
-
Simulating the Evolution of Alignment and Values in Machine Intelligence
Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
-
Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics
Community members from the UK blind community, Kerala, and Tamil Nadu helped define what counts as culturally appropriate depictions of artifacts, and the authors tested whether those definitions can be turned into repeatable LLM-as-a-judge measurements.
-
RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics
RIFT taxonomy identifies eight failure modes in rubric design for LLMs and provides automated metrics matching human judgments with up to 0.925 F1 score.
-
Computational Hermeneutics: Evaluating generative AI as a cultural technology
Generative AI should be evaluated through computational hermeneutics using iterative, human-inclusive benchmarks that measure cultural context rather than isolated model outputs.