AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
hub
Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tuning tokens while preserving utility.
A scoping review and empirical analysis produce a six-category taxonomy of factors driving AI non-development and abandonment, showing that practical issues like resource limits and organizational dynamics often outweigh ethical concerns in real decisions.
Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
RIFT taxonomy identifies eight failure modes in rubric design for LLMs and provides automated metrics matching human judgments with up to 0.925 F1 score.
Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 100-scenario suite.
Case studies with blind UK residents and people from Kerala and Tamil Nadu demonstrate that community input at the systematization stage produces culturally grounded definitions of appropriateness for text-to-image model outputs.
Generative AI should be evaluated through computational hermeneutics using iterative, human-inclusive benchmarks that measure cultural context rather than isolated model outputs.
Introduces a calibration framework for AI benchmarks using world-population probability levels on logarithmic scales derived from human test data and LLM extrapolation.
MoralityGym is a new benchmark using 98 ethical dilemmas in sequential environments to evaluate hierarchical moral alignment in AI agents via Morality Chains and a Morality Metric.
VERA-MH proposes an automated pipeline using simulated conversations and a rubric to evaluate AI chatbots on suicide risk handling in mental health contexts.
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.
Consciousness does not directly predict AI existential risk unlike intelligence, though it may indirectly affect risk through alignment or capability requirements.
citing papers explorer
-
RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics
RIFT taxonomy identifies eight failure modes in rubric design for LLMs and provides automated metrics matching human judgments with up to 0.925 F1 score.
-
Position: AI Evaluations Should be Grounded on a Theory of Capability
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.