{"total":13,"items":[{"citing_arxiv_id":"2605.14164","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unsteady Metrics and Benchmarking Cultures of AI Model Builders","primary_cat":"cs.AI","submitted_at":"2026-05-13T22:39:10+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06865","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dataset Watermarking for Closed LLMs with Provable Detection","primary_cat":"cs.LG","submitted_at":"2026-05-07T19:06:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tuning tokens while preserving utility.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.28053","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"To Build or Not to Build? Factors that Lead to Non-Development or Abandonment of AI Systems","primary_cat":"cs.CY","submitted_at":"2026-04-30T16:00:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A scoping review and empirical analysis produce a six-category taxonomy of factors driving AI non-development and abandonment, showing that practical issues like resource limits and organizational dynamics often outweigh ethical concerns in real decisions.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"ACM, Seoul Republic of Korea, 1305-1317. doi:10.1145/3531146.3533186 [43] Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca. 2025. Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation. doi:10.48550/arXiv.2502.06559 arXiv:2502.06559 [cs]. [44] Andre Esteva, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M. Swetter, Helen M. Blau, and Sebastian Thrun. 2017. Dermatologist- level classification of skin cancer with deep neural networks.Nature542, 7639 (Feb. 2017), 115-118. doi:10.1038/nature21056 [45] Sheryl Estrada. 2025. MIT report: 95% of generative AI pilots at companies are failing. https://fortune."},{"citing_arxiv_id":"2604.05274","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Simulating the Evolution of Alignment and Values in Machine Intelligence","primary_cat":"cs.AI","submitted_at":"2026-04-07T00:18:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02406","ref_index":36,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics","primary_cat":"cs.CY","submitted_at":"2026-04-02T17:17:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Case studies with blind UK residents and people from Kerala and Tamil Nadu demonstrate that community input at the systematization stage produces culturally grounded definitions of appropriateness for text-to-image model outputs.","context_count":2,"top_context_role":"other","top_context_polarity":"unclear","context_text":"Computing Machinery, New York, NY, USA, 72-76. https://doi.org/10.1145/3715668.3735629 [35] Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca. 2025. Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation. https://arxiv.org/abs/2502.06559 [36] Yannick Exner, Jochen Hartmann, Oded Netzer, and Shunyuan Zhang. 2025. AI in Disguise - How AI-Generated Ads' Visual Cues Shape Consumer Perception and Performance. doi:10.2139/ssrn.5096969 [37] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. 2009. Describing objects by their attributes. In2009 IEEE Conference on Computer Vision and Pattern Recognition."},{"citing_arxiv_id":"2604.01375","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics","primary_cat":"cs.AI","submitted_at":"2026-04-01T20:34:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RIFT taxonomy identifies eight failure modes in rubric design for LLMs and provides automated metrics matching human judgments with up to 0.925 F1 score.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16403","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Computational Hermeneutics: Evaluating generative AI as a cultural technology","primary_cat":"cs.AI","submitted_at":"2026-03-31T12:18:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Generative AI should be evaluated through computational hermeneutics using iterative, human-inclusive benchmarks that measure cultural context rather than isolated model outputs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Towards a rigorous science of interpretable machine learning.stat, 1050:2, 2017. [30] Brian D Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, et al. Relational norms for human-AI cooperation.arXiv preprint arXiv:2502.12102, 2025. [31] William Empson.Seven Types of Ambiguity. 1930. [32] Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca. Can we trust AI benchmarks? An interdisciplinary review of current issues in AI evaluation.arXiv preprint arXiv:2502.06559, 2025. [33] Kawin Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of"},{"citing_arxiv_id":"2603.14987","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI","primary_cat":"cs.CL","submitted_at":"2026-03-16T08:51:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 100-scenario suite.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.18911","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Human-Level AI Tales to AI Leveling Human Scales","primary_cat":"cs.LG","submitted_at":"2026-02-21T17:27:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces a calibration framework for AI benchmarks using world-population probability levels on logarithmic scales derived from human test data and LLM extrapolation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.13372","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents","primary_cat":"cs.AI","submitted_at":"2026-02-13T15:40:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MoralityGym is a new benchmark using 98 ethical dilemmas in sequential environments to evaluate hierarchical moral alignment in AI agents via Morality Chains and a Morality Metric.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.19115","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AI Consciousness and Existential Risk","primary_cat":"cs.AI","submitted_at":"2025-11-24T13:48:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Consciousness does not directly predict AI existential risk unlike intelligence, though it may indirectly affect risk through alignment or capability requirements.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.15297","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VERA-MH Concept Paper","primary_cat":"cs.CY","submitted_at":"2025-10-17T04:07:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VERA-MH proposes an automated pipeline using simulated conversations and a rubric to evaluate AI chatbots on suicide risk handling in mental health contexts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.19590","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Position: AI Evaluations Should be Grounded on a Theory of Capability","primary_cat":"cs.AI","submitted_at":"2025-09-23T21:29:04+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}