Is your AI Model Accurate Enough? The Difficult Choices Behind Rigorous AI Development and the EU AI Act

Bram Rijsbosch; Gerasimos Spanakis; Gijs van Dijck; Konrad Kollnig; Kristof Meding; Lucas G. Uberti-Bona Marin

arxiv: 2604.03254 · v2 · submitted 2026-03-11 · 💻 cs.CY · cs.AI

Is your AI Model Accurate Enough? The Difficult Choices Behind Rigorous AI Development and the EU AI Act

Lucas G. Uberti-Bona Marin , Bram Rijsbosch , Kristof Meding , Gerasimos Spanakis , Gijs van Dijck , Konrad Kollnig This is my paper

Pith reviewed 2026-05-15 13:07 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords AI accuracyEU AI Actnormative choicesperformance evaluationhigh-risk systemsAI governancetechno-normative decisionsregulatory compliance

0 comments

The pith

Evaluating AI accuracy requires four context-dependent normative choices rather than purely technical measurement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the idea that accuracy is an objective technical property and shows instead that it rests on four specific choices: which metrics to use, how to balance them, what data to test against, and what thresholds count as acceptable. These choices decide which errors get priority, how risks are shared, and whether a system meets the EU AI Act's requirement for an appropriate level of accuracy in high-risk applications. A sympathetic reader cares because the choices are unavoidable yet often left implicit, so making them visible helps developers, auditors, and regulators turn legal safety rules into concrete technical practice. The analysis uses the 2024 AI Act as its main case study and links each choice to documentation obligations and technical standards.

Core claim

The central claim is that any robust evaluation of AI performance depends on four techno-normative choices: selecting metrics, balancing multiple metrics, measuring against representative data, and determining acceptance thresholds. Each choice embeds assumptions about acceptable risks, errors, and trade-offs, and each directly shapes whether a high-risk system satisfies the AI Act's accuracy requirement and its associated documentation duties.

What carries the argument

The four choices (selecting metrics, balancing multiple metrics, measuring against representative data, determining acceptance thresholds) that link technical implementation to legal compliance by embedding implicit assumptions about risks and trade-offs.

Load-bearing premise

That these four choices are central to every robust performance evaluation and directly determine whether a system meets the AI Act's accuracy requirement.

What would settle it

An example of a high-risk AI system that regulators accept as compliant with the AI Act's accuracy rule without any explicit documentation or justification of the four choices.

Figures

Figures reproduced from arXiv: 2604.03254 by Bram Rijsbosch, Gerasimos Spanakis, Gijs van Dijck, Konrad Kollnig, Kristof Meding, Lucas G. Uberti-Bona Marin.

**Figure 1.** Figure 1: Illustration of the four techno-normative choices of performance evaluations, as examined in this paper. Depending [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

read the original abstract

Technical and legal debates frequently suggest that "accuracy" is an objective, measurable, and purely technical property. We challenge this view, showing that evaluating AI performance fundamentally depends on context-dependent normative decisions. These techno-normative choices are crucial for rigorous AI deployment, as they determine which errors are prioritised, how risks are distributed, and how trade-offs between competing objectives are resolved. This paper provides a legal-technical analysis of the choices that shape how accuracy is defined, measured, and assessed, using the 2024 European Union AI Act -- which mandates an "appropriate level of accuracy" for high-risk systems -- as a primary case study. We identify and analyse four choices central to any robust performance evaluation: (1) selecting metrics, (2) balancing multiple metrics, (3) measuring metrics against representative data, and (4) determining acceptance thresholds. For each choice, we study its relationship to the AI Act's accuracy requirement and associated documentation obligations, show how its technical implementation embeds implicit or explicit assumptions about acceptable risks, errors, and trade-offs, and discuss the implications for the practical implementation of the AI Act by examples and related technical standards. By making the techno-normative dimensions of accuracy explicit, this paper contributes to broader interdisciplinary debates on AI governance and regulation, and offers specific guidance for regulators, auditors, and developers tasked with translating (legal) safety requirements into technical practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper usefully maps four practical choices in AI accuracy evaluation to the EU AI Act but remains largely conceptual.

read the letter

The main point is that accuracy for AI systems isn't a neutral technical fact. It depends on four choices: which metrics to pick, how to balance them when they conflict, what data counts as representative, and where to set the pass/fail line. The paper ties each of these to the EU AI Act's demand for an appropriate level of accuracy in high-risk systems and to the documentation rules that come with it. It does a decent job spelling this out with references to technical standards and some examples. That framing makes the value-laden parts explicit, which is helpful for anyone trying to turn the legal text into engineering decisions. The structure is clean and the argument follows logically from the Act's wording. The weaker part is the lack of depth in showing how these choices actually affect real systems or compliance outcomes. The analysis is mostly interpretive, drawing on standard ML ideas without new empirical work or detailed case studies that would test whether the four choices cover the main issues in practice. It also doesn't engage much with how courts or regulators might interpret appropriate in specific sectors. This kind of paper is aimed at people working on AI regulation and compliance—developers who need to document their choices, auditors checking high-risk systems, and policymakers refining guidance. It won't change the field on its own, but it clarifies a practical problem that often gets glossed over. I would send it out for peer review. The topic is timely, the framing is useful, and the gaps are fixable with more concrete examples or sector-specific applications.

Referee Report

2 major / 3 minor

Summary. The paper argues that evaluating AI model accuracy is not an objective or purely technical property but depends on four context-dependent techno-normative choices: (1) selecting metrics, (2) balancing multiple metrics, (3) measuring against representative data, and (4) determining acceptance thresholds. Using the EU AI Act's mandate for an 'appropriate level of accuracy' in high-risk systems as the primary case study, it maps these choices to the Act's requirements and documentation obligations, illustrating through examples and standards references how technical implementations embed assumptions about risks, error prioritization, and trade-offs.

Significance. If the analysis holds, the paper contributes meaningfully to AI governance debates by making explicit the normative dimensions often obscured in technical evaluations. It offers practical guidance for translating the EU AI Act's accuracy obligations into implementation, bridging ML evaluation practices with legal compliance needs. The interdisciplinary framing, drawing on legal text and standard ML concepts without circularity, positions it as useful for regulators, auditors, and developers during the Act's rollout phase.

major comments (2)

The section identifying and justifying the four choices: the claim that these four are 'central to any robust performance evaluation' and directly determine compliance with the AI Act is load-bearing but rests on an assumption without systematic comparison to other evaluation dimensions (e.g., robustness testing or temporal stability); this weakens the generality of the techno-normative framework presented.
In the analysis of choice (4) on acceptance thresholds and its link to the AI Act: the discussion of how thresholds embed risk assumptions lacks concrete cross-references to specific Act provisions (such as the exact wording on 'appropriate level of accuracy' or related recitals) or to technical standards like ISO/IEC 42001, making the compliance implications less precisely supported than the central claim requires.

minor comments (3)

The abstract and introduction could more explicitly preview the paper's specific guidance for practical implementation of the AI Act to better orient readers.
Some examples in the metric balancing and representative data sections would benefit from additional tables or structured comparisons to clarify trade-offs across different AI domains.
Ensure consistent capitalization and abbreviation of 'EU AI Act' throughout, and add a brief glossary for legal terms like 'high-risk systems' for the technical audience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments, which help clarify and strengthen the paper's framework. We address each point below and indicate the revisions to be made.

read point-by-point responses

Referee: The section identifying and justifying the four choices: the claim that these four are 'central to any robust performance evaluation' and directly determine compliance with the AI Act is load-bearing but rests on an assumption without systematic comparison to other evaluation dimensions (e.g., robustness testing or temporal stability); this weakens the generality of the techno-normative framework presented.

Authors: We acknowledge the point. The four choices are framed as central to accuracy evaluation under the AI Act's specific requirements rather than to all possible performance dimensions. To strengthen the justification and generality, we will add a concise paragraph in the section introducing the four choices. This will briefly compare them to other aspects such as robustness testing and temporal stability, explaining their distinct treatment in the Act while noting that the framework prioritizes accuracy-related decisions for compliance purposes. revision: partial
Referee: In the analysis of choice (4) on acceptance thresholds and its link to the AI Act: the discussion of how thresholds embed risk assumptions lacks concrete cross-references to specific Act provisions (such as the exact wording on 'appropriate level of accuracy' or related recitals) or to technical standards like ISO/IEC 42001, making the compliance implications less precisely supported than the central claim requires.

Authors: We agree that more precise references are needed. The revised manuscript will incorporate direct quotations from the EU AI Act (including the wording on 'appropriate level of accuracy' and relevant recitals), along with explicit citations to ISO/IEC 42001 and related standards on performance metrics and thresholds. This will better support the discussion of how thresholds embed risk assumptions and compliance implications. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an interpretive legal-technical analysis of the EU AI Act's accuracy requirements for high-risk AI systems. It identifies four standard choices in ML performance evaluation (metric selection, multi-metric balancing, representative data, and thresholds) and maps them to the Act's 'appropriate level of accuracy' language and documentation obligations. No equations, derivations, fitted parameters, or formal proofs are present. The argument relies on external legal text, established ML concepts, and examples rather than any self-referential definitions, self-citation chains, or reductions of predictions to inputs by construction. The central claim remains independent of any internal circular step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on interpreting the EU AI Act text and standard machine learning evaluation practices without introducing new free parameters or invented entities.

axioms (1)

domain assumption The EU AI Act mandates an appropriate level of accuracy for high-risk AI systems
This is the foundational legal premise taken directly from the regulation as the basis for the case study.

pith-pipeline@v0.9.0 · 5582 in / 1140 out tokens · 206673 ms · 2026-05-15T13:07:11.357173+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify and analyse four choices central to any robust performance evaluation: (1) selecting metrics, (2) balancing multiple metrics, (3) measuring metrics against representative data, and (4) determining acceptance thresholds.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

High-risk AI systems shall be designed and developed in such a way that they achieve an appropriate level of accuracy, robustness, and cybersecurity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

[1]

Technical Report

2020.ISO/IEC TR 29119-11:2020 Software testing – Part 11: Testing of AI-based systems. Technical Report. ISO/IEC, Geneva, Switzerland

work page 2020
[2]

Associated Press. 2025. AI skin cancer detection app claims 99.8% accuracy rate in ruling out cancer. (March 2025). https://www.yahoo. com/news/ai-skin-cancer-detection-app-094039927.html Online; published on Yahoo News. FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Uberti-Bona Marin, et al

work page 2025
[3]

Brinker, Achim Hekler, Axel Hauschild, Carola Berking, Bastian Schilling, Alexander H

Titus J. Brinker, Achim Hekler, Axel Hauschild, Carola Berking, Bastian Schilling, Alexander H. Enk, Sebastian Haferkamp, Ante Karoglan, Christof von Kalle, Michael Weichenthal, Elke Sattler, Dirk Schadendorf, Maria R. Gaiser, Joachim Klode, and Jochen S. Utikal

work page
[4]

doi:10.1016/j.ejca.2018.12.016

Comparing Artificial Intelligence Algorithms to 157 German Dermatologists: The Melanoma Classification Benchmark.European Journal of Cancer111 (April 2019), 30–37. doi:10.1016/j.ejca.2018.12.016

work page doi:10.1016/j.ejca.2018.12.016 2019
[5]

Joy Buolamwini and Timnit Gebru. 2018. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency. PMLR, 77–91

work page 2018
[6]

Gordon Dai and Yunze Xiao. 2025. Embracing Contradiction: Theoretical Inconsistency Will Not Impede the Road of Building Responsible AI Systems. arXiv:2505.18139 [cs] doi:10.48550/arXiv:2505.18139

work page doi:10.48550/arxiv:2505.18139 2025
[7]

Íñigo De Troya, Jacqueline Kernahan, Neelke Doorn, Virginia Dignum, and Roel Dobbe. 2025. Misabstraction in Sociotechnical Systems. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. ACM, Athens Greece, 1829–1842. doi:10.1145/3715275.3732122

work page doi:10.1145/3715275.3732122 2025
[8]

Leon Derczynski. 2016. Complementarity, F-score, and NLP Evaluation. InProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). E...

work page 2016
[9]

Vincent Dick, Christoph Sinz, Martina Mittlböck, Harald Kittler, and Philipp Tschandl. 2019. Accuracy of Computer-Aided Diagnosis of Melanoma: A Meta-analysis.JAMA Dermatology155, 11 (Nov. 2019), 1291–1299. doi:10.1001/jamadermatol.2019.1375

work page doi:10.1001/jamadermatol.2019.1375 2019
[10]

Stephan Dreiseitl, Michael Binder, Krispin Hable, and Harald Kittler. 2009. Computer versus Human Diagnosis of Melanoma: Evaluation of the Feasibility of an Automated Diagnostic System in a Prospective Clinical Trial.Melanoma Research19, 3 (June 2009), 180. doi:10.1097/CMR.0b013e32832a1e41

work page doi:10.1097/cmr.0b013e32832a1e41 2009
[11]

European Commission. 2022. The ‘Blue Guide’ on the implementation of EU product rules 2022.Official Journal of the European Union, C 247, 29 June 2022, C 247/1–C 247/151 pages. https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:C:2022:247:FULL Commission notice, 2022/C 247/01, Text with EEA relevance

work page 2022
[12]

European Commission. 2024. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union, L 1689. https://eur- lex.europa.eu/eli/reg/2024/1689/oj OJ L 2024/1689, 12.7.2024; ELI: 32024R1689

work page 2024
[13]

European Commission. 2025. Commission Implementing Decision on a standardisation request to the European Committee for Standardisation and the European Committee for Electrotechnical Standardisation as regards high-risk AI-systems in support of Regulation (EU) 2024/1689 of the European Parliament and of the Council and repealing Implementing Decision C(20...

work page 2025
[14]

European Commission. n.d.. Harmonised Standards. https://single-market-economy.ec.europa.eu/single-market/european-standards/ harmonised-standards_en. Accessed: 2025-06-12

work page 2025
[15]

European Committee for Standardization (CEN) and European Committee for Electrotechnical Standardization (CENELEC). 2025. CEN/CLC/JTC 21 Work Programme – Artificial Intelligence. https://standards.cencenelec.eu/ords/f?p=205:22:::::FSP_ORG_ID,FSP_ LANG_ID:2916257,25&cs=114251C6C0B684FBBC069923513BF6348. Accessed 10 Jan 2026; includes current work items and...

work page 2025
[16]

European Committee for Standardization (CEN) and European Committee for Electrotechnical Standardization (CENELEC). 2025. Draft prEN 18229-2: AI trustworthiness framework – Part 2: Accuracy and robustness. https://standards.cencenelec.eu/ords/f?p=205:110::::: FSP_PROJECT,FSP_LANG_ID:82493,25&cs=1400FEB0B6AA9D5AB34BF0233CC4E75B7. Work Item Number JT021047;...

work page 2025
[17]

Sina Fazelpour and Will Fleisher. 2025. The Value of Disagreement in AI Design, Evaluation, and Alignment. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’25). Association for Computing Machinery, New York, NY, USA, 2138–2150. doi:10.1145/3715275.3732146

work page doi:10.1145/3715275.3732146 2025
[18]

Ferri, J

C. Ferri, J. Hernández-Orallo, and R. Modroiu. 2009. An Experimental Comparison of Performance Measures for Classification.Pattern Recognition Letters30, 1 (Jan. 2009), 27–38. doi:10.1016/j.patrec.2008.08.010

work page doi:10.1016/j.patrec.2008.08.010 2009
[19]

Laura K. Ferris. 2021. Early Detection of Melanoma: Rethinking the Outcomes That Matter.JAMA Dermatology157, 5 (May 2021), 511–513. doi:10.1001/jamadermatol.2020.5650

work page doi:10.1001/jamadermatol.2020.5650 2021
[20]

Gloria González Fuster. 2010. Inaccuracy as a Privacy-Enhancing Tool.Ethics and Information Technology12, 1 (March 2010), 87–95. doi:10.1007/s10676-009-9212-z

work page doi:10.1007/s10676-009-9212-z 2010
[21]

James Griffin. 1977. Are There Incommensurable Values?Philosophy & Public Affairs7, 1 (1977), 39–59. jstor:2265123

work page 1977
[22]

H. A. Haenssle, C. Fink, R. Schneiderbauer, F. Toberer, T. Buhl, A. Blum, A. Kalloo, A. Ben Hadj Hassen, L. Thomas, A. Enk, L. Uhlmann, Christina Alt, Monika Arenbergerova, Renato Bakos, Anne Baltzer, Ines Bertlich, Andreas Blum, Therezia Bokor-Billmann, Jonathan Bowling, Naira Braghiroli, Ralph Braun, Kristina Buder-Bakhaya, Timo Buhl, Horacio Cabo, Leo ...

work page doi:10.1093/annonc/mdy166 2026
[23]

Anders Herlitz. 2025. Effectiveness Analysis and Value Incommensurability.Cost Effectiveness and Resource Allocation23, 1 (April 2025),

work page 2025
[24]

doi:10.1186/s12962-025-00624-w

work page doi:10.1186/s12962-025-00624-w
[25]

Sara Hosseinzadeh Kassani and Peyman Hosseinzadeh Kassani. 2019. A Comparative Study of Deep Learning Architectures on Melanoma Detection.Tissue and Cell58 (June 2019), 76–83. doi:10.1016/j.tice.2019.04.009

work page doi:10.1016/j.tice.2019.04.009 2019
[26]

Ben Hutchinson, Negar Rostamzadeh, Christina Greer, Katherine Heller, and Vinodkumar Prabhakaran. 2022. Evaluation Gaps in Machine Learning Practice. In2022 ACM Conference on Fairness Accountability and Transparency. ACM, Seoul Republic of Korea, 1859–1876. doi:10.1145/3531146.3533233

work page doi:10.1145/3531146.3533233 2022
[27]

2022.ISO/IEC TS 4213:2022 — Information technology — Artificial intelligence — Assessment of machine learn- ing classification performance

ISO/IEC JTC 1/SC 42. 2022.ISO/IEC TS 4213:2022 — Information technology — Artificial intelligence — Assessment of machine learn- ing classification performance. Technical Specification 4213:2022. International Organization for Standardization and International Electrotechnical Commission. Defines methods for assessing machine learning classification performance

work page 2022
[28]

Nathalie Japkowicz. [n. d.]. Why Question Machine Learning Evaluation Methods (An Illustrative Review of the Shortcomings of Current Methods)

work page
[29]

Mario Fernando Jojoa Acosta, Liesle Yail Caballero Tovar, Maria Begonya Garcia-Zapirain, and Winston Spencer Percybrooks. 2021. Melanoma Diagnosis Using Deep Learning Techniques on Dermatoscopic Images.BMC Medical Imaging21, 1 (Jan. 2021), 6. doi:10. 1186/s12880-020-00534-8

work page 2021
[30]

Ron Kohavi. 1995. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. InProceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2 (IJCAI’95). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1137–1143

work page 1995
[31]

Johann Laux, Sandra Wachter, and Brent Mittelstadt. 2024. Three pathways for standardisation and ethical disclosure by default under the European Union Artificial Intelligence Act.Computer Law & Security Review53 (2024), 105957

work page 2024
[32]

Luigi Lavazza and Sandro Morasca. 2023. Common Problems With the Usage of F-Measure and Accuracy Metrics in Medical Research. IEEE Access11 (2023), 51515–51526. doi:10.1109/ACCESS.2023.3278996

work page doi:10.1109/access.2023.3278996 2023
[33]

Maier-Hein, A

Lena Maier-Hein, Annika Reinke, Patrick Godau, Minu D. Tizabi, Florian Buettner, Evangelia Christodoulou, Ben Glocker, Fabian Isensee, Jens Kleesiek, Michal Kozubek, Mauricio Reyes, Michael A. Riegler, Manuel Wiesenfarth, A. Emre Kavur, Carole H. Sudre, Michael Baumgartner, Matthias Eisenmann, Doreen Heckmann-Nötzel, Tim Rädsch, Laura Acion, Michela Anton...

work page doi:10.1038/s41592-023-02151-z 2024
[34]

Helen Marsden, Polychronis Kemos, Marcello Venzi, Mariana Noy, Shameera Maheswaran, Nicholas Francis, Christopher Hyde, Daniel Mullarkey, Dilraj Kalsi, and Lucy Thomas. 2024. Accuracy of an Artificial Intelligence as a Medical Device as Part of a UK-based Skin Cancer Teledermatology Service.Frontiers in Medicine11 (March 2024), 1302363. doi:10.3389/fmed.2...

work page doi:10.3389/fmed.2024.1302363 2024
[35]

Brent Daniel Mittelstadt, Patrick Allo, Mariarosaria Taddeo, Sandra Wachter, and Luciano Floridi. 2016. The Ethics of Algorithms: Mapping the Debate.Big Data & Society3, 2 (Dec. 2016), 2053951716679679. doi:10.1177/2053951716679679

work page doi:10.1177/2053951716679679 2016
[36]

Ahmad Naeem, Muhammad Shoaib Farooq, Adel Khelifi, and Adnan Abid. 2020. Malignant Melanoma Classification Using Deep Learning: Datasets, Performance Measurements, Challenges and Opportunities.IEEE Access8 (2020), 110575–110597. doi:10.1109/ ACCESS.2020.3001507

work page arXiv 2020
[37]

National Institute of Standards and Technology. 2025. AI Risk Management Framework Playbook — Measure. https://airc.nist.gov/airmf- resources/playbook/measure/. Online guidance accompanying the NIST AI Risk Management Framework, Measure function

work page 2025
[38]

Luke Oakden-Rayner, Jared Dunnmon, Gustavo Carneiro, and Christopher Ré. 2020. Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging.Proceedings of the ACM Conference on Health, Inference, and Learning2020 (April 2020), 151–159. doi:10.1145/3368555.3384468

work page doi:10.1145/3368555.3384468 2020
[39]

Samir Passi and Solon Barocas. 2019. Problem Formulation and Fairness. InProceedings of the Conference on Fairness, Accountability, and Transparency (FAT* ’19). Association for Computing Machinery, New York, NY, USA, 39–48. doi:10.1145/3287560.3287567 FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Uberti-Bona Marin, et al

work page doi:10.1145/3287560.3287567 2019
[40]

Michael Phillips, Jack Greenhalgh, Helen Marsden, and Ioulios Palamaras. 2019. Detection of Malignant Melanoma Using Artificial Intelligence: An Observational Study of Diagnostic Accuracy.Dermatology Practical & Conceptual10, 1 (Dec. 2019), e2020011. doi:10. 5826/dpc.1001a11

work page 2019
[41]

Matin, Gorav N

Michael Phillips, Helen Marsden, Wayne Jaffe, Rubeta N. Matin, Gorav N. Wali, Jack Greenhalgh, Emily McGrath, Rob James, Evmorfia Ladoyanni, Anthony Bewley, Giuseppe Argenziano, and Ioulios Palamaras. 2019. Assessment of Accuracy of an Artificial Intelligence Algorithm to Detect Melanoma in Images of Skin Lesions.JAMA Network Open2, 10 (Oct. 2019), e19134...

work page doi:10.1001/jamanetworkopen 2019
[42]

Sebastian Raschka. 2020. Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. arXiv:1811.12808 [cs] doi:10.48550/arXiv.1811.12808

work page doi:10.48550/arxiv.1811.12808 2020
[43]

Anka Reuel, Lisa Soder, Benjamin Bucknall, and Trond Arne Undheim. 2024. Position: Technical Research and Talent Is Needed for Effective AI Governance. InProceedings of the 41st International Conference on Machine Learning. PMLR, 42543–42557

work page 2024
[44]

Konstantinos Sechidis, Grigorios Tsoumakas, and Ioannis Vlahavas. 2011. On the Stratification of Multi-label Data. InMachine Learning and Knowledge Discovery in Databases, Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis (Eds.). Vol. 6913. Springer Berlin Heidelberg, Berlin, Heidelberg, 145–158. doi:10.1007/978-3-642-23808-6_10

work page doi:10.1007/978-3-642-23808-6_10 2011
[45]

Glenn Shafer and Vladimir Vovk. 2008. A Tutorial on Conformal Prediction.Journal of Machine Learning Research9, 12 (2008), 371–421

work page 2008
[46]

Adam Leon Smith. 2025. The CEN-CENELEC JTC 21 work programme supporting the EU AI Act. https://adamleonsmith.substack. com/p/the-cen-cenelec-jtc-21-work-programme Substack post, accessed January 10, 2025

work page 2025
[47]

Piotr Szymański and Tomasz Kajdanowicz. 2017. A Network Perspective on Stratification of Multi-Label Data. InProceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications. PMLR, 22–35

work page 2017

[1] [1]

Technical Report

2020.ISO/IEC TR 29119-11:2020 Software testing – Part 11: Testing of AI-based systems. Technical Report. ISO/IEC, Geneva, Switzerland

work page 2020

[2] [2]

Associated Press. 2025. AI skin cancer detection app claims 99.8% accuracy rate in ruling out cancer. (March 2025). https://www.yahoo. com/news/ai-skin-cancer-detection-app-094039927.html Online; published on Yahoo News. FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Uberti-Bona Marin, et al

work page 2025

[3] [3]

Brinker, Achim Hekler, Axel Hauschild, Carola Berking, Bastian Schilling, Alexander H

Titus J. Brinker, Achim Hekler, Axel Hauschild, Carola Berking, Bastian Schilling, Alexander H. Enk, Sebastian Haferkamp, Ante Karoglan, Christof von Kalle, Michael Weichenthal, Elke Sattler, Dirk Schadendorf, Maria R. Gaiser, Joachim Klode, and Jochen S. Utikal

work page

[4] [4]

doi:10.1016/j.ejca.2018.12.016

Comparing Artificial Intelligence Algorithms to 157 German Dermatologists: The Melanoma Classification Benchmark.European Journal of Cancer111 (April 2019), 30–37. doi:10.1016/j.ejca.2018.12.016

work page doi:10.1016/j.ejca.2018.12.016 2019

[5] [5]

Joy Buolamwini and Timnit Gebru. 2018. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency. PMLR, 77–91

work page 2018

[6] [6]

Gordon Dai and Yunze Xiao. 2025. Embracing Contradiction: Theoretical Inconsistency Will Not Impede the Road of Building Responsible AI Systems. arXiv:2505.18139 [cs] doi:10.48550/arXiv:2505.18139

work page doi:10.48550/arxiv:2505.18139 2025

[7] [7]

Íñigo De Troya, Jacqueline Kernahan, Neelke Doorn, Virginia Dignum, and Roel Dobbe. 2025. Misabstraction in Sociotechnical Systems. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. ACM, Athens Greece, 1829–1842. doi:10.1145/3715275.3732122

work page doi:10.1145/3715275.3732122 2025

[8] [8]

Leon Derczynski. 2016. Complementarity, F-score, and NLP Evaluation. InProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). E...

work page 2016

[9] [9]

Vincent Dick, Christoph Sinz, Martina Mittlböck, Harald Kittler, and Philipp Tschandl. 2019. Accuracy of Computer-Aided Diagnosis of Melanoma: A Meta-analysis.JAMA Dermatology155, 11 (Nov. 2019), 1291–1299. doi:10.1001/jamadermatol.2019.1375

work page doi:10.1001/jamadermatol.2019.1375 2019

[10] [10]

Stephan Dreiseitl, Michael Binder, Krispin Hable, and Harald Kittler. 2009. Computer versus Human Diagnosis of Melanoma: Evaluation of the Feasibility of an Automated Diagnostic System in a Prospective Clinical Trial.Melanoma Research19, 3 (June 2009), 180. doi:10.1097/CMR.0b013e32832a1e41

work page doi:10.1097/cmr.0b013e32832a1e41 2009

[11] [11]

European Commission. 2022. The ‘Blue Guide’ on the implementation of EU product rules 2022.Official Journal of the European Union, C 247, 29 June 2022, C 247/1–C 247/151 pages. https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:C:2022:247:FULL Commission notice, 2022/C 247/01, Text with EEA relevance

work page 2022

[12] [12]

European Commission. 2024. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union, L 1689. https://eur- lex.europa.eu/eli/reg/2024/1689/oj OJ L 2024/1689, 12.7.2024; ELI: 32024R1689

work page 2024

[13] [13]

European Commission. 2025. Commission Implementing Decision on a standardisation request to the European Committee for Standardisation and the European Committee for Electrotechnical Standardisation as regards high-risk AI-systems in support of Regulation (EU) 2024/1689 of the European Parliament and of the Council and repealing Implementing Decision C(20...

work page 2025

[14] [14]

European Commission. n.d.. Harmonised Standards. https://single-market-economy.ec.europa.eu/single-market/european-standards/ harmonised-standards_en. Accessed: 2025-06-12

work page 2025

[15] [15]

European Committee for Standardization (CEN) and European Committee for Electrotechnical Standardization (CENELEC). 2025. CEN/CLC/JTC 21 Work Programme – Artificial Intelligence. https://standards.cencenelec.eu/ords/f?p=205:22:::::FSP_ORG_ID,FSP_ LANG_ID:2916257,25&cs=114251C6C0B684FBBC069923513BF6348. Accessed 10 Jan 2026; includes current work items and...

work page 2025

[16] [16]

European Committee for Standardization (CEN) and European Committee for Electrotechnical Standardization (CENELEC). 2025. Draft prEN 18229-2: AI trustworthiness framework – Part 2: Accuracy and robustness. https://standards.cencenelec.eu/ords/f?p=205:110::::: FSP_PROJECT,FSP_LANG_ID:82493,25&cs=1400FEB0B6AA9D5AB34BF0233CC4E75B7. Work Item Number JT021047;...

work page 2025

[17] [17]

Sina Fazelpour and Will Fleisher. 2025. The Value of Disagreement in AI Design, Evaluation, and Alignment. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’25). Association for Computing Machinery, New York, NY, USA, 2138–2150. doi:10.1145/3715275.3732146

work page doi:10.1145/3715275.3732146 2025

[18] [18]

Ferri, J

C. Ferri, J. Hernández-Orallo, and R. Modroiu. 2009. An Experimental Comparison of Performance Measures for Classification.Pattern Recognition Letters30, 1 (Jan. 2009), 27–38. doi:10.1016/j.patrec.2008.08.010

work page doi:10.1016/j.patrec.2008.08.010 2009

[19] [19]

Laura K. Ferris. 2021. Early Detection of Melanoma: Rethinking the Outcomes That Matter.JAMA Dermatology157, 5 (May 2021), 511–513. doi:10.1001/jamadermatol.2020.5650

work page doi:10.1001/jamadermatol.2020.5650 2021

[20] [20]

Gloria González Fuster. 2010. Inaccuracy as a Privacy-Enhancing Tool.Ethics and Information Technology12, 1 (March 2010), 87–95. doi:10.1007/s10676-009-9212-z

work page doi:10.1007/s10676-009-9212-z 2010

[21] [21]

James Griffin. 1977. Are There Incommensurable Values?Philosophy & Public Affairs7, 1 (1977), 39–59. jstor:2265123

work page 1977

[22] [22]

H. A. Haenssle, C. Fink, R. Schneiderbauer, F. Toberer, T. Buhl, A. Blum, A. Kalloo, A. Ben Hadj Hassen, L. Thomas, A. Enk, L. Uhlmann, Christina Alt, Monika Arenbergerova, Renato Bakos, Anne Baltzer, Ines Bertlich, Andreas Blum, Therezia Bokor-Billmann, Jonathan Bowling, Naira Braghiroli, Ralph Braun, Kristina Buder-Bakhaya, Timo Buhl, Horacio Cabo, Leo ...

work page doi:10.1093/annonc/mdy166 2026

[23] [23]

Anders Herlitz. 2025. Effectiveness Analysis and Value Incommensurability.Cost Effectiveness and Resource Allocation23, 1 (April 2025),

work page 2025

[24] [24]

doi:10.1186/s12962-025-00624-w

work page doi:10.1186/s12962-025-00624-w

[25] [25]

Sara Hosseinzadeh Kassani and Peyman Hosseinzadeh Kassani. 2019. A Comparative Study of Deep Learning Architectures on Melanoma Detection.Tissue and Cell58 (June 2019), 76–83. doi:10.1016/j.tice.2019.04.009

work page doi:10.1016/j.tice.2019.04.009 2019

[26] [26]

Ben Hutchinson, Negar Rostamzadeh, Christina Greer, Katherine Heller, and Vinodkumar Prabhakaran. 2022. Evaluation Gaps in Machine Learning Practice. In2022 ACM Conference on Fairness Accountability and Transparency. ACM, Seoul Republic of Korea, 1859–1876. doi:10.1145/3531146.3533233

work page doi:10.1145/3531146.3533233 2022

[27] [27]

2022.ISO/IEC TS 4213:2022 — Information technology — Artificial intelligence — Assessment of machine learn- ing classification performance

ISO/IEC JTC 1/SC 42. 2022.ISO/IEC TS 4213:2022 — Information technology — Artificial intelligence — Assessment of machine learn- ing classification performance. Technical Specification 4213:2022. International Organization for Standardization and International Electrotechnical Commission. Defines methods for assessing machine learning classification performance

work page 2022

[28] [28]

Nathalie Japkowicz. [n. d.]. Why Question Machine Learning Evaluation Methods (An Illustrative Review of the Shortcomings of Current Methods)

work page

[29] [29]

Mario Fernando Jojoa Acosta, Liesle Yail Caballero Tovar, Maria Begonya Garcia-Zapirain, and Winston Spencer Percybrooks. 2021. Melanoma Diagnosis Using Deep Learning Techniques on Dermatoscopic Images.BMC Medical Imaging21, 1 (Jan. 2021), 6. doi:10. 1186/s12880-020-00534-8

work page 2021

[30] [30]

Ron Kohavi. 1995. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. InProceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2 (IJCAI’95). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1137–1143

work page 1995

[31] [31]

Johann Laux, Sandra Wachter, and Brent Mittelstadt. 2024. Three pathways for standardisation and ethical disclosure by default under the European Union Artificial Intelligence Act.Computer Law & Security Review53 (2024), 105957

work page 2024

[32] [32]

Luigi Lavazza and Sandro Morasca. 2023. Common Problems With the Usage of F-Measure and Accuracy Metrics in Medical Research. IEEE Access11 (2023), 51515–51526. doi:10.1109/ACCESS.2023.3278996

work page doi:10.1109/access.2023.3278996 2023

[33] [33]

Maier-Hein, A

Lena Maier-Hein, Annika Reinke, Patrick Godau, Minu D. Tizabi, Florian Buettner, Evangelia Christodoulou, Ben Glocker, Fabian Isensee, Jens Kleesiek, Michal Kozubek, Mauricio Reyes, Michael A. Riegler, Manuel Wiesenfarth, A. Emre Kavur, Carole H. Sudre, Michael Baumgartner, Matthias Eisenmann, Doreen Heckmann-Nötzel, Tim Rädsch, Laura Acion, Michela Anton...

work page doi:10.1038/s41592-023-02151-z 2024

[34] [34]

Helen Marsden, Polychronis Kemos, Marcello Venzi, Mariana Noy, Shameera Maheswaran, Nicholas Francis, Christopher Hyde, Daniel Mullarkey, Dilraj Kalsi, and Lucy Thomas. 2024. Accuracy of an Artificial Intelligence as a Medical Device as Part of a UK-based Skin Cancer Teledermatology Service.Frontiers in Medicine11 (March 2024), 1302363. doi:10.3389/fmed.2...

work page doi:10.3389/fmed.2024.1302363 2024

[35] [35]

Brent Daniel Mittelstadt, Patrick Allo, Mariarosaria Taddeo, Sandra Wachter, and Luciano Floridi. 2016. The Ethics of Algorithms: Mapping the Debate.Big Data & Society3, 2 (Dec. 2016), 2053951716679679. doi:10.1177/2053951716679679

work page doi:10.1177/2053951716679679 2016

[36] [36]

Ahmad Naeem, Muhammad Shoaib Farooq, Adel Khelifi, and Adnan Abid. 2020. Malignant Melanoma Classification Using Deep Learning: Datasets, Performance Measurements, Challenges and Opportunities.IEEE Access8 (2020), 110575–110597. doi:10.1109/ ACCESS.2020.3001507

work page arXiv 2020

[37] [37]

National Institute of Standards and Technology. 2025. AI Risk Management Framework Playbook — Measure. https://airc.nist.gov/airmf- resources/playbook/measure/. Online guidance accompanying the NIST AI Risk Management Framework, Measure function

work page 2025

[38] [38]

Luke Oakden-Rayner, Jared Dunnmon, Gustavo Carneiro, and Christopher Ré. 2020. Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging.Proceedings of the ACM Conference on Health, Inference, and Learning2020 (April 2020), 151–159. doi:10.1145/3368555.3384468

work page doi:10.1145/3368555.3384468 2020

[39] [39]

Samir Passi and Solon Barocas. 2019. Problem Formulation and Fairness. InProceedings of the Conference on Fairness, Accountability, and Transparency (FAT* ’19). Association for Computing Machinery, New York, NY, USA, 39–48. doi:10.1145/3287560.3287567 FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Uberti-Bona Marin, et al

work page doi:10.1145/3287560.3287567 2019

[40] [40]

Michael Phillips, Jack Greenhalgh, Helen Marsden, and Ioulios Palamaras. 2019. Detection of Malignant Melanoma Using Artificial Intelligence: An Observational Study of Diagnostic Accuracy.Dermatology Practical & Conceptual10, 1 (Dec. 2019), e2020011. doi:10. 5826/dpc.1001a11

work page 2019

[41] [41]

Matin, Gorav N

Michael Phillips, Helen Marsden, Wayne Jaffe, Rubeta N. Matin, Gorav N. Wali, Jack Greenhalgh, Emily McGrath, Rob James, Evmorfia Ladoyanni, Anthony Bewley, Giuseppe Argenziano, and Ioulios Palamaras. 2019. Assessment of Accuracy of an Artificial Intelligence Algorithm to Detect Melanoma in Images of Skin Lesions.JAMA Network Open2, 10 (Oct. 2019), e19134...

work page doi:10.1001/jamanetworkopen 2019

[42] [42]

Sebastian Raschka. 2020. Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. arXiv:1811.12808 [cs] doi:10.48550/arXiv.1811.12808

work page doi:10.48550/arxiv.1811.12808 2020

[43] [43]

Anka Reuel, Lisa Soder, Benjamin Bucknall, and Trond Arne Undheim. 2024. Position: Technical Research and Talent Is Needed for Effective AI Governance. InProceedings of the 41st International Conference on Machine Learning. PMLR, 42543–42557

work page 2024

[44] [44]

Konstantinos Sechidis, Grigorios Tsoumakas, and Ioannis Vlahavas. 2011. On the Stratification of Multi-label Data. InMachine Learning and Knowledge Discovery in Databases, Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis (Eds.). Vol. 6913. Springer Berlin Heidelberg, Berlin, Heidelberg, 145–158. doi:10.1007/978-3-642-23808-6_10

work page doi:10.1007/978-3-642-23808-6_10 2011

[45] [45]

Glenn Shafer and Vladimir Vovk. 2008. A Tutorial on Conformal Prediction.Journal of Machine Learning Research9, 12 (2008), 371–421

work page 2008

[46] [46]

Adam Leon Smith. 2025. The CEN-CENELEC JTC 21 work programme supporting the EU AI Act. https://adamleonsmith.substack. com/p/the-cen-cenelec-jtc-21-work-programme Substack post, accessed January 10, 2025

work page 2025

[47] [47]

Piotr Szymański and Tomasz Kajdanowicz. 2017. A Network Perspective on Stratification of Multi-Label Data. InProceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications. PMLR, 22–35

work page 2017