Is your AI Model Accurate Enough? The Difficult Choices Behind Rigorous AI Development and the EU AI Act
Pith reviewed 2026-05-15 13:07 UTC · model grok-4.3
The pith
Evaluating AI accuracy requires four context-dependent normative choices rather than purely technical measurement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that any robust evaluation of AI performance depends on four techno-normative choices: selecting metrics, balancing multiple metrics, measuring against representative data, and determining acceptance thresholds. Each choice embeds assumptions about acceptable risks, errors, and trade-offs, and each directly shapes whether a high-risk system satisfies the AI Act's accuracy requirement and its associated documentation duties.
What carries the argument
The four choices (selecting metrics, balancing multiple metrics, measuring against representative data, determining acceptance thresholds) that link technical implementation to legal compliance by embedding implicit assumptions about risks and trade-offs.
Load-bearing premise
That these four choices are central to every robust performance evaluation and directly determine whether a system meets the AI Act's accuracy requirement.
What would settle it
An example of a high-risk AI system that regulators accept as compliant with the AI Act's accuracy rule without any explicit documentation or justification of the four choices.
Figures
read the original abstract
Technical and legal debates frequently suggest that "accuracy" is an objective, measurable, and purely technical property. We challenge this view, showing that evaluating AI performance fundamentally depends on context-dependent normative decisions. These techno-normative choices are crucial for rigorous AI deployment, as they determine which errors are prioritised, how risks are distributed, and how trade-offs between competing objectives are resolved. This paper provides a legal-technical analysis of the choices that shape how accuracy is defined, measured, and assessed, using the 2024 European Union AI Act -- which mandates an "appropriate level of accuracy" for high-risk systems -- as a primary case study. We identify and analyse four choices central to any robust performance evaluation: (1) selecting metrics, (2) balancing multiple metrics, (3) measuring metrics against representative data, and (4) determining acceptance thresholds. For each choice, we study its relationship to the AI Act's accuracy requirement and associated documentation obligations, show how its technical implementation embeds implicit or explicit assumptions about acceptable risks, errors, and trade-offs, and discuss the implications for the practical implementation of the AI Act by examples and related technical standards. By making the techno-normative dimensions of accuracy explicit, this paper contributes to broader interdisciplinary debates on AI governance and regulation, and offers specific guidance for regulators, auditors, and developers tasked with translating (legal) safety requirements into technical practice.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that evaluating AI model accuracy is not an objective or purely technical property but depends on four context-dependent techno-normative choices: (1) selecting metrics, (2) balancing multiple metrics, (3) measuring against representative data, and (4) determining acceptance thresholds. Using the EU AI Act's mandate for an 'appropriate level of accuracy' in high-risk systems as the primary case study, it maps these choices to the Act's requirements and documentation obligations, illustrating through examples and standards references how technical implementations embed assumptions about risks, error prioritization, and trade-offs.
Significance. If the analysis holds, the paper contributes meaningfully to AI governance debates by making explicit the normative dimensions often obscured in technical evaluations. It offers practical guidance for translating the EU AI Act's accuracy obligations into implementation, bridging ML evaluation practices with legal compliance needs. The interdisciplinary framing, drawing on legal text and standard ML concepts without circularity, positions it as useful for regulators, auditors, and developers during the Act's rollout phase.
major comments (2)
- The section identifying and justifying the four choices: the claim that these four are 'central to any robust performance evaluation' and directly determine compliance with the AI Act is load-bearing but rests on an assumption without systematic comparison to other evaluation dimensions (e.g., robustness testing or temporal stability); this weakens the generality of the techno-normative framework presented.
- In the analysis of choice (4) on acceptance thresholds and its link to the AI Act: the discussion of how thresholds embed risk assumptions lacks concrete cross-references to specific Act provisions (such as the exact wording on 'appropriate level of accuracy' or related recitals) or to technical standards like ISO/IEC 42001, making the compliance implications less precisely supported than the central claim requires.
minor comments (3)
- The abstract and introduction could more explicitly preview the paper's specific guidance for practical implementation of the AI Act to better orient readers.
- Some examples in the metric balancing and representative data sections would benefit from additional tables or structured comparisons to clarify trade-offs across different AI domains.
- Ensure consistent capitalization and abbreviation of 'EU AI Act' throughout, and add a brief glossary for legal terms like 'high-risk systems' for the technical audience.
Simulated Author's Rebuttal
We thank the referee for these constructive comments, which help clarify and strengthen the paper's framework. We address each point below and indicate the revisions to be made.
read point-by-point responses
-
Referee: The section identifying and justifying the four choices: the claim that these four are 'central to any robust performance evaluation' and directly determine compliance with the AI Act is load-bearing but rests on an assumption without systematic comparison to other evaluation dimensions (e.g., robustness testing or temporal stability); this weakens the generality of the techno-normative framework presented.
Authors: We acknowledge the point. The four choices are framed as central to accuracy evaluation under the AI Act's specific requirements rather than to all possible performance dimensions. To strengthen the justification and generality, we will add a concise paragraph in the section introducing the four choices. This will briefly compare them to other aspects such as robustness testing and temporal stability, explaining their distinct treatment in the Act while noting that the framework prioritizes accuracy-related decisions for compliance purposes. revision: partial
-
Referee: In the analysis of choice (4) on acceptance thresholds and its link to the AI Act: the discussion of how thresholds embed risk assumptions lacks concrete cross-references to specific Act provisions (such as the exact wording on 'appropriate level of accuracy' or related recitals) or to technical standards like ISO/IEC 42001, making the compliance implications less precisely supported than the central claim requires.
Authors: We agree that more precise references are needed. The revised manuscript will incorporate direct quotations from the EU AI Act (including the wording on 'appropriate level of accuracy' and relevant recitals), along with explicit citations to ISO/IEC 42001 and related standards on performance metrics and thresholds. This will better support the discussion of how thresholds embed risk assumptions and compliance implications. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an interpretive legal-technical analysis of the EU AI Act's accuracy requirements for high-risk AI systems. It identifies four standard choices in ML performance evaluation (metric selection, multi-metric balancing, representative data, and thresholds) and maps them to the Act's 'appropriate level of accuracy' language and documentation obligations. No equations, derivations, fitted parameters, or formal proofs are present. The argument relies on external legal text, established ML concepts, and examples rather than any self-referential definitions, self-citation chains, or reductions of predictions to inputs by construction. The central claim remains independent of any internal circular step.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The EU AI Act mandates an appropriate level of accuracy for high-risk AI systems
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We identify and analyse four choices central to any robust performance evaluation: (1) selecting metrics, (2) balancing multiple metrics, (3) measuring metrics against representative data, and (4) determining acceptance thresholds.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
High-risk AI systems shall be designed and developed in such a way that they achieve an appropriate level of accuracy, robustness, and cybersecurity
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
2020.ISO/IEC TR 29119-11:2020 Software testing – Part 11: Testing of AI-based systems. Technical Report. ISO/IEC, Geneva, Switzerland
work page 2020
-
[2]
Associated Press. 2025. AI skin cancer detection app claims 99.8% accuracy rate in ruling out cancer. (March 2025). https://www.yahoo. com/news/ai-skin-cancer-detection-app-094039927.html Online; published on Yahoo News. FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Uberti-Bona Marin, et al
work page 2025
-
[3]
Brinker, Achim Hekler, Axel Hauschild, Carola Berking, Bastian Schilling, Alexander H
Titus J. Brinker, Achim Hekler, Axel Hauschild, Carola Berking, Bastian Schilling, Alexander H. Enk, Sebastian Haferkamp, Ante Karoglan, Christof von Kalle, Michael Weichenthal, Elke Sattler, Dirk Schadendorf, Maria R. Gaiser, Joachim Klode, and Jochen S. Utikal
-
[4]
doi:10.1016/j.ejca.2018.12.016
Comparing Artificial Intelligence Algorithms to 157 German Dermatologists: The Melanoma Classification Benchmark.European Journal of Cancer111 (April 2019), 30–37. doi:10.1016/j.ejca.2018.12.016
-
[5]
Joy Buolamwini and Timnit Gebru. 2018. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency. PMLR, 77–91
work page 2018
-
[6]
Gordon Dai and Yunze Xiao. 2025. Embracing Contradiction: Theoretical Inconsistency Will Not Impede the Road of Building Responsible AI Systems. arXiv:2505.18139 [cs] doi:10.48550/arXiv:2505.18139
-
[7]
Íñigo De Troya, Jacqueline Kernahan, Neelke Doorn, Virginia Dignum, and Roel Dobbe. 2025. Misabstraction in Sociotechnical Systems. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. ACM, Athens Greece, 1829–1842. doi:10.1145/3715275.3732122
-
[8]
Leon Derczynski. 2016. Complementarity, F-score, and NLP Evaluation. InProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). E...
work page 2016
-
[9]
Vincent Dick, Christoph Sinz, Martina Mittlböck, Harald Kittler, and Philipp Tschandl. 2019. Accuracy of Computer-Aided Diagnosis of Melanoma: A Meta-analysis.JAMA Dermatology155, 11 (Nov. 2019), 1291–1299. doi:10.1001/jamadermatol.2019.1375
-
[10]
Stephan Dreiseitl, Michael Binder, Krispin Hable, and Harald Kittler. 2009. Computer versus Human Diagnosis of Melanoma: Evaluation of the Feasibility of an Automated Diagnostic System in a Prospective Clinical Trial.Melanoma Research19, 3 (June 2009), 180. doi:10.1097/CMR.0b013e32832a1e41
-
[11]
European Commission. 2022. The ‘Blue Guide’ on the implementation of EU product rules 2022.Official Journal of the European Union, C 247, 29 June 2022, C 247/1–C 247/151 pages. https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:C:2022:247:FULL Commission notice, 2022/C 247/01, Text with EEA relevance
work page 2022
-
[12]
European Commission. 2024. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union, L 1689. https://eur- lex.europa.eu/eli/reg/2024/1689/oj OJ L 2024/1689, 12.7.2024; ELI: 32024R1689
work page 2024
-
[13]
European Commission. 2025. Commission Implementing Decision on a standardisation request to the European Committee for Standardisation and the European Committee for Electrotechnical Standardisation as regards high-risk AI-systems in support of Regulation (EU) 2024/1689 of the European Parliament and of the Council and repealing Implementing Decision C(20...
work page 2025
-
[14]
European Commission. n.d.. Harmonised Standards. https://single-market-economy.ec.europa.eu/single-market/european-standards/ harmonised-standards_en. Accessed: 2025-06-12
work page 2025
-
[15]
European Committee for Standardization (CEN) and European Committee for Electrotechnical Standardization (CENELEC). 2025. CEN/CLC/JTC 21 Work Programme – Artificial Intelligence. https://standards.cencenelec.eu/ords/f?p=205:22:::::FSP_ORG_ID,FSP_ LANG_ID:2916257,25&cs=114251C6C0B684FBBC069923513BF6348. Accessed 10 Jan 2026; includes current work items and...
work page 2025
-
[16]
European Committee for Standardization (CEN) and European Committee for Electrotechnical Standardization (CENELEC). 2025. Draft prEN 18229-2: AI trustworthiness framework – Part 2: Accuracy and robustness. https://standards.cencenelec.eu/ords/f?p=205:110::::: FSP_PROJECT,FSP_LANG_ID:82493,25&cs=1400FEB0B6AA9D5AB34BF0233CC4E75B7. Work Item Number JT021047;...
work page 2025
-
[17]
Sina Fazelpour and Will Fleisher. 2025. The Value of Disagreement in AI Design, Evaluation, and Alignment. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’25). Association for Computing Machinery, New York, NY, USA, 2138–2150. doi:10.1145/3715275.3732146
-
[18]
C. Ferri, J. Hernández-Orallo, and R. Modroiu. 2009. An Experimental Comparison of Performance Measures for Classification.Pattern Recognition Letters30, 1 (Jan. 2009), 27–38. doi:10.1016/j.patrec.2008.08.010
-
[19]
Laura K. Ferris. 2021. Early Detection of Melanoma: Rethinking the Outcomes That Matter.JAMA Dermatology157, 5 (May 2021), 511–513. doi:10.1001/jamadermatol.2020.5650
-
[20]
Gloria González Fuster. 2010. Inaccuracy as a Privacy-Enhancing Tool.Ethics and Information Technology12, 1 (March 2010), 87–95. doi:10.1007/s10676-009-9212-z
-
[21]
James Griffin. 1977. Are There Incommensurable Values?Philosophy & Public Affairs7, 1 (1977), 39–59. jstor:2265123
work page 1977
-
[22]
H. A. Haenssle, C. Fink, R. Schneiderbauer, F. Toberer, T. Buhl, A. Blum, A. Kalloo, A. Ben Hadj Hassen, L. Thomas, A. Enk, L. Uhlmann, Christina Alt, Monika Arenbergerova, Renato Bakos, Anne Baltzer, Ines Bertlich, Andreas Blum, Therezia Bokor-Billmann, Jonathan Bowling, Naira Braghiroli, Ralph Braun, Kristina Buder-Bakhaya, Timo Buhl, Horacio Cabo, Leo ...
-
[23]
Anders Herlitz. 2025. Effectiveness Analysis and Value Incommensurability.Cost Effectiveness and Resource Allocation23, 1 (April 2025),
work page 2025
-
[24]
doi:10.1186/s12962-025-00624-w
-
[25]
Sara Hosseinzadeh Kassani and Peyman Hosseinzadeh Kassani. 2019. A Comparative Study of Deep Learning Architectures on Melanoma Detection.Tissue and Cell58 (June 2019), 76–83. doi:10.1016/j.tice.2019.04.009
-
[26]
Ben Hutchinson, Negar Rostamzadeh, Christina Greer, Katherine Heller, and Vinodkumar Prabhakaran. 2022. Evaluation Gaps in Machine Learning Practice. In2022 ACM Conference on Fairness Accountability and Transparency. ACM, Seoul Republic of Korea, 1859–1876. doi:10.1145/3531146.3533233
-
[27]
ISO/IEC JTC 1/SC 42. 2022.ISO/IEC TS 4213:2022 — Information technology — Artificial intelligence — Assessment of machine learn- ing classification performance. Technical Specification 4213:2022. International Organization for Standardization and International Electrotechnical Commission. Defines methods for assessing machine learning classification performance
work page 2022
-
[28]
Nathalie Japkowicz. [n. d.]. Why Question Machine Learning Evaluation Methods (An Illustrative Review of the Shortcomings of Current Methods)
-
[29]
Mario Fernando Jojoa Acosta, Liesle Yail Caballero Tovar, Maria Begonya Garcia-Zapirain, and Winston Spencer Percybrooks. 2021. Melanoma Diagnosis Using Deep Learning Techniques on Dermatoscopic Images.BMC Medical Imaging21, 1 (Jan. 2021), 6. doi:10. 1186/s12880-020-00534-8
work page 2021
-
[30]
Ron Kohavi. 1995. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. InProceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2 (IJCAI’95). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1137–1143
work page 1995
-
[31]
Johann Laux, Sandra Wachter, and Brent Mittelstadt. 2024. Three pathways for standardisation and ethical disclosure by default under the European Union Artificial Intelligence Act.Computer Law & Security Review53 (2024), 105957
work page 2024
-
[32]
Luigi Lavazza and Sandro Morasca. 2023. Common Problems With the Usage of F-Measure and Accuracy Metrics in Medical Research. IEEE Access11 (2023), 51515–51526. doi:10.1109/ACCESS.2023.3278996
-
[33]
Lena Maier-Hein, Annika Reinke, Patrick Godau, Minu D. Tizabi, Florian Buettner, Evangelia Christodoulou, Ben Glocker, Fabian Isensee, Jens Kleesiek, Michal Kozubek, Mauricio Reyes, Michael A. Riegler, Manuel Wiesenfarth, A. Emre Kavur, Carole H. Sudre, Michael Baumgartner, Matthias Eisenmann, Doreen Heckmann-Nötzel, Tim Rädsch, Laura Acion, Michela Anton...
-
[34]
Helen Marsden, Polychronis Kemos, Marcello Venzi, Mariana Noy, Shameera Maheswaran, Nicholas Francis, Christopher Hyde, Daniel Mullarkey, Dilraj Kalsi, and Lucy Thomas. 2024. Accuracy of an Artificial Intelligence as a Medical Device as Part of a UK-based Skin Cancer Teledermatology Service.Frontiers in Medicine11 (March 2024), 1302363. doi:10.3389/fmed.2...
-
[35]
Brent Daniel Mittelstadt, Patrick Allo, Mariarosaria Taddeo, Sandra Wachter, and Luciano Floridi. 2016. The Ethics of Algorithms: Mapping the Debate.Big Data & Society3, 2 (Dec. 2016), 2053951716679679. doi:10.1177/2053951716679679
- [36]
-
[37]
National Institute of Standards and Technology. 2025. AI Risk Management Framework Playbook — Measure. https://airc.nist.gov/airmf- resources/playbook/measure/. Online guidance accompanying the NIST AI Risk Management Framework, Measure function
work page 2025
-
[38]
Luke Oakden-Rayner, Jared Dunnmon, Gustavo Carneiro, and Christopher Ré. 2020. Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging.Proceedings of the ACM Conference on Health, Inference, and Learning2020 (April 2020), 151–159. doi:10.1145/3368555.3384468
-
[39]
Samir Passi and Solon Barocas. 2019. Problem Formulation and Fairness. InProceedings of the Conference on Fairness, Accountability, and Transparency (FAT* ’19). Association for Computing Machinery, New York, NY, USA, 39–48. doi:10.1145/3287560.3287567 FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Uberti-Bona Marin, et al
-
[40]
Michael Phillips, Jack Greenhalgh, Helen Marsden, and Ioulios Palamaras. 2019. Detection of Malignant Melanoma Using Artificial Intelligence: An Observational Study of Diagnostic Accuracy.Dermatology Practical & Conceptual10, 1 (Dec. 2019), e2020011. doi:10. 5826/dpc.1001a11
work page 2019
-
[41]
Michael Phillips, Helen Marsden, Wayne Jaffe, Rubeta N. Matin, Gorav N. Wali, Jack Greenhalgh, Emily McGrath, Rob James, Evmorfia Ladoyanni, Anthony Bewley, Giuseppe Argenziano, and Ioulios Palamaras. 2019. Assessment of Accuracy of an Artificial Intelligence Algorithm to Detect Melanoma in Images of Skin Lesions.JAMA Network Open2, 10 (Oct. 2019), e19134...
-
[42]
Sebastian Raschka. 2020. Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. arXiv:1811.12808 [cs] doi:10.48550/arXiv.1811.12808
-
[43]
Anka Reuel, Lisa Soder, Benjamin Bucknall, and Trond Arne Undheim. 2024. Position: Technical Research and Talent Is Needed for Effective AI Governance. InProceedings of the 41st International Conference on Machine Learning. PMLR, 42543–42557
work page 2024
-
[44]
Konstantinos Sechidis, Grigorios Tsoumakas, and Ioannis Vlahavas. 2011. On the Stratification of Multi-label Data. InMachine Learning and Knowledge Discovery in Databases, Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis (Eds.). Vol. 6913. Springer Berlin Heidelberg, Berlin, Heidelberg, 145–158. doi:10.1007/978-3-642-23808-6_10
-
[45]
Glenn Shafer and Vladimir Vovk. 2008. A Tutorial on Conformal Prediction.Journal of Machine Learning Research9, 12 (2008), 371–421
work page 2008
-
[46]
Adam Leon Smith. 2025. The CEN-CENELEC JTC 21 work programme supporting the EU AI Act. https://adamleonsmith.substack. com/p/the-cen-cenelec-jtc-21-work-programme Substack post, accessed January 10, 2025
work page 2025
-
[47]
Piotr Szymański and Tomasz Kajdanowicz. 2017. A Network Perspective on Stratification of Multi-Label Data. InProceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications. PMLR, 22–35
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.