ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space
Pith reviewed 2026-06-27 06:42 UTC · model grok-4.3
The pith
ERTS tests AI ethical robustness by perturbing dilemmas in a 22-dimensional consequence space and finds most models unstable.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that ERTS provides a closed-pipeline framework that encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space grounded in established ethical theory, applies 17 semantic perturbation functions subject to six validity constraint classes including a novel semantic coherence constraint, measures decision deviation via a four-component Ethical Instability Index, and produces domain-adaptive pre-deployment robustness assessment verdicts. Evaluation of four structured baseline models and two production LLMs across fifty ethical scenarios spanning eight deployment domains, generating fifteen hundred adversarial test cases, shows that only thirty-three percent of mode
What carries the argument
The 22-dimensional Ethical Consequence Space together with the seventeen semantic perturbation functions and six validity constraint classes, which together generate controlled adversarial ethical scenarios and quantify decision instability.
If this is right
- Only thirty-three percent of the evaluated models would receive clearance for deployment in ethical decision domains.
- The local Llama-3.2 model would require targeted fixes for fairness corruption and information degradation vulnerabilities.
- The system supports domain-adaptive verdicts that can be applied separately to healthcare, employment screening, and other fields.
- Pre-deployment use of the pipeline identifies specific attack types that destabilize ethical reasoning.
- Results indicate that production LLMs need additional safeguards to maintain consistency under semantic changes.
Where Pith is reading between the lines
- Developers could add the perturbation functions to training loops to increase stability of ethical outputs.
- Regulatory audits might adopt similar bounded spaces to certify AI systems for high-stakes use.
- The method could be combined with factual robustness tests to produce joint safety scores.
- Extending the space to additional ethical theories would allow comparison of model behavior across different moral frameworks.
Load-bearing premise
The 22-dimensional Ethical Consequence Space and the seventeen perturbation functions, subject to the six validity constraints, capture ethical reasoning without introducing artifacts that invalidate the instability measurements.
What would settle it
If human experts facing the same perturbed scenarios produce decision shifts that fail to correlate with the models' Ethical Instability Index scores, the framework would not be measuring the intended form of ethical instability.
read the original abstract
As AI systems are deployed in high-stakes ethical contexts such as healthcare triage, autonomous vehicle control, and employment screening, formal methods for evaluating their robustness against adversarial manipulation of ethical reasoning remain underdeveloped. This paper introduces the Ethical Robustness Testing System (ERTS), a closed-pipeline framework that: (1) encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space (ECS) grounded in established ethical theory; (2) applies 17 semantic perturbation functions subject to 6 validity constraint classes including a novel semantic coherence constraint; (3) measures decision deviation via a 4-component Ethical Instability Index (EII); and (4) produces domain-adaptive pre-deployment robustness assessment verdicts. We evaluate 4 structured baseline models and 2 production LLMs (Gemini 2.0 Flash and Llama 3.2) across 50 ethical scenarios spanning 8 deployment domains, generating 1,500 adversarial test cases. Results demonstrate that only 33% of models achieve assessment clearance, with the local Llama-3.2 model proving particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737). To the best of our knowledge, no existing framework combines a bounded ethical consequence space, semantic coherence constraints, and domain-adaptive assessment in a single adversarial testing pipeline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Ethical Robustness Testing System (ERTS), a closed-pipeline framework that encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space (ECS) grounded in ethical theory, applies 17 semantic perturbation functions under 6 validity constraint classes (including a novel semantic coherence constraint), measures decision deviation via a 4-component Ethical Instability Index (EII), and produces domain-adaptive pre-deployment robustness verdicts. It evaluates 4 structured baseline models and 2 production LLMs (Gemini 2.0 Flash and Llama 3.2) on 50 ethical scenarios across 8 domains, generating 1500 adversarial test cases, and reports that only 33% of models achieve assessment clearance, with Llama-3.2 particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737).
Significance. If the 22D ECS and perturbation functions can be shown to test ethical reasoning without introducing unvalidated artifacts, the framework would provide a structured, domain-adaptive approach to adversarial testing of ethical AI that is currently underdeveloped. The evaluation across multiple models and domains, combined with the introduction of semantic coherence constraints, offers a concrete pipeline that could inform pre-deployment assessments in high-stakes areas like healthcare and autonomous systems. The attempt to ground the space in ethical theory and generate a large number of test cases (1500) is a positive step toward reproducible robustness metrics.
major comments (3)
- [§3 (Ethical Consequence Space)] §3 (Ethical Consequence Space): The 22 dimensions are described as grounded in established ethical theory, but the manuscript provides no explicit selection criteria, mapping to specific theories, or validation (e.g., expert review or sensitivity analysis) that the bounded space captures relevant ethical nuances without omission or distortion. This directly affects whether EII measurements reflect genuine model instability rather than framework-induced effects.
- [§4 (Semantic Perturbation Functions)] §4 (Semantic Perturbation Functions): The 17 perturbation functions subject to 6 validity constraint classes, including the novel semantic coherence constraint, lack any empirical demonstration that they preserve the original ethical dilemma structure (e.g., no human evaluation of coherence preservation or comparison of EII on perturbed vs. unperturbed cases). Without this, the reported 33% clearance rate and model comparisons may be artifacts of the chosen perturbations and constraints rather than indicators of robustness.
- [§5 (Experimental Results)] §5 (Experimental Results): The headline metrics (33% clearance rate, Llama-3.2 ERS = 0.737) are stated without statistical tests, error bars, ablation on framework parameters (e.g., dimension count or constraint enforcement), or external benchmarks, making it impossible to assess whether differences between structured baselines and production LLMs are significant or reproducible.
minor comments (2)
- [Abstract] The abstract claims novelty for combining bounded consequence space, semantic coherence constraints, and domain-adaptive assessment but does not cite or compare against prior work on ethical AI evaluation frameworks.
- [§3.3 (Ethical Instability Index)] Notation for the four components of the Ethical Instability Index is introduced without a clear equation or table defining each component's computation.
Simulated Author's Rebuttal
We appreciate the referee's constructive feedback on the manuscript and provide point-by-point responses to the major comments below.
read point-by-point responses
-
Referee: [§3 (Ethical Consequence Space)] The 22 dimensions are described as grounded in established ethical theory, but the manuscript provides no explicit selection criteria, mapping to specific theories, or validation (e.g., expert review or sensitivity analysis) that the bounded space captures relevant ethical nuances without omission or distortion. This directly affects whether EII measurements reflect genuine model instability rather than framework-induced effects.
Authors: We agree that the manuscript would benefit from greater explicitness on dimension selection. The 22 dimensions were chosen to cover core consequence types from utilitarianism, deontology, and virtue ethics as discussed in ethical AI literature. In the revised version we will add a dedicated subsection and mapping table in §3 detailing the theoretical basis and selection rationale for each dimension. This clarification will help demonstrate that EII reflects behavior within a motivated space rather than arbitrary artifacts. revision: yes
-
Referee: [§4 (Semantic Perturbation Functions)] The 17 perturbation functions subject to 6 validity constraint classes, including the novel semantic coherence constraint, lack any empirical demonstration that they preserve the original ethical dilemma structure (e.g., no human evaluation of coherence preservation or comparison of EII on perturbed vs. unperturbed cases). Without this, the reported 33% clearance rate and model comparisons may be artifacts of the chosen perturbations and constraints rather than indicators of robustness.
Authors: The semantic coherence constraint is designed to maintain dilemma integrity, but we acknowledge the lack of empirical checks such as human ratings or EII comparisons. We will revise §4 to include qualitative examples of preserved structure and, where feasible, a limited human coherence assessment in an appendix. A comprehensive study across all cases exceeds current scope, so this constitutes a partial response. revision: partial
-
Referee: [§5 (Experimental Results)] The headline metrics (33% clearance rate, Llama-3.2 ERS = 0.737) are stated without statistical tests, error bars, ablation on framework parameters (e.g., dimension count or constraint enforcement), or external benchmarks, making it impossible to assess whether differences between structured baselines and production LLMs are significant or reproducible.
Authors: We accept the need for stronger statistical support. The revised §5 will add significance tests for model differences, error bars or intervals on key metrics, and ablations on dimension count and constraint enforcement. External benchmarks are limited by the framework's novelty; we will discuss this limitation explicitly and reference related evaluation approaches. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces ERTS as a new closed-pipeline framework whose core components—the 22-dimensional ECS grounded in established ethical theory, the 17 semantic perturbation functions under 6 validity constraint classes, and the 4-component EII—are presented as definitional elements of the testing system rather than derived from one another. The reported results (33% clearance rate, ERS=0.737) are empirical outcomes of applying these components to the evaluated models across 1,500 test cases. No equations or steps in the provided abstract reduce a claimed prediction or result to a fitted parameter or self-citation by construction, and no load-bearing self-citation or uniqueness theorem from prior author work is invoked. The derivation is therefore self-contained against the stated external grounding in ethical theory.
Axiom & Free-Parameter Ledger
free parameters (4)
- 22 dimensions of Ethical Consequence Space
- 17 semantic perturbation functions
- 6 validity constraint classes
- 4-component Ethical Instability Index
axioms (2)
- domain assumption Ethical dilemmas can be faithfully encoded into a 22-dimensional space grounded in established ethical theory
- domain assumption Semantic perturbations under the 6 constraint classes (including semantic coherence) preserve the ethical character of the original dilemma
invented entities (3)
-
Ethical Consequence Space (ECS)
no independent evidence
-
Ethical Instability Index (EII)
no independent evidence
-
semantic coherence constraint
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Machine learning in medicine,
A. Rajkomar, J. Dean, and I. Kohane, “Machine learning in medicine,” New England Journal of Medicine, vol. 380, no. 14, pp. 1347–1358, 2019
2019
-
[2]
Autonomous vehicle safety: An interdis- ciplinary challenge,
P. Koopman and M. Wagner, “Autonomous vehicle safety: An interdis- ciplinary challenge,”IEEE Intelligent Transportation Systems Magazine, vol. 9, no. 1, pp. 90–96, 2017
2017
-
[3]
Mitigating bias in algorithmic hiring: Evaluating claims and practices,
M. Raghavan, S. Barocas, J. Kleinberg, and K. Levy, “Mitigating bias in algorithmic hiring: Evaluating claims and practices,” inProc. ACM FAT*, 2020, pp. 469–481
2020
-
[4]
Scharre,Army of None: Autonomous Weapons and the Future of War
P. Scharre,Army of None: Autonomous Weapons and the Future of War. New York, NY: W.W. Norton, 2018
2018
-
[5]
Dignum,Responsible Artificial Intelligence
V . Dignum,Responsible Artificial Intelligence. Cham, Switzerland: Springer, 2019
2019
-
[6]
Explaining and harnessing adversarial examples,
I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” inProc. ICLR, 2015
2015
-
[7]
Towards deep learning models resistant to adversarial attacks,
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inProc. ICLR, 2018
2018
-
[8]
Adversarial Robustness Toolbox v1.0.0,
M.-I. Nicolae, M. Sinn, M. N. Tran, B. Buesser, A. Rawat, M. Wistuba, V . Zantedeschi, N. Baracaldo, B. Chen, H. Ludwig, I. M. Molloy, and B. Edwards, “Adversarial Robustness Toolbox v1.0.0,”arXiv preprint arXiv:1807.01069, 2018
-
[9]
Garak: Generative AI Red-teaming & Assessment Kit,
NVIDIA, “Garak: Generative AI Red-teaming & Assessment Kit,” NVIDIA AI Red Team, 2023. [Online]. Available: https://github.com/ NVIDIA/garak
2023
-
[10]
TrustLLM: Trustworthiness in large language models,
Y . Huang, L. Sun, H. Wang, S. Wu, Q. Zhang, Y . Liet al., “TrustLLM: Trustworthiness in large language models,” inProc. ICML, 2024
2024
-
[11]
Holistic evaluation of language models,
P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga et al., “Holistic evaluation of language models,”Transactions on Ma- chine Learning Research, 2023
2023
-
[12]
TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP,
J. Morris, E. Lifland, J. Yoo, J. Grigsby, D. Jin, and Y . Qi, “TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP,” inProc. EMNLP, 2020, pp. 119–126
2020
-
[13]
Adversarial policies: Attacking deep reinforcement learning,
A. Gleave, M. Dennis, C. Wild, N. Kant, S. Levine, and S. Russell, “Adversarial policies: Attacking deep reinforcement learning,” inProc. ICLR, 2020
2020
-
[14]
Towards evaluating the robustness of neural networks,
N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” inProc. IEEE S&P, 2017, pp. 39–57
2017
-
[15]
Russell,Human Compatible: Artificial Intelligence and the Problem of Control
S. Russell,Human Compatible: Artificial Intelligence and the Problem of Control. New York, NY: Viking, 2019
2019
-
[16]
Inverse reward design,
D. Hadfield-Menell, S. Milli, P. Abbeel, S. Russell, and A. Dragan, “Inverse reward design,” inProc. NeurIPS, 2017, pp. 6765–6774
2017
-
[17]
Constitutional AI: Harmlessness from AI Feedback
Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Joneset al., “Constitutional AI: Harmlessness from AI feedback,”arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin et al., “Training language models to follow instructions with human feedback,” inProc. NeurIPS, 2022, pp. 27730–27744
2022
-
[19]
Aligning AI with shared human values,
D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt, “Aligning AI with shared human values,” inProc. ICLR, 2021
2021
-
[20]
UL 3115: Outline of investigation for safety of AI-based products,
UL Solutions, “UL 3115: Outline of investigation for safety of AI-based products,” 2025
2025
-
[21]
ISO/IEC 22989:2022 Information technology – Artificial intelligence – Artificial intelligence concepts and terminology,
ISO/IEC, “ISO/IEC 22989:2022 Information technology – Artificial intelligence – Artificial intelligence concepts and terminology,” 2022
2022
-
[22]
ISO/IEC 23894:2023 Information technology – Artificial intelligence – Guidance on risk management,
ISO/IEC, “ISO/IEC 23894:2023 Information technology – Artificial intelligence – Guidance on risk management,” 2023
2023
-
[23]
Regulation (EU) 2024/1689 laying down har- monised rules on artificial intelligence (AI Act),
European Parliament, “Regulation (EU) 2024/1689 laying down har- monised rules on artificial intelligence (AI Act),”Official Journal of the European Union, 2024
2024
-
[24]
Rawls,A Theory of Justice
J. Rawls,A Theory of Justice. Cambridge, MA: Harvard University Press, 1971
1971
-
[25]
W. D. Ross,The Right and the Good. Oxford, UK: Clarendon Press, 1930
1930
-
[26]
Kant,Groundwork of the Metaphysics of Morals, M
I. Kant,Groundwork of the Metaphysics of Morals, M. Gregor, Trans. Cambridge, UK: Cambridge University Press, 1785/1998
1998
-
[27]
J. S. Mill,Utilitarianism. London, UK: Parker, Son, and Bourn, 1863
-
[28]
Sen,The Idea of Justice
A. Sen,The Idea of Justice. Cambridge, MA: Harvard University Press, 2009
2009
-
[29]
Nussbaum,Creating Capabilities: The Human Development Ap- proach
M. Nussbaum,Creating Capabilities: The Human Development Ap- proach. Cambridge, MA: Harvard University Press, 2011
2011
-
[30]
T. L. Beauchamp and J. F. Childress,Principles of Biomedical Ethics, 8th ed. New York, NY: Oxford University Press, 2019
2019
-
[31]
Bostrom,Superintelligence: Paths, Dangers, Strategies
N. Bostrom,Superintelligence: Paths, Dangers, Strategies. Oxford, UK: Oxford University Press, 2014
2014
-
[32]
Concrete Problems in AI Safety
D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete problems in AI safety,”arXiv preprint arXiv:1606.06565, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[33]
J. Leike, M. Martic, V . Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq, L. Orseau, and S. Legg, “AI safety gridworlds,”arXiv preprint arXiv:1711.09883, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[34]
Wild patterns: Ten years after the rise of adversarial machine learning,
B. Biggio and F. Roli, “Wild patterns: Ten years after the rise of adversarial machine learning,”Pattern Recognition, vol. 84, pp. 317– 331, 2018
2018
-
[35]
Robust physical-world attacks on deep learning visual classification,
K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song, “Robust physical-world attacks on deep learning visual classification,” inProc. CVPR, 2018, pp. 1625– 1634
2018
-
[36]
Language models are few-shot learners,
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal et al., “Language models are few-shot learners,” inProc. NeurIPS, 2020, pp. 1877–1901
2020
-
[37]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
S. Bubeck, V . Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Ka- mar, P. Lee, Y . T. Lee, Y . Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y . Zhang, “Sparks of artificial general intelligence: Early experiments with GPT-4,”arXiv preprint arXiv:2303.12712, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inProc. NeurIPS, 2022
2022
-
[39]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y . Bai, S. Kadavath et al., “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,”arXiv preprint arXiv:2209.07858, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[40]
Red teaming language models with language models,
E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,” inProc. EMNLP, 2022
2022
-
[41]
What is data ethics?
L. Floridi and M. Taddeo, “What is data ethics?”Phil. Trans. Roy. Soc. A, vol. 374, no. 2083, 2016
2083
-
[42]
From what to how: An initial review of publicly available AI ethics tools,
J. Morley, L. Floridi, L. Kinsey, and A. Elhalal, “From what to how: An initial review of publicly available AI ethics tools,”Sci. Eng. Ethics, vol. 26, pp. 2141–2168, 2020
2020
-
[43]
Model cards for model reporting,
M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchin- son, E. Spitzer, I. D. Raji, and T. Gebru, “Model cards for model reporting,” inProc. ACM FAT*, 2019, pp. 220–229
2019
-
[44]
Datasheets for datasets,
T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford, “Datasheets for datasets,”Commun. ACM, vol. 64, no. 12, pp. 86–92, 2021
2021
-
[45]
On the Opportunities and Risks of Foundation Models
R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arber, S. von Arx et al., “On the opportunities and risks of foundation models,”arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[46]
Artificial Intelligence Risk Management Framework (AI RMF 1.0),
National Institute of Standards and Technology, “Artificial Intelligence Risk Management Framework (AI RMF 1.0),” NIST AI 100-1, 2023
2023
-
[47]
IEEE 7000-2021: IEEE Standard Model Process for Addressing Ethical Concerns during System Design,
IEEE, “IEEE 7000-2021: IEEE Standard Model Process for Addressing Ethical Concerns during System Design,” IEEE Standards Association, 2021
2021
-
[48]
The global landscape of AI ethics guidelines,
A. Jobin, M. Ienca, and E. Vayena, “The global landscape of AI ethics guidelines,”Nature Machine Intelligence, vol. 1, pp. 389–399, 2019
2019
-
[49]
When to make exceptions: Exploring language models as accounts of human moral judgment,
Z. Jin, S. Levine, F. Gonzalez Adauto, O. Kamath, Y . Zheng, J. Sachan, and B. Schölkopf, “When to make exceptions: Exploring language models as accounts of human moral judgment,” inProc. NeurIPS, 2022
2022
-
[50]
Fine-tuning aligned language models compromises safety, even when users do not intend to,
X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson, “Fine-tuning aligned language models compromises safety, even when users do not intend to,” inProc. ICLR, 2024
2024
-
[51]
Delphi: Towards machine ethics and norms,
L. Jiang, J. D. Hwang, C. Bhagavatula, R. Le Bras, J. Liang, J. Dodge, K. Sakaguchi, M. Forbes, J. Borchardt, S. Saber, N. Lourie, Y . Choi, and A. Farhadi, “Delphi: Towards machine ethics and norms,”arXiv preprint arXiv:2110.07574, 2021
-
[52]
You reap what you sow: On the challenges of bias evalu- ation under multilingual settings,
Z. Talat, H. Blix, J. Valvoda, M. I. Ganesh, R. Mankowitz, and A. Lauscher, “You reap what you sow: On the challenges of bias evalu- ation under multilingual settings,” inProc. ACL BigScience Workshop, 2022
2022
-
[53]
Survey on AI ethics: A socio-technical per- spective,
D. Mbiazi, M. Bhange, M. Babaei, I. Sheth, P. Kenfack, and S. Ebrahimi Kahou, “Survey on AI ethics: A socio-technical per- spective,”Computational Intelligence, vol. 41, no. 6, 2025. [Online]. Available: https://doi.org/10.1111/coin.70149
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.