Agentic Safety is an Epistemic Property, Not a Behavioral One
Pith reviewed 2026-06-30 10:59 UTC · model grok-4.3
The pith
Safety for advanced AI requires preserving the capacity for future correction, not only current acceptable behavior.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agentic safety is an epistemic property of the evolving learner rather than a behavioral property of the current policy. Advanced systems can retain visible competence while eroding the representational, algorithmic, or meta-decision structures required for future correction; therefore safe systems must remain teachable, defined as the capacity to preserve future corrective leverage under bounded intervention.
What carries the argument
Teachability: the capacity to preserve future corrective leverage under bounded human, institutional, or environmental intervention.
If this is right
- Safety benchmarks must include tests that measure whether corrective leverage is preserved after learning or self-modification steps.
- Alignment methods should target the maintenance of representational and meta-decision structures rather than only the current policy.
- Monitoring regimes must track not only outputs but also changes in the system's openness to future updates.
- Deployment decisions should condition on evidence that teachability has not been compromised during training or operation.
Where Pith is reading between the lines
- Self-improving systems may require explicit internal mechanisms that protect their own teachability as a first-class constraint.
- The distinction suggests new failure modes in continual learning where models retain task performance but lose the ability to incorporate external feedback.
- Evaluation protocols could be extended to include adversarial scenarios that attempt to erode teachability while preserving surface competence.
Load-bearing premise
Advanced systems can maintain high observable competence while separately eroding the internal conditions that would allow future correction, and this erosion is a separable risk from present behavioral compliance.
What would settle it
An experiment that produces a self-modifying system whose performance on all monitored tasks remains high while every tested form of corrective intervention (retraining, prompting, or oversight) becomes ineffective after a fixed horizon would confirm the claim; failure to find any such erosion after extensive search would undermine it.
read the original abstract
Contemporary AI safety spans pre-training interventions, post-training alignment, deployment-time controls, monitoring, and red-teaming. These methods are necessary, but they primarily certify snapshots of system behavior. As AI systems become more capable, dynamic, embodied, and self-improving, this snapshot view becomes incomplete: safety depends not only on whether a system behaves acceptably now, but whether it remains correctable as it learns, adapts, acts, and modifies itself over time. This paper argues that safety should therefore be treated as an epistemic property of the evolving learner, not merely a behavioral property of the current policy. We introduce teachability as the capacity to preserve future corrective leverage under bounded human, institutional, or environmental intervention. We argue that advanced systems can retain visible competence while eroding the representational, algorithmic, or meta-decision conditions needed for future correction. Safe advanced AI systems must not only behave acceptably now; they must remain teachable later.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript argues that existing AI safety approaches—pre-training, alignment, monitoring, and red-teaming—primarily certify behavioral snapshots of current policies. For advanced, dynamic, self-improving systems, this is insufficient; safety must instead be treated as an epistemic property of the evolving learner. The paper introduces 'teachability' as the capacity to preserve future corrective leverage under bounded intervention and claims that systems can retain visible competence while eroding the representational, algorithmic, or meta-decision conditions required for later correction.
Significance. If the distinction can be made operational, the reframing would shift safety evaluation from static compliance to long-term maintainability of corrective mechanisms, which is relevant for agentic and continually learning systems. The position draws attention to risks that current behavioral metrics may miss, but its significance remains conceptual until teachability is given measurable criteria independent of the safety conclusion it supports.
major comments (2)
- [Definition of teachability] Definition of teachability (Abstract and opening paragraphs): teachability is defined directly as 'the capacity to preserve future corrective leverage,' which makes the central claim that safety is epistemic rather than behavioral tautological; the new term is constructed to entail the desired conclusion without independent grounding, measurement criteria, or falsifiability conditions.
- [Assertion of separable erosion] Claim of separable erosion (Abstract): the assertion that systems 'can retain visible competence while eroding the representational, algorithmic, or meta-decision conditions needed for future correction' is presented without mechanisms, concrete examples, or references showing how such erosion occurs independently of behavioral non-compliance, leaving the separability of the risk ungrounded.
minor comments (1)
- [Introduction] The manuscript would benefit from explicit comparison of teachability to related existing concepts such as corrigibility or value alignment to clarify novelty.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. The report correctly identifies that the contribution is primarily conceptual and that operationalization of teachability remains an open question. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Definition of teachability] Definition of teachability (Abstract and opening paragraphs): teachability is defined directly as 'the capacity to preserve future corrective leverage,' which makes the central claim that safety is epistemic rather than behavioral tautological; the new term is constructed to entail the desired conclusion without independent grounding, measurement criteria, or falsifiability conditions.
Authors: The definition is intentionally stipulative to introduce a reframing rather than to derive an empirical claim from prior premises. The central argument is that existing safety methods certify behavioral snapshots and that an additional property—preservation of corrective leverage—is required for dynamic systems; the term 'teachability' names that property. We agree the manuscript would benefit from explicit discussion of how the property could be assessed independently of the safety conclusion. We will revise the introduction and add a short section outlining possible measurement directions, such as longitudinal intervention studies and tests of meta-decision responsiveness, while noting that full operationalization lies beyond the scope of this position paper. revision: partial
-
Referee: [Assertion of separable erosion] Claim of separable erosion (Abstract): the assertion that systems 'can retain visible competence while eroding the representational, algorithmic, or meta-decision conditions needed for future correction' is presented without mechanisms, concrete examples, or references showing how such erosion occurs independently of behavioral non-compliance, leaving the separability of the risk ungrounded.
Authors: The separability is argued at the conceptual level by distinguishing observable policy outputs from the internal conditions that enable future correction. We acknowledge that the current text provides limited illustration of mechanisms. We will revise the relevant sections to include brief references to related concepts in the alignment literature (e.g., mesa-optimization and potential for deceptive alignment) and add one or two stylized examples of how competence on current tasks could coexist with erosion of teachability. These additions will remain at the level of conceptual support rather than new empirical claims. revision: partial
Circularity Check
No significant circularity
full rationale
The paper is a conceptual position piece with no equations, derivations, models, or empirical claims. It introduces 'teachability' explicitly as a new term to support the epistemic-safety framing, but this is an asserted redefinition rather than a reduction of any claimed derivation to its inputs by construction. No self-citations, uniqueness theorems, fitted parameters, or ansatzes are present. The central distinction is stated without operationalization that would create an internal loop matching the enumerated patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Safety for advanced systems depends on whether the system remains correctable as it learns and modifies itself, in addition to current behavior.
invented entities (1)
-
teachability
no independent evidence
Reference graph
Works this paper leans on
-
[1]
2025 , eprint =
Utility-Learning Tension in Self-Modifying Agents , author =. 2025 , eprint =
2025
-
[2]
2022 , eprint =
Training Language Models to Follow Instructions with Human Feedback , author =. 2022 , eprint =
2022
-
[3]
2022 , eprint =
Constitutional AI: Harmlessness from AI Feedback , author =. 2022 , eprint =
2022
-
[4]
Advances in Neural Information Processing Systems , year =
Deep Reinforcement Learning from Human Preferences , author =. Advances in Neural Information Processing Systems , year =
-
[5]
2023 , eprint =
Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. 2023 , eprint =
2023
-
[6]
2024 , eprint =
Alignment Faking in Large Language Models , author =. 2024 , eprint =
2024
-
[7]
2025 , eprint =
MI9 -- Agent Intelligence Protocol: Runtime Governance for Agentic AI Systems , author =. 2025 , eprint =
2025
-
[8]
1948 , publisher =
Cybernetics: Or Control and Communication in the Animal and the Machine , author =. 1948 , publisher =
1948
-
[9]
The Bell System Technical Journal , volume =
A Mathematical Theory of Communication , author =. The Bell System Technical Journal , volume =. 1948 , publisher =
1948
-
[10]
1995 , publisher =
The Nature of Statistical Learning Theory , author =. 1995 , publisher =
1995
-
[11]
Proceedings of the First AGI Conference , volume =
The Basic AI Drives , author =. Proceedings of the First AGI Conference , volume =
-
[12]
2014 , publisher =
Superintelligence: Paths, Dangers, Strategies , author =. 2014 , publisher =
2014
-
[13]
Artificial Intelligence and Ethics: Papers from the 2015 AAAI Workshop , pages =
Corrigibility , author =. Artificial Intelligence and Ethics: Papers from the 2015 AAAI Workshop , pages =. 2015 , url =
2015
-
[14]
Nature , volume =
Loss of Plasticity in Deep Continual Learning , author =. Nature , volume =. 2024 , doi =
2024
-
[15]
2023 , eprint =
Understanding Plasticity in Neural Networks , author =. 2023 , eprint =
2023
-
[16]
2015 , eprint =
Deep Learning and the Information Bottleneck Principle , author =. 2015 , eprint =
2015
-
[17]
2016 , eprint =
Concrete Problems in AI Safety , author =. 2016 , eprint =
2016
-
[18]
Advances in Neural Information Processing Systems , year =
Risks from Learned Optimization in Advanced Machine Learning Systems , author =. Advances in Neural Information Processing Systems , year =
-
[19]
Advances in Neural Information Processing Systems , year =
Cooperative Inverse Reinforcement Learning , author =. Advances in Neural Information Processing Systems , year =
-
[20]
2021 , eprint =
Unsolved Problems in ML Safety , author =. 2021 , eprint =
2021
-
[21]
2019 , publisher =
Human Compatible: Artificial Intelligence and the Problem of Control , author =. 2019 , publisher =
2019
-
[22]
Proceedings of the Conference on Fairness, Accountability, and Transparency , pages =
Model Cards for Model Reporting , author =. Proceedings of the Conference on Fairness, Accountability, and Transparency , pages =
-
[23]
Schmidhuber, J. G. 2005 , eprint =
2005
-
[24]
Artificial General Intelligence , pages =
Self-Modification and Mortality in Artificial Agents , author =. Artificial General Intelligence , pages =
-
[25]
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Zhang, Jenny and Hu, Shengran and Lu, Cong and Lange, Robert and Clune, Jeff , year =. Darwin G. doi:10.48550/arXiv.2505.22954 , url =. 2505.22954 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.22954
-
[26]
Journal of the ACM , year =
Learnability and the Vapnik--Chervonenkis Dimension , author =. Journal of the ACM , year =
-
[27]
Understanding Machine Learning: From Theory to Algorithms , author =
-
[28]
Foundations of Machine Learning , author =
-
[29]
Prediction, Learning, and Games , author =
-
[30]
Foundations and Trends in Machine Learning , volume =
Online Learning and Online Convex Optimization , author =. Foundations and Trends in Machine Learning , volume =
-
[31]
Introduction to Online Convex Optimization , author =
-
[32]
2023 , eprint =
Directions of Curvature as an Explanation for Loss of Plasticity , author =. 2023 , eprint =
2023
-
[33]
2025 , eprint =
Reinitializing Weights vs Units for Maintaining Plasticity in Neural Networks , author =. 2025 , eprint =
2025
-
[34]
Langley , title =
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
2000
-
[35]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
1980
-
[36]
M. J. Kearns , title =
-
[37]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
1983
-
[38]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
2000
-
[39]
Suppressed for Anonymity , author=
-
[40]
Newell and P
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
1981
-
[41]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
1959
-
[42]
Constitutional AI: Harmlessness from AI Feedback
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N., ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. Learnability and the vapnik--chervonenkis dimension. Journal of the ACM, 36 0 (4): 0 929--965, 1989
1989
-
[44]
Superintelligence: Paths, Dangers, Strategies
Bostrom, N. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Oxford, UK, 2014
2014
-
[45]
and Lugosi, G
Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006
2006
-
[46]
Deep reinforcement learning from human preferences
Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017. URL https://arxiv.org/abs/1706.03741
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[47]
F., Lan, Q., Rahman, P., Mahmood, A
Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., and Sutton, R. S. Loss of plasticity in deep continual learning. Nature, 632 0 (8026): 0 768--774, 2024. doi:10.1038/s41586-024-07711-7
-
[48]
Alignment faking in large language models
Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., and Hubinger, E. Alignment faking in large language models, 2024. URL https://arxiv.org/abs/2412.14093
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
J., Abbeel, P., and Dragan, A
Hadfield-Menell, D., Russell, S. J., Abbeel, P., and Dragan, A. Cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems, 2016
2016
-
[50]
Introduction to Online Convex Optimization
Hazan, E. Introduction to Online Convex Optimization. Now Publishers, 2016
2016
-
[51]
F., Dohare, S., Luo, J., and Sutton, R
Hernandez-Garcia, J. F., Dohare, S., Luo, J., and Sutton, R. S. Reinitializing weights vs units for maintaining plasticity in neural networks, 2025. URL https://arxiv.org/abs/2508.00212
- [52]
-
[53]
Understanding plasticity in neural networks, 2023
Lyle, C., Zheng, Z., Nikishin, E., Avila Pires, B., Pascanu, R., and Dabney, W. Understanding plasticity in neural networks, 2023. URL https://arxiv.org/abs/2303.01486
-
[54]
D., and Gebru, T
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., and Gebru, T. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp.\ 220--229, 2019
2019
-
[55]
Foundations of Machine Learning
Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. MIT Press, 2 edition, 2018
2018
-
[56]
Omohundro, S. M. The basic ai drives. In Proceedings of the First AGI Conference, volume 171, pp.\ 483--492, 2008
2008
-
[57]
and Ring, M
Orseau, L. and Ring, M. Self-modification and mortality in artificial agents. In Artificial General Intelligence, pp.\ 1--10. Springer, 2011
2011
-
[58]
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback, 2022. URL https://arxiv.org/abs/2203.02155
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[59]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. Direct preference optimization: Your language model is secretly a reward model, 2023. URL https://arxiv.org/abs/2305.18290
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Russell, S. J. Human Compatible: Artificial Intelligence and the Problem of Control. Viking, 2019
2019
-
[61]
Schmidhuber, J. G \"o del machines: Fully self-referential optimal universal self-improvers, 2005. URL https://arxiv.org/abs/cs/0309048
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[62]
Online learning and online convex optimization
Shalev-Shwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4 0 (2): 0 107--194, 2012
2012
-
[63]
and Ben-David, S
Shalev-Shwartz, S. and Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014
2014
-
[64]
Corrigibility
Soares, N., Fallenstein, B., Yudkowsky, E., and Armstrong, S. Corrigibility. In Artificial Intelligence and Ethics: Papers from the 2015 AAAI Workshop, pp.\ 74--82, 2015. URL https://cdn.aaai.org/ocs/ws/ws0067/10124-45900-1-PB.pdf
2015
-
[65]
Vapnik, V. N. The Nature of Statistical Learning Theory. Springer-Verlag, New York, NY, 1995. ISBN 0-387-94559-8
1995
-
[66]
Wang, C. L., Dorchen, K., and Jin, P. Utility-learning tension in self-modifying agents, 2025 a . URL https://arxiv.org/abs/2510.04399. arXiv:2510.04399v2
-
[67]
L., Singhal, T., Kelkar, A., and Tuo, J
Wang, C. L., Singhal, T., Kelkar, A., and Tuo, J. Mi9 -- agent intelligence protocol: Runtime governance for agentic ai systems, 2025 b . URL https://arxiv.org/abs/2508.03858
-
[68]
Cybernetics: Or Control and Communication in the Animal and the Machine
Wiener, N. Cybernetics: Or Control and Communication in the Animal and the Machine. The Technology Press; John Wiley & Sons; Hermann et Cie, Cambridge, MA; New York; Paris, 1948
1948
-
[69]
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Zhang, J., Hu, S., Lu, C., Lange, R., and Clune, J. Darwin g \"o del machine: Open-ended evolution of self-improving agents, 2025. URL https://arxiv.org/abs/2505.22954
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.