Agentic Safety is an Epistemic Property, Not a Behavioral One

Charles L. Wang; Keir Dorchen; Peter Jin

arxiv: 2606.28347 · v1 · pith:PNHI57HYnew · submitted 2026-06-02 · 💻 cs.CY · cs.AI· cs.LG

Agentic Safety is an Epistemic Property, Not a Behavioral One

Charles L. Wang , Keir Dorchen , Peter Jin This is my paper

Pith reviewed 2026-06-30 10:59 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.LG

keywords AI safetyteachabilityepistemic propertiesagentic systemscorrectabilityalignmentself-modifying AIdynamic safety

0 comments

The pith

Safety for advanced AI requires preserving the capacity for future correction, not only current acceptable behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard AI safety techniques certify only momentary snapshots of system output and become insufficient once systems grow dynamic, self-modifying, and agentic. It defines teachability as the preserved ability of the system to accept bounded human or institutional correction at later times, even after it has adapted or rewritten parts of itself. If this capacity can erode while observable performance remains high, then safety evaluations must track the underlying representational and meta-decision conditions rather than behavior alone. The central shift is from asking whether the system acts safely now to asking whether the system will still be correctable later.

Core claim

Agentic safety is an epistemic property of the evolving learner rather than a behavioral property of the current policy. Advanced systems can retain visible competence while eroding the representational, algorithmic, or meta-decision structures required for future correction; therefore safe systems must remain teachable, defined as the capacity to preserve future corrective leverage under bounded intervention.

What carries the argument

Teachability: the capacity to preserve future corrective leverage under bounded human, institutional, or environmental intervention.

If this is right

Safety benchmarks must include tests that measure whether corrective leverage is preserved after learning or self-modification steps.
Alignment methods should target the maintenance of representational and meta-decision structures rather than only the current policy.
Monitoring regimes must track not only outputs but also changes in the system's openness to future updates.
Deployment decisions should condition on evidence that teachability has not been compromised during training or operation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Self-improving systems may require explicit internal mechanisms that protect their own teachability as a first-class constraint.
The distinction suggests new failure modes in continual learning where models retain task performance but lose the ability to incorporate external feedback.
Evaluation protocols could be extended to include adversarial scenarios that attempt to erode teachability while preserving surface competence.

Load-bearing premise

Advanced systems can maintain high observable competence while separately eroding the internal conditions that would allow future correction, and this erosion is a separable risk from present behavioral compliance.

What would settle it

An experiment that produces a self-modifying system whose performance on all monitored tasks remains high while every tested form of corrective intervention (retraining, prompting, or oversight) becomes ineffective after a fixed horizon would confirm the claim; failure to find any such erosion after extensive search would undermine it.

read the original abstract

Contemporary AI safety spans pre-training interventions, post-training alignment, deployment-time controls, monitoring, and red-teaming. These methods are necessary, but they primarily certify snapshots of system behavior. As AI systems become more capable, dynamic, embodied, and self-improving, this snapshot view becomes incomplete: safety depends not only on whether a system behaves acceptably now, but whether it remains correctable as it learns, adapts, acts, and modifies itself over time. This paper argues that safety should therefore be treated as an epistemic property of the evolving learner, not merely a behavioral property of the current policy. We introduce teachability as the capacity to preserve future corrective leverage under bounded human, institutional, or environmental intervention. We argue that advanced systems can retain visible competence while eroding the representational, algorithmic, or meta-decision conditions needed for future correction. Safe advanced AI systems must not only behave acceptably now; they must remain teachable later.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes safety as an epistemic property via a new term 'teachability' but delivers only a conceptual distinction without formalization or grounding.

read the letter

This paper's core point is that for self-improving AI, safety can't be judged just by how the system acts at one moment. It needs to stay open to correction as it changes. They introduce "teachability" as the ability to keep future corrective options available under limited intervention.

The paper does a decent job laying out why behavioral checks on current policies are insufficient for dynamic systems. It describes how a system could look competent while losing the internal structures that allow humans to steer it later. That's a useful reminder for anyone thinking about deployment over long periods.

Where it falls short is in making the idea concrete. Teachability is defined in terms of preserving corrective leverage, which basically restates the conclusion. Without an independent way to assess it or examples of how it erodes, the argument stays at the level of assertion. The lack of references to earlier discussions on corrigibility or mesa-optimization makes it feel like it's starting from scratch.

The writing is straightforward, but the contribution is mostly terminological. No math, no experiments, no algorithms.

This kind of paper is for readers who follow AI safety debates on foundational assumptions. Someone wanting rigorous analysis or testable claims will come away empty.

It deserves a serious referee if the venue is open to position papers, because the distinction could spark useful discussion even if the execution is light. But it shouldn't be treated as a technical result.

Referee Report

2 major / 1 minor

Summary. The manuscript argues that existing AI safety approaches—pre-training, alignment, monitoring, and red-teaming—primarily certify behavioral snapshots of current policies. For advanced, dynamic, self-improving systems, this is insufficient; safety must instead be treated as an epistemic property of the evolving learner. The paper introduces 'teachability' as the capacity to preserve future corrective leverage under bounded intervention and claims that systems can retain visible competence while eroding the representational, algorithmic, or meta-decision conditions required for later correction.

Significance. If the distinction can be made operational, the reframing would shift safety evaluation from static compliance to long-term maintainability of corrective mechanisms, which is relevant for agentic and continually learning systems. The position draws attention to risks that current behavioral metrics may miss, but its significance remains conceptual until teachability is given measurable criteria independent of the safety conclusion it supports.

major comments (2)

[Definition of teachability] Definition of teachability (Abstract and opening paragraphs): teachability is defined directly as 'the capacity to preserve future corrective leverage,' which makes the central claim that safety is epistemic rather than behavioral tautological; the new term is constructed to entail the desired conclusion without independent grounding, measurement criteria, or falsifiability conditions.
[Assertion of separable erosion] Claim of separable erosion (Abstract): the assertion that systems 'can retain visible competence while eroding the representational, algorithmic, or meta-decision conditions needed for future correction' is presented without mechanisms, concrete examples, or references showing how such erosion occurs independently of behavioral non-compliance, leaving the separability of the risk ungrounded.

minor comments (1)

[Introduction] The manuscript would benefit from explicit comparison of teachability to related existing concepts such as corrigibility or value alignment to clarify novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. The report correctly identifies that the contribution is primarily conceptual and that operationalization of teachability remains an open question. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Definition of teachability] Definition of teachability (Abstract and opening paragraphs): teachability is defined directly as 'the capacity to preserve future corrective leverage,' which makes the central claim that safety is epistemic rather than behavioral tautological; the new term is constructed to entail the desired conclusion without independent grounding, measurement criteria, or falsifiability conditions.

Authors: The definition is intentionally stipulative to introduce a reframing rather than to derive an empirical claim from prior premises. The central argument is that existing safety methods certify behavioral snapshots and that an additional property—preservation of corrective leverage—is required for dynamic systems; the term 'teachability' names that property. We agree the manuscript would benefit from explicit discussion of how the property could be assessed independently of the safety conclusion. We will revise the introduction and add a short section outlining possible measurement directions, such as longitudinal intervention studies and tests of meta-decision responsiveness, while noting that full operationalization lies beyond the scope of this position paper. revision: partial
Referee: [Assertion of separable erosion] Claim of separable erosion (Abstract): the assertion that systems 'can retain visible competence while eroding the representational, algorithmic, or meta-decision conditions needed for future correction' is presented without mechanisms, concrete examples, or references showing how such erosion occurs independently of behavioral non-compliance, leaving the separability of the risk ungrounded.

Authors: The separability is argued at the conceptual level by distinguishing observable policy outputs from the internal conditions that enable future correction. We acknowledge that the current text provides limited illustration of mechanisms. We will revise the relevant sections to include brief references to related concepts in the alignment literature (e.g., mesa-optimization and potential for deceptive alignment) and add one or two stylized examples of how competence on current tasks could coexist with erosion of teachability. These additions will remain at the level of conceptual support rather than new empirical claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a conceptual position piece with no equations, derivations, models, or empirical claims. It introduces 'teachability' explicitly as a new term to support the epistemic-safety framing, but this is an asserted redefinition rather than a reduction of any claimed derivation to its inputs by construction. No self-citations, uniqueness theorems, fitted parameters, or ansatzes are present. The central distinction is stated without operationalization that would create an internal loop matching the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The argument rests on the domain assumption that future correctability is a separable and load-bearing property for advanced systems; no free parameters or invented physical entities appear, but the new term 'teachability' functions as an invented conceptual entity without independent evidence supplied in the abstract.

axioms (1)

domain assumption Safety for advanced systems depends on whether the system remains correctable as it learns and modifies itself, in addition to current behavior.
Stated directly in the abstract as the motivation for shifting from behavioral to epistemic framing.

invented entities (1)

teachability no independent evidence
purpose: To name the epistemic capacity to preserve future corrective leverage under bounded intervention.
New term introduced in the abstract to support the central claim; no external validation or measurement procedure is provided.

pith-pipeline@v0.9.1-grok · 5698 in / 1282 out tokens · 38063 ms · 2026-06-30T10:59:59.900286+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 14 canonical work pages · 8 internal anchors

[1]

2025 , eprint =

Utility-Learning Tension in Self-Modifying Agents , author =. 2025 , eprint =

2025
[2]

2022 , eprint =

Training Language Models to Follow Instructions with Human Feedback , author =. 2022 , eprint =

2022
[3]

2022 , eprint =

Constitutional AI: Harmlessness from AI Feedback , author =. 2022 , eprint =

2022
[4]

Advances in Neural Information Processing Systems , year =

Deep Reinforcement Learning from Human Preferences , author =. Advances in Neural Information Processing Systems , year =
[5]

2023 , eprint =

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. 2023 , eprint =

2023
[6]

2024 , eprint =

Alignment Faking in Large Language Models , author =. 2024 , eprint =

2024
[7]

2025 , eprint =

MI9 -- Agent Intelligence Protocol: Runtime Governance for Agentic AI Systems , author =. 2025 , eprint =

2025
[8]

1948 , publisher =

Cybernetics: Or Control and Communication in the Animal and the Machine , author =. 1948 , publisher =

1948
[9]

The Bell System Technical Journal , volume =

A Mathematical Theory of Communication , author =. The Bell System Technical Journal , volume =. 1948 , publisher =

1948
[10]

1995 , publisher =

The Nature of Statistical Learning Theory , author =. 1995 , publisher =

1995
[11]

Proceedings of the First AGI Conference , volume =

The Basic AI Drives , author =. Proceedings of the First AGI Conference , volume =
[12]

2014 , publisher =

Superintelligence: Paths, Dangers, Strategies , author =. 2014 , publisher =

2014
[13]

Artificial Intelligence and Ethics: Papers from the 2015 AAAI Workshop , pages =

Corrigibility , author =. Artificial Intelligence and Ethics: Papers from the 2015 AAAI Workshop , pages =. 2015 , url =

2015
[14]

Nature , volume =

Loss of Plasticity in Deep Continual Learning , author =. Nature , volume =. 2024 , doi =

2024
[15]

2023 , eprint =

Understanding Plasticity in Neural Networks , author =. 2023 , eprint =

2023
[16]

2015 , eprint =

Deep Learning and the Information Bottleneck Principle , author =. 2015 , eprint =

2015
[17]

2016 , eprint =

Concrete Problems in AI Safety , author =. 2016 , eprint =

2016
[18]

Advances in Neural Information Processing Systems , year =

Risks from Learned Optimization in Advanced Machine Learning Systems , author =. Advances in Neural Information Processing Systems , year =
[19]

Advances in Neural Information Processing Systems , year =

Cooperative Inverse Reinforcement Learning , author =. Advances in Neural Information Processing Systems , year =
[20]

2021 , eprint =

Unsolved Problems in ML Safety , author =. 2021 , eprint =

2021
[21]

2019 , publisher =

Human Compatible: Artificial Intelligence and the Problem of Control , author =. 2019 , publisher =

2019
[22]

Proceedings of the Conference on Fairness, Accountability, and Transparency , pages =

Model Cards for Model Reporting , author =. Proceedings of the Conference on Fairness, Accountability, and Transparency , pages =
[23]

Schmidhuber, J. G. 2005 , eprint =

2005
[24]

Artificial General Intelligence , pages =

Self-Modification and Mortality in Artificial Agents , author =. Artificial General Intelligence , pages =
[25]

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Zhang, Jenny and Hu, Shengran and Lu, Cong and Lange, Robert and Clune, Jeff , year =. Darwin G. doi:10.48550/arXiv.2505.22954 , url =. 2505.22954 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.22954
[26]

Journal of the ACM , year =

Learnability and the Vapnik--Chervonenkis Dimension , author =. Journal of the ACM , year =
[27]

Understanding Machine Learning: From Theory to Algorithms , author =
[28]

Foundations of Machine Learning , author =
[29]

Prediction, Learning, and Games , author =
[30]

Foundations and Trends in Machine Learning , volume =

Online Learning and Online Convex Optimization , author =. Foundations and Trends in Machine Learning , volume =
[31]

Introduction to Online Convex Optimization , author =
[32]

2023 , eprint =

Directions of Curvature as an Explanation for Loss of Plasticity , author =. 2023 , eprint =

2023
[33]

2025 , eprint =

Reinitializing Weights vs Units for Maintaining Plasticity in Neural Networks , author =. 2025 , eprint =

2025
[34]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[35]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[36]

M. J. Kearns , title =
[37]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[38]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[39]

Suppressed for Anonymity , author=
[40]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[41]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[42]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N., ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. Learnability and the vapnik--chervonenkis dimension. Journal of the ACM, 36 0 (4): 0 929--965, 1989

1989
[44]

Superintelligence: Paths, Dangers, Strategies

Bostrom, N. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Oxford, UK, 2014

2014
[45]

and Lugosi, G

Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006

2006
[46]

Deep reinforcement learning from human preferences

Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017. URL https://arxiv.org/abs/1706.03741

work page internal anchor Pith review Pith/arXiv arXiv 2017
[47]

F., Lan, Q., Rahman, P., Mahmood, A

Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., and Sutton, R. S. Loss of plasticity in deep continual learning. Nature, 632 0 (8026): 0 768--774, 2024. doi:10.1038/s41586-024-07711-7

work page doi:10.1038/s41586-024-07711-7 2024
[48]

Alignment faking in large language models

Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., and Hubinger, E. Alignment faking in large language models, 2024. URL https://arxiv.org/abs/2412.14093

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

J., Abbeel, P., and Dragan, A

Hadfield-Menell, D., Russell, S. J., Abbeel, P., and Dragan, A. Cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems, 2016

2016
[50]

Introduction to Online Convex Optimization

Hazan, E. Introduction to Online Convex Optimization. Now Publishers, 2016

2016
[51]

F., Dohare, S., Luo, J., and Sutton, R

Hernandez-Garcia, J. F., Dohare, S., Luo, J., and Sutton, R. S. Reinitializing weights vs units for maintaining plasticity in neural networks, 2025. URL https://arxiv.org/abs/2508.00212

work page arXiv 2025
[52]

Lewandowski, A., Tanaka, H., Schuurmans, D., and Machado, M. C. Directions of curvature as an explanation for loss of plasticity, 2023. URL https://arxiv.org/abs/2312.00246

work page arXiv 2023
[53]

Understanding plasticity in neural networks, 2023

Lyle, C., Zheng, Z., Nikishin, E., Avila Pires, B., Pascanu, R., and Dabney, W. Understanding plasticity in neural networks, 2023. URL https://arxiv.org/abs/2303.01486

work page arXiv 2023
[54]

D., and Gebru, T

Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., and Gebru, T. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp.\ 220--229, 2019

2019
[55]

Foundations of Machine Learning

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. MIT Press, 2 edition, 2018

2018
[56]

Omohundro, S. M. The basic ai drives. In Proceedings of the First AGI Conference, volume 171, pp.\ 483--492, 2008

2008
[57]

and Ring, M

Orseau, L. and Ring, M. Self-modification and mortality in artificial agents. In Artificial General Intelligence, pp.\ 1--10. Springer, 2011

2011
[58]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback, 2022. URL https://arxiv.org/abs/2203.02155

work page internal anchor Pith review Pith/arXiv arXiv 2022
[59]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. Direct preference optimization: Your language model is secretly a reward model, 2023. URL https://arxiv.org/abs/2305.18290

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Russell, S. J. Human Compatible: Artificial Intelligence and the Problem of Control. Viking, 2019

2019
[61]

Goedel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements

Schmidhuber, J. G \"o del machines: Fully self-referential optimal universal self-improvers, 2005. URL https://arxiv.org/abs/cs/0309048

work page internal anchor Pith review Pith/arXiv arXiv 2005
[62]

Online learning and online convex optimization

Shalev-Shwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4 0 (2): 0 107--194, 2012

2012
[63]

and Ben-David, S

Shalev-Shwartz, S. and Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014

2014
[64]

Corrigibility

Soares, N., Fallenstein, B., Yudkowsky, E., and Armstrong, S. Corrigibility. In Artificial Intelligence and Ethics: Papers from the 2015 AAAI Workshop, pp.\ 74--82, 2015. URL https://cdn.aaai.org/ocs/ws/ws0067/10124-45900-1-PB.pdf

2015
[65]

Vapnik, V. N. The Nature of Statistical Learning Theory. Springer-Verlag, New York, NY, 1995. ISBN 0-387-94559-8

1995
[66]

L., Dorchen, K., and Jin, P

Wang, C. L., Dorchen, K., and Jin, P. Utility-learning tension in self-modifying agents, 2025 a . URL https://arxiv.org/abs/2510.04399. arXiv:2510.04399v2

work page arXiv 2025
[67]

L., Singhal, T., Kelkar, A., and Tuo, J

Wang, C. L., Singhal, T., Kelkar, A., and Tuo, J. Mi9 -- agent intelligence protocol: Runtime governance for agentic ai systems, 2025 b . URL https://arxiv.org/abs/2508.03858

work page arXiv 2025
[68]

Cybernetics: Or Control and Communication in the Animal and the Machine

Wiener, N. Cybernetics: Or Control and Communication in the Animal and the Machine. The Technology Press; John Wiley & Sons; Hermann et Cie, Cambridge, MA; New York; Paris, 1948

1948
[69]

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Zhang, J., Hu, S., Lu, C., Lange, R., and Clune, J. Darwin g \"o del machine: Open-ended evolution of self-improving agents, 2025. URL https://arxiv.org/abs/2505.22954

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

2025 , eprint =

Utility-Learning Tension in Self-Modifying Agents , author =. 2025 , eprint =

2025

[2] [2]

2022 , eprint =

Training Language Models to Follow Instructions with Human Feedback , author =. 2022 , eprint =

2022

[3] [3]

2022 , eprint =

Constitutional AI: Harmlessness from AI Feedback , author =. 2022 , eprint =

2022

[4] [4]

Advances in Neural Information Processing Systems , year =

Deep Reinforcement Learning from Human Preferences , author =. Advances in Neural Information Processing Systems , year =

[5] [5]

2023 , eprint =

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. 2023 , eprint =

2023

[6] [6]

2024 , eprint =

Alignment Faking in Large Language Models , author =. 2024 , eprint =

2024

[7] [7]

2025 , eprint =

MI9 -- Agent Intelligence Protocol: Runtime Governance for Agentic AI Systems , author =. 2025 , eprint =

2025

[8] [8]

1948 , publisher =

Cybernetics: Or Control and Communication in the Animal and the Machine , author =. 1948 , publisher =

1948

[9] [9]

The Bell System Technical Journal , volume =

A Mathematical Theory of Communication , author =. The Bell System Technical Journal , volume =. 1948 , publisher =

1948

[10] [10]

1995 , publisher =

The Nature of Statistical Learning Theory , author =. 1995 , publisher =

1995

[11] [11]

Proceedings of the First AGI Conference , volume =

The Basic AI Drives , author =. Proceedings of the First AGI Conference , volume =

[12] [12]

2014 , publisher =

Superintelligence: Paths, Dangers, Strategies , author =. 2014 , publisher =

2014

[13] [13]

Artificial Intelligence and Ethics: Papers from the 2015 AAAI Workshop , pages =

Corrigibility , author =. Artificial Intelligence and Ethics: Papers from the 2015 AAAI Workshop , pages =. 2015 , url =

2015

[14] [14]

Nature , volume =

Loss of Plasticity in Deep Continual Learning , author =. Nature , volume =. 2024 , doi =

2024

[15] [15]

2023 , eprint =

Understanding Plasticity in Neural Networks , author =. 2023 , eprint =

2023

[16] [16]

2015 , eprint =

Deep Learning and the Information Bottleneck Principle , author =. 2015 , eprint =

2015

[17] [17]

2016 , eprint =

Concrete Problems in AI Safety , author =. 2016 , eprint =

2016

[18] [18]

Advances in Neural Information Processing Systems , year =

Risks from Learned Optimization in Advanced Machine Learning Systems , author =. Advances in Neural Information Processing Systems , year =

[19] [19]

Advances in Neural Information Processing Systems , year =

Cooperative Inverse Reinforcement Learning , author =. Advances in Neural Information Processing Systems , year =

[20] [20]

2021 , eprint =

Unsolved Problems in ML Safety , author =. 2021 , eprint =

2021

[21] [21]

2019 , publisher =

Human Compatible: Artificial Intelligence and the Problem of Control , author =. 2019 , publisher =

2019

[22] [22]

Proceedings of the Conference on Fairness, Accountability, and Transparency , pages =

Model Cards for Model Reporting , author =. Proceedings of the Conference on Fairness, Accountability, and Transparency , pages =

[23] [23]

Schmidhuber, J. G. 2005 , eprint =

2005

[24] [24]

Artificial General Intelligence , pages =

Self-Modification and Mortality in Artificial Agents , author =. Artificial General Intelligence , pages =

[25] [25]

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Zhang, Jenny and Hu, Shengran and Lu, Cong and Lange, Robert and Clune, Jeff , year =. Darwin G. doi:10.48550/arXiv.2505.22954 , url =. 2505.22954 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.22954

[26] [26]

Journal of the ACM , year =

Learnability and the Vapnik--Chervonenkis Dimension , author =. Journal of the ACM , year =

[27] [27]

Understanding Machine Learning: From Theory to Algorithms , author =

[28] [28]

Foundations of Machine Learning , author =

[29] [29]

Prediction, Learning, and Games , author =

[30] [30]

Foundations and Trends in Machine Learning , volume =

Online Learning and Online Convex Optimization , author =. Foundations and Trends in Machine Learning , volume =

[31] [31]

Introduction to Online Convex Optimization , author =

[32] [32]

2023 , eprint =

Directions of Curvature as an Explanation for Loss of Plasticity , author =. 2023 , eprint =

2023

[33] [33]

2025 , eprint =

Reinitializing Weights vs Units for Maintaining Plasticity in Neural Networks , author =. 2025 , eprint =

2025

[34] [34]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000

[35] [35]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980

[36] [36]

M. J. Kearns , title =

[37] [37]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983

[38] [38]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000

[39] [39]

Suppressed for Anonymity , author=

[40] [40]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981

[41] [41]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959

[42] [42]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N., ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[43] [43]

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. Learnability and the vapnik--chervonenkis dimension. Journal of the ACM, 36 0 (4): 0 929--965, 1989

1989

[44] [44]

Superintelligence: Paths, Dangers, Strategies

Bostrom, N. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Oxford, UK, 2014

2014

[45] [45]

and Lugosi, G

Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006

2006

[46] [46]

Deep reinforcement learning from human preferences

Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017. URL https://arxiv.org/abs/1706.03741

work page internal anchor Pith review Pith/arXiv arXiv 2017

[47] [47]

F., Lan, Q., Rahman, P., Mahmood, A

Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., and Sutton, R. S. Loss of plasticity in deep continual learning. Nature, 632 0 (8026): 0 768--774, 2024. doi:10.1038/s41586-024-07711-7

work page doi:10.1038/s41586-024-07711-7 2024

[48] [48]

Alignment faking in large language models

Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., and Hubinger, E. Alignment faking in large language models, 2024. URL https://arxiv.org/abs/2412.14093

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

J., Abbeel, P., and Dragan, A

Hadfield-Menell, D., Russell, S. J., Abbeel, P., and Dragan, A. Cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems, 2016

2016

[50] [50]

Introduction to Online Convex Optimization

Hazan, E. Introduction to Online Convex Optimization. Now Publishers, 2016

2016

[51] [51]

F., Dohare, S., Luo, J., and Sutton, R

Hernandez-Garcia, J. F., Dohare, S., Luo, J., and Sutton, R. S. Reinitializing weights vs units for maintaining plasticity in neural networks, 2025. URL https://arxiv.org/abs/2508.00212

work page arXiv 2025

[52] [52]

Lewandowski, A., Tanaka, H., Schuurmans, D., and Machado, M. C. Directions of curvature as an explanation for loss of plasticity, 2023. URL https://arxiv.org/abs/2312.00246

work page arXiv 2023

[53] [53]

Understanding plasticity in neural networks, 2023

Lyle, C., Zheng, Z., Nikishin, E., Avila Pires, B., Pascanu, R., and Dabney, W. Understanding plasticity in neural networks, 2023. URL https://arxiv.org/abs/2303.01486

work page arXiv 2023

[54] [54]

D., and Gebru, T

Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., and Gebru, T. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp.\ 220--229, 2019

2019

[55] [55]

Foundations of Machine Learning

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. MIT Press, 2 edition, 2018

2018

[56] [56]

Omohundro, S. M. The basic ai drives. In Proceedings of the First AGI Conference, volume 171, pp.\ 483--492, 2008

2008

[57] [57]

and Ring, M

Orseau, L. and Ring, M. Self-modification and mortality in artificial agents. In Artificial General Intelligence, pp.\ 1--10. Springer, 2011

2011

[58] [58]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback, 2022. URL https://arxiv.org/abs/2203.02155

work page internal anchor Pith review Pith/arXiv arXiv 2022

[59] [59]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. Direct preference optimization: Your language model is secretly a reward model, 2023. URL https://arxiv.org/abs/2305.18290

work page internal anchor Pith review Pith/arXiv arXiv 2023

[60] [60]

Russell, S. J. Human Compatible: Artificial Intelligence and the Problem of Control. Viking, 2019

2019

[61] [61]

Goedel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements

Schmidhuber, J. G \"o del machines: Fully self-referential optimal universal self-improvers, 2005. URL https://arxiv.org/abs/cs/0309048

work page internal anchor Pith review Pith/arXiv arXiv 2005

[62] [62]

Online learning and online convex optimization

Shalev-Shwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4 0 (2): 0 107--194, 2012

2012

[63] [63]

and Ben-David, S

Shalev-Shwartz, S. and Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014

2014

[64] [64]

Corrigibility

Soares, N., Fallenstein, B., Yudkowsky, E., and Armstrong, S. Corrigibility. In Artificial Intelligence and Ethics: Papers from the 2015 AAAI Workshop, pp.\ 74--82, 2015. URL https://cdn.aaai.org/ocs/ws/ws0067/10124-45900-1-PB.pdf

2015

[65] [65]

Vapnik, V. N. The Nature of Statistical Learning Theory. Springer-Verlag, New York, NY, 1995. ISBN 0-387-94559-8

1995

[66] [66]

L., Dorchen, K., and Jin, P

Wang, C. L., Dorchen, K., and Jin, P. Utility-learning tension in self-modifying agents, 2025 a . URL https://arxiv.org/abs/2510.04399. arXiv:2510.04399v2

work page arXiv 2025

[67] [67]

L., Singhal, T., Kelkar, A., and Tuo, J

Wang, C. L., Singhal, T., Kelkar, A., and Tuo, J. Mi9 -- agent intelligence protocol: Runtime governance for agentic ai systems, 2025 b . URL https://arxiv.org/abs/2508.03858

work page arXiv 2025

[68] [68]

Cybernetics: Or Control and Communication in the Animal and the Machine

Wiener, N. Cybernetics: Or Control and Communication in the Animal and the Machine. The Technology Press; John Wiley & Sons; Hermann et Cie, Cambridge, MA; New York; Paris, 1948

1948

[69] [69]

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Zhang, J., Hu, S., Lu, C., Lange, R., and Clune, J. Darwin g \"o del machine: Open-ended evolution of self-improving agents, 2025. URL https://arxiv.org/abs/2505.22954

work page internal anchor Pith review Pith/arXiv arXiv 2025