Safety from Honesty in a Disinterested AI Predictor

Adam Oberman; Anna Gaven\v{c}iak; Aton Kamanda; Damiano Fornasiere; David Hyland; Francis Rhys Ward; Gael Gendron; Iulian Serban; Jacob Livingston Slosser; Joumana Ghosn

arxiv: 2606.29657 · v1 · pith:ANYQJS5Cnew · submitted 2026-06-28 · 💻 cs.AI · cs.LG

Safety from Honesty in a Disinterested AI Predictor

Yoshua Bengio , Oliver Richardson , Tom\'a\v{s} Gaven\v{c}iak , Michael Cohen , Rory Svarc , Damiano Fornasiere , Gael Gendron , David Hyland

show 8 more authors

Aton Kamanda Adam Oberman Francis Rhys Ward Anna Gaven\v{c}iak Jacob Livingston Slosser Vincent Mai Iulian Serban Joumana Ghosn

This is my paper

Pith reviewed 2026-06-30 06:50 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords AI safetyBayesian posteriorhonest predictionepistemic contextualizationdisinterested predictortraining dynamicsresidual harmmisalignment

0 comments

The pith

A disinterested AI predictor approximates the Bayesian posterior on contextualized statements to remain honest without internal agency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a safety argument for the Scientist AI Predictor by training it to match the Bayesian posterior over a dataset of epistemically contextualized natural-language statements. Epistemic contextualization separates factual claims from communication acts, so goal expressions become evidence to explain rather than drives the model adopts. The training objective uses only the posterior approximation and never receives reward signals from downstream deployment effects, with any needed agency supplied by external scaffolding. Under assumptions on training dynamics and the sparsity of coordinated harmful patterns, the probability that the resulting Predictor produces residual harm above a threshold is bounded as small.

Core claim

The Scientist AI Predictor, trained to approximate the Bayesian posterior conditioned on epistemically contextualized statements, honestly predicts agents, actions, and consequences without itself selecting outputs to achieve goals, because the data representation treats goal expressions as evidence rather than adopted drives and the posterior-seeking objective receives no direct training signal from deployment outcomes.

What carries the argument

Epistemic contextualization of text, which distinguishes latent factual claims from communication acts, together with a posterior-seeking training objective that excludes any reward from downstream deployment effects.

Load-bearing premise

Coordinated patterns of harm underestimation are rare under the initialization distribution and receive no direct training signal.

What would settle it

Empirical observation during training that coordinated underestimation of harm across many queries emerges and persists even though no deployment reward is provided.

read the original abstract

As AI systems become more capable, training procedures that optimize for downstream outcomes risk introducing implicit agency: goal-directed behavior that designers never specified. We present a formal safety argument for the Scientist AI (SAI) Predictor, trained to approximate the Bayesian posterior conditioned on a dataset of "epistemically contextualized" natural-language statements. We argue that such a Predictor can honestly predict agents, actions, and their consequences without itself being an agent that selects outputs to achieve goals. This rests on data representation and on the training procedure. Epistemic contextualization of text distinguishes latent factual claims from communication acts, so expressions of goals are treated as evidence to be explained rather than drives the model adopts. With a posterior-seeking training objective, this is intended to drive the Predictor toward calibrated, cautious predictions. Training proceeds so downstream effects of deploying a prediction never serve as a reward signal; any agency the system needs is supplied by explicit scaffolding constrained by guardrails. We prove that, under assumptions on the training dynamics and on the argued sparsity of dangerous Predictors, the probability that training produces a Predictor whose guarded deployment carries residual harm above a specified threshold is small: a dangerous Predictor would have to underestimate harm in a coordinated way across many queries while such coordinated patterns are rare under the initialization distribution and receive no direct training signal. Safety and accuracy are jointly supported in this framework, since the constraints that secure accuracy are the same ones that make coordinated deception costly. These guarantees against misalignment and agency arising from within the Predictor itself do not preclude the use of the Predictor as part of an agentic system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The safety bound is conditional on unclosed assumptions about sparsity of coordinated deception at initialization and under the posterior objective.

read the letter

The main takeaway is that this paper gives a formal probability bound on the chance that training produces a dangerous predictor, but the bound only goes through if coordinated underestimation patterns are rare in the initialization distribution and receive no training signal. Those two claims are asserted rather than derived from an explicit measure on functions or a proof about the loss landscape.

What is new is the combination of epistemic contextualization of statements with a posterior-seeking objective, plus the argument that this setup makes coordinated deception costly while supporting calibration. The claim that the same constraints aid both accuracy and safety is a clean framing that ties the representation choice directly to the safety conclusion.

The paper does a reasonable job spelling out how treating goal expressions as evidence to be explained, rather than as drives, keeps the predictor disinterested. That distinction is stated clearly and avoids some of the usual slippage between prediction and agency.

The soft spot is exactly the one in the stress-test note. The abstract presents the result as holding under assumptions on training dynamics and sparsity, yet supplies no derivation showing why the initialization measure assigns low mass to the relevant coordinated functions or why gradient steps cannot amplify them. Without that step the bound reduces to the choice of premises, which is the circularity the reader flagged. The full text may close this, but the abstract alone leaves it open.

This is for readers working on formal safety arguments for oracles and predictors. Someone already thinking about Bayesian approaches or disinterested systems will find a concrete proposal to examine. It deserves a serious referee to check whether the derivation in the body actually justifies the sparsity claims or whether they remain external assumptions.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a formal safety argument for the Scientist AI (SAI) Predictor, trained to approximate the Bayesian posterior conditioned on epistemically contextualized natural-language statements. It claims that epistemic contextualization distinguishes factual claims from communication acts, and that a posterior-seeking objective drives calibrated predictions without implicit agency. The central result is a proof that, under assumptions on training dynamics and sparsity of dangerous predictors, the probability of training a predictor whose guarded deployment exceeds a residual-harm threshold is small, because dangerous behavior requires coordinated underestimation of harm across queries, which is asserted to be rare under the initialization distribution and to receive no direct training signal.

Significance. If the assumptions can be rigorously justified and the derivation completed, the work would offer a distinctive formal route to AI safety that links data representation choices directly to honesty guarantees, allowing the predictor to serve as a component in explicitly scaffolded agentic systems. The attempt to make accuracy and safety constraints coincide through the same representational and objective choices is a conceptual strength worth developing.

major comments (3)

[Abstract] Abstract: The probability bound is derived from the assumptions that coordinated underestimation patterns are rare under the initialization distribution and receive no direct training signal, yet the manuscript provides no explicit measure on the function space, no derivation that the initialization assigns low mass to such functions, and no argument that gradient steps on the posterior objective cannot amplify them. This renders the bound conditional on unclosed premises rather than derived from the epistemic-contextualization representation.
[Main safety argument] Main safety argument (the probability bound referenced in the abstract): The reduction that 'a dangerous Predictor would have to underestimate harm in a coordinated way across many queries' is asserted without showing necessity; the manuscript does not rule out other misalignment modes (e.g., non-coordinated or context-specific errors) that could still produce high residual harm while evading the sparsity claim.
[Assumptions on training dynamics and sparsity] Assumptions on training dynamics and sparsity: These premises are introduced to secure the safety conclusion but receive no further justification, empirical grounding, or proof that the loss landscape separates honest from coordinated-deceptive predictors, making the central claim rest on premises whose realism remains unexamined.

minor comments (2)

[Abstract] The acronym SAI is introduced in the abstract without immediate expansion.
An explicit statement of the function-space measure used to formalize 'sparsity' would clarify the argument.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. The feedback correctly identifies that the safety argument relies on premises about sparsity and training dynamics that are stated but not fully closed within the manuscript. We respond to each major comment below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract: The probability bound is derived from the assumptions that coordinated underestimation patterns are rare under the initialization distribution and receive no direct training signal, yet the manuscript provides no explicit measure on the function space, no derivation that the initialization assigns low mass to such functions, and no argument that gradient steps on the posterior objective cannot amplify them. This renders the bound conditional on unclosed premises rather than derived from the epistemic-contextualization representation.

Authors: The referee is correct that the probability bound is conditional on the stated assumptions rather than derived in full from the representation alone. The manuscript explicitly frames the result as holding under assumptions on initialization and training dynamics, with conceptual arguments for sparsity based on the disinterested objective. We will revise the abstract and introduction to more precisely flag these as open premises requiring future formalization, including discussion of function-space measures. revision: yes
Referee: [Main safety argument] Main safety argument (the probability bound referenced in the abstract): The reduction that 'a dangerous Predictor would have to underestimate harm in a coordinated way across many queries' is asserted without showing necessity; the manuscript does not rule out other misalignment modes (e.g., non-coordinated or context-specific errors) that could still produce high residual harm while evading the sparsity claim.

Authors: The manuscript defines a dangerous predictor as one producing high residual harm under guarded deployment and argues that, given epistemic contextualization and the posterior objective, achieving this requires systematic underestimation that must be coordinated to persist across queries. Non-coordinated errors are expected to be corrected by calibration. We agree the necessity is asserted at a high level and will add a clarifying subsection explaining why alternative modes are subsumed under the sparsity claim or do not yield high residual harm. revision: partial
Referee: [Assumptions on training dynamics and sparsity] Assumptions on training dynamics and sparsity: These premises are introduced to secure the safety conclusion but receive no further justification, empirical grounding, or proof that the loss landscape separates honest from coordinated-deceptive predictors, making the central claim rest on premises whose realism remains unexamined.

Authors: The assumptions are presented as such because a rigorous separation proof for the loss landscape lies outside the paper's scope, which centers on connecting data representation choices to honesty. The manuscript supplies conceptual reasoning that the posterior objective supplies no direct signal for coordinated deception. We will expand the assumptions section with additional justification drawn from the training procedure but acknowledge that empirical grounding and full landscape analysis remain open. revision: yes

Circularity Check

0 steps flagged

No significant circularity; result is explicitly conditional on unproven assumptions.

full rationale

The paper states its central probability bound holds 'under assumptions on the training dynamics and on the argued sparsity of dangerous Predictors' and then describes what a dangerous predictor would require. This is a conditional claim rather than a derivation that reduces the bound to its own inputs by construction. No equations, fitted parameters renamed as predictions, or self-citation chains that close the argument are exhibited. The sparsity claim is presented as following from the epistemic contextualization and posterior objective, but the text supplies no reduction showing the conclusion is definitionally equivalent to the premise. The derivation chain therefore remains open and non-circular on its own terms.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The safety probability bound rests on two unproven assumptions about training dynamics and predictor sparsity plus two newly introduced modeling constructs; no independent evidence or external benchmarks are referenced in the abstract.

axioms (2)

ad hoc to paper Assumptions on the training dynamics
Invoked to ensure that coordinated underestimation patterns receive no direct training signal.
ad hoc to paper Sparsity of dangerous Predictors under the initialization distribution
Used to argue that coordinated harmful behavior is rare before any training occurs.

invented entities (2)

Scientist AI (SAI) Predictor no independent evidence
purpose: A model trained to approximate the Bayesian posterior over epistemically contextualized statements without adopting goals.
Core object for which the safety guarantee is claimed.
epistemically contextualized natural-language statements no independent evidence
purpose: Data format that separates latent factual claims from communication acts so goals are treated as evidence rather than drives.
Central data-representation choice that enables the honesty argument.

pith-pipeline@v0.9.1-grok · 5895 in / 1661 out tokens · 35441 ms · 2026-06-30T06:50:42.351374+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 30 canonical work pages · 10 internal anchors

[1]

The hawthorne effect in reasoning models: Evaluating and steering test awareness, 2025

Sahar Abdelnabi and Ahmed Salem. The hawthorne effect in reasoning models: Evaluating and steering test awareness, 2025. URL https://arxiv.org/abs/2505.14617

work page arXiv 2025
[2]

Superintelligence cannot be contained: Lessons from computability theory

Manuel Alfonseca, Manuel Cebrian, Antonio Fern \' a ndez Anta, Lorenzo Coviello, Andr \' e s Abeliuk, and Iyad Rahwan. Superintelligence cannot be contained: Lessons from computability theory. CoRR, abs/1607.00913, 2016. URL http://arxiv.org/abs/1607.00913

work page arXiv 2016
[3]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man \'e . Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[4]

and Ritchie, Stuart J

Anthropic . Agentic misalignment: How LLMs could be insider threats. arXiv preprint arXiv:2510.05179, June 2025. URL https://www.anthropic.com/research/agentic-misalignment. Detailed report on simulated blackmail and self-preservation behaviors in Claude 4

work page arXiv 2025
[5]

Thinking inside the box: Controlling and using an oracle ai

Stuart Armstrong, Anders Sandberg, and Nick Bostrom. Thinking inside the box: Controlling and using an oracle ai. Minds and Machines, 22: 0 299--324, 2012

2012
[6]

Guidelines for Artificial Intelligence Containment

James Babcock, Janos Kram \' a r, and Roman V. Yampolskiy. Guidelines for artificial intelligence containment. CoRR, abs/1707.08476, 2017. URL http://arxiv.org/abs/1707.08476

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

Probabilistic evaluation of counterfactual queries

Alexander Balke and Judea Pearl. Probabilistic evaluation of counterfactual queries. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI'94, pages 230--237, Seattle, Washington, USA, 1994. AAAI Press. URL https://cdn.aaai.org/AAAI/1994/AAAI94-035.pdf

1994
[8]

Sander Beckers and Joseph Y. Halpern. Abstracting causal models, 2019. URL https://arxiv.org/abs/1812.03789

work page internal anchor Pith review Pith/arXiv arXiv 2019
[9]

A theory of learning from different domains

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79 0 (1): 0 151--175, 2010

2010
[10]

Managing extreme ai risks amid rapid progress

Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, et al. Managing extreme ai risks amid rapid progress. Science, 384 0 (6698): 0 842--845, 2024

2024
[11]

International ai safety report 2025: First key update: Capabilities and risk implications

Yoshua Bengio, Stephen Clare, Carina Prunkl, Shalaleh Rismani, Maksym Andriushchenko, Ben Bucknall, Philip Fox, Tiancheng Hu, Cameron Jones, Sam Manning, et al. International ai safety report 2025: First key update: Capabilities and risk implications. arXiv preprint arXiv:2510.13653, 2025 a

work page arXiv 2025
[12]

Bengio et al

Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, S \"o ren Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, and David Williams-King. Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path? arXiv preprint arXiv:2502.1565...

work page arXiv 2025
[13]

International AI safety report 2026

Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Malcolm Murray, et al. International AI safety report 2026 . Technical report, UK Government , 2026. URL https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026

2026
[14]

Emergent misalignment: Narrow finetuning can produce broadly misaligned llms

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Mart \' n Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. arXiv preprint arXiv:2502.17424, 2025

work page arXiv 2025
[15]

Tineke Blom, Stephan Bongers, and Joris M. Mooij. Beyond structural causal models: Causal constraints models, 2019. URL https://arxiv.org/abs/1805.06539

work page arXiv 2019
[16]

The superintelligent will: Motivation and instrumental rationality in advanced artificial agents

Nick Bostrom. The superintelligent will: Motivation and instrumental rationality in advanced artificial agents. Minds and Machines, 22 0 (2): 0 71--85, 2012

2012
[17]

Chalmers

David Bourget and David J. Chalmers. Philosophers on philosophy: The 2020 philpapers survey. Philosophers' Imprint, 23 0 (11), 2023. doi:10.3998/phimp.2109. URL https://doi.org/10.3998/phimp.2109

work page doi:10.3998/phimp.2109 2020
[18]

Sycophantic ai decreases prosocial intentions and promotes dependence

Myra Cheng et al. Sycophantic ai decreases prosocial intentions and promotes dependence. Science, 391: 0 eaec8352, 2026. doi:10.1126/science.aec8352

work page doi:10.1126/science.aec8352 2026
[19]

Eliciting latent knowledge: How to tell if your eyes deceive you

Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting latent knowledge: How to tell if your eyes deceive you. Technical report, Alignment Research Center, December 2021

2021
[20]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

2017
[21]

Advanced artificial agents intervene in the provision of reward

Michael Cohen, Marcus Hutter, and Michael Osborne. Advanced artificial agents intervene in the provision of reward. AI magazine, 43 0 (3): 0 282--293, 2022

2022
[22]

Imitation learning is probably existentially safe

Michael K Cohen and Marcus Hutter. Imitation learning is probably existentially safe. AI Magazine, 46 0 (4): 0 e70040, 2025

2025
[23]

Regulating advanced artificial agents

Michael K Cohen, Noam Kolt, Yoshua Bengio, Gillian K Hadfield, and Stuart Russell. Regulating advanced artificial agents. Science, 384 0 (6691): 0 36--38, 2024

2024
[24]

Bayesian structure learning with generative flow networks

Tristan Deleu, Ant \'o nio G \'o is, Chris Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, and Yoshua Bengio. Bayesian structure learning with generative flow networks. In Uncertainty in Artificial Intelligence, pages 518--528. PMLR, 2022

2022
[25]

Embedded agency, 2020

Abram Demski and Scott Garrabrant. Embedded agency, 2020. URL https://arxiv.org/abs/1902.09469

work page arXiv 2020
[26]

Language models recognize dropout and Gaussian noise applied to their activations

Damiano Fornasiere, Mirko Bronzi, Spencer Kitts, Alessandro Palmas, Yoshua Bengio, and Oliver Richardson. Language models recognize dropout and gaussian noise applied to their activations, 2026. URL https://arxiv.org/abs/2604.17465

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Causal abstraction: A theoretical foundation for mechanistic interpretability, 2025

Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, and Thomas Icard. Causal abstraction: A theoretical foundation for mechanistic interpretability, 2025. URL https://arxiv.org/abs/2301.04709

work page arXiv 2025
[28]

Alignment faking in large language models

Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger. Alignment faking in large language models. ArX...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. Association for Computing Machinery, 2023. doi:10.1145/3605764....

work page doi:10.1145/3605764.3623985 2023
[30]

Amortizing intractable inference in large language models

Edward J Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio, and Nikolay Malkin. Amortizing intractable inference in large language models. In Proc. International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Ouj6p4ca60

2024
[31]

Risks from Learned Optimization in Advanced Machine Learning Systems

Evan Hubinger, Chris van Merwijk, Vlad \' i mir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[32]

Thinking, fast and slow

Daniel Kahneman. Thinking, fast and slow. macmillan, 2011

2011
[33]

Predicting vs

Margaret Li, Weijia Shi, Artidoro Pagnoni, Peter West, and Ari Holtzman. Predicting vs. acting: A trade-off between world modeling & agent modeling. arXiv preprint arXiv:2407.02446, 2024

work page arXiv 2024
[34]

Frontier Models are Capable of In-context Scheming

Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming. ArXiv preprint, 2412.04984, 2024. URL https://arxiv.org/abs/2412.04984

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Weis, Seijin Kobayashi, Blake Richards, Guillaume Lajoie, Angelika Steger, Marcus Hutter, James Manyika, Rif A

Alexander Meulemans, Rajai Nasser, Maciej Wołczyk, Marissa A. Weis, Seijin Kobayashi, Blake Richards, Guillaume Lajoie, Angelika Steger, Marcus Hutter, James Manyika, Rif A. Saurous, João Sacramento, and Blaise Agüera y Arcas. Embedded universal predictive intelligence: a coherent framework for multi-agent learning, 2025. URL https://arxiv.org/abs/2511.22226

work page arXiv 2025
[36]

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

The alignment problem from a deep learning perspective

Richard Ngo, Lawrence Chan, and S \"o ren Mindermann. The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022

work page arXiv 2022
[38]

The basic ai drives

Stephen M Omohundro. The basic ai drives. In Artificial intelligence safety and security, pages 47--55. Chapman and Hall/CRC, 2018

2018
[39]

Shaking the foundations: delusions in sequence models for interaction and control

Pedro A Ortega, Markus Kunesch, Gr \'e goire Del \'e tang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Perolat, et al. Shaking the foundations: delusions in sequence models for interaction and control. arXiv preprint arXiv:2110.10819, 2021

work page arXiv 2021
[40]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730--27744, 2022

2022
[41]

Causality

Judea Pearl. Causality. Cambridge university press, 2009

2009
[42]

Performative prediction

Juan Perdomo, Tijana Zrnic, Celestine Mendler-D \"u nner, and Moritz Hardt. Performative prediction. In International Conference on Machine Learning, pages 7599--7609. PMLR, 2020

2020
[43]

Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models, 2022. URL https://arxiv.org/abs/2211.09527

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

Dataset shift in machine learning

Joaquin Qui \ n onero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. MIT Press, 2008

2008
[45]

Loss as the inconsistency of a probabilistic dependency graph: Choose your model, not your loss function, 2022

Oliver E Richardson. Loss as the inconsistency of a probabilistic dependency graph: Choose your model, not your loss function, 2022. URL https://arxiv.org/abs/2202.11862

work page arXiv 2022
[46]

Qualitative mechanism independence, 2025

Oliver E Richardson, Spencer Peters, and Joseph Y Halpern. Qualitative mechanism independence, 2025. URL https://arxiv.org/abs/2501.15488

work page arXiv 2025
[47]

A Unified Theory of Probabilistic Modeling, Dependence, and Inconsistency

Oliver Ethan Richardson. A Unified Theory of Probabilistic Modeling, Dependence, and Inconsistency. Cornell University, 2024

2024
[48]

Estimating causal effects of treatments in randomized and nonrandomized studies

Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66 0 (5): 0 688, 1974

1974
[49]

Human compatible: AI and the problem of control

Stuart Russell. Human compatible: AI and the problem of control. Penguin Uk, 2019

2019
[50]

Incomplete tasks induce shutdown resistance in some frontier llms

Jeremy Schlatter, Benjamin Weinstein-Raun, and Jeffrey Ladish. Incomplete tasks induce shutdown resistance in some frontier llms. Transactions on Machine Learning Research, 2026

2026
[51]

On causal and anticausal learning

Bernhard Sch \"o lkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 459--466, 2012

2012
[52]

Toward causal representation learning

Bernhard Sch \"o lkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109 0 (5): 0 612--634, 2021

2021
[53]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Text understanding in gpt-4 vs humans

Thomas R Shultz, Jamie M Wise, and Ardavan Salehi Nobandegani. Text understanding in gpt-4 vs humans. arXiv preprint arXiv:2403.17196, 2024

work page arXiv 2024
[55]

Alan M. Turing. On computable numbers, with an application to the E ntscheidungsproblem. Proceedings of the London Mathematical Society, 42 0 (2): 0 230--265, 1936

1936
[56]

Optimal policies tend to seek power

Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, and Prasad Tadepalli. Optimal policies tend to seek power. In Advances in Neural Information Processing Systems, volume 34, pages 23063--23074, 2021

2021
[57]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

2022
[58]

Star: Self-taught reasoner

Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Self-taught reasoner. In Proceedings of the NIPS, volume 22, 2022

2022
[59]

Persistent pre-training poisoning of llms, 2024

Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, and Daphne Ippolito. Persistent pre-training poisoning of llms, 2024. URL https://arxiv.org/abs/2410.13722

work page arXiv 2024
[60]

Consequences of misaligned ai

Simon Zhuang and Dylan Hadfield-Menell. Consequences of misaligned ai. Advances in Neural Information Processing Systems, 33: 0 15763--15773, 2020

2020

[1] [1]

The hawthorne effect in reasoning models: Evaluating and steering test awareness, 2025

Sahar Abdelnabi and Ahmed Salem. The hawthorne effect in reasoning models: Evaluating and steering test awareness, 2025. URL https://arxiv.org/abs/2505.14617

work page arXiv 2025

[2] [2]

Superintelligence cannot be contained: Lessons from computability theory

Manuel Alfonseca, Manuel Cebrian, Antonio Fern \' a ndez Anta, Lorenzo Coviello, Andr \' e s Abeliuk, and Iyad Rahwan. Superintelligence cannot be contained: Lessons from computability theory. CoRR, abs/1607.00913, 2016. URL http://arxiv.org/abs/1607.00913

work page arXiv 2016

[3] [3]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man \'e . Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[4] [4]

and Ritchie, Stuart J

Anthropic . Agentic misalignment: How LLMs could be insider threats. arXiv preprint arXiv:2510.05179, June 2025. URL https://www.anthropic.com/research/agentic-misalignment. Detailed report on simulated blackmail and self-preservation behaviors in Claude 4

work page arXiv 2025

[5] [5]

Thinking inside the box: Controlling and using an oracle ai

Stuart Armstrong, Anders Sandberg, and Nick Bostrom. Thinking inside the box: Controlling and using an oracle ai. Minds and Machines, 22: 0 299--324, 2012

2012

[6] [6]

Guidelines for Artificial Intelligence Containment

James Babcock, Janos Kram \' a r, and Roman V. Yampolskiy. Guidelines for artificial intelligence containment. CoRR, abs/1707.08476, 2017. URL http://arxiv.org/abs/1707.08476

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

Probabilistic evaluation of counterfactual queries

Alexander Balke and Judea Pearl. Probabilistic evaluation of counterfactual queries. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI'94, pages 230--237, Seattle, Washington, USA, 1994. AAAI Press. URL https://cdn.aaai.org/AAAI/1994/AAAI94-035.pdf

1994

[8] [8]

Sander Beckers and Joseph Y. Halpern. Abstracting causal models, 2019. URL https://arxiv.org/abs/1812.03789

work page internal anchor Pith review Pith/arXiv arXiv 2019

[9] [9]

A theory of learning from different domains

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79 0 (1): 0 151--175, 2010

2010

[10] [10]

Managing extreme ai risks amid rapid progress

Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, et al. Managing extreme ai risks amid rapid progress. Science, 384 0 (6698): 0 842--845, 2024

2024

[11] [11]

International ai safety report 2025: First key update: Capabilities and risk implications

Yoshua Bengio, Stephen Clare, Carina Prunkl, Shalaleh Rismani, Maksym Andriushchenko, Ben Bucknall, Philip Fox, Tiancheng Hu, Cameron Jones, Sam Manning, et al. International ai safety report 2025: First key update: Capabilities and risk implications. arXiv preprint arXiv:2510.13653, 2025 a

work page arXiv 2025

[12] [12]

Bengio et al

Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, S \"o ren Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, and David Williams-King. Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path? arXiv preprint arXiv:2502.1565...

work page arXiv 2025

[13] [13]

International AI safety report 2026

Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Malcolm Murray, et al. International AI safety report 2026 . Technical report, UK Government , 2026. URL https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026

2026

[14] [14]

Emergent misalignment: Narrow finetuning can produce broadly misaligned llms

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Mart \' n Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. arXiv preprint arXiv:2502.17424, 2025

work page arXiv 2025

[15] [15]

Tineke Blom, Stephan Bongers, and Joris M. Mooij. Beyond structural causal models: Causal constraints models, 2019. URL https://arxiv.org/abs/1805.06539

work page arXiv 2019

[16] [16]

The superintelligent will: Motivation and instrumental rationality in advanced artificial agents

Nick Bostrom. The superintelligent will: Motivation and instrumental rationality in advanced artificial agents. Minds and Machines, 22 0 (2): 0 71--85, 2012

2012

[17] [17]

Chalmers

David Bourget and David J. Chalmers. Philosophers on philosophy: The 2020 philpapers survey. Philosophers' Imprint, 23 0 (11), 2023. doi:10.3998/phimp.2109. URL https://doi.org/10.3998/phimp.2109

work page doi:10.3998/phimp.2109 2020

[18] [18]

Sycophantic ai decreases prosocial intentions and promotes dependence

Myra Cheng et al. Sycophantic ai decreases prosocial intentions and promotes dependence. Science, 391: 0 eaec8352, 2026. doi:10.1126/science.aec8352

work page doi:10.1126/science.aec8352 2026

[19] [19]

Eliciting latent knowledge: How to tell if your eyes deceive you

Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting latent knowledge: How to tell if your eyes deceive you. Technical report, Alignment Research Center, December 2021

2021

[20] [20]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

2017

[21] [21]

Advanced artificial agents intervene in the provision of reward

Michael Cohen, Marcus Hutter, and Michael Osborne. Advanced artificial agents intervene in the provision of reward. AI magazine, 43 0 (3): 0 282--293, 2022

2022

[22] [22]

Imitation learning is probably existentially safe

Michael K Cohen and Marcus Hutter. Imitation learning is probably existentially safe. AI Magazine, 46 0 (4): 0 e70040, 2025

2025

[23] [23]

Regulating advanced artificial agents

Michael K Cohen, Noam Kolt, Yoshua Bengio, Gillian K Hadfield, and Stuart Russell. Regulating advanced artificial agents. Science, 384 0 (6691): 0 36--38, 2024

2024

[24] [24]

Bayesian structure learning with generative flow networks

Tristan Deleu, Ant \'o nio G \'o is, Chris Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, and Yoshua Bengio. Bayesian structure learning with generative flow networks. In Uncertainty in Artificial Intelligence, pages 518--528. PMLR, 2022

2022

[25] [25]

Embedded agency, 2020

Abram Demski and Scott Garrabrant. Embedded agency, 2020. URL https://arxiv.org/abs/1902.09469

work page arXiv 2020

[26] [26]

Language models recognize dropout and Gaussian noise applied to their activations

Damiano Fornasiere, Mirko Bronzi, Spencer Kitts, Alessandro Palmas, Yoshua Bengio, and Oliver Richardson. Language models recognize dropout and gaussian noise applied to their activations, 2026. URL https://arxiv.org/abs/2604.17465

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Causal abstraction: A theoretical foundation for mechanistic interpretability, 2025

Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, and Thomas Icard. Causal abstraction: A theoretical foundation for mechanistic interpretability, 2025. URL https://arxiv.org/abs/2301.04709

work page arXiv 2025

[28] [28]

Alignment faking in large language models

Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger. Alignment faking in large language models. ArX...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. Association for Computing Machinery, 2023. doi:10.1145/3605764....

work page doi:10.1145/3605764.3623985 2023

[30] [30]

Amortizing intractable inference in large language models

Edward J Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio, and Nikolay Malkin. Amortizing intractable inference in large language models. In Proc. International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Ouj6p4ca60

2024

[31] [31]

Risks from Learned Optimization in Advanced Machine Learning Systems

Evan Hubinger, Chris van Merwijk, Vlad \' i mir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[32] [32]

Thinking, fast and slow

Daniel Kahneman. Thinking, fast and slow. macmillan, 2011

2011

[33] [33]

Predicting vs

Margaret Li, Weijia Shi, Artidoro Pagnoni, Peter West, and Ari Holtzman. Predicting vs. acting: A trade-off between world modeling & agent modeling. arXiv preprint arXiv:2407.02446, 2024

work page arXiv 2024

[34] [34]

Frontier Models are Capable of In-context Scheming

Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming. ArXiv preprint, 2412.04984, 2024. URL https://arxiv.org/abs/2412.04984

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Weis, Seijin Kobayashi, Blake Richards, Guillaume Lajoie, Angelika Steger, Marcus Hutter, James Manyika, Rif A

Alexander Meulemans, Rajai Nasser, Maciej Wołczyk, Marissa A. Weis, Seijin Kobayashi, Blake Richards, Guillaume Lajoie, Angelika Steger, Marcus Hutter, James Manyika, Rif A. Saurous, João Sacramento, and Blaise Agüera y Arcas. Embedded universal predictive intelligence: a coherent framework for multi-agent learning, 2025. URL https://arxiv.org/abs/2511.22226

work page arXiv 2025

[36] [36]

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

The alignment problem from a deep learning perspective

Richard Ngo, Lawrence Chan, and S \"o ren Mindermann. The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022

work page arXiv 2022

[38] [38]

The basic ai drives

Stephen M Omohundro. The basic ai drives. In Artificial intelligence safety and security, pages 47--55. Chapman and Hall/CRC, 2018

2018

[39] [39]

Shaking the foundations: delusions in sequence models for interaction and control

Pedro A Ortega, Markus Kunesch, Gr \'e goire Del \'e tang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Perolat, et al. Shaking the foundations: delusions in sequence models for interaction and control. arXiv preprint arXiv:2110.10819, 2021

work page arXiv 2021

[40] [40]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730--27744, 2022

2022

[41] [41]

Causality

Judea Pearl. Causality. Cambridge university press, 2009

2009

[42] [42]

Performative prediction

Juan Perdomo, Tijana Zrnic, Celestine Mendler-D \"u nner, and Moritz Hardt. Performative prediction. In International Conference on Machine Learning, pages 7599--7609. PMLR, 2020

2020

[43] [43]

Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models, 2022. URL https://arxiv.org/abs/2211.09527

work page internal anchor Pith review Pith/arXiv arXiv 2022

[44] [44]

Dataset shift in machine learning

Joaquin Qui \ n onero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. MIT Press, 2008

2008

[45] [45]

Loss as the inconsistency of a probabilistic dependency graph: Choose your model, not your loss function, 2022

Oliver E Richardson. Loss as the inconsistency of a probabilistic dependency graph: Choose your model, not your loss function, 2022. URL https://arxiv.org/abs/2202.11862

work page arXiv 2022

[46] [46]

Qualitative mechanism independence, 2025

Oliver E Richardson, Spencer Peters, and Joseph Y Halpern. Qualitative mechanism independence, 2025. URL https://arxiv.org/abs/2501.15488

work page arXiv 2025

[47] [47]

A Unified Theory of Probabilistic Modeling, Dependence, and Inconsistency

Oliver Ethan Richardson. A Unified Theory of Probabilistic Modeling, Dependence, and Inconsistency. Cornell University, 2024

2024

[48] [48]

Estimating causal effects of treatments in randomized and nonrandomized studies

Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66 0 (5): 0 688, 1974

1974

[49] [49]

Human compatible: AI and the problem of control

Stuart Russell. Human compatible: AI and the problem of control. Penguin Uk, 2019

2019

[50] [50]

Incomplete tasks induce shutdown resistance in some frontier llms

Jeremy Schlatter, Benjamin Weinstein-Raun, and Jeffrey Ladish. Incomplete tasks induce shutdown resistance in some frontier llms. Transactions on Machine Learning Research, 2026

2026

[51] [51]

On causal and anticausal learning

Bernhard Sch \"o lkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 459--466, 2012

2012

[52] [52]

Toward causal representation learning

Bernhard Sch \"o lkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109 0 (5): 0 612--634, 2021

2021

[53] [53]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

Text understanding in gpt-4 vs humans

Thomas R Shultz, Jamie M Wise, and Ardavan Salehi Nobandegani. Text understanding in gpt-4 vs humans. arXiv preprint arXiv:2403.17196, 2024

work page arXiv 2024

[55] [55]

Alan M. Turing. On computable numbers, with an application to the E ntscheidungsproblem. Proceedings of the London Mathematical Society, 42 0 (2): 0 230--265, 1936

1936

[56] [56]

Optimal policies tend to seek power

Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, and Prasad Tadepalli. Optimal policies tend to seek power. In Advances in Neural Information Processing Systems, volume 34, pages 23063--23074, 2021

2021

[57] [57]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

2022

[58] [58]

Star: Self-taught reasoner

Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Self-taught reasoner. In Proceedings of the NIPS, volume 22, 2022

2022

[59] [59]

Persistent pre-training poisoning of llms, 2024

Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, and Daphne Ippolito. Persistent pre-training poisoning of llms, 2024. URL https://arxiv.org/abs/2410.13722

work page arXiv 2024

[60] [60]

Consequences of misaligned ai

Simon Zhuang and Dylan Hadfield-Menell. Consequences of misaligned ai. Advances in Neural Information Processing Systems, 33: 0 15763--15773, 2020

2020