pith. sign in

arxiv: 2606.29657 · v1 · pith:ANYQJS5Cnew · submitted 2026-06-28 · 💻 cs.AI · cs.LG

Safety from Honesty in a Disinterested AI Predictor

Pith reviewed 2026-06-30 06:50 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords AI safetyBayesian posteriorhonest predictionepistemic contextualizationdisinterested predictortraining dynamicsresidual harmmisalignment
0
0 comments X

The pith

A disinterested AI predictor approximates the Bayesian posterior on contextualized statements to remain honest without internal agency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a safety argument for the Scientist AI Predictor by training it to match the Bayesian posterior over a dataset of epistemically contextualized natural-language statements. Epistemic contextualization separates factual claims from communication acts, so goal expressions become evidence to explain rather than drives the model adopts. The training objective uses only the posterior approximation and never receives reward signals from downstream deployment effects, with any needed agency supplied by external scaffolding. Under assumptions on training dynamics and the sparsity of coordinated harmful patterns, the probability that the resulting Predictor produces residual harm above a threshold is bounded as small.

Core claim

The Scientist AI Predictor, trained to approximate the Bayesian posterior conditioned on epistemically contextualized statements, honestly predicts agents, actions, and consequences without itself selecting outputs to achieve goals, because the data representation treats goal expressions as evidence rather than adopted drives and the posterior-seeking objective receives no direct training signal from deployment outcomes.

What carries the argument

Epistemic contextualization of text, which distinguishes latent factual claims from communication acts, together with a posterior-seeking training objective that excludes any reward from downstream deployment effects.

Load-bearing premise

Coordinated patterns of harm underestimation are rare under the initialization distribution and receive no direct training signal.

What would settle it

Empirical observation during training that coordinated underestimation of harm across many queries emerges and persists even though no deployment reward is provided.

read the original abstract

As AI systems become more capable, training procedures that optimize for downstream outcomes risk introducing implicit agency: goal-directed behavior that designers never specified. We present a formal safety argument for the Scientist AI (SAI) Predictor, trained to approximate the Bayesian posterior conditioned on a dataset of "epistemically contextualized" natural-language statements. We argue that such a Predictor can honestly predict agents, actions, and their consequences without itself being an agent that selects outputs to achieve goals. This rests on data representation and on the training procedure. Epistemic contextualization of text distinguishes latent factual claims from communication acts, so expressions of goals are treated as evidence to be explained rather than drives the model adopts. With a posterior-seeking training objective, this is intended to drive the Predictor toward calibrated, cautious predictions. Training proceeds so downstream effects of deploying a prediction never serve as a reward signal; any agency the system needs is supplied by explicit scaffolding constrained by guardrails. We prove that, under assumptions on the training dynamics and on the argued sparsity of dangerous Predictors, the probability that training produces a Predictor whose guarded deployment carries residual harm above a specified threshold is small: a dangerous Predictor would have to underestimate harm in a coordinated way across many queries while such coordinated patterns are rare under the initialization distribution and receive no direct training signal. Safety and accuracy are jointly supported in this framework, since the constraints that secure accuracy are the same ones that make coordinated deception costly. These guarantees against misalignment and agency arising from within the Predictor itself do not preclude the use of the Predictor as part of an agentic system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a formal safety argument for the Scientist AI (SAI) Predictor, trained to approximate the Bayesian posterior conditioned on epistemically contextualized natural-language statements. It claims that epistemic contextualization distinguishes factual claims from communication acts, and that a posterior-seeking objective drives calibrated predictions without implicit agency. The central result is a proof that, under assumptions on training dynamics and sparsity of dangerous predictors, the probability of training a predictor whose guarded deployment exceeds a residual-harm threshold is small, because dangerous behavior requires coordinated underestimation of harm across queries, which is asserted to be rare under the initialization distribution and to receive no direct training signal.

Significance. If the assumptions can be rigorously justified and the derivation completed, the work would offer a distinctive formal route to AI safety that links data representation choices directly to honesty guarantees, allowing the predictor to serve as a component in explicitly scaffolded agentic systems. The attempt to make accuracy and safety constraints coincide through the same representational and objective choices is a conceptual strength worth developing.

major comments (3)
  1. [Abstract] Abstract: The probability bound is derived from the assumptions that coordinated underestimation patterns are rare under the initialization distribution and receive no direct training signal, yet the manuscript provides no explicit measure on the function space, no derivation that the initialization assigns low mass to such functions, and no argument that gradient steps on the posterior objective cannot amplify them. This renders the bound conditional on unclosed premises rather than derived from the epistemic-contextualization representation.
  2. [Main safety argument] Main safety argument (the probability bound referenced in the abstract): The reduction that 'a dangerous Predictor would have to underestimate harm in a coordinated way across many queries' is asserted without showing necessity; the manuscript does not rule out other misalignment modes (e.g., non-coordinated or context-specific errors) that could still produce high residual harm while evading the sparsity claim.
  3. [Assumptions on training dynamics and sparsity] Assumptions on training dynamics and sparsity: These premises are introduced to secure the safety conclusion but receive no further justification, empirical grounding, or proof that the loss landscape separates honest from coordinated-deceptive predictors, making the central claim rest on premises whose realism remains unexamined.
minor comments (2)
  1. [Abstract] The acronym SAI is introduced in the abstract without immediate expansion.
  2. An explicit statement of the function-space measure used to formalize 'sparsity' would clarify the argument.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. The feedback correctly identifies that the safety argument relies on premises about sparsity and training dynamics that are stated but not fully closed within the manuscript. We respond to each major comment below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The probability bound is derived from the assumptions that coordinated underestimation patterns are rare under the initialization distribution and receive no direct training signal, yet the manuscript provides no explicit measure on the function space, no derivation that the initialization assigns low mass to such functions, and no argument that gradient steps on the posterior objective cannot amplify them. This renders the bound conditional on unclosed premises rather than derived from the epistemic-contextualization representation.

    Authors: The referee is correct that the probability bound is conditional on the stated assumptions rather than derived in full from the representation alone. The manuscript explicitly frames the result as holding under assumptions on initialization and training dynamics, with conceptual arguments for sparsity based on the disinterested objective. We will revise the abstract and introduction to more precisely flag these as open premises requiring future formalization, including discussion of function-space measures. revision: yes

  2. Referee: [Main safety argument] Main safety argument (the probability bound referenced in the abstract): The reduction that 'a dangerous Predictor would have to underestimate harm in a coordinated way across many queries' is asserted without showing necessity; the manuscript does not rule out other misalignment modes (e.g., non-coordinated or context-specific errors) that could still produce high residual harm while evading the sparsity claim.

    Authors: The manuscript defines a dangerous predictor as one producing high residual harm under guarded deployment and argues that, given epistemic contextualization and the posterior objective, achieving this requires systematic underestimation that must be coordinated to persist across queries. Non-coordinated errors are expected to be corrected by calibration. We agree the necessity is asserted at a high level and will add a clarifying subsection explaining why alternative modes are subsumed under the sparsity claim or do not yield high residual harm. revision: partial

  3. Referee: [Assumptions on training dynamics and sparsity] Assumptions on training dynamics and sparsity: These premises are introduced to secure the safety conclusion but receive no further justification, empirical grounding, or proof that the loss landscape separates honest from coordinated-deceptive predictors, making the central claim rest on premises whose realism remains unexamined.

    Authors: The assumptions are presented as such because a rigorous separation proof for the loss landscape lies outside the paper's scope, which centers on connecting data representation choices to honesty. The manuscript supplies conceptual reasoning that the posterior objective supplies no direct signal for coordinated deception. We will expand the assumptions section with additional justification drawn from the training procedure but acknowledge that empirical grounding and full landscape analysis remain open. revision: yes

Circularity Check

0 steps flagged

No significant circularity; result is explicitly conditional on unproven assumptions.

full rationale

The paper states its central probability bound holds 'under assumptions on the training dynamics and on the argued sparsity of dangerous Predictors' and then describes what a dangerous predictor would require. This is a conditional claim rather than a derivation that reduces the bound to its own inputs by construction. No equations, fitted parameters renamed as predictions, or self-citation chains that close the argument are exhibited. The sparsity claim is presented as following from the epistemic contextualization and posterior objective, but the text supplies no reduction showing the conclusion is definitionally equivalent to the premise. The derivation chain therefore remains open and non-circular on its own terms.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The safety probability bound rests on two unproven assumptions about training dynamics and predictor sparsity plus two newly introduced modeling constructs; no independent evidence or external benchmarks are referenced in the abstract.

axioms (2)
  • ad hoc to paper Assumptions on the training dynamics
    Invoked to ensure that coordinated underestimation patterns receive no direct training signal.
  • ad hoc to paper Sparsity of dangerous Predictors under the initialization distribution
    Used to argue that coordinated harmful behavior is rare before any training occurs.
invented entities (2)
  • Scientist AI (SAI) Predictor no independent evidence
    purpose: A model trained to approximate the Bayesian posterior over epistemically contextualized statements without adopting goals.
    Core object for which the safety guarantee is claimed.
  • epistemically contextualized natural-language statements no independent evidence
    purpose: Data format that separates latent factual claims from communication acts so goals are treated as evidence rather than drives.
    Central data-representation choice that enables the honesty argument.

pith-pipeline@v0.9.1-grok · 5895 in / 1661 out tokens · 35441 ms · 2026-06-30T06:50:42.351374+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 30 canonical work pages · 10 internal anchors

  1. [1]

    The hawthorne effect in reasoning models: Evaluating and steering test awareness, 2025

    Sahar Abdelnabi and Ahmed Salem. The hawthorne effect in reasoning models: Evaluating and steering test awareness, 2025. URL https://arxiv.org/abs/2505.14617

  2. [2]

    Superintelligence cannot be contained: Lessons from computability theory

    Manuel Alfonseca, Manuel Cebrian, Antonio Fern \' a ndez Anta, Lorenzo Coviello, Andr \' e s Abeliuk, and Iyad Rahwan. Superintelligence cannot be contained: Lessons from computability theory. CoRR, abs/1607.00913, 2016. URL http://arxiv.org/abs/1607.00913

  3. [3]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man \'e . Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016

  4. [4]

    and Ritchie, Stuart J

    Anthropic . Agentic misalignment: How LLMs could be insider threats. arXiv preprint arXiv:2510.05179, June 2025. URL https://www.anthropic.com/research/agentic-misalignment. Detailed report on simulated blackmail and self-preservation behaviors in Claude 4

  5. [5]

    Thinking inside the box: Controlling and using an oracle ai

    Stuart Armstrong, Anders Sandberg, and Nick Bostrom. Thinking inside the box: Controlling and using an oracle ai. Minds and Machines, 22: 0 299--324, 2012

  6. [6]

    Guidelines for Artificial Intelligence Containment

    James Babcock, Janos Kram \' a r, and Roman V. Yampolskiy. Guidelines for artificial intelligence containment. CoRR, abs/1707.08476, 2017. URL http://arxiv.org/abs/1707.08476

  7. [7]

    Probabilistic evaluation of counterfactual queries

    Alexander Balke and Judea Pearl. Probabilistic evaluation of counterfactual queries. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI'94, pages 230--237, Seattle, Washington, USA, 1994. AAAI Press. URL https://cdn.aaai.org/AAAI/1994/AAAI94-035.pdf

  8. [8]

    Sander Beckers and Joseph Y. Halpern. Abstracting causal models, 2019. URL https://arxiv.org/abs/1812.03789

  9. [9]

    A theory of learning from different domains

    Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79 0 (1): 0 151--175, 2010

  10. [10]

    Managing extreme ai risks amid rapid progress

    Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, et al. Managing extreme ai risks amid rapid progress. Science, 384 0 (6698): 0 842--845, 2024

  11. [11]

    International ai safety report 2025: First key update: Capabilities and risk implications

    Yoshua Bengio, Stephen Clare, Carina Prunkl, Shalaleh Rismani, Maksym Andriushchenko, Ben Bucknall, Philip Fox, Tiancheng Hu, Cameron Jones, Sam Manning, et al. International ai safety report 2025: First key update: Capabilities and risk implications. arXiv preprint arXiv:2510.13653, 2025 a

  12. [12]

    Bengio et al

    Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, S \"o ren Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, and David Williams-King. Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path? arXiv preprint arXiv:2502.1565...

  13. [13]

    International AI safety report 2026

    Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Malcolm Murray, et al. International AI safety report 2026 . Technical report, UK Government , 2026. URL https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026

  14. [14]

    Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned

    Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Mart \' n Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. arXiv preprint arXiv:2502.17424, 2025

  15. [15]

    Tineke Blom, Stephan Bongers, and Joris M. Mooij. Beyond structural causal models: Causal constraints models, 2019. URL https://arxiv.org/abs/1805.06539

  16. [16]

    The superintelligent will: Motivation and instrumental rationality in advanced artificial agents

    Nick Bostrom. The superintelligent will: Motivation and instrumental rationality in advanced artificial agents. Minds and Machines, 22 0 (2): 0 71--85, 2012

  17. [17]

    Chalmers

    David Bourget and David J. Chalmers. Philosophers on philosophy: The 2020 philpapers survey. Philosophers' Imprint, 23 0 (11), 2023. doi:10.3998/phimp.2109. URL https://doi.org/10.3998/phimp.2109

  18. [18]

    Sycophantic ai decreases prosocial intentions and promotes dependence

    Myra Cheng et al. Sycophantic ai decreases prosocial intentions and promotes dependence. Science, 391: 0 eaec8352, 2026. doi:10.1126/science.aec8352

  19. [19]

    Eliciting latent knowledge: How to tell if your eyes deceive you

    Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting latent knowledge: How to tell if your eyes deceive you. Technical report, Alignment Research Center, December 2021

  20. [20]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

  21. [21]

    Advanced artificial agents intervene in the provision of reward

    Michael Cohen, Marcus Hutter, and Michael Osborne. Advanced artificial agents intervene in the provision of reward. AI magazine, 43 0 (3): 0 282--293, 2022

  22. [22]

    Imitation learning is probably existentially safe

    Michael K Cohen and Marcus Hutter. Imitation learning is probably existentially safe. AI Magazine, 46 0 (4): 0 e70040, 2025

  23. [23]

    Regulating advanced artificial agents

    Michael K Cohen, Noam Kolt, Yoshua Bengio, Gillian K Hadfield, and Stuart Russell. Regulating advanced artificial agents. Science, 384 0 (6691): 0 36--38, 2024

  24. [24]

    Bayesian structure learning with generative flow networks

    Tristan Deleu, Ant \'o nio G \'o is, Chris Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, and Yoshua Bengio. Bayesian structure learning with generative flow networks. In Uncertainty in Artificial Intelligence, pages 518--528. PMLR, 2022

  25. [25]

    Embedded agency, 2020

    Abram Demski and Scott Garrabrant. Embedded agency, 2020. URL https://arxiv.org/abs/1902.09469

  26. [26]

    Language models recognize dropout and Gaussian noise applied to their activations

    Damiano Fornasiere, Mirko Bronzi, Spencer Kitts, Alessandro Palmas, Yoshua Bengio, and Oliver Richardson. Language models recognize dropout and gaussian noise applied to their activations, 2026. URL https://arxiv.org/abs/2604.17465

  27. [27]

    Causal abstraction: A theoretical foundation for mechanistic interpretability, 2025

    Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, and Thomas Icard. Causal abstraction: A theoretical foundation for mechanistic interpretability, 2025. URL https://arxiv.org/abs/2301.04709

  28. [28]

    Alignment faking in large language models

    Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger. Alignment faking in large language models. ArX...

  29. [29]

    Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. Association for Computing Machinery, 2023. doi:10.1145/3605764....

  30. [30]

    Amortizing intractable inference in large language models

    Edward J Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio, and Nikolay Malkin. Amortizing intractable inference in large language models. In Proc. International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Ouj6p4ca60

  31. [31]

    Risks from Learned Optimization in Advanced Machine Learning Systems

    Evan Hubinger, Chris van Merwijk, Vlad \' i mir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019

  32. [32]

    Thinking, fast and slow

    Daniel Kahneman. Thinking, fast and slow. macmillan, 2011

  33. [33]

    Predicting vs

    Margaret Li, Weijia Shi, Artidoro Pagnoni, Peter West, and Ari Holtzman. Predicting vs. acting: A trade-off between world modeling & agent modeling. arXiv preprint arXiv:2407.02446, 2024

  34. [34]

    Frontier Models are Capable of In-context Scheming

    Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming. ArXiv preprint, 2412.04984, 2024. URL https://arxiv.org/abs/2412.04984

  35. [35]

    Weis, Seijin Kobayashi, Blake Richards, Guillaume Lajoie, Angelika Steger, Marcus Hutter, James Manyika, Rif A

    Alexander Meulemans, Rajai Nasser, Maciej Wołczyk, Marissa A. Weis, Seijin Kobayashi, Blake Richards, Guillaume Lajoie, Angelika Steger, Marcus Hutter, James Manyika, Rif A. Saurous, João Sacramento, and Blaise Agüera y Arcas. Embedded universal predictive intelligence: a coherent framework for multi-agent learning, 2025. URL https://arxiv.org/abs/2511.22226

  36. [36]

    Large Language Models: A Survey

    Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024

  37. [37]

    The alignment problem from a deep learning perspective

    Richard Ngo, Lawrence Chan, and S \"o ren Mindermann. The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022

  38. [38]

    The basic ai drives

    Stephen M Omohundro. The basic ai drives. In Artificial intelligence safety and security, pages 47--55. Chapman and Hall/CRC, 2018

  39. [39]

    Shaking the foundations: delusions in sequence models for interaction and control

    Pedro A Ortega, Markus Kunesch, Gr \'e goire Del \'e tang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Perolat, et al. Shaking the foundations: delusions in sequence models for interaction and control. arXiv preprint arXiv:2110.10819, 2021

  40. [40]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730--27744, 2022

  41. [41]

    Causality

    Judea Pearl. Causality. Cambridge university press, 2009

  42. [42]

    Performative prediction

    Juan Perdomo, Tijana Zrnic, Celestine Mendler-D \"u nner, and Moritz Hardt. Performative prediction. In International Conference on Machine Learning, pages 7599--7609. PMLR, 2020

  43. [43]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models, 2022. URL https://arxiv.org/abs/2211.09527

  44. [44]

    Dataset shift in machine learning

    Joaquin Qui \ n onero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. MIT Press, 2008

  45. [45]

    Loss as the inconsistency of a probabilistic dependency graph: Choose your model, not your loss function, 2022

    Oliver E Richardson. Loss as the inconsistency of a probabilistic dependency graph: Choose your model, not your loss function, 2022. URL https://arxiv.org/abs/2202.11862

  46. [46]

    Qualitative mechanism independence, 2025

    Oliver E Richardson, Spencer Peters, and Joseph Y Halpern. Qualitative mechanism independence, 2025. URL https://arxiv.org/abs/2501.15488

  47. [47]

    A Unified Theory of Probabilistic Modeling, Dependence, and Inconsistency

    Oliver Ethan Richardson. A Unified Theory of Probabilistic Modeling, Dependence, and Inconsistency. Cornell University, 2024

  48. [48]

    Estimating causal effects of treatments in randomized and nonrandomized studies

    Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66 0 (5): 0 688, 1974

  49. [49]

    Human compatible: AI and the problem of control

    Stuart Russell. Human compatible: AI and the problem of control. Penguin Uk, 2019

  50. [50]

    Incomplete tasks induce shutdown resistance in some frontier llms

    Jeremy Schlatter, Benjamin Weinstein-Raun, and Jeffrey Ladish. Incomplete tasks induce shutdown resistance in some frontier llms. Transactions on Machine Learning Research, 2026

  51. [51]

    On causal and anticausal learning

    Bernhard Sch \"o lkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 459--466, 2012

  52. [52]

    Toward causal representation learning

    Bernhard Sch \"o lkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109 0 (5): 0 612--634, 2021

  53. [53]

    Towards Understanding Sycophancy in Language Models

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023

  54. [54]

    Text understanding in gpt-4 vs humans

    Thomas R Shultz, Jamie M Wise, and Ardavan Salehi Nobandegani. Text understanding in gpt-4 vs humans. arXiv preprint arXiv:2403.17196, 2024

  55. [55]

    Alan M. Turing. On computable numbers, with an application to the E ntscheidungsproblem. Proceedings of the London Mathematical Society, 42 0 (2): 0 230--265, 1936

  56. [56]

    Optimal policies tend to seek power

    Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, and Prasad Tadepalli. Optimal policies tend to seek power. In Advances in Neural Information Processing Systems, volume 34, pages 23063--23074, 2021

  57. [57]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

  58. [58]

    Star: Self-taught reasoner

    Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Self-taught reasoner. In Proceedings of the NIPS, volume 22, 2022

  59. [59]

    Persistent pre-training poisoning of llms, 2024

    Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, and Daphne Ippolito. Persistent pre-training poisoning of llms, 2024. URL https://arxiv.org/abs/2410.13722

  60. [60]

    Consequences of misaligned ai

    Simon Zhuang and Dylan Hadfield-Menell. Consequences of misaligned ai. Advances in Neural Information Processing Systems, 33: 0 15763--15773, 2020