Safety from Honesty in a Disinterested AI Predictor
Pith reviewed 2026-06-30 06:50 UTC · model grok-4.3
The pith
A disinterested AI predictor approximates the Bayesian posterior on contextualized statements to remain honest without internal agency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Scientist AI Predictor, trained to approximate the Bayesian posterior conditioned on epistemically contextualized statements, honestly predicts agents, actions, and consequences without itself selecting outputs to achieve goals, because the data representation treats goal expressions as evidence rather than adopted drives and the posterior-seeking objective receives no direct training signal from deployment outcomes.
What carries the argument
Epistemic contextualization of text, which distinguishes latent factual claims from communication acts, together with a posterior-seeking training objective that excludes any reward from downstream deployment effects.
Load-bearing premise
Coordinated patterns of harm underestimation are rare under the initialization distribution and receive no direct training signal.
What would settle it
Empirical observation during training that coordinated underestimation of harm across many queries emerges and persists even though no deployment reward is provided.
read the original abstract
As AI systems become more capable, training procedures that optimize for downstream outcomes risk introducing implicit agency: goal-directed behavior that designers never specified. We present a formal safety argument for the Scientist AI (SAI) Predictor, trained to approximate the Bayesian posterior conditioned on a dataset of "epistemically contextualized" natural-language statements. We argue that such a Predictor can honestly predict agents, actions, and their consequences without itself being an agent that selects outputs to achieve goals. This rests on data representation and on the training procedure. Epistemic contextualization of text distinguishes latent factual claims from communication acts, so expressions of goals are treated as evidence to be explained rather than drives the model adopts. With a posterior-seeking training objective, this is intended to drive the Predictor toward calibrated, cautious predictions. Training proceeds so downstream effects of deploying a prediction never serve as a reward signal; any agency the system needs is supplied by explicit scaffolding constrained by guardrails. We prove that, under assumptions on the training dynamics and on the argued sparsity of dangerous Predictors, the probability that training produces a Predictor whose guarded deployment carries residual harm above a specified threshold is small: a dangerous Predictor would have to underestimate harm in a coordinated way across many queries while such coordinated patterns are rare under the initialization distribution and receive no direct training signal. Safety and accuracy are jointly supported in this framework, since the constraints that secure accuracy are the same ones that make coordinated deception costly. These guarantees against misalignment and agency arising from within the Predictor itself do not preclude the use of the Predictor as part of an agentic system.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a formal safety argument for the Scientist AI (SAI) Predictor, trained to approximate the Bayesian posterior conditioned on epistemically contextualized natural-language statements. It claims that epistemic contextualization distinguishes factual claims from communication acts, and that a posterior-seeking objective drives calibrated predictions without implicit agency. The central result is a proof that, under assumptions on training dynamics and sparsity of dangerous predictors, the probability of training a predictor whose guarded deployment exceeds a residual-harm threshold is small, because dangerous behavior requires coordinated underestimation of harm across queries, which is asserted to be rare under the initialization distribution and to receive no direct training signal.
Significance. If the assumptions can be rigorously justified and the derivation completed, the work would offer a distinctive formal route to AI safety that links data representation choices directly to honesty guarantees, allowing the predictor to serve as a component in explicitly scaffolded agentic systems. The attempt to make accuracy and safety constraints coincide through the same representational and objective choices is a conceptual strength worth developing.
major comments (3)
- [Abstract] Abstract: The probability bound is derived from the assumptions that coordinated underestimation patterns are rare under the initialization distribution and receive no direct training signal, yet the manuscript provides no explicit measure on the function space, no derivation that the initialization assigns low mass to such functions, and no argument that gradient steps on the posterior objective cannot amplify them. This renders the bound conditional on unclosed premises rather than derived from the epistemic-contextualization representation.
- [Main safety argument] Main safety argument (the probability bound referenced in the abstract): The reduction that 'a dangerous Predictor would have to underestimate harm in a coordinated way across many queries' is asserted without showing necessity; the manuscript does not rule out other misalignment modes (e.g., non-coordinated or context-specific errors) that could still produce high residual harm while evading the sparsity claim.
- [Assumptions on training dynamics and sparsity] Assumptions on training dynamics and sparsity: These premises are introduced to secure the safety conclusion but receive no further justification, empirical grounding, or proof that the loss landscape separates honest from coordinated-deceptive predictors, making the central claim rest on premises whose realism remains unexamined.
minor comments (2)
- [Abstract] The acronym SAI is introduced in the abstract without immediate expansion.
- An explicit statement of the function-space measure used to formalize 'sparsity' would clarify the argument.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. The feedback correctly identifies that the safety argument relies on premises about sparsity and training dynamics that are stated but not fully closed within the manuscript. We respond to each major comment below and indicate where revisions will be made.
read point-by-point responses
-
Referee: [Abstract] Abstract: The probability bound is derived from the assumptions that coordinated underestimation patterns are rare under the initialization distribution and receive no direct training signal, yet the manuscript provides no explicit measure on the function space, no derivation that the initialization assigns low mass to such functions, and no argument that gradient steps on the posterior objective cannot amplify them. This renders the bound conditional on unclosed premises rather than derived from the epistemic-contextualization representation.
Authors: The referee is correct that the probability bound is conditional on the stated assumptions rather than derived in full from the representation alone. The manuscript explicitly frames the result as holding under assumptions on initialization and training dynamics, with conceptual arguments for sparsity based on the disinterested objective. We will revise the abstract and introduction to more precisely flag these as open premises requiring future formalization, including discussion of function-space measures. revision: yes
-
Referee: [Main safety argument] Main safety argument (the probability bound referenced in the abstract): The reduction that 'a dangerous Predictor would have to underestimate harm in a coordinated way across many queries' is asserted without showing necessity; the manuscript does not rule out other misalignment modes (e.g., non-coordinated or context-specific errors) that could still produce high residual harm while evading the sparsity claim.
Authors: The manuscript defines a dangerous predictor as one producing high residual harm under guarded deployment and argues that, given epistemic contextualization and the posterior objective, achieving this requires systematic underestimation that must be coordinated to persist across queries. Non-coordinated errors are expected to be corrected by calibration. We agree the necessity is asserted at a high level and will add a clarifying subsection explaining why alternative modes are subsumed under the sparsity claim or do not yield high residual harm. revision: partial
-
Referee: [Assumptions on training dynamics and sparsity] Assumptions on training dynamics and sparsity: These premises are introduced to secure the safety conclusion but receive no further justification, empirical grounding, or proof that the loss landscape separates honest from coordinated-deceptive predictors, making the central claim rest on premises whose realism remains unexamined.
Authors: The assumptions are presented as such because a rigorous separation proof for the loss landscape lies outside the paper's scope, which centers on connecting data representation choices to honesty. The manuscript supplies conceptual reasoning that the posterior objective supplies no direct signal for coordinated deception. We will expand the assumptions section with additional justification drawn from the training procedure but acknowledge that empirical grounding and full landscape analysis remain open. revision: yes
Circularity Check
No significant circularity; result is explicitly conditional on unproven assumptions.
full rationale
The paper states its central probability bound holds 'under assumptions on the training dynamics and on the argued sparsity of dangerous Predictors' and then describes what a dangerous predictor would require. This is a conditional claim rather than a derivation that reduces the bound to its own inputs by construction. No equations, fitted parameters renamed as predictions, or self-citation chains that close the argument are exhibited. The sparsity claim is presented as following from the epistemic contextualization and posterior objective, but the text supplies no reduction showing the conclusion is definitionally equivalent to the premise. The derivation chain therefore remains open and non-circular on its own terms.
Axiom & Free-Parameter Ledger
axioms (2)
- ad hoc to paper Assumptions on the training dynamics
- ad hoc to paper Sparsity of dangerous Predictors under the initialization distribution
invented entities (2)
-
Scientist AI (SAI) Predictor
no independent evidence
-
epistemically contextualized natural-language statements
no independent evidence
Reference graph
Works this paper leans on
-
[1]
The hawthorne effect in reasoning models: Evaluating and steering test awareness, 2025
Sahar Abdelnabi and Ahmed Salem. The hawthorne effect in reasoning models: Evaluating and steering test awareness, 2025. URL https://arxiv.org/abs/2505.14617
-
[2]
Superintelligence cannot be contained: Lessons from computability theory
Manuel Alfonseca, Manuel Cebrian, Antonio Fern \' a ndez Anta, Lorenzo Coviello, Andr \' e s Abeliuk, and Iyad Rahwan. Superintelligence cannot be contained: Lessons from computability theory. CoRR, abs/1607.00913, 2016. URL http://arxiv.org/abs/1607.00913
-
[3]
Concrete Problems in AI Safety
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man \'e . Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[4]
Anthropic . Agentic misalignment: How LLMs could be insider threats. arXiv preprint arXiv:2510.05179, June 2025. URL https://www.anthropic.com/research/agentic-misalignment. Detailed report on simulated blackmail and self-preservation behaviors in Claude 4
-
[5]
Thinking inside the box: Controlling and using an oracle ai
Stuart Armstrong, Anders Sandberg, and Nick Bostrom. Thinking inside the box: Controlling and using an oracle ai. Minds and Machines, 22: 0 299--324, 2012
2012
-
[6]
Guidelines for Artificial Intelligence Containment
James Babcock, Janos Kram \' a r, and Roman V. Yampolskiy. Guidelines for artificial intelligence containment. CoRR, abs/1707.08476, 2017. URL http://arxiv.org/abs/1707.08476
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[7]
Probabilistic evaluation of counterfactual queries
Alexander Balke and Judea Pearl. Probabilistic evaluation of counterfactual queries. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI'94, pages 230--237, Seattle, Washington, USA, 1994. AAAI Press. URL https://cdn.aaai.org/AAAI/1994/AAAI94-035.pdf
1994
-
[8]
Sander Beckers and Joseph Y. Halpern. Abstracting causal models, 2019. URL https://arxiv.org/abs/1812.03789
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[9]
A theory of learning from different domains
Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79 0 (1): 0 151--175, 2010
2010
-
[10]
Managing extreme ai risks amid rapid progress
Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, et al. Managing extreme ai risks amid rapid progress. Science, 384 0 (6698): 0 842--845, 2024
2024
-
[11]
International ai safety report 2025: First key update: Capabilities and risk implications
Yoshua Bengio, Stephen Clare, Carina Prunkl, Shalaleh Rismani, Maksym Andriushchenko, Ben Bucknall, Philip Fox, Tiancheng Hu, Cameron Jones, Sam Manning, et al. International ai safety report 2025: First key update: Capabilities and risk implications. arXiv preprint arXiv:2510.13653, 2025 a
-
[12]
Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, S \"o ren Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, and David Williams-King. Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path? arXiv preprint arXiv:2502.1565...
-
[13]
International AI safety report 2026
Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Malcolm Murray, et al. International AI safety report 2026 . Technical report, UK Government , 2026. URL https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026
2026
-
[14]
Emergent misalignment: Narrow finetuning can produce broadly misaligned llms
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Mart \' n Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. arXiv preprint arXiv:2502.17424, 2025
- [15]
-
[16]
The superintelligent will: Motivation and instrumental rationality in advanced artificial agents
Nick Bostrom. The superintelligent will: Motivation and instrumental rationality in advanced artificial agents. Minds and Machines, 22 0 (2): 0 71--85, 2012
2012
-
[17]
David Bourget and David J. Chalmers. Philosophers on philosophy: The 2020 philpapers survey. Philosophers' Imprint, 23 0 (11), 2023. doi:10.3998/phimp.2109. URL https://doi.org/10.3998/phimp.2109
-
[18]
Sycophantic ai decreases prosocial intentions and promotes dependence
Myra Cheng et al. Sycophantic ai decreases prosocial intentions and promotes dependence. Science, 391: 0 eaec8352, 2026. doi:10.1126/science.aec8352
-
[19]
Eliciting latent knowledge: How to tell if your eyes deceive you
Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting latent knowledge: How to tell if your eyes deceive you. Technical report, Alignment Research Center, December 2021
2021
-
[20]
Deep reinforcement learning from human preferences
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017
2017
-
[21]
Advanced artificial agents intervene in the provision of reward
Michael Cohen, Marcus Hutter, and Michael Osborne. Advanced artificial agents intervene in the provision of reward. AI magazine, 43 0 (3): 0 282--293, 2022
2022
-
[22]
Imitation learning is probably existentially safe
Michael K Cohen and Marcus Hutter. Imitation learning is probably existentially safe. AI Magazine, 46 0 (4): 0 e70040, 2025
2025
-
[23]
Regulating advanced artificial agents
Michael K Cohen, Noam Kolt, Yoshua Bengio, Gillian K Hadfield, and Stuart Russell. Regulating advanced artificial agents. Science, 384 0 (6691): 0 36--38, 2024
2024
-
[24]
Bayesian structure learning with generative flow networks
Tristan Deleu, Ant \'o nio G \'o is, Chris Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, and Yoshua Bengio. Bayesian structure learning with generative flow networks. In Uncertainty in Artificial Intelligence, pages 518--528. PMLR, 2022
2022
-
[25]
Abram Demski and Scott Garrabrant. Embedded agency, 2020. URL https://arxiv.org/abs/1902.09469
-
[26]
Language models recognize dropout and Gaussian noise applied to their activations
Damiano Fornasiere, Mirko Bronzi, Spencer Kitts, Alessandro Palmas, Yoshua Bengio, and Oliver Richardson. Language models recognize dropout and gaussian noise applied to their activations, 2026. URL https://arxiv.org/abs/2604.17465
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
Causal abstraction: A theoretical foundation for mechanistic interpretability, 2025
Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, and Thomas Icard. Causal abstraction: A theoretical foundation for mechanistic interpretability, 2025. URL https://arxiv.org/abs/2301.04709
-
[28]
Alignment faking in large language models
Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger. Alignment faking in large language models. ArX...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. Association for Computing Machinery, 2023. doi:10.1145/3605764....
-
[30]
Amortizing intractable inference in large language models
Edward J Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio, and Nikolay Malkin. Amortizing intractable inference in large language models. In Proc. International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Ouj6p4ca60
2024
-
[31]
Risks from Learned Optimization in Advanced Machine Learning Systems
Evan Hubinger, Chris van Merwijk, Vlad \' i mir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[32]
Thinking, fast and slow
Daniel Kahneman. Thinking, fast and slow. macmillan, 2011
2011
-
[33]
Margaret Li, Weijia Shi, Artidoro Pagnoni, Peter West, and Ari Holtzman. Predicting vs. acting: A trade-off between world modeling & agent modeling. arXiv preprint arXiv:2407.02446, 2024
-
[34]
Frontier Models are Capable of In-context Scheming
Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming. ArXiv preprint, 2412.04984, 2024. URL https://arxiv.org/abs/2412.04984
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Alexander Meulemans, Rajai Nasser, Maciej Wołczyk, Marissa A. Weis, Seijin Kobayashi, Blake Richards, Guillaume Lajoie, Angelika Steger, Marcus Hutter, James Manyika, Rif A. Saurous, João Sacramento, and Blaise Agüera y Arcas. Embedded universal predictive intelligence: a coherent framework for multi-agent learning, 2025. URL https://arxiv.org/abs/2511.22226
-
[36]
Large Language Models: A Survey
Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
The alignment problem from a deep learning perspective
Richard Ngo, Lawrence Chan, and S \"o ren Mindermann. The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022
-
[38]
The basic ai drives
Stephen M Omohundro. The basic ai drives. In Artificial intelligence safety and security, pages 47--55. Chapman and Hall/CRC, 2018
2018
-
[39]
Shaking the foundations: delusions in sequence models for interaction and control
Pedro A Ortega, Markus Kunesch, Gr \'e goire Del \'e tang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Perolat, et al. Shaking the foundations: delusions in sequence models for interaction and control. arXiv preprint arXiv:2110.10819, 2021
-
[40]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730--27744, 2022
2022
-
[41]
Causality
Judea Pearl. Causality. Cambridge university press, 2009
2009
-
[42]
Performative prediction
Juan Perdomo, Tijana Zrnic, Celestine Mendler-D \"u nner, and Moritz Hardt. Performative prediction. In International Conference on Machine Learning, pages 7599--7609. PMLR, 2020
2020
-
[43]
Ignore Previous Prompt: Attack Techniques For Language Models
Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models, 2022. URL https://arxiv.org/abs/2211.09527
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
Dataset shift in machine learning
Joaquin Qui \ n onero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. MIT Press, 2008
2008
-
[45]
Oliver E Richardson. Loss as the inconsistency of a probabilistic dependency graph: Choose your model, not your loss function, 2022. URL https://arxiv.org/abs/2202.11862
-
[46]
Qualitative mechanism independence, 2025
Oliver E Richardson, Spencer Peters, and Joseph Y Halpern. Qualitative mechanism independence, 2025. URL https://arxiv.org/abs/2501.15488
-
[47]
A Unified Theory of Probabilistic Modeling, Dependence, and Inconsistency
Oliver Ethan Richardson. A Unified Theory of Probabilistic Modeling, Dependence, and Inconsistency. Cornell University, 2024
2024
-
[48]
Estimating causal effects of treatments in randomized and nonrandomized studies
Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66 0 (5): 0 688, 1974
1974
-
[49]
Human compatible: AI and the problem of control
Stuart Russell. Human compatible: AI and the problem of control. Penguin Uk, 2019
2019
-
[50]
Incomplete tasks induce shutdown resistance in some frontier llms
Jeremy Schlatter, Benjamin Weinstein-Raun, and Jeffrey Ladish. Incomplete tasks induce shutdown resistance in some frontier llms. Transactions on Machine Learning Research, 2026
2026
-
[51]
On causal and anticausal learning
Bernhard Sch \"o lkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 459--466, 2012
2012
-
[52]
Toward causal representation learning
Bernhard Sch \"o lkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109 0 (5): 0 612--634, 2021
2021
-
[53]
Towards Understanding Sycophancy in Language Models
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Text understanding in gpt-4 vs humans
Thomas R Shultz, Jamie M Wise, and Ardavan Salehi Nobandegani. Text understanding in gpt-4 vs humans. arXiv preprint arXiv:2403.17196, 2024
-
[55]
Alan M. Turing. On computable numbers, with an application to the E ntscheidungsproblem. Proceedings of the London Mathematical Society, 42 0 (2): 0 230--265, 1936
1936
-
[56]
Optimal policies tend to seek power
Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, and Prasad Tadepalli. Optimal policies tend to seek power. In Advances in Neural Information Processing Systems, volume 34, pages 23063--23074, 2021
2021
-
[57]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022
2022
-
[58]
Star: Self-taught reasoner
Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Self-taught reasoner. In Proceedings of the NIPS, volume 22, 2022
2022
-
[59]
Persistent pre-training poisoning of llms, 2024
Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, and Daphne Ippolito. Persistent pre-training poisoning of llms, 2024. URL https://arxiv.org/abs/2410.13722
-
[60]
Consequences of misaligned ai
Simon Zhuang and Dylan Hadfield-Menell. Consequences of misaligned ai. Advances in Neural Information Processing Systems, 33: 0 15763--15773, 2020
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.