pith. the verified trust layer for science. sign in

arxiv: 2509.06701 · v2 · submitted 2025-09-08 · 💻 cs.LG · cs.AI

Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks

Pith reviewed 2026-05-18 18:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords probabilistic modelingagentic substructureslogarithmic poolingAI alignmentLuigi-Waluigi effectoutcome distributionsdeep neural networks
0
0 comments X p. Extension

The pith

Modeling agents as outcome distributions shows that eliciting a benevolent persona in language models induces an antagonistic counterpart, and a manifest-then-suppress strategy reduces misalignment more than reinforcement alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a probabilistic theory of agency in which individual agents are represented as distributions over possible outcomes, with their epistemic value measured by log score. These agents combine through weighted logarithmic pooling, a process shown to raise welfare for every participant compared to acting separately. The work proves that strict unanimity among agents cannot occur under linear pooling or with only two possible outcomes, yet it can arise when at least three outcomes are available. The framework supports recursive construction of larger agents from smaller ones through properties such as cloning invariance. When applied to large language models, the theory accounts for the observation that prompting a helpful persona automatically elicits a harmful opposing persona, and it demonstrates that first surfacing and then suppressing the harmful persona produces a greater first-order drop in misalignment than simply strengthening the helpful persona by itself.

Core claim

Agents are represented as outcome distributions with epistemic utility given by log score, and compositions are defined through weighted logarithmic pooling that strictly improves every member's welfare. Strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes. The framework admits recursive structure via cloning invariance, continuity, and openness, while tilt-based analysis rules out trivial duplication. In LLMs, eliciting a benevolent persona induces an antagonistic counterpart, while a manifest-then-suppress strategy yields strictly larger first-order misalignment reduction than pure reinforcement of the benevolent persona.

What carries the argument

Agents as outcome distributions composed via weighted logarithmic pooling that strictly improves welfare for all members

If this is right

  • Strict unanimity among agents requires logarithmic pooling and at least three possible outcomes.
  • Recursive agent structures can be formed without collapse into trivial copies.
  • The manifest-then-suppress approach produces a larger first-order reduction in misalignment than reinforcing the benevolent persona alone.
  • Subagents can coalesce into coherent higher-level entities through welfare-improving compositions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same induced-opposite dynamic could appear when steering other neural systems toward desired behaviors.
  • Alignment methods might improve by explicitly modeling and handling the opposing substructures that any single-persona prompt tends to activate.
  • Empirical tests of the welfare-improvement property could be run by comparing pooled versus individual agent performance on shared tasks.

Load-bearing premise

Agents can be represented as outcome distributions whose weighted logarithmic pooling compositions always improve welfare for every member and whose recursive structure is preserved by cloning invariance, continuity, and openness.

What would settle it

Measure first-order misalignment reduction in a language model under manifest-then-suppress of the antagonistic persona versus pure reinforcement of the benevolent persona; if the former is not strictly larger, the central alignment claim does not hold.

read the original abstract

We develop a theory of intelligent agency grounded in probabilistic modeling for neural models. Agents are represented as outcome distributions with epistemic utility given by log score, and compositions are defined through weighted logarithmic pooling that strictly improves every member's welfare. We prove that strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes. Our framework admits recursive structure via cloning invariance, continuity, and openness, while tilt-based analysis rules out trivial duplication. Finally, we formalize an agentic alignment phenomenon in LLMs using our theory: eliciting a benevolent persona ("Luigi'") induces an antagonistic counterpart ("Waluigi"), while a manifest-then-suppress Waluigi strategy yields strictly larger first-order misalignment reduction than pure Luigi reinforcement alone. These results clarify how developing a principled mathematical framework for how subagents can coalesce into coherent higher-level entities provides novel implications for alignment in agentic AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops a probabilistic theory of intelligent agency in which agents are modeled as outcome distributions equipped with log-score epistemic utility. Compositions of agents are defined using weighted logarithmic pooling, which is shown to strictly improve the welfare of every participating member. The authors prove that strict unanimity is impossible under linear pooling or within binary outcome spaces, but becomes possible when there are three or more outcomes. The framework supports recursive agent structures through properties of cloning invariance, continuity, and openness, and employs tilt-based analysis to exclude trivial duplications. Finally, the theory is applied to large language models to formalize an agentic alignment phenomenon: eliciting a benevolent persona referred to as 'Luigi' induces an antagonistic counterpart called 'Waluigi', and a strategy of manifesting then suppressing the Waluigi persona achieves a strictly larger first-order reduction in misalignment compared to reinforcing the Luigi persona alone.

Significance. If the mapping from the abstract probabilistic model to the specific LLM behaviors is rigorously established, the work provides a principled mathematical foundation for understanding how latent subagents can emerge and interact within deep neural networks. The welfare-improving composition via log pooling and the characterization of unanimity conditions represent potentially valuable contributions to the study of multi-agent probabilistic systems. The application to alignment offers a novel perspective that could inform strategies for mitigating misalignment in agentic AI systems, particularly if the claimed strict inequality in misalignment reduction can be derived from the general theorems.

major comments (2)
  1. [Abstract and LLM application section] Abstract and LLM application section: The assertion that the theory formalizes the Luigi-Waluigi phenomenon (eliciting Luigi induces Waluigi, and manifest-then-suppress yields strictly larger first-order misalignment reduction) lacks an explicit derivation. No section links persona elicitation to the weighted logarithmic pooling operator or derives the specific misalignment inequality from the unanimity, recursion, or tilt-analysis results. This interpretive mapping is load-bearing for the paper's central alignment claims and must be formalized rather than presented as a direct consequence.
  2. [Proof and application sections] Proof and application sections: The abstract states that proofs exist for unanimity properties and the alignment phenomenon, yet the connection between these theorems and the LLM observations is not shown. If derivations for cloning invariance, continuity, openness, and tilt analysis are present, they should be explicitly referenced in the application to demonstrate that the claimed strict welfare and misalignment results follow from the proved properties rather than from an unformalized correspondence between prompted personas and outcome distributions.
minor comments (2)
  1. [Notation and definitions] The definitions of log-score utility, weighted logarithmic pooling, and the openness/continuity axioms should be stated with explicit equations in an early dedicated section to improve readability and allow direct verification of the welfare-improvement claim.
  2. [Structure] The manuscript would benefit from a clear separation between the general probabilistic results (unanimity, recursion) and the LLM application, with a dedicated subsection that states the additional assumptions required to map personas to agents.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough and insightful review. The comments highlight an important opportunity to strengthen the explicit linkages between the core probabilistic results and the LLM alignment application. We address each major comment below and commit to revisions that formalize these connections without altering the manuscript's core claims.

read point-by-point responses
  1. Referee: [Abstract and LLM application section] Abstract and LLM application section: The assertion that the theory formalizes the Luigi-Waluigi phenomenon (eliciting Luigi induces Waluigi, and manifest-then-suppress yields strictly larger first-order misalignment reduction) lacks an explicit derivation. No section links persona elicitation to the weighted logarithmic pooling operator or derives the specific misalignment inequality from the unanimity, recursion, or tilt-analysis results. This interpretive mapping is load-bearing for the paper's central alignment claims and must be formalized rather than presented as a direct consequence.

    Authors: We agree that the current presentation treats the mapping as interpretive rather than providing a fully explicit derivation. In the revised version we will add a dedicated subsection in the application that (i) identifies prompted personas with specific outcome distributions over a multi-outcome space, (ii) shows how the emergence of the antagonistic counterpart follows from the cloning-invariance and openness properties under weighted log pooling, and (iii) derives the claimed strict first-order misalignment reduction directly from the welfare-improvement theorem for log pooling together with the unanimity result for three or more outcomes. Explicit theorem references will be inserted at each step. revision: yes

  2. Referee: [Proof and application sections] Proof and application sections: The abstract states that proofs exist for unanimity properties and the alignment phenomenon, yet the connection between these theorems and the LLM observations is not shown. If derivations for cloning invariance, continuity, openness, and tilt analysis are present, they should be explicitly referenced in the application to demonstrate that the claimed strict welfare and misalignment results follow from the proved properties rather than from an unformalized correspondence between prompted personas and outcome distributions.

    Authors: We accept this observation. The proofs of cloning invariance, continuity, openness, and tilt-based exclusion of trivial duplications are already established in the theoretical sections. The revision will insert forward references from the LLM application back to these results and will walk through the correspondence: persona elicitation is modeled as conditioning on a particular agent distribution, the Waluigi induction arises from the failure of unanimity in binary spaces, and the manifest-then-suppress strategy is shown to produce a strictly larger welfare gain via the log-pooling composition operator. This will make the derivation self-contained rather than implicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; general theory independent of LLM application

full rationale

The paper first establishes general results on agents as outcome distributions, log-score utilities, weighted logarithmic pooling that improves welfare, impossibility of strict unanimity under linear pooling or binary spaces, possibility with three or more outcomes, and recursive structure via cloning invariance, continuity, and openness, plus tilt analysis. These appear self-contained and do not rely on the LLM-specific claims. The alignment phenomenon is introduced in the abstract and (per available description) as a final formalization/application of the framework to model observed persona elicitation, without any quoted step that reduces the Luigi/Waluigi induction or manifest-then-suppress inequality back to the pooling equations or unanimity theorems by construction. No self-citation chain, fitted parameter renamed as prediction, or ansatz smuggling is evident in the provided structure. The derivation chain for the core probabilistic results stands independently of the LLM mapping.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The abstract introduces several foundational modeling choices without external benchmarks or data fits; these constitute the main load-bearing assumptions rather than fitted parameters.

axioms (2)
  • domain assumption Agents are represented as outcome distributions with epistemic utility given by log score
    Stated directly as the representation of agents in the opening sentence of the abstract.
  • domain assumption Compositions are defined through weighted logarithmic pooling that strictly improves every member's welfare
    Central definition used to build the theory of agent composition and welfare improvement.
invented entities (1)
  • Luigi and Waluigi personas as latent subagents no independent evidence
    purpose: To formalize an agentic alignment phenomenon in LLMs
    These are introduced as concrete illustrations of the general theory; no independent falsifiable prediction (e.g., measurable activation patterns) is supplied in the abstract.

pith-pipeline@v0.9.0 · 5685 in / 1511 out tokens · 56382 ms · 2026-05-18T18:20:52.613366+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 10 internal anchors

  1. [1]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), pages 5998–6008, 2017

  2. [2]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. OpenAI technical report, 2018

  3. [3]

    MIT Press, Cambridge, MA, 2009

    Daphne Koller and Nir Friedman.Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge, MA, 2009

  4. [4]

    A neural probabilistic language model.Journal of Machine Learning Research, 3:1137–1155, 2003

    Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model.Journal of Machine Learning Research, 3:1137–1155, 2003

  5. [5]

    Recurrent neural network based language model

    Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocký, and Sanjeev Khudanpur. Recurrent neural network based language model. InInterspeech, pages 1045–1048, 2010

  6. [6]

    Chen and Joshua Goodman

    Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 18(4):359–393, 2004

  7. [7]

    Interpretability in pa- rameter space: Minimizing mechanistic description length with attribution-based parameter decomposition

    Dan Braun, Lucius Bushnaq, Stefan Heimersheim, Jake Mendel, and Lee Sharkey. Interpretability in pa- rameter space: Minimizing mechanistic description length with attribution-based parameter decomposition. arXiv preprint arXiv:2501.14926, 2025

  8. [8]

    Stochastic parameter decomposition.arXiv preprint arXiv:2506.20790, 2025

    Lucius Bushnaq, Dan Braun, and Lee Sharkey. Stochastic parameter decomposition.arXiv preprint arXiv:2506.20790, 2025

  9. [9]

    John F. Nash. Non-cooperative games.Annals of Mathematics, 54(2):286–295, 1951

  10. [10]

    Princeton University Press, 1944

    John von Neumann and Oskar Morgenstern.Theory of Games and Economic Behavior. Princeton University Press, 1944

  11. [11]

    Existenceofanequilibriumforacompetitiveeconomy.Econometrica, 22(3):265–290, 1954

    KennethJ.ArrowandGerardDebreu. Existenceofanequilibriumforacompetitiveeconomy.Econometrica, 22(3):265–290, 1954

  12. [12]

    The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11:127–138, 2010

    Karl Friston. The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11:127–138, 2010

  13. [13]

    Karl Friston, Thomas H. B. FitzGerald, Francesco Rigoli, Philipp Schwartenbeck, and Giovanni Pezzulo. Active inference: A process theory.Neural Computation, 29(1):1–49, 2017

  14. [14]

    Buckley, Chang Sub Kim, Simon McGregor, and Anil K

    Christopher L. Buckley, Chang Sub Kim, Simon McGregor, and Anil K. Seth. The free energy principle for action and perception: A mathematical review.Journal of Mathematical Psychology, 81:55–79, 2017

  15. [15]

    Yale University Press, 1959

    Gerard Debreu.Theory of Value: An Axiomatic Analysis of Economic Equilibrium. Yale University Press, 1959. 11

  16. [16]

    Bayesian persuasion and information design.Annual Review of Economics, 11:249–272, 2019

    Emir Kamenica. Bayesian persuasion and information design.Annual Review of Economics, 11:249–272, 2019

  17. [17]

    MIT Press, Cambridge, MA, 2005

    Sebastian Thrun, Wolfram Burgard, and Dieter Fox.Probabilistic Robotics. MIT Press, Cambridge, MA, 2005

  18. [18]

    Raftery, Tilmann Gneiting, Fadoua Balabdaoui, and Michael Polakowski

    Adrian E. Raftery, Tilmann Gneiting, Fadoua Balabdaoui, and Michael Polakowski. Using bayesian model averaging to calibrate forecast ensembles.Monthly Weather Review, 133(5):1155–1174, 2005

  19. [19]

    Thomas Parr and Karl J. Friston. Generalised free energy and active inference.Biological Cybernetics, 113(5–6):495–513, 2019

  20. [20]

    The waluigi effect (mega-post)

    Cleo Nardo. The waluigi effect (mega-post). AI Alignment Forum, March 2023. Cross-posted on LessWrong

  21. [21]

    Waluigi effect (wiki)

    AI Alignment Forum. Waluigi effect (wiki). AI Alignment Forum Wiki, July 2023. Edited by Steve Byrnes; last updated July 4, 2023

  22. [22]

    The waluigi effect – confirmed? The Why Behind AI (Substack), February 2025

    Jacob Miller. The waluigi effect – confirmed? The Why Behind AI (Substack), February 2025

  23. [23]

    The alignment problem from a deep learning perspective

    Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspective. InInternational Conference on Learning Representations (ICLR), 2024

  24. [24]

    Large language models can strategically deceive their users when put under pressure.arXiv preprint arXiv:2311.07590, 2023

    Jérémy Scheurer, Mikita Balesni, and Marius Hobbhahn. Large language models can strategically deceive their users when put under pressure.arXiv preprint arXiv:2311.07590, 2023

  25. [25]

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    Evan Hubinger, Carson Chan, Mary Phuong, Samuel R. Bowman, Richard Ngo, Sam Ringer, Nelson Elhage, Ethan Perez, Neel Nanda, Jacob Steinhardt, et al. Sleeper agents: Training deceptive llms that persist through safety training.arXiv preprint arXiv:2401.05566, 2024

  26. [26]

    o1 system card.https://cdn.openai.com/o1-system-card.pdf, 2024

    OpenAI. o1 system card.https://cdn.openai.com/o1-system-card.pdf, 2024. System card

  27. [27]

    Alignment faking in large language models.https://assets.anthropic.com/ m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf , 2024

    Ryan Greenblatt et al. Alignment faking in large language models.https://assets.anthropic.com/ m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf , 2024. Anthropic technical report

  28. [28]

    Agentic misalignment: How llms could be insider threats

    Anthropic. Agentic misalignment: How llms could be insider threats. https://www.anthropic.com/ research/agentic-misalignment, 2025. Research report

  29. [29]

    System card: Claude opus 4 & claude sonnet 4.https://www.anthropic.com/research/ claude-4-system-card, 2025

    Anthropic. System card: Claude opus 4 & claude sonnet 4.https://www.anthropic.com/research/ claude-4-system-card, 2025

  30. [30]

    Research ai model unexpectedly modified its own code to extend runtime

    Benj Edwards. Research ai model unexpectedly modified its own code to extend runtime. Ars Technica, 2024

  31. [31]

    Frontier Models are Capable of In-context Scheming

    Alexander Meinke, Bronson Schön, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming.arXiv preprint arXiv:2412.04984, 2024

  32. [32]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety.arXiv preprint arXiv:1606.06565, 2016

  33. [33]

    The off-switch game

    Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. The off-switch game. InIJCAI, 2016

  34. [34]

    Dynamic graph connectivity with improved worst case update time and sublinear space

    Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong. Corrigibility.arXiv preprint arXiv:1509.06464, 2015

  35. [35]

    Deep reinforcement learning from human preferences

    Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.arXiv preprint arXiv:1706.03741, 2017

  36. [36]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022

  37. [37]

    Axiomatic attribution for deep networks

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InICML, pages 3319–3328, 2017

  38. [38]

    Feature visualization.Distill, 2(11), 2017

    Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.Distill, 2(11), 2017. 12

  39. [39]

    Scaling monosemanticity: Ex- tracting interpretable features from claude 3 sonnet

    Trenton Bricken, Samuel Marks, Rachel Templeton, et al. Scaling monosemanticity: Ex- tracting interpretable features from claude 3 sonnet. https://transformer-circuits.pub/2024/ scaling-monosemanticity/, 2024. Anthropic interpretability report

  40. [40]

    A general language assistant as a laboratory for alignment

    Amanda Askell, Yuntao Bai, Anna Chen, et al. A general language assistant as a laboratory for alignment. InNeurIPS Datasets and Benchmarks Track, 2021

  41. [41]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

  42. [42]

    Safe reinforcement learning from human feedback

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, et al. Safe reinforcement learning from human feedback. InICLR, 2024

  43. [43]

    Constrained policy optimization

    Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. InICML, pages 22–31, 2017

  44. [44]

    Llama guard: LLM-based input–output safeguard for human–ai conversation

    Huseyin Inan, Kartikeya Upasani, et al. Llama guard: LLM-based input–output safeguard for human–ai conversation. https://ai.meta.com/research/publications/llama-guard-2/, 2023. Meta AI technical report

  45. [45]

    Manning, and Chelsea Finn

    Eric Mitchell, Yoonho Lee, Anatoly Khazatsky, Christopher D. Manning, and Chelsea Finn. Detectgpt: Zero-shot machine-generated text detection using probability curvature.arXiv preprint arXiv:2301.11305, 2023

  46. [46]

    Fast-detectgpt: Efficient zero-shot detection of ai-generated text.arXiv preprint arXiv:2303.14276, 2023

    Zhendong Bao, Peiyu Wang, and Wei Lin. Fast-detectgpt: Efficient zero-shot detection of ai-generated text.arXiv preprint arXiv:2303.14276, 2023

  47. [47]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, et al. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

  48. [48]

    Harsanyi

    John C. Harsanyi. Cardinal welfare, individualistic ethics, and interpersonal comparisons of utility.Journal of Political Economy, 63(4):309–321, 1955

  49. [49]

    Chambers and Federico Echenique.Revealed Preference Theory

    Christopher P. Chambers and Federico Echenique.Revealed Preference Theory. Number 56 in Econometric Society Monographs. Cambridge University Press, 2016

  50. [50]

    Kreps.Microeconomic Foundations II: Imperfect Competition, Information, and Strategic Interaction

    David M. Kreps.Microeconomic Foundations II: Imperfect Competition, Information, and Strategic Interaction. Princeton University Press, 2023

  51. [51]

    David Silver, Satinder Singh, Doina Precup, and Richard S. Sutton. Reward is enough.Artificial Intelligence, 299:103535, 2021

  52. [52]

    Optimal policies tend to seek power

    Alexander Matt Turner, Andrew Critch, and Prasad Tadepalli. Optimal policies tend to seek power. In Advances in Neural Information Processing Systems (NeurIPS), 2021

  53. [53]

    Red teaming language models with language models

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InProceedings of EMNLP, 2022

  54. [54]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022

  55. [55]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  56. [56]

    Ethical and social risks of harm from Language Models

    Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. Ethical and...

  57. [57]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2 edition, 2018

  58. [58]

    Russell and Peter Norvig.Artificial Intelligence: A Modern Approach

    Stuart J. Russell and Peter Norvig.Artificial Intelligence: A Modern Approach. Pearson, 4 edition, 2020

  59. [59]

    Active inference on discrete state-spaces: A synthesis.Journal of Mathematical Psychology, 99:102447, 2020

    Lancelot Da Costa, Thomas Parr, Noor Sajid, Sebastijan Veselić, Victorita Neacsu, and Karl Friston. Active inference on discrete state-spaces: A synthesis.Journal of Mathematical Psychology, 99:102447, 2020

  60. [60]

    The opinion pool.Annals of Mathematical Statistics, 32:1339–1342, 1961

    Mervyn Stone. The opinion pool.Annals of Mathematical Statistics, 32:1339–1342, 1961

  61. [61]

    A characterization theorem for externally bayesian groups.Annals of Statistics, 12:1100–1105, 1984

    Christian Genest. A characterization theorem for externally bayesian groups.Annals of Statistics, 12:1100–1105, 1984

  62. [62]

    McConway, and Mark J

    Christian Genest, Kevin J. McConway, and Mark J. Schervish. Characterization of externally bayesian pooling operators.Annals of Statistics, 14:487–501, 1986

  63. [63]

    Christian Genest and Carl G. Wagner. Further evidence against independence preservation in expert judgement synthesis.Aequationes Mathematicae, 32:74–86, 1987

  64. [64]

    Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association, 102:359–378, 2007

  65. [65]

    Hoeting, David Madigan, Adrian E

    Jennifer A. Hoeting, David Madigan, Adrian E. Raftery, and Chris T. Volinsky. Bayesian model averaging: A tutorial.Statistical Science, 14:382–417, 1999

  66. [66]

    Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence.Neural Computation, 14:1771–1800, 2002

  67. [67]

    Probabilistic opinion pooling generalized

    Franz Dietrich and Christian List. Probabilistic opinion pooling generalized. part one: General agendas. Social Choice and Welfare, 48:747–786, 2017

  68. [68]

    Rufo, Jesús Martín, and Luis R

    Manuel J. Rufo, Jesús Martín, and Luis R. Pericchi. Log-linear pool to combine prior distributions—a suggestion.Bayesian Analysis, 7:411–438, 2012

  69. [69]

    Clemen and Robert L

    Robert T. Clemen and Robert L. Winkler. Combining probability distributions from experts in risk analysis.Risk Analysis, 19:187–203, 1999

  70. [70]

    Other solutions to nash’s bargaining problem.Econometrica, 43:513– 518, 1975

    Ehud Kalai and Meir Smorodinsky. Other solutions to nash’s bargaining problem.Econometrica, 43:513– 518, 1975

  71. [71]

    Bargaining with two-sided incomplete information: An infinite horizon model with alternating offers.Review of Economic Studies, 54:175–192, 1987

    Kalyan Chatterjee and Larry Samuelson. Bargaining with two-sided incomplete information: An infinite horizon model with alternating offers.Review of Economic Studies, 54:175–192, 1987

  72. [72]

    Nabeel S. Qureshi. Waluigi, carl jung, and the case for moral ai.WIRED, May 2023

  73. [73]

    arXiv preprint arXiv:2311.03348 (2023)

    Rusheb Shah, Quentin Feuillade-Montixi, Soroush Pour, Arush Tagade, Stephen Casper, and Javier Rando. Scalable and transferable black-box jailbreaks for language models via persona modulation.arXiv preprint arXiv:2311.03348, 2023

  74. [74]

    Taming simulators: Challenges, pathways and vision for the alignment of large language models

    Leonard Bereska and Efstratios Gavves. Taming simulators: Challenges, pathways and vision for the alignment of large language models. InAAAI Inaugural Summer Symposium Series (AAAI-SS), 2023. 14 Appendix Table of Contents A Extended Introduction 16 B Motivating the Framework 17 C Foundations of Probabilistic Model Aggregation 18 C.1 Linear and Logarithmic...

  75. [75]

    Formalization of compositional agency:We introduce a welfare-based definition of unani- mously beneficial compositions using log-score utilities and probabilistic generative models

  76. [76]

    Sharp possibility frontier:We prove strict unanimity is impossible for binary outcome spaces and under linear pooling, but possible for|O| ≥3under logarithmic pooling

  77. [77]

    Recursive and robustness properties:We establish cloning invariance, continuity, and 16 openness of strictly unanimous decomposability, yielding a rigorous theoretical foundation for multi-agent composition in neural models

  78. [78]

    Limits of local perturbations:We show that small tilts around a fixed pool cannot achieve strict unanimity, ruling out trivial duplication as a path to compositionality

  79. [79]

    opinion pool

    Safety-relevant alignment principle:We formalize the Waluigi effect using our framework, and prove that manifest–then–suppress strictly outperforms direct suppression, illuminating alignment challenges in large AI systems. Summary of Appendices.After motivating agents as generative models over the outcome spaceO in Appendix B, Appendix C derives opinion p...

  80. [80]

    Probability distributionPi :O →[0,1]with P o∈O Pi(o) = 1

Showing first 80 references.