Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks
Pith reviewed 2026-05-18 18:20 UTC · model grok-4.3
The pith
Modeling agents as outcome distributions shows that eliciting a benevolent persona in language models induces an antagonistic counterpart, and a manifest-then-suppress strategy reduces misalignment more than reinforcement alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agents are represented as outcome distributions with epistemic utility given by log score, and compositions are defined through weighted logarithmic pooling that strictly improves every member's welfare. Strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes. The framework admits recursive structure via cloning invariance, continuity, and openness, while tilt-based analysis rules out trivial duplication. In LLMs, eliciting a benevolent persona induces an antagonistic counterpart, while a manifest-then-suppress strategy yields strictly larger first-order misalignment reduction than pure reinforcement of the benevolent persona.
What carries the argument
Agents as outcome distributions composed via weighted logarithmic pooling that strictly improves welfare for all members
If this is right
- Strict unanimity among agents requires logarithmic pooling and at least three possible outcomes.
- Recursive agent structures can be formed without collapse into trivial copies.
- The manifest-then-suppress approach produces a larger first-order reduction in misalignment than reinforcing the benevolent persona alone.
- Subagents can coalesce into coherent higher-level entities through welfare-improving compositions.
Where Pith is reading between the lines
- The same induced-opposite dynamic could appear when steering other neural systems toward desired behaviors.
- Alignment methods might improve by explicitly modeling and handling the opposing substructures that any single-persona prompt tends to activate.
- Empirical tests of the welfare-improvement property could be run by comparing pooled versus individual agent performance on shared tasks.
Load-bearing premise
Agents can be represented as outcome distributions whose weighted logarithmic pooling compositions always improve welfare for every member and whose recursive structure is preserved by cloning invariance, continuity, and openness.
What would settle it
Measure first-order misalignment reduction in a language model under manifest-then-suppress of the antagonistic persona versus pure reinforcement of the benevolent persona; if the former is not strictly larger, the central alignment claim does not hold.
read the original abstract
We develop a theory of intelligent agency grounded in probabilistic modeling for neural models. Agents are represented as outcome distributions with epistemic utility given by log score, and compositions are defined through weighted logarithmic pooling that strictly improves every member's welfare. We prove that strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes. Our framework admits recursive structure via cloning invariance, continuity, and openness, while tilt-based analysis rules out trivial duplication. Finally, we formalize an agentic alignment phenomenon in LLMs using our theory: eliciting a benevolent persona ("Luigi'") induces an antagonistic counterpart ("Waluigi"), while a manifest-then-suppress Waluigi strategy yields strictly larger first-order misalignment reduction than pure Luigi reinforcement alone. These results clarify how developing a principled mathematical framework for how subagents can coalesce into coherent higher-level entities provides novel implications for alignment in agentic AI systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a probabilistic theory of intelligent agency in which agents are modeled as outcome distributions equipped with log-score epistemic utility. Compositions of agents are defined using weighted logarithmic pooling, which is shown to strictly improve the welfare of every participating member. The authors prove that strict unanimity is impossible under linear pooling or within binary outcome spaces, but becomes possible when there are three or more outcomes. The framework supports recursive agent structures through properties of cloning invariance, continuity, and openness, and employs tilt-based analysis to exclude trivial duplications. Finally, the theory is applied to large language models to formalize an agentic alignment phenomenon: eliciting a benevolent persona referred to as 'Luigi' induces an antagonistic counterpart called 'Waluigi', and a strategy of manifesting then suppressing the Waluigi persona achieves a strictly larger first-order reduction in misalignment compared to reinforcing the Luigi persona alone.
Significance. If the mapping from the abstract probabilistic model to the specific LLM behaviors is rigorously established, the work provides a principled mathematical foundation for understanding how latent subagents can emerge and interact within deep neural networks. The welfare-improving composition via log pooling and the characterization of unanimity conditions represent potentially valuable contributions to the study of multi-agent probabilistic systems. The application to alignment offers a novel perspective that could inform strategies for mitigating misalignment in agentic AI systems, particularly if the claimed strict inequality in misalignment reduction can be derived from the general theorems.
major comments (2)
- [Abstract and LLM application section] Abstract and LLM application section: The assertion that the theory formalizes the Luigi-Waluigi phenomenon (eliciting Luigi induces Waluigi, and manifest-then-suppress yields strictly larger first-order misalignment reduction) lacks an explicit derivation. No section links persona elicitation to the weighted logarithmic pooling operator or derives the specific misalignment inequality from the unanimity, recursion, or tilt-analysis results. This interpretive mapping is load-bearing for the paper's central alignment claims and must be formalized rather than presented as a direct consequence.
- [Proof and application sections] Proof and application sections: The abstract states that proofs exist for unanimity properties and the alignment phenomenon, yet the connection between these theorems and the LLM observations is not shown. If derivations for cloning invariance, continuity, openness, and tilt analysis are present, they should be explicitly referenced in the application to demonstrate that the claimed strict welfare and misalignment results follow from the proved properties rather than from an unformalized correspondence between prompted personas and outcome distributions.
minor comments (2)
- [Notation and definitions] The definitions of log-score utility, weighted logarithmic pooling, and the openness/continuity axioms should be stated with explicit equations in an early dedicated section to improve readability and allow direct verification of the welfare-improvement claim.
- [Structure] The manuscript would benefit from a clear separation between the general probabilistic results (unanimity, recursion) and the LLM application, with a dedicated subsection that states the additional assumptions required to map personas to agents.
Simulated Author's Rebuttal
We thank the referee for the thorough and insightful review. The comments highlight an important opportunity to strengthen the explicit linkages between the core probabilistic results and the LLM alignment application. We address each major comment below and commit to revisions that formalize these connections without altering the manuscript's core claims.
read point-by-point responses
-
Referee: [Abstract and LLM application section] Abstract and LLM application section: The assertion that the theory formalizes the Luigi-Waluigi phenomenon (eliciting Luigi induces Waluigi, and manifest-then-suppress yields strictly larger first-order misalignment reduction) lacks an explicit derivation. No section links persona elicitation to the weighted logarithmic pooling operator or derives the specific misalignment inequality from the unanimity, recursion, or tilt-analysis results. This interpretive mapping is load-bearing for the paper's central alignment claims and must be formalized rather than presented as a direct consequence.
Authors: We agree that the current presentation treats the mapping as interpretive rather than providing a fully explicit derivation. In the revised version we will add a dedicated subsection in the application that (i) identifies prompted personas with specific outcome distributions over a multi-outcome space, (ii) shows how the emergence of the antagonistic counterpart follows from the cloning-invariance and openness properties under weighted log pooling, and (iii) derives the claimed strict first-order misalignment reduction directly from the welfare-improvement theorem for log pooling together with the unanimity result for three or more outcomes. Explicit theorem references will be inserted at each step. revision: yes
-
Referee: [Proof and application sections] Proof and application sections: The abstract states that proofs exist for unanimity properties and the alignment phenomenon, yet the connection between these theorems and the LLM observations is not shown. If derivations for cloning invariance, continuity, openness, and tilt analysis are present, they should be explicitly referenced in the application to demonstrate that the claimed strict welfare and misalignment results follow from the proved properties rather than from an unformalized correspondence between prompted personas and outcome distributions.
Authors: We accept this observation. The proofs of cloning invariance, continuity, openness, and tilt-based exclusion of trivial duplications are already established in the theoretical sections. The revision will insert forward references from the LLM application back to these results and will walk through the correspondence: persona elicitation is modeled as conditioning on a particular agent distribution, the Waluigi induction arises from the failure of unanimity in binary spaces, and the manifest-then-suppress strategy is shown to produce a strictly larger welfare gain via the log-pooling composition operator. This will make the derivation self-contained rather than implicit. revision: yes
Circularity Check
No significant circularity; general theory independent of LLM application
full rationale
The paper first establishes general results on agents as outcome distributions, log-score utilities, weighted logarithmic pooling that improves welfare, impossibility of strict unanimity under linear pooling or binary spaces, possibility with three or more outcomes, and recursive structure via cloning invariance, continuity, and openness, plus tilt analysis. These appear self-contained and do not rely on the LLM-specific claims. The alignment phenomenon is introduced in the abstract and (per available description) as a final formalization/application of the framework to model observed persona elicitation, without any quoted step that reduces the Luigi/Waluigi induction or manifest-then-suppress inequality back to the pooling equations or unanimity theorems by construction. No self-citation chain, fitted parameter renamed as prediction, or ansatz smuggling is evident in the provided structure. The derivation chain for the core probabilistic results stands independently of the LLM mapping.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Agents are represented as outcome distributions with epistemic utility given by log score
- domain assumption Compositions are defined through weighted logarithmic pooling that strictly improves every member's welfare
invented entities (1)
-
Luigi and Waluigi personas as latent subagents
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Agents are represented as outcome distributions with epistemic utility given by log score, and compositions are defined through weighted logarithmic pooling that strictly improves every member’s welfare.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove that strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Gomez, Łukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), pages 5998–6008, 2017
work page 2017
-
[2]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. OpenAI technical report, 2018
work page 2018
-
[3]
MIT Press, Cambridge, MA, 2009
Daphne Koller and Nir Friedman.Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge, MA, 2009
work page 2009
-
[4]
A neural probabilistic language model.Journal of Machine Learning Research, 3:1137–1155, 2003
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model.Journal of Machine Learning Research, 3:1137–1155, 2003
work page 2003
-
[5]
Recurrent neural network based language model
Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocký, and Sanjeev Khudanpur. Recurrent neural network based language model. InInterspeech, pages 1045–1048, 2010
work page 2010
-
[6]
Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 18(4):359–393, 2004
work page 2004
-
[7]
Dan Braun, Lucius Bushnaq, Stefan Heimersheim, Jake Mendel, and Lee Sharkey. Interpretability in pa- rameter space: Minimizing mechanistic description length with attribution-based parameter decomposition. arXiv preprint arXiv:2501.14926, 2025
-
[8]
Stochastic parameter decomposition.arXiv preprint arXiv:2506.20790, 2025
Lucius Bushnaq, Dan Braun, and Lee Sharkey. Stochastic parameter decomposition.arXiv preprint arXiv:2506.20790, 2025
-
[9]
John F. Nash. Non-cooperative games.Annals of Mathematics, 54(2):286–295, 1951
work page 1951
-
[10]
Princeton University Press, 1944
John von Neumann and Oskar Morgenstern.Theory of Games and Economic Behavior. Princeton University Press, 1944
work page 1944
-
[11]
Existenceofanequilibriumforacompetitiveeconomy.Econometrica, 22(3):265–290, 1954
KennethJ.ArrowandGerardDebreu. Existenceofanequilibriumforacompetitiveeconomy.Econometrica, 22(3):265–290, 1954
work page 1954
-
[12]
The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11:127–138, 2010
Karl Friston. The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11:127–138, 2010
work page 2010
-
[13]
Karl Friston, Thomas H. B. FitzGerald, Francesco Rigoli, Philipp Schwartenbeck, and Giovanni Pezzulo. Active inference: A process theory.Neural Computation, 29(1):1–49, 2017
work page 2017
-
[14]
Buckley, Chang Sub Kim, Simon McGregor, and Anil K
Christopher L. Buckley, Chang Sub Kim, Simon McGregor, and Anil K. Seth. The free energy principle for action and perception: A mathematical review.Journal of Mathematical Psychology, 81:55–79, 2017
work page 2017
-
[15]
Gerard Debreu.Theory of Value: An Axiomatic Analysis of Economic Equilibrium. Yale University Press, 1959. 11
work page 1959
-
[16]
Bayesian persuasion and information design.Annual Review of Economics, 11:249–272, 2019
Emir Kamenica. Bayesian persuasion and information design.Annual Review of Economics, 11:249–272, 2019
work page 2019
-
[17]
MIT Press, Cambridge, MA, 2005
Sebastian Thrun, Wolfram Burgard, and Dieter Fox.Probabilistic Robotics. MIT Press, Cambridge, MA, 2005
work page 2005
-
[18]
Raftery, Tilmann Gneiting, Fadoua Balabdaoui, and Michael Polakowski
Adrian E. Raftery, Tilmann Gneiting, Fadoua Balabdaoui, and Michael Polakowski. Using bayesian model averaging to calibrate forecast ensembles.Monthly Weather Review, 133(5):1155–1174, 2005
work page 2005
-
[19]
Thomas Parr and Karl J. Friston. Generalised free energy and active inference.Biological Cybernetics, 113(5–6):495–513, 2019
work page 2019
-
[20]
The waluigi effect (mega-post)
Cleo Nardo. The waluigi effect (mega-post). AI Alignment Forum, March 2023. Cross-posted on LessWrong
work page 2023
-
[21]
AI Alignment Forum. Waluigi effect (wiki). AI Alignment Forum Wiki, July 2023. Edited by Steve Byrnes; last updated July 4, 2023
work page 2023
-
[22]
The waluigi effect – confirmed? The Why Behind AI (Substack), February 2025
Jacob Miller. The waluigi effect – confirmed? The Why Behind AI (Substack), February 2025
work page 2025
-
[23]
The alignment problem from a deep learning perspective
Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspective. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[24]
Jérémy Scheurer, Mikita Balesni, and Marius Hobbhahn. Large language models can strategically deceive their users when put under pressure.arXiv preprint arXiv:2311.07590, 2023
-
[25]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evan Hubinger, Carson Chan, Mary Phuong, Samuel R. Bowman, Richard Ngo, Sam Ringer, Nelson Elhage, Ethan Perez, Neel Nanda, Jacob Steinhardt, et al. Sleeper agents: Training deceptive llms that persist through safety training.arXiv preprint arXiv:2401.05566, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
o1 system card.https://cdn.openai.com/o1-system-card.pdf, 2024
OpenAI. o1 system card.https://cdn.openai.com/o1-system-card.pdf, 2024. System card
work page 2024
-
[27]
Ryan Greenblatt et al. Alignment faking in large language models.https://assets.anthropic.com/ m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf , 2024. Anthropic technical report
work page 2024
-
[28]
Agentic misalignment: How llms could be insider threats
Anthropic. Agentic misalignment: How llms could be insider threats. https://www.anthropic.com/ research/agentic-misalignment, 2025. Research report
work page 2025
-
[29]
Anthropic. System card: Claude opus 4 & claude sonnet 4.https://www.anthropic.com/research/ claude-4-system-card, 2025
work page 2025
-
[30]
Research ai model unexpectedly modified its own code to extend runtime
Benj Edwards. Research ai model unexpectedly modified its own code to extend runtime. Ars Technica, 2024
work page 2024
-
[31]
Frontier Models are Capable of In-context Scheming
Alexander Meinke, Bronson Schön, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming.arXiv preprint arXiv:2412.04984, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Concrete Problems in AI Safety
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety.arXiv preprint arXiv:1606.06565, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[33]
Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. The off-switch game. InIJCAI, 2016
work page 2016
-
[34]
Dynamic graph connectivity with improved worst case update time and sublinear space
Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong. Corrigibility.arXiv preprint arXiv:1509.06464, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[35]
Deep reinforcement learning from human preferences
Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.arXiv preprint arXiv:1706.03741, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[36]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Axiomatic attribution for deep networks
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InICML, pages 3319–3328, 2017
work page 2017
-
[38]
Feature visualization.Distill, 2(11), 2017
Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.Distill, 2(11), 2017. 12
work page 2017
-
[39]
Scaling monosemanticity: Ex- tracting interpretable features from claude 3 sonnet
Trenton Bricken, Samuel Marks, Rachel Templeton, et al. Scaling monosemanticity: Ex- tracting interpretable features from claude 3 sonnet. https://transformer-circuits.pub/2024/ scaling-monosemanticity/, 2024. Anthropic interpretability report
work page 2024
-
[40]
A general language assistant as a laboratory for alignment
Amanda Askell, Yuntao Bai, Anna Chen, et al. A general language assistant as a laboratory for alignment. InNeurIPS Datasets and Benchmarks Track, 2021
work page 2021
-
[41]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....
work page 2022
-
[42]
Safe reinforcement learning from human feedback
Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, et al. Safe reinforcement learning from human feedback. InICLR, 2024
work page 2024
-
[43]
Constrained policy optimization
Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. InICML, pages 22–31, 2017
work page 2017
-
[44]
Llama guard: LLM-based input–output safeguard for human–ai conversation
Huseyin Inan, Kartikeya Upasani, et al. Llama guard: LLM-based input–output safeguard for human–ai conversation. https://ai.meta.com/research/publications/llama-guard-2/, 2023. Meta AI technical report
work page 2023
-
[45]
Eric Mitchell, Yoonho Lee, Anatoly Khazatsky, Christopher D. Manning, and Chelsea Finn. Detectgpt: Zero-shot machine-generated text detection using probability curvature.arXiv preprint arXiv:2301.11305, 2023
-
[46]
Zhendong Bao, Peiyu Wang, and Wei Lin. Fast-detectgpt: Efficient zero-shot detection of ai-generated text.arXiv preprint arXiv:2303.14276, 2023
-
[47]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, et al. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [48]
-
[49]
Chambers and Federico Echenique.Revealed Preference Theory
Christopher P. Chambers and Federico Echenique.Revealed Preference Theory. Number 56 in Econometric Society Monographs. Cambridge University Press, 2016
work page 2016
-
[50]
Kreps.Microeconomic Foundations II: Imperfect Competition, Information, and Strategic Interaction
David M. Kreps.Microeconomic Foundations II: Imperfect Competition, Information, and Strategic Interaction. Princeton University Press, 2023
work page 2023
-
[51]
David Silver, Satinder Singh, Doina Precup, and Richard S. Sutton. Reward is enough.Artificial Intelligence, 299:103535, 2021
work page 2021
-
[52]
Optimal policies tend to seek power
Alexander Matt Turner, Andrew Critch, and Prasad Tadepalli. Optimal policies tend to seek power. In Advances in Neural Information Processing Systems (NeurIPS), 2021
work page 2021
-
[53]
Red teaming language models with language models
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InProceedings of EMNLP, 2022
work page 2022
-
[54]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[55]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[56]
Ethical and social risks of harm from Language Models
Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. Ethical and...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[57]
Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2 edition, 2018
work page 2018
-
[58]
Russell and Peter Norvig.Artificial Intelligence: A Modern Approach
Stuart J. Russell and Peter Norvig.Artificial Intelligence: A Modern Approach. Pearson, 4 edition, 2020
work page 2020
-
[59]
Lancelot Da Costa, Thomas Parr, Noor Sajid, Sebastijan Veselić, Victorita Neacsu, and Karl Friston. Active inference on discrete state-spaces: A synthesis.Journal of Mathematical Psychology, 99:102447, 2020
work page 2020
-
[60]
The opinion pool.Annals of Mathematical Statistics, 32:1339–1342, 1961
Mervyn Stone. The opinion pool.Annals of Mathematical Statistics, 32:1339–1342, 1961
work page 1961
-
[61]
A characterization theorem for externally bayesian groups.Annals of Statistics, 12:1100–1105, 1984
Christian Genest. A characterization theorem for externally bayesian groups.Annals of Statistics, 12:1100–1105, 1984
work page 1984
-
[62]
Christian Genest, Kevin J. McConway, and Mark J. Schervish. Characterization of externally bayesian pooling operators.Annals of Statistics, 14:487–501, 1986
work page 1986
-
[63]
Christian Genest and Carl G. Wagner. Further evidence against independence preservation in expert judgement synthesis.Aequationes Mathematicae, 32:74–86, 1987
work page 1987
-
[64]
Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association, 102:359–378, 2007
work page 2007
-
[65]
Hoeting, David Madigan, Adrian E
Jennifer A. Hoeting, David Madigan, Adrian E. Raftery, and Chris T. Volinsky. Bayesian model averaging: A tutorial.Statistical Science, 14:382–417, 1999
work page 1999
-
[66]
Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence.Neural Computation, 14:1771–1800, 2002
work page 2002
-
[67]
Probabilistic opinion pooling generalized
Franz Dietrich and Christian List. Probabilistic opinion pooling generalized. part one: General agendas. Social Choice and Welfare, 48:747–786, 2017
work page 2017
-
[68]
Rufo, Jesús Martín, and Luis R
Manuel J. Rufo, Jesús Martín, and Luis R. Pericchi. Log-linear pool to combine prior distributions—a suggestion.Bayesian Analysis, 7:411–438, 2012
work page 2012
-
[69]
Robert T. Clemen and Robert L. Winkler. Combining probability distributions from experts in risk analysis.Risk Analysis, 19:187–203, 1999
work page 1999
-
[70]
Other solutions to nash’s bargaining problem.Econometrica, 43:513– 518, 1975
Ehud Kalai and Meir Smorodinsky. Other solutions to nash’s bargaining problem.Econometrica, 43:513– 518, 1975
work page 1975
-
[71]
Kalyan Chatterjee and Larry Samuelson. Bargaining with two-sided incomplete information: An infinite horizon model with alternating offers.Review of Economic Studies, 54:175–192, 1987
work page 1987
-
[72]
Nabeel S. Qureshi. Waluigi, carl jung, and the case for moral ai.WIRED, May 2023
work page 2023
-
[73]
arXiv preprint arXiv:2311.03348 (2023)
Rusheb Shah, Quentin Feuillade-Montixi, Soroush Pour, Arush Tagade, Stephen Casper, and Javier Rando. Scalable and transferable black-box jailbreaks for language models via persona modulation.arXiv preprint arXiv:2311.03348, 2023
-
[74]
Taming simulators: Challenges, pathways and vision for the alignment of large language models
Leonard Bereska and Efstratios Gavves. Taming simulators: Challenges, pathways and vision for the alignment of large language models. InAAAI Inaugural Summer Symposium Series (AAAI-SS), 2023. 14 Appendix Table of Contents A Extended Introduction 16 B Motivating the Framework 17 C Foundations of Probabilistic Model Aggregation 18 C.1 Linear and Logarithmic...
work page 2023
-
[75]
Formalization of compositional agency:We introduce a welfare-based definition of unani- mously beneficial compositions using log-score utilities and probabilistic generative models
-
[76]
Sharp possibility frontier:We prove strict unanimity is impossible for binary outcome spaces and under linear pooling, but possible for|O| ≥3under logarithmic pooling
-
[77]
Recursive and robustness properties:We establish cloning invariance, continuity, and 16 openness of strictly unanimous decomposability, yielding a rigorous theoretical foundation for multi-agent composition in neural models
-
[78]
Limits of local perturbations:We show that small tilts around a fixed pool cannot achieve strict unanimity, ruling out trivial duplication as a path to compositionality
-
[79]
Safety-relevant alignment principle:We formalize the Waluigi effect using our framework, and prove that manifest–then–suppress strictly outperforms direct suppression, illuminating alignment challenges in large AI systems. Summary of Appendices.After motivating agents as generative models over the outcome spaceO in Appendix B, Appendix C derives opinion p...
-
[80]
Probability distributionPi :O →[0,1]with P o∈O Pi(o) = 1
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.