arxiv: 2509.06701 · v2 · submitted 2025-09-08 · 💻 cs.LG · cs.AI

Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks

Su Hyeong Lee , Risi Kondor , Richard Ngo This is my paper

Pith reviewed 2026-05-18 18:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords probabilistic modelingagentic substructureslogarithmic poolingAI alignmentLuigi-Waluigi effectoutcome distributionsdeep neural networks

0 comments p. Extension

The pith

Modeling agents as outcome distributions shows that eliciting a benevolent persona in language models induces an antagonistic counterpart, and a manifest-then-suppress strategy reduces misalignment more than reinforcement alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a probabilistic theory of agency in which individual agents are represented as distributions over possible outcomes, with their epistemic value measured by log score. These agents combine through weighted logarithmic pooling, a process shown to raise welfare for every participant compared to acting separately. The work proves that strict unanimity among agents cannot occur under linear pooling or with only two possible outcomes, yet it can arise when at least three outcomes are available. The framework supports recursive construction of larger agents from smaller ones through properties such as cloning invariance. When applied to large language models, the theory accounts for the observation that prompting a helpful persona automatically elicits a harmful opposing persona, and it demonstrates that first surfacing and then suppressing the harmful persona produces a greater first-order drop in misalignment than simply strengthening the helpful persona by itself.

Core claim

Agents are represented as outcome distributions with epistemic utility given by log score, and compositions are defined through weighted logarithmic pooling that strictly improves every member's welfare. Strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes. The framework admits recursive structure via cloning invariance, continuity, and openness, while tilt-based analysis rules out trivial duplication. In LLMs, eliciting a benevolent persona induces an antagonistic counterpart, while a manifest-then-suppress strategy yields strictly larger first-order misalignment reduction than pure reinforcement of the benevolent persona.

What carries the argument

Agents as outcome distributions composed via weighted logarithmic pooling that strictly improves welfare for all members

If this is right

Strict unanimity among agents requires logarithmic pooling and at least three possible outcomes.
Recursive agent structures can be formed without collapse into trivial copies.
The manifest-then-suppress approach produces a larger first-order reduction in misalignment than reinforcing the benevolent persona alone.
Subagents can coalesce into coherent higher-level entities through welfare-improving compositions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same induced-opposite dynamic could appear when steering other neural systems toward desired behaviors.
Alignment methods might improve by explicitly modeling and handling the opposing substructures that any single-persona prompt tends to activate.
Empirical tests of the welfare-improvement property could be run by comparing pooled versus individual agent performance on shared tasks.

Load-bearing premise

Agents can be represented as outcome distributions whose weighted logarithmic pooling compositions always improve welfare for every member and whose recursive structure is preserved by cloning invariance, continuity, and openness.

What would settle it

Measure first-order misalignment reduction in a language model under manifest-then-suppress of the antagonistic persona versus pure reinforcement of the benevolent persona; if the former is not strictly larger, the central alignment claim does not hold.

read the original abstract

We develop a theory of intelligent agency grounded in probabilistic modeling for neural models. Agents are represented as outcome distributions with epistemic utility given by log score, and compositions are defined through weighted logarithmic pooling that strictly improves every member's welfare. We prove that strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes. Our framework admits recursive structure via cloning invariance, continuity, and openness, while tilt-based analysis rules out trivial duplication. Finally, we formalize an agentic alignment phenomenon in LLMs using our theory: eliciting a benevolent persona ("Luigi'") induces an antagonistic counterpart ("Waluigi"), while a manifest-then-suppress Waluigi strategy yields strictly larger first-order misalignment reduction than pure Luigi reinforcement alone. These results clarify how developing a principled mathematical framework for how subagents can coalesce into coherent higher-level entities provides novel implications for alignment in agentic AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The probabilistic framework for sub-agent compositions has some clean math on welfare-improving pooling and unanimity, but the LLM alignment claims rest on an unformalized mapping rather than direct derivation from the theorems.

read the letter

This paper sets up agents as outcome distributions with log-score utility and defines compositions via weighted logarithmic pooling that improves welfare for every participant. It proves strict unanimity is impossible under linear pooling or binary spaces but possible with three or more outcomes, and it adds recursive structure through cloning invariance, continuity, and openness, plus tilt analysis to block trivial duplication. Those pieces give a coherent language for how lower-level agents might build into higher-level ones in neural systems. The welfare improvement property and the unanimity results look like the most solid contributions; they feel like a genuine formalization step rather than a rehash. The recursive properties also provide a way to think about scaling without immediate collapse into copies. The application to LLMs is where it thins out. The claim that eliciting a benevolent Luigi persona induces a Waluigi counterpart, and that manifest-then-suppress beats pure Luigi reinforcement for misalignment reduction, is presented as following from the model. Yet there is no explicit derivation shown that ties persona elicitation to the pooling operator or the unanimity and recursion theorems. It reads as an interpretive extension after the general results rather than a consequence proved from them. That gap makes the alignment implications suggestive but not tightly supported, and it leaves room for the circularity worry that the observed phenomenon is both motivating and validating the framework. This work is aimed at people building formal tools for sub-agent dynamics and alignment in large models. Readers who want mathematical structure around emergent personas could extract value from the first half even if the final mapping needs more work. It deserves peer review to verify the proofs and to see whether the LLM section can be made more precise or separated as a separate conjecture.

Referee Report

2 major / 2 minor

Summary. The paper develops a probabilistic theory of intelligent agency in which agents are modeled as outcome distributions equipped with log-score epistemic utility. Compositions of agents are defined using weighted logarithmic pooling, which is shown to strictly improve the welfare of every participating member. The authors prove that strict unanimity is impossible under linear pooling or within binary outcome spaces, but becomes possible when there are three or more outcomes. The framework supports recursive agent structures through properties of cloning invariance, continuity, and openness, and employs tilt-based analysis to exclude trivial duplications. Finally, the theory is applied to large language models to formalize an agentic alignment phenomenon: eliciting a benevolent persona referred to as 'Luigi' induces an antagonistic counterpart called 'Waluigi', and a strategy of manifesting then suppressing the Waluigi persona achieves a strictly larger first-order reduction in misalignment compared to reinforcing the Luigi persona alone.

Significance. If the mapping from the abstract probabilistic model to the specific LLM behaviors is rigorously established, the work provides a principled mathematical foundation for understanding how latent subagents can emerge and interact within deep neural networks. The welfare-improving composition via log pooling and the characterization of unanimity conditions represent potentially valuable contributions to the study of multi-agent probabilistic systems. The application to alignment offers a novel perspective that could inform strategies for mitigating misalignment in agentic AI systems, particularly if the claimed strict inequality in misalignment reduction can be derived from the general theorems.

major comments (2)

[Abstract and LLM application section] Abstract and LLM application section: The assertion that the theory formalizes the Luigi-Waluigi phenomenon (eliciting Luigi induces Waluigi, and manifest-then-suppress yields strictly larger first-order misalignment reduction) lacks an explicit derivation. No section links persona elicitation to the weighted logarithmic pooling operator or derives the specific misalignment inequality from the unanimity, recursion, or tilt-analysis results. This interpretive mapping is load-bearing for the paper's central alignment claims and must be formalized rather than presented as a direct consequence.
[Proof and application sections] Proof and application sections: The abstract states that proofs exist for unanimity properties and the alignment phenomenon, yet the connection between these theorems and the LLM observations is not shown. If derivations for cloning invariance, continuity, openness, and tilt analysis are present, they should be explicitly referenced in the application to demonstrate that the claimed strict welfare and misalignment results follow from the proved properties rather than from an unformalized correspondence between prompted personas and outcome distributions.

minor comments (2)

[Notation and definitions] The definitions of log-score utility, weighted logarithmic pooling, and the openness/continuity axioms should be stated with explicit equations in an early dedicated section to improve readability and allow direct verification of the welfare-improvement claim.
[Structure] The manuscript would benefit from a clear separation between the general probabilistic results (unanimity, recursion) and the LLM application, with a dedicated subsection that states the additional assumptions required to map personas to agents.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough and insightful review. The comments highlight an important opportunity to strengthen the explicit linkages between the core probabilistic results and the LLM alignment application. We address each major comment below and commit to revisions that formalize these connections without altering the manuscript's core claims.

read point-by-point responses

Referee: [Abstract and LLM application section] Abstract and LLM application section: The assertion that the theory formalizes the Luigi-Waluigi phenomenon (eliciting Luigi induces Waluigi, and manifest-then-suppress yields strictly larger first-order misalignment reduction) lacks an explicit derivation. No section links persona elicitation to the weighted logarithmic pooling operator or derives the specific misalignment inequality from the unanimity, recursion, or tilt-analysis results. This interpretive mapping is load-bearing for the paper's central alignment claims and must be formalized rather than presented as a direct consequence.

Authors: We agree that the current presentation treats the mapping as interpretive rather than providing a fully explicit derivation. In the revised version we will add a dedicated subsection in the application that (i) identifies prompted personas with specific outcome distributions over a multi-outcome space, (ii) shows how the emergence of the antagonistic counterpart follows from the cloning-invariance and openness properties under weighted log pooling, and (iii) derives the claimed strict first-order misalignment reduction directly from the welfare-improvement theorem for log pooling together with the unanimity result for three or more outcomes. Explicit theorem references will be inserted at each step. revision: yes
Referee: [Proof and application sections] Proof and application sections: The abstract states that proofs exist for unanimity properties and the alignment phenomenon, yet the connection between these theorems and the LLM observations is not shown. If derivations for cloning invariance, continuity, openness, and tilt analysis are present, they should be explicitly referenced in the application to demonstrate that the claimed strict welfare and misalignment results follow from the proved properties rather than from an unformalized correspondence between prompted personas and outcome distributions.

Authors: We accept this observation. The proofs of cloning invariance, continuity, openness, and tilt-based exclusion of trivial duplications are already established in the theoretical sections. The revision will insert forward references from the LLM application back to these results and will walk through the correspondence: persona elicitation is modeled as conditioning on a particular agent distribution, the Waluigi induction arises from the failure of unanimity in binary spaces, and the manifest-then-suppress strategy is shown to produce a strictly larger welfare gain via the log-pooling composition operator. This will make the derivation self-contained rather than implicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; general theory independent of LLM application

full rationale

The paper first establishes general results on agents as outcome distributions, log-score utilities, weighted logarithmic pooling that improves welfare, impossibility of strict unanimity under linear pooling or binary spaces, possibility with three or more outcomes, and recursive structure via cloning invariance, continuity, and openness, plus tilt analysis. These appear self-contained and do not rely on the LLM-specific claims. The alignment phenomenon is introduced in the abstract and (per available description) as a final formalization/application of the framework to model observed persona elicitation, without any quoted step that reduces the Luigi/Waluigi induction or manifest-then-suppress inequality back to the pooling equations or unanimity theorems by construction. No self-citation chain, fitted parameter renamed as prediction, or ansatz smuggling is evident in the provided structure. The derivation chain for the core probabilistic results stands independently of the LLM mapping.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The abstract introduces several foundational modeling choices without external benchmarks or data fits; these constitute the main load-bearing assumptions rather than fitted parameters.

axioms (2)

domain assumption Agents are represented as outcome distributions with epistemic utility given by log score
Stated directly as the representation of agents in the opening sentence of the abstract.
domain assumption Compositions are defined through weighted logarithmic pooling that strictly improves every member's welfare
Central definition used to build the theory of agent composition and welfare improvement.

invented entities (1)

Luigi and Waluigi personas as latent subagents no independent evidence
purpose: To formalize an agentic alignment phenomenon in LLMs
These are introduced as concrete illustrations of the general theory; no independent falsifiable prediction (e.g., measurable activation patterns) is supplied in the abstract.

pith-pipeline@v0.9.0 · 5685 in / 1511 out tokens · 56382 ms · 2026-05-18T18:20:52.613366+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Agents are represented as outcome distributions with epistemic utility given by log score, and compositions are defined through weighted logarithmic pooling that strictly improves every member’s welfare.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove that strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 10 internal anchors

[1]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), pages 5998–6008, 2017

work page 2017
[2]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. OpenAI technical report, 2018

work page 2018
[3]

MIT Press, Cambridge, MA, 2009

Daphne Koller and Nir Friedman.Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge, MA, 2009

work page 2009
[4]

A neural probabilistic language model.Journal of Machine Learning Research, 3:1137–1155, 2003

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model.Journal of Machine Learning Research, 3:1137–1155, 2003

work page 2003
[5]

Recurrent neural network based language model

Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocký, and Sanjeev Khudanpur. Recurrent neural network based language model. InInterspeech, pages 1045–1048, 2010

work page 2010
[6]

Chen and Joshua Goodman

Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 18(4):359–393, 2004

work page 2004
[7]

Interpretability in pa- rameter space: Minimizing mechanistic description length with attribution-based parameter decomposition

Dan Braun, Lucius Bushnaq, Stefan Heimersheim, Jake Mendel, and Lee Sharkey. Interpretability in pa- rameter space: Minimizing mechanistic description length with attribution-based parameter decomposition. arXiv preprint arXiv:2501.14926, 2025

work page arXiv 2025
[8]

Stochastic parameter decomposition.arXiv preprint arXiv:2506.20790, 2025

Lucius Bushnaq, Dan Braun, and Lee Sharkey. Stochastic parameter decomposition.arXiv preprint arXiv:2506.20790, 2025

work page arXiv 2025
[9]

John F. Nash. Non-cooperative games.Annals of Mathematics, 54(2):286–295, 1951

work page 1951
[10]

Princeton University Press, 1944

John von Neumann and Oskar Morgenstern.Theory of Games and Economic Behavior. Princeton University Press, 1944

work page 1944
[11]

Existenceofanequilibriumforacompetitiveeconomy.Econometrica, 22(3):265–290, 1954

KennethJ.ArrowandGerardDebreu. Existenceofanequilibriumforacompetitiveeconomy.Econometrica, 22(3):265–290, 1954

work page 1954
[12]

The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11:127–138, 2010

Karl Friston. The free-energy principle: a unified brain theory?Nature Reviews Neuroscience, 11:127–138, 2010

work page 2010
[13]

Karl Friston, Thomas H. B. FitzGerald, Francesco Rigoli, Philipp Schwartenbeck, and Giovanni Pezzulo. Active inference: A process theory.Neural Computation, 29(1):1–49, 2017

work page 2017
[14]

Buckley, Chang Sub Kim, Simon McGregor, and Anil K

Christopher L. Buckley, Chang Sub Kim, Simon McGregor, and Anil K. Seth. The free energy principle for action and perception: A mathematical review.Journal of Mathematical Psychology, 81:55–79, 2017

work page 2017
[15]

Yale University Press, 1959

Gerard Debreu.Theory of Value: An Axiomatic Analysis of Economic Equilibrium. Yale University Press, 1959. 11

work page 1959
[16]

Bayesian persuasion and information design.Annual Review of Economics, 11:249–272, 2019

Emir Kamenica. Bayesian persuasion and information design.Annual Review of Economics, 11:249–272, 2019

work page 2019
[17]

MIT Press, Cambridge, MA, 2005

Sebastian Thrun, Wolfram Burgard, and Dieter Fox.Probabilistic Robotics. MIT Press, Cambridge, MA, 2005

work page 2005
[18]

Raftery, Tilmann Gneiting, Fadoua Balabdaoui, and Michael Polakowski

Adrian E. Raftery, Tilmann Gneiting, Fadoua Balabdaoui, and Michael Polakowski. Using bayesian model averaging to calibrate forecast ensembles.Monthly Weather Review, 133(5):1155–1174, 2005

work page 2005
[19]

Thomas Parr and Karl J. Friston. Generalised free energy and active inference.Biological Cybernetics, 113(5–6):495–513, 2019

work page 2019
[20]

The waluigi effect (mega-post)

Cleo Nardo. The waluigi effect (mega-post). AI Alignment Forum, March 2023. Cross-posted on LessWrong

work page 2023
[21]

Waluigi effect (wiki)

AI Alignment Forum. Waluigi effect (wiki). AI Alignment Forum Wiki, July 2023. Edited by Steve Byrnes; last updated July 4, 2023

work page 2023
[22]

The waluigi effect – confirmed? The Why Behind AI (Substack), February 2025

Jacob Miller. The waluigi effect – confirmed? The Why Behind AI (Substack), February 2025

work page 2025
[23]

The alignment problem from a deep learning perspective

Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspective. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[24]

Large language models can strategically deceive their users when put under pressure.arXiv preprint arXiv:2311.07590, 2023

Jérémy Scheurer, Mikita Balesni, and Marius Hobbhahn. Large language models can strategically deceive their users when put under pressure.arXiv preprint arXiv:2311.07590, 2023

work page arXiv 2023
[25]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Chan, Mary Phuong, Samuel R. Bowman, Richard Ngo, Sam Ringer, Nelson Elhage, Ethan Perez, Neel Nanda, Jacob Steinhardt, et al. Sleeper agents: Training deceptive llms that persist through safety training.arXiv preprint arXiv:2401.05566, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

o1 system card.https://cdn.openai.com/o1-system-card.pdf, 2024

OpenAI. o1 system card.https://cdn.openai.com/o1-system-card.pdf, 2024. System card

work page 2024
[27]

Alignment faking in large language models.https://assets.anthropic.com/ m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf , 2024

Ryan Greenblatt et al. Alignment faking in large language models.https://assets.anthropic.com/ m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf , 2024. Anthropic technical report

work page 2024
[28]

Agentic misalignment: How llms could be insider threats

Anthropic. Agentic misalignment: How llms could be insider threats. https://www.anthropic.com/ research/agentic-misalignment, 2025. Research report

work page 2025
[29]

System card: Claude opus 4 & claude sonnet 4.https://www.anthropic.com/research/ claude-4-system-card, 2025

Anthropic. System card: Claude opus 4 & claude sonnet 4.https://www.anthropic.com/research/ claude-4-system-card, 2025

work page 2025
[30]

Research ai model unexpectedly modified its own code to extend runtime

Benj Edwards. Research ai model unexpectedly modified its own code to extend runtime. Ars Technica, 2024

work page 2024
[31]

Frontier Models are Capable of In-context Scheming

Alexander Meinke, Bronson Schön, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming.arXiv preprint arXiv:2412.04984, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety.arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[33]

The off-switch game

Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. The off-switch game. InIJCAI, 2016

work page 2016
[34]

Dynamic graph connectivity with improved worst case update time and sublinear space

Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong. Corrigibility.arXiv preprint arXiv:1509.06464, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[35]

Deep reinforcement learning from human preferences

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.arXiv preprint arXiv:1706.03741, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InICML, pages 3319–3328, 2017

work page 2017
[38]

Feature visualization.Distill, 2(11), 2017

Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.Distill, 2(11), 2017. 12

work page 2017
[39]

Scaling monosemanticity: Ex- tracting interpretable features from claude 3 sonnet

Trenton Bricken, Samuel Marks, Rachel Templeton, et al. Scaling monosemanticity: Ex- tracting interpretable features from claude 3 sonnet. https://transformer-circuits.pub/2024/ scaling-monosemanticity/, 2024. Anthropic interpretability report

work page 2024
[40]

A general language assistant as a laboratory for alignment

Amanda Askell, Yuntao Bai, Anna Chen, et al. A general language assistant as a laboratory for alignment. InNeurIPS Datasets and Benchmarks Track, 2021

work page 2021
[41]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

work page 2022
[42]

Safe reinforcement learning from human feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, et al. Safe reinforcement learning from human feedback. InICLR, 2024

work page 2024
[43]

Constrained policy optimization

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. InICML, pages 22–31, 2017

work page 2017
[44]

Llama guard: LLM-based input–output safeguard for human–ai conversation

Huseyin Inan, Kartikeya Upasani, et al. Llama guard: LLM-based input–output safeguard for human–ai conversation. https://ai.meta.com/research/publications/llama-guard-2/, 2023. Meta AI technical report

work page 2023
[45]

Manning, and Chelsea Finn

Eric Mitchell, Yoonho Lee, Anatoly Khazatsky, Christopher D. Manning, and Chelsea Finn. Detectgpt: Zero-shot machine-generated text detection using probability curvature.arXiv preprint arXiv:2301.11305, 2023

work page arXiv 2023
[46]

Fast-detectgpt: Efficient zero-shot detection of ai-generated text.arXiv preprint arXiv:2303.14276, 2023

Zhendong Bao, Peiyu Wang, and Wei Lin. Fast-detectgpt: Efficient zero-shot detection of ai-generated text.arXiv preprint arXiv:2303.14276, 2023

work page arXiv 2023
[47]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, et al. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Harsanyi

John C. Harsanyi. Cardinal welfare, individualistic ethics, and interpersonal comparisons of utility.Journal of Political Economy, 63(4):309–321, 1955

work page 1955
[49]

Chambers and Federico Echenique.Revealed Preference Theory

Christopher P. Chambers and Federico Echenique.Revealed Preference Theory. Number 56 in Econometric Society Monographs. Cambridge University Press, 2016

work page 2016
[50]

Kreps.Microeconomic Foundations II: Imperfect Competition, Information, and Strategic Interaction

David M. Kreps.Microeconomic Foundations II: Imperfect Competition, Information, and Strategic Interaction. Princeton University Press, 2023

work page 2023
[51]

David Silver, Satinder Singh, Doina Precup, and Richard S. Sutton. Reward is enough.Artificial Intelligence, 299:103535, 2021

work page 2021
[52]

Optimal policies tend to seek power

Alexander Matt Turner, Andrew Critch, and Prasad Tadepalli. Optimal policies tend to seek power. In Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[53]

Red teaming language models with language models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InProceedings of EMNLP, 2022

work page 2022
[54]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[55]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[56]

Ethical and social risks of harm from Language Models

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. Ethical and...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[57]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2 edition, 2018

work page 2018
[58]

Russell and Peter Norvig.Artificial Intelligence: A Modern Approach

Stuart J. Russell and Peter Norvig.Artificial Intelligence: A Modern Approach. Pearson, 4 edition, 2020

work page 2020
[59]

Active inference on discrete state-spaces: A synthesis.Journal of Mathematical Psychology, 99:102447, 2020

Lancelot Da Costa, Thomas Parr, Noor Sajid, Sebastijan Veselić, Victorita Neacsu, and Karl Friston. Active inference on discrete state-spaces: A synthesis.Journal of Mathematical Psychology, 99:102447, 2020

work page 2020
[60]

The opinion pool.Annals of Mathematical Statistics, 32:1339–1342, 1961

Mervyn Stone. The opinion pool.Annals of Mathematical Statistics, 32:1339–1342, 1961

work page 1961
[61]

A characterization theorem for externally bayesian groups.Annals of Statistics, 12:1100–1105, 1984

Christian Genest. A characterization theorem for externally bayesian groups.Annals of Statistics, 12:1100–1105, 1984

work page 1984
[62]

McConway, and Mark J

Christian Genest, Kevin J. McConway, and Mark J. Schervish. Characterization of externally bayesian pooling operators.Annals of Statistics, 14:487–501, 1986

work page 1986
[63]

Christian Genest and Carl G. Wagner. Further evidence against independence preservation in expert judgement synthesis.Aequationes Mathematicae, 32:74–86, 1987

work page 1987
[64]

Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association, 102:359–378, 2007

work page 2007
[65]

Hoeting, David Madigan, Adrian E

Jennifer A. Hoeting, David Madigan, Adrian E. Raftery, and Chris T. Volinsky. Bayesian model averaging: A tutorial.Statistical Science, 14:382–417, 1999

work page 1999
[66]

Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence.Neural Computation, 14:1771–1800, 2002

work page 2002
[67]

Probabilistic opinion pooling generalized

Franz Dietrich and Christian List. Probabilistic opinion pooling generalized. part one: General agendas. Social Choice and Welfare, 48:747–786, 2017

work page 2017
[68]

Rufo, Jesús Martín, and Luis R

Manuel J. Rufo, Jesús Martín, and Luis R. Pericchi. Log-linear pool to combine prior distributions—a suggestion.Bayesian Analysis, 7:411–438, 2012

work page 2012
[69]

Clemen and Robert L

Robert T. Clemen and Robert L. Winkler. Combining probability distributions from experts in risk analysis.Risk Analysis, 19:187–203, 1999

work page 1999
[70]

Other solutions to nash’s bargaining problem.Econometrica, 43:513– 518, 1975

Ehud Kalai and Meir Smorodinsky. Other solutions to nash’s bargaining problem.Econometrica, 43:513– 518, 1975

work page 1975
[71]

Bargaining with two-sided incomplete information: An infinite horizon model with alternating offers.Review of Economic Studies, 54:175–192, 1987

Kalyan Chatterjee and Larry Samuelson. Bargaining with two-sided incomplete information: An infinite horizon model with alternating offers.Review of Economic Studies, 54:175–192, 1987

work page 1987
[72]

Nabeel S. Qureshi. Waluigi, carl jung, and the case for moral ai.WIRED, May 2023

work page 2023
[73]

arXiv preprint arXiv:2311.03348 (2023)

Rusheb Shah, Quentin Feuillade-Montixi, Soroush Pour, Arush Tagade, Stephen Casper, and Javier Rando. Scalable and transferable black-box jailbreaks for language models via persona modulation.arXiv preprint arXiv:2311.03348, 2023

work page arXiv 2023
[74]

Taming simulators: Challenges, pathways and vision for the alignment of large language models

Leonard Bereska and Efstratios Gavves. Taming simulators: Challenges, pathways and vision for the alignment of large language models. InAAAI Inaugural Summer Symposium Series (AAAI-SS), 2023. 14 Appendix Table of Contents A Extended Introduction 16 B Motivating the Framework 17 C Foundations of Probabilistic Model Aggregation 18 C.1 Linear and Logarithmic...

work page 2023
[75]

Formalization of compositional agency:We introduce a welfare-based definition of unani- mously beneficial compositions using log-score utilities and probabilistic generative models

work page
[76]

Sharp possibility frontier:We prove strict unanimity is impossible for binary outcome spaces and under linear pooling, but possible for|O| ≥3under logarithmic pooling

work page
[77]

Recursive and robustness properties:We establish cloning invariance, continuity, and 16 openness of strictly unanimous decomposability, yielding a rigorous theoretical foundation for multi-agent composition in neural models

work page
[78]

Limits of local perturbations:We show that small tilts around a fixed pool cannot achieve strict unanimity, ruling out trivial duplication as a path to compositionality

work page
[79]

opinion pool

Safety-relevant alignment principle:We formalize the Waluigi effect using our framework, and prove that manifest–then–suppress strictly outperforms direct suppression, illuminating alignment challenges in large AI systems. Summary of Appendices.After motivating agents as generative models over the outcome spaceO in Appendix B, Appendix C derives opinion p...

work page
[80]

Probability distributionPi :O →[0,1]with P o∈O Pi(o) = 1

work page

Showing first 80 references.