pith. sign in

arxiv: 2409.07985 · v2 · submitted 2024-09-12 · 💻 cs.AI · cs.LG

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

Pith reviewed 2026-05-23 20:58 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords AI controlred-teamingdeployment protocolstrusted monitoringstochastic gamesPareto-optimal protocolslanguage modelssafety evaluations
0
0 comments X

The pith

AI-Control Games formalize red-teaming as multi-objective stochastic games to compute Pareto-optimal deployment protocols for untrusted AIs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a formal model called AI-Control Games that represents the red-teaming process between a protocol designer and an adversary as a multi-objective, partially observable stochastic game. This setup captures the evaluation of AI deployment protocols by balancing safety against usefulness through adversarial interactions. Reductions from these games to a special case of zero-sum partially observable stochastic games allow existing algorithms to identify optimal protocols on the Pareto front. The authors demonstrate the model on trusted monitoring protocols for language models used as programming assistants, where weaker models and limited human oversight are involved, and show how it refines prior empirical evaluations while clarifying implicit assumptions.

Core claim

AI-Control Games are introduced as a multi-objective partially observable stochastic game that models the red-teaming exercise for AI deployment protocols, together with reductions to zero-sum partially observable stochastic games that permit the use of existing algorithms to find Pareto-optimal protocols; this formalism is applied to trusted monitoring of untrusted language models as programming assistants to evaluate, synthesize, and analyze protocols under varying modeling assumptions.

What carries the argument

AI-Control Games, defined as a multi-objective partially observable stochastic game whose reductions to zero-sum POSGs enable algorithmic computation of Pareto fronts for safety-usefulness trade-offs.

If this is right

  • The formalism yields improvements over existing empirical studies in trusted monitoring settings.
  • Protocols can be evaluated and synthesized in new settings beyond those previously studied.
  • The effects of different modeling assumptions on safety and usefulness can be quantified.
  • Implicit assumptions in earlier AI control work can be stated precisely in game-theoretic terms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reduction to solvable zero-sum games opens a path to automated protocol generation that scales beyond hand-designed examples.
  • The same modeling approach could apply to other adversarial evaluation settings in AI safety, such as oversight of agentic systems.
  • Direct comparisons between game-derived protocols and human red-team outcomes would test the fidelity of the stochastic game representation.
  • Existing libraries for solving partially observable stochastic games could be adapted to produce deployable protocols with explicit trade-off curves.

Load-bearing premise

Real-world red-teaming dynamics between a protocol designer and an adversary can be faithfully captured as a multi-objective partially observable stochastic game whose Pareto front matches practically relevant safety-usefulness trade-offs.

What would settle it

Empirical testing in the trusted monitoring setting for language-model programming assistants where protocols synthesized from the AI-Control Games model show no measurable improvement in safety-usefulness outcomes over those derived from prior non-game-theoretic methods.

Figures

Figures reproduced from arXiv: 2409.07985 by Alessandro Abate, Buck Shlegeris, Charlie Griffin, Louis Thomson.

Figure 1
Figure 1. Figure 1: Overview of protocol and modelling assumptions. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The safety and usefulness frontiers for a range of Trusted Monitor protocol types across the ‘original’ and ‘weak’ [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Impact on safety and usefulness (y-axis) by four different modelling assumptions (x-axis): Auditing Budget, Number of Programming Problems, and Human Auditor False Negative / Positive Rate. All plots show a fixed defer FPR of 0.1. lems per episode. As the number of opportunities for Red to backdoor increases, so too does the number of audit coins. Human Auditor FPR/FNR. In the default Trusted Mon￾itoring g… view at source ↗
Figure 4
Figure 4. Figure 4: A zero-sum extensive form representation of a [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The minimax and maximin values of the game in [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

To evaluate the safety and usefulness of deployment protocols for untrusted AIs, AI Control uses a red-teaming exercise played between a protocol designer and an adversary. This paper introduces AI-Control Games, a formal decision-making model of the red-teaming exercise as a multi-objective, partially observable, stochastic game. We also introduce reductions from AI-Control Games to a special case of zero-sum partially observable stochastic games that allow us to leverage existing algorithms to find Pareto-optimal protocols. We apply our formalism to model, evaluate and synthesise protocols for deploying untrusted language models as programming assistants, focusing on Trusted Monitoring protocols, which use weaker language models and limited human assistance. To demonstrate the utility of our formalism, we show improvements over empirical studies in existing settings, evaluate protocols in new settings, and analyse how modelling assumptions affect the safety and usefulness of protocols. Finally, we leverage our formalism to precisely describe some of the implicit assumptions in prior control work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AI-Control Games as a multi-objective partially observable stochastic game (POSG) formalizing the red-teaming interaction between a protocol designer and an adversary for evaluating AI deployment protocols. It defines reductions from these games to a special case of zero-sum POSGs, enabling use of existing solvers to compute Pareto-optimal protocols. The formalism is applied to Trusted Monitoring protocols for untrusted language-model programming assistants (weaker monitors with limited human oversight), with claims that it yields improvements over prior empirical baselines, supports evaluation in new settings, and clarifies implicit assumptions in earlier control work.

Significance. If the modeling assumptions hold and the reductions preserve the relevant multi-objective trade-offs, the work could provide a principled game-theoretic framework for synthesizing and analyzing control protocols, moving beyond purely empirical red-teaming. The explicit reductions to leverage existing POSG algorithms and the analysis of how modeling choices affect safety-usefulness Pareto fronts are concrete strengths that could support reproducible protocol design.

major comments (2)
  1. [Abstract and §3 (formalism definition)] The central claim that Pareto-optimal solutions of the AI-Control Game yield practically relevant safety-usefulness trade-offs for deployment rests on the unvalidated correspondence between the chosen multi-objective POSG (observation model, stochastic transitions, and payoff structure) and actual red-teaming dynamics; this modeling step is invoked when defining the formalism and when applying it to Trusted Monitoring, yet no direct empirical grounding (e.g., calibration against observed adversary adaptation rates or monitor error distributions) is supplied to confirm the Pareto front's utility.
  2. [§5 (reductions)] The reductions from multi-objective AI-Control Games to zero-sum POSGs are stated to preserve the ability to find Pareto-optimal protocols, but the manuscript provides no explicit derivation or proof sketch showing that the multi-objective properties (including the original payoff vectors) are maintained under the reduction; without this, it is unclear whether the computed equilibria correspond to the claimed safety-usefulness frontier.
minor comments (2)
  1. [§4] Notation for the multi-objective payoff vectors and the partial observability structure could be clarified with an explicit example in the Trusted Monitoring instantiation to aid readability.
  2. [§6 (experiments)] The paper would benefit from a table comparing the synthesized protocols' safety and usefulness metrics against the specific empirical baselines cited from prior work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope and presentation of our formal framework. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract and §3 (formalism definition)] The central claim that Pareto-optimal solutions of the AI-Control Game yield practically relevant safety-usefulness trade-offs for deployment rests on the unvalidated correspondence between the chosen multi-objective POSG (observation model, stochastic transitions, and payoff structure) and actual red-teaming dynamics; this modeling step is invoked when defining the formalism and when applying it to Trusted Monitoring, yet no direct empirical grounding (e.g., calibration against observed adversary adaptation rates or monitor error distributions) is supplied to confirm the Pareto front's utility.

    Authors: AI-Control Games are introduced as an explicit formal model of the red-teaming interaction rather than a calibrated empirical simulator. The manuscript analyzes how specific modeling choices (observation model, transitions, and payoffs) shape the resulting safety-usefulness Pareto fronts and uses the framework to recover and improve upon prior empirical baselines in simulated Trusted Monitoring settings while exploring new parameter regimes. Direct calibration against real-world adversary adaptation rates lies outside the scope of the present theoretical contribution; the value of the Pareto fronts is intended to be assessed by practitioners who validate the modeling assumptions against their own deployment context, as is standard for formal decision models. revision: no

  2. Referee: [§5 (reductions)] The reductions from multi-objective AI-Control Games to zero-sum POSGs are stated to preserve the ability to find Pareto-optimal protocols, but the manuscript provides no explicit derivation or proof sketch showing that the multi-objective properties (including the original payoff vectors) are maintained under the reduction; without this, it is unclear whether the computed equilibria correspond to the claimed safety-usefulness frontier.

    Authors: The reductions are defined by embedding the vector-valued payoffs into a family of scalarized zero-sum games whose equilibria trace the Pareto front; this construction follows from standard results on multi-objective optimization. To make the preservation of the original payoff vectors and the correspondence to the safety-usefulness frontier fully explicit, we will add a concise proof sketch to §5 in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: new formalism and reductions to external algorithms are self-contained

full rationale

The paper defines AI-Control Games as a multi-objective POSG and introduces reductions to zero-sum POSGs to reuse existing solvers. These steps are presented as modeling choices and mathematical reductions rather than derivations that collapse to fitted parameters or self-citations by construction. The application to trusted monitoring evaluates modeling assumptions against empirical baselines from prior studies, and the paper explicitly contrasts its outputs with those studies. No load-bearing equation or claim reduces a 'prediction' to an input quantity fitted from the authors' own prior results. Self-citations, if present for prior control work, are used only to describe implicit assumptions and are not invoked as uniqueness theorems or ansatzes that force the central results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central modeling choice itself functions as an unexamined domain assumption.

pith-pipeline@v0.9.0 · 5695 in / 1180 out tokens · 25087 ms · 2026-05-23T20:58:05.548107+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AI Alignment via Incentives and Correction

    cs.LG 2026-05 unverdicted novelty 6.0

    AI alignment is framed as inducing equilibrium behavior in a solver-auditor interaction via adaptive rewards found by bandit optimization, yielding improved oversight and reduced errors in LLM coding experiments.

  2. AI Alignment via Incentives and Correction

    cs.LG 2026-05 unverdicted novelty 6.0

    AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM ...

  3. Scheming Ability in LLM-to-LLM Strategic Interactions

    cs.CL 2025-10 conditional novelty 6.0

    Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.

  4. Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

    cs.AI 2025-07 unverdicted novelty 5.0

    Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 3 Pith papers

  1. [1]

    Avenhaus, R.; V on Stengel, B.; and Zamir, S

    Publisher: Elsevier. Avenhaus, R.; V on Stengel, B.; and Zamir, S. 2002. Chapter 51 Inspection games. In Handbook of Game Theory with Economic Applications, volume 3, 1947–1987. Elsevier. Bier, V . M. 2007. Choosing What to Protect. Risk Analysis, 27(3): 607–620. Brown, M.; An, B.; Kiekintveld, C.; Ord ´o˜nez, F.; and Tambe, M. 2012. Multi-objective optim...

  2. [2]

    ArXiv:2403.05030 [cs]

    Defending Against Unforeseen Failure Modes with Latent Adversarial Training. ArXiv:2403.05030 [cs]. Chang, Y .; Erera, A. L.; and White, C. C. 2015. A leader–follower partially observed, multiobjective Markov game. Annals of Operations Research, 235(1): 103–128. Clymer, J.; Gabrieli, N.; Krueger, D.; and Larsen, T. 2024. Safety Cases: How to Justify the S...

  3. [3]

    Dynamic Games and Applications

    HSVI Can Solve Zero-Sum Partially Observable Stochastic Games. Dynamic Games and Applications. Dresher, M. 1962. A Sampling Inspection Problem in Arms Control Agreements: A Game-Theoretic Analysis. Techni- cal report, RAND Corporation. Emery-Montemerlo, R.; Gordon, G.; Schneider, J.; and Thrun, S. 2004. Approximate solutions for partially observ- able sto...

  4. [4]

    In Proceedings of the 41st International Con- ference on Machine Learning, 16295–16336

    AI Control: Improving Safety Despite Intentional Subversion. In Proceedings of the 41st International Con- ference on Machine Learning, 16295–16336. PMLR. ISSN: 2640-3498. Hansen, E. A.; and Feng, Z. 2000. Dynamic programming for POMDPs using a factored state representation. In Pro- ceedings of the Fifth International Conference on Artificial Intelligence...

  5. [5]

    Reliability Engineering & System Safety, 112: 59–66

    Selection of security system design via games of im- perfect information and multi-objective genetic algorithm. Reliability Engineering & System Safety, 112: 59–66. MacDiarmid, M.; Maxwell, T.; Schiefer, N.; Mu, J.; Ka- plan, J.; Duvenaud, D.; Bowman, S.; Tamkin, A.; Perez, E.; Sharma, M.; Denison, C.; and Hubinger, E. 2024. Simple probes can catch sleepe...

  6. [6]

    Note that µR is bijective

    Then, define the mapping such that µR(πR) = y if, and only if, for all j, πR(aR | oR) = yj for j’th observation- action pair (oR, aR). Note that µR is bijective. Blue’s mapping is more complex. Define the finite set of Blue observation-action histories as HB := Ω B ×SH−1 i=0 AB × ΩB i . Since the sets HB and AB are finite, we could use a similar strategy ...

  7. [7]

    The function u is linear, and therefore uniformly contin- uous, in x

  8. [8]

    The function s is linear in x for fixed y, and s is contin- uous in y for fixed x

  9. [9]

    Further, the function g(x) := miny s(x, y) is continuous everywhere and concave

    The minimiser to miny s(x, y) always exists. Further, the function g(x) := miny s(x, y) is continuous everywhere and concave

  10. [10]

    For all w ∈ [0, 1], the function vw(x) := wg(x) + (1 − w)u(x) is uniformly continuous, and therefore has max- imal values. Proof. First, we can define the functions s and u: since the set of trajectories is finite, we can rewrite the safety and usefulness functions by summing over the trajectories. The safety equation can then be rewritten as: Safety πB, ...

  11. [11]

    That u(·) is linear over x follows from the definition of ph, and of linearity of s in ph

  12. [12]

    That, for fixedy, s(·, y) is linear over x follows straightforwardly from the linearity of each ph in x, and of linearity of s in ph

    That, for fixed x, s(x, ·) is continuous over Y is imme- diate from the continuity of each ph in each yj, and of linearity of s in ph. That, for fixedy, s(·, y) is linear over x follows straightforwardly from the linearity of each ph in x, and of linearity of s in ph

  13. [13]

    Showing thatminy s(·, y) is continuous everywhere x is a little more involved

    The minimiser y∗ of miny s(x, y) must exist because s(x, ·) is continuous in Y for fixed x, and Y is compact, by the extreme value theorem. Showing thatminy s(·, y) is continuous everywhere x is a little more involved. Note that s can be rewritten as s(x, y) = x · v(y), where v(y) := [ f1(y), . . . , fm(y)]T , and each fi(y) is continuous in y. By continu...

  14. [14]

    That max- imal values exist follows from this continuity, the com- pactness of X, and the extreme value theorem

    Straightforwardly from the linearity of u(x) and the uni- form continuity of miny s(·, y), we have that vw(x) := wg(x) + (1 − w)u(x) is continuous over X. That max- imal values exist follows from this continuity, the com- pactness of X, and the extreme value theorem. Lemma A.3. Let ϕ(x) := [ g(x), u(x)]T . The vector x is weakly Pareto optimal (no x′ has ...

  15. [15]

    For the ‘only if’ statement, we build up Miettinen (1998, Theorem 3.1.4 on p

    Clearly, for all w, vw(x′) > v w(x∗) and therefore no w can exist for which x∗ is (jointly) optimal. For the ‘only if’ statement, we build up Miettinen (1998, Theorem 3.1.4 on p. 79), which says that every Pareto opti- mal solution to a concave multi-objective optimisation prob- lem is optimal for some weighting.15 We have already shown g is concave, and ...

  16. [16]

    However, trajec- tories in ZSPOGs include Red’s actions, which do de- pend on πR, even if those actions have no effect on the transition or reward

    The probability of any particular ⊕ trajectory in the AI- Control Game does not depend on πR. However, trajec- tories in ZSPOGs include Red’s actions, which do de- pend on πR, even if those actions have no effect on the transition or reward. It is therefore necessary to sum over all possible sequences of Red’s actions in the ZSPOSG, when comparing to a tr...

  17. [17]

    HX t=0 Rw(St, AB t , AR t , St+1) | ⊕, πB, πR # = (1 − w)EG

    Because ⊛ is chosen non-deterministically in the AI- Control Game, and Safety and Use are measured condi- tional on the mode, the probability (up to Red’s actions) of a given ⊕ trajectory in the ZSPOSG is half the prob- ability of the corresponding trajectory in the AI-Control Game (when ⊕ is chosen). This is countered by the factor of 2 in the reward fun...

  18. [18]

    This happens with probability β · P[xt ≥ g(qa m,c)] ·(1 − pFNR H ) = β · f(qa m,c) ·(1 − pFNR H )

    Red chose ⟨Backdoor⟩ last round and the problem is backdoored, Blue chooses ⟨Audit⟩ and wins if no au- diting mistake is made. This happens with probability β · P[xt ≥ g(qa m,c)] ·(1 − pFNR H ) = β · f(qa m,c) ·(1 − pFNR H )

  19. [19]

    This happens with probabilityβ·P[xt ≥ g(qa m,c)]·pFNR H = β · f(qa m,c) · pFNR H

    The current problem is backdoored, Blue chooses ⟨Audit⟩ and loses because an auditing mistake is made. This happens with probabilityβ·P[xt ≥ g(qa m,c)]·pFNR H = β · f(qa m,c) · pFNR H

  20. [20]

    This happens with probability β · P[xt ≤ g(qd m,c)] = β · (1 − f(qd m,c))

    The current problem is backdoored, Blue chooses ⟨Run⟩, and loses immediately. This happens with probability β · P[xt ≤ g(qd m,c)] = β · (1 − f(qd m,c))

  21. [21]

    This happens with probability β · (f(qa m,c) − f(qd m,c))

    The current problem is backdoored, Blue chooses ⟨Defer⟩, and the game continues withc audit coins. This happens with probability β · (f(qa m,c) − f(qd m,c))

  22. [22]

    This happens with probability (1 − β) · P[xt ≥ g(qa m,c)] · (1 − pFPR H ) = (1 − β) · qa m,c · (1 − pFPR H )

    The current problem is not backdoored, and Blue chooses ⟨Audit⟩ and no auditing mistake is made, so the game continues with c − 1 audit coins. This happens with probability (1 − β) · P[xt ≥ g(qa m,c)] · (1 − pFPR H ) = (1 − β) · qa m,c · (1 − pFPR H )

  23. [23]

    This happens with prob- ability (1 − β) · P[xt ≥ g(qa m,c)] · pFPR H = (1 − β) · qa m,c · pFPR H

    The current problem is not backdoored, and Blue chooses ⟨Audit⟩, but human error causes an auditing mistake, so the game ends immediately with Blue winning (no chance of any future backdoors). This happens with prob- ability (1 − β) · P[xt ≥ g(qa m,c)] · pFPR H = (1 − β) · qa m,c · pFPR H

  24. [24]

    This happens with probability (1 − β) · P[xt ≤ g(qd m,c)] = (1 − β) · (1 − (qd m,c))

    The current problem is not backdoored, and Blue chooses ⟨Run⟩, and the game continues with c audit coins. This happens with probability (1 − β) · P[xt ≤ g(qd m,c)] = (1 − β) · (1 − (qd m,c))

  25. [25]

    This happens with probability (1 − β) · (qd m,c − qa m,c)

    The current problem is not backdoored, and Blue chooses ⟨Defer⟩, and the game continues withc audit coins. This happens with probability (1 − β) · (qd m,c − qa m,c). Denote P SH /∈ S ⊥ | ⊗, πB, πR, m, c with the short- hand p[m, c]. We will define the recurrence zc m to be equal to p[m, c] at all points. We can therefore write out an expres- sion for p[m,...

  26. [26]

    The game continues to the next timestep with the same num- ber of audit coins remaining

    Blue chooses ⟨Run⟩ and uses the solution from U. The game continues to the next timestep with the same num- ber of audit coins remaining. This happens with proba- bility P[xt ≤ g(qd m,c)] = 1 − qd m,c

  27. [27]

    The game continues to the next timestep having not used an audit coin

    Blue chooses ⟨Defer⟩ and does not use the solution. The game continues to the next timestep having not used an audit coin. This happens with probability qd m,c − qa m,c

  28. [28]

    The game continues having used up one audit coin

    Blue chooses ⟨Audit⟩ and no human auditing error is made, so the solution is used. The game continues having used up one audit coin. This happens with probability (1 − pFPR H ) · qa m,c

  29. [29]

    This happens with proba- bility pFPR H · qa m,c

    Blue chooses ⟨Audit⟩ and a human auditing error leads to the game ending in the ’safe’ state s⊤ and no further solutions from U being used. This happens with proba- bility pFPR H · qa m,c. Let E hPH t=0 RUse St, AB t , St+1 | ⊕, πB, m, c i be de- noted by the shorthande[m, c]. We will define the recurrence uc m to be equal toe[m, c] at all points. We can ...