Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols
Pith reviewed 2026-05-23 20:58 UTC · model grok-4.3
The pith
AI-Control Games formalize red-teaming as multi-objective stochastic games to compute Pareto-optimal deployment protocols for untrusted AIs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AI-Control Games are introduced as a multi-objective partially observable stochastic game that models the red-teaming exercise for AI deployment protocols, together with reductions to zero-sum partially observable stochastic games that permit the use of existing algorithms to find Pareto-optimal protocols; this formalism is applied to trusted monitoring of untrusted language models as programming assistants to evaluate, synthesize, and analyze protocols under varying modeling assumptions.
What carries the argument
AI-Control Games, defined as a multi-objective partially observable stochastic game whose reductions to zero-sum POSGs enable algorithmic computation of Pareto fronts for safety-usefulness trade-offs.
If this is right
- The formalism yields improvements over existing empirical studies in trusted monitoring settings.
- Protocols can be evaluated and synthesized in new settings beyond those previously studied.
- The effects of different modeling assumptions on safety and usefulness can be quantified.
- Implicit assumptions in earlier AI control work can be stated precisely in game-theoretic terms.
Where Pith is reading between the lines
- The reduction to solvable zero-sum games opens a path to automated protocol generation that scales beyond hand-designed examples.
- The same modeling approach could apply to other adversarial evaluation settings in AI safety, such as oversight of agentic systems.
- Direct comparisons between game-derived protocols and human red-team outcomes would test the fidelity of the stochastic game representation.
- Existing libraries for solving partially observable stochastic games could be adapted to produce deployable protocols with explicit trade-off curves.
Load-bearing premise
Real-world red-teaming dynamics between a protocol designer and an adversary can be faithfully captured as a multi-objective partially observable stochastic game whose Pareto front matches practically relevant safety-usefulness trade-offs.
What would settle it
Empirical testing in the trusted monitoring setting for language-model programming assistants where protocols synthesized from the AI-Control Games model show no measurable improvement in safety-usefulness outcomes over those derived from prior non-game-theoretic methods.
Figures
read the original abstract
To evaluate the safety and usefulness of deployment protocols for untrusted AIs, AI Control uses a red-teaming exercise played between a protocol designer and an adversary. This paper introduces AI-Control Games, a formal decision-making model of the red-teaming exercise as a multi-objective, partially observable, stochastic game. We also introduce reductions from AI-Control Games to a special case of zero-sum partially observable stochastic games that allow us to leverage existing algorithms to find Pareto-optimal protocols. We apply our formalism to model, evaluate and synthesise protocols for deploying untrusted language models as programming assistants, focusing on Trusted Monitoring protocols, which use weaker language models and limited human assistance. To demonstrate the utility of our formalism, we show improvements over empirical studies in existing settings, evaluate protocols in new settings, and analyse how modelling assumptions affect the safety and usefulness of protocols. Finally, we leverage our formalism to precisely describe some of the implicit assumptions in prior control work.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AI-Control Games as a multi-objective partially observable stochastic game (POSG) formalizing the red-teaming interaction between a protocol designer and an adversary for evaluating AI deployment protocols. It defines reductions from these games to a special case of zero-sum POSGs, enabling use of existing solvers to compute Pareto-optimal protocols. The formalism is applied to Trusted Monitoring protocols for untrusted language-model programming assistants (weaker monitors with limited human oversight), with claims that it yields improvements over prior empirical baselines, supports evaluation in new settings, and clarifies implicit assumptions in earlier control work.
Significance. If the modeling assumptions hold and the reductions preserve the relevant multi-objective trade-offs, the work could provide a principled game-theoretic framework for synthesizing and analyzing control protocols, moving beyond purely empirical red-teaming. The explicit reductions to leverage existing POSG algorithms and the analysis of how modeling choices affect safety-usefulness Pareto fronts are concrete strengths that could support reproducible protocol design.
major comments (2)
- [Abstract and §3 (formalism definition)] The central claim that Pareto-optimal solutions of the AI-Control Game yield practically relevant safety-usefulness trade-offs for deployment rests on the unvalidated correspondence between the chosen multi-objective POSG (observation model, stochastic transitions, and payoff structure) and actual red-teaming dynamics; this modeling step is invoked when defining the formalism and when applying it to Trusted Monitoring, yet no direct empirical grounding (e.g., calibration against observed adversary adaptation rates or monitor error distributions) is supplied to confirm the Pareto front's utility.
- [§5 (reductions)] The reductions from multi-objective AI-Control Games to zero-sum POSGs are stated to preserve the ability to find Pareto-optimal protocols, but the manuscript provides no explicit derivation or proof sketch showing that the multi-objective properties (including the original payoff vectors) are maintained under the reduction; without this, it is unclear whether the computed equilibria correspond to the claimed safety-usefulness frontier.
minor comments (2)
- [§4] Notation for the multi-objective payoff vectors and the partial observability structure could be clarified with an explicit example in the Trusted Monitoring instantiation to aid readability.
- [§6 (experiments)] The paper would benefit from a table comparing the synthesized protocols' safety and usefulness metrics against the specific empirical baselines cited from prior work.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the scope and presentation of our formal framework. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract and §3 (formalism definition)] The central claim that Pareto-optimal solutions of the AI-Control Game yield practically relevant safety-usefulness trade-offs for deployment rests on the unvalidated correspondence between the chosen multi-objective POSG (observation model, stochastic transitions, and payoff structure) and actual red-teaming dynamics; this modeling step is invoked when defining the formalism and when applying it to Trusted Monitoring, yet no direct empirical grounding (e.g., calibration against observed adversary adaptation rates or monitor error distributions) is supplied to confirm the Pareto front's utility.
Authors: AI-Control Games are introduced as an explicit formal model of the red-teaming interaction rather than a calibrated empirical simulator. The manuscript analyzes how specific modeling choices (observation model, transitions, and payoffs) shape the resulting safety-usefulness Pareto fronts and uses the framework to recover and improve upon prior empirical baselines in simulated Trusted Monitoring settings while exploring new parameter regimes. Direct calibration against real-world adversary adaptation rates lies outside the scope of the present theoretical contribution; the value of the Pareto fronts is intended to be assessed by practitioners who validate the modeling assumptions against their own deployment context, as is standard for formal decision models. revision: no
-
Referee: [§5 (reductions)] The reductions from multi-objective AI-Control Games to zero-sum POSGs are stated to preserve the ability to find Pareto-optimal protocols, but the manuscript provides no explicit derivation or proof sketch showing that the multi-objective properties (including the original payoff vectors) are maintained under the reduction; without this, it is unclear whether the computed equilibria correspond to the claimed safety-usefulness frontier.
Authors: The reductions are defined by embedding the vector-valued payoffs into a family of scalarized zero-sum games whose equilibria trace the Pareto front; this construction follows from standard results on multi-objective optimization. To make the preservation of the original payoff vectors and the correspondence to the safety-usefulness frontier fully explicit, we will add a concise proof sketch to §5 in the revised manuscript. revision: yes
Circularity Check
No circularity: new formalism and reductions to external algorithms are self-contained
full rationale
The paper defines AI-Control Games as a multi-objective POSG and introduces reductions to zero-sum POSGs to reuse existing solvers. These steps are presented as modeling choices and mathematical reductions rather than derivations that collapse to fitted parameters or self-citations by construction. The application to trusted monitoring evaluates modeling assumptions against empirical baselines from prior studies, and the paper explicitly contrasts its outputs with those studies. No load-bearing equation or claim reduces a 'prediction' to an input quantity fitted from the authors' own prior results. Self-citations, if present for prior control work, are used only to describe implicit assumptions and are not invoked as uniqueness theorems or ansatzes that force the central results.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 4 Pith papers
-
AI Alignment via Incentives and Correction
AI alignment is framed as inducing equilibrium behavior in a solver-auditor interaction via adaptive rewards found by bandit optimization, yielding improved oversight and reduced errors in LLM coding experiments.
-
AI Alignment via Incentives and Correction
AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM ...
-
Scheming Ability in LLM-to-LLM Strategic Interactions
Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.
-
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.
Reference graph
Works this paper leans on
-
[1]
Avenhaus, R.; V on Stengel, B.; and Zamir, S
Publisher: Elsevier. Avenhaus, R.; V on Stengel, B.; and Zamir, S. 2002. Chapter 51 Inspection games. In Handbook of Game Theory with Economic Applications, volume 3, 1947–1987. Elsevier. Bier, V . M. 2007. Choosing What to Protect. Risk Analysis, 27(3): 607–620. Brown, M.; An, B.; Kiekintveld, C.; Ord ´o˜nez, F.; and Tambe, M. 2012. Multi-objective optim...
work page 2002
-
[2]
Defending Against Unforeseen Failure Modes with Latent Adversarial Training. ArXiv:2403.05030 [cs]. Chang, Y .; Erera, A. L.; and White, C. C. 2015. A leader–follower partially observed, multiobjective Markov game. Annals of Operations Research, 235(1): 103–128. Clymer, J.; Gabrieli, N.; Krueger, D.; and Larsen, T. 2024. Safety Cases: How to Justify the S...
-
[3]
Dynamic Games and Applications
HSVI Can Solve Zero-Sum Partially Observable Stochastic Games. Dynamic Games and Applications. Dresher, M. 1962. A Sampling Inspection Problem in Arms Control Agreements: A Game-Theoretic Analysis. Techni- cal report, RAND Corporation. Emery-Montemerlo, R.; Gordon, G.; Schneider, J.; and Thrun, S. 2004. Approximate solutions for partially observ- able sto...
work page 1962
-
[4]
In Proceedings of the 41st International Con- ference on Machine Learning, 16295–16336
AI Control: Improving Safety Despite Intentional Subversion. In Proceedings of the 41st International Con- ference on Machine Learning, 16295–16336. PMLR. ISSN: 2640-3498. Hansen, E. A.; and Feng, Z. 2000. Dynamic programming for POMDPs using a factored state representation. In Pro- ceedings of the Fifth International Conference on Artificial Intelligence...
work page 2000
-
[5]
Reliability Engineering & System Safety, 112: 59–66
Selection of security system design via games of im- perfect information and multi-objective genetic algorithm. Reliability Engineering & System Safety, 112: 59–66. MacDiarmid, M.; Maxwell, T.; Schiefer, N.; Mu, J.; Ka- plan, J.; Duvenaud, D.; Bowman, S.; Tamkin, A.; Perez, E.; Sharma, M.; Denison, C.; and Hubinger, E. 2024. Simple probes can catch sleepe...
work page 2024
-
[6]
Then, define the mapping such that µR(πR) = y if, and only if, for all j, πR(aR | oR) = yj for j’th observation- action pair (oR, aR). Note that µR is bijective. Blue’s mapping is more complex. Define the finite set of Blue observation-action histories as HB := Ω B ×SH−1 i=0 AB × ΩB i . Since the sets HB and AB are finite, we could use a similar strategy ...
work page 1953
-
[7]
The function u is linear, and therefore uniformly contin- uous, in x
-
[8]
The function s is linear in x for fixed y, and s is contin- uous in y for fixed x
-
[9]
Further, the function g(x) := miny s(x, y) is continuous everywhere and concave
The minimiser to miny s(x, y) always exists. Further, the function g(x) := miny s(x, y) is continuous everywhere and concave
-
[10]
For all w ∈ [0, 1], the function vw(x) := wg(x) + (1 − w)u(x) is uniformly continuous, and therefore has max- imal values. Proof. First, we can define the functions s and u: since the set of trajectories is finite, we can rewrite the safety and usefulness functions by summing over the trajectories. The safety equation can then be rewritten as: Safety πB, ...
-
[11]
That u(·) is linear over x follows from the definition of ph, and of linearity of s in ph
-
[12]
That, for fixed x, s(x, ·) is continuous over Y is imme- diate from the continuity of each ph in each yj, and of linearity of s in ph. That, for fixedy, s(·, y) is linear over x follows straightforwardly from the linearity of each ph in x, and of linearity of s in ph
-
[13]
Showing thatminy s(·, y) is continuous everywhere x is a little more involved
The minimiser y∗ of miny s(x, y) must exist because s(x, ·) is continuous in Y for fixed x, and Y is compact, by the extreme value theorem. Showing thatminy s(·, y) is continuous everywhere x is a little more involved. Note that s can be rewritten as s(x, y) = x · v(y), where v(y) := [ f1(y), . . . , fm(y)]T , and each fi(y) is continuous in y. By continu...
-
[14]
Straightforwardly from the linearity of u(x) and the uni- form continuity of miny s(·, y), we have that vw(x) := wg(x) + (1 − w)u(x) is continuous over X. That max- imal values exist follows from this continuity, the com- pactness of X, and the extreme value theorem. Lemma A.3. Let ϕ(x) := [ g(x), u(x)]T . The vector x is weakly Pareto optimal (no x′ has ...
-
[15]
For the ‘only if’ statement, we build up Miettinen (1998, Theorem 3.1.4 on p
Clearly, for all w, vw(x′) > v w(x∗) and therefore no w can exist for which x∗ is (jointly) optimal. For the ‘only if’ statement, we build up Miettinen (1998, Theorem 3.1.4 on p. 79), which says that every Pareto opti- mal solution to a concave multi-objective optimisation prob- lem is optimal for some weighting.15 We have already shown g is concave, and ...
work page 1998
-
[16]
The probability of any particular ⊕ trajectory in the AI- Control Game does not depend on πR. However, trajec- tories in ZSPOGs include Red’s actions, which do de- pend on πR, even if those actions have no effect on the transition or reward. It is therefore necessary to sum over all possible sequences of Red’s actions in the ZSPOSG, when comparing to a tr...
-
[17]
HX t=0 Rw(St, AB t , AR t , St+1) | ⊕, πB, πR # = (1 − w)EG
Because ⊛ is chosen non-deterministically in the AI- Control Game, and Safety and Use are measured condi- tional on the mode, the probability (up to Red’s actions) of a given ⊕ trajectory in the ZSPOSG is half the prob- ability of the corresponding trajectory in the AI-Control Game (when ⊕ is chosen). This is countered by the factor of 2 in the reward fun...
work page 1997
-
[18]
This happens with probability β · P[xt ≥ g(qa m,c)] ·(1 − pFNR H ) = β · f(qa m,c) ·(1 − pFNR H )
Red chose ⟨Backdoor⟩ last round and the problem is backdoored, Blue chooses ⟨Audit⟩ and wins if no au- diting mistake is made. This happens with probability β · P[xt ≥ g(qa m,c)] ·(1 − pFNR H ) = β · f(qa m,c) ·(1 − pFNR H )
-
[19]
This happens with probabilityβ·P[xt ≥ g(qa m,c)]·pFNR H = β · f(qa m,c) · pFNR H
The current problem is backdoored, Blue chooses ⟨Audit⟩ and loses because an auditing mistake is made. This happens with probabilityβ·P[xt ≥ g(qa m,c)]·pFNR H = β · f(qa m,c) · pFNR H
-
[20]
This happens with probability β · P[xt ≤ g(qd m,c)] = β · (1 − f(qd m,c))
The current problem is backdoored, Blue chooses ⟨Run⟩, and loses immediately. This happens with probability β · P[xt ≤ g(qd m,c)] = β · (1 − f(qd m,c))
-
[21]
This happens with probability β · (f(qa m,c) − f(qd m,c))
The current problem is backdoored, Blue chooses ⟨Defer⟩, and the game continues withc audit coins. This happens with probability β · (f(qa m,c) − f(qd m,c))
-
[22]
The current problem is not backdoored, and Blue chooses ⟨Audit⟩ and no auditing mistake is made, so the game continues with c − 1 audit coins. This happens with probability (1 − β) · P[xt ≥ g(qa m,c)] · (1 − pFPR H ) = (1 − β) · qa m,c · (1 − pFPR H )
-
[23]
This happens with prob- ability (1 − β) · P[xt ≥ g(qa m,c)] · pFPR H = (1 − β) · qa m,c · pFPR H
The current problem is not backdoored, and Blue chooses ⟨Audit⟩, but human error causes an auditing mistake, so the game ends immediately with Blue winning (no chance of any future backdoors). This happens with prob- ability (1 − β) · P[xt ≥ g(qa m,c)] · pFPR H = (1 − β) · qa m,c · pFPR H
-
[24]
This happens with probability (1 − β) · P[xt ≤ g(qd m,c)] = (1 − β) · (1 − (qd m,c))
The current problem is not backdoored, and Blue chooses ⟨Run⟩, and the game continues with c audit coins. This happens with probability (1 − β) · P[xt ≤ g(qd m,c)] = (1 − β) · (1 − (qd m,c))
-
[25]
This happens with probability (1 − β) · (qd m,c − qa m,c)
The current problem is not backdoored, and Blue chooses ⟨Defer⟩, and the game continues withc audit coins. This happens with probability (1 − β) · (qd m,c − qa m,c). Denote P SH /∈ S ⊥ | ⊗, πB, πR, m, c with the short- hand p[m, c]. We will define the recurrence zc m to be equal to p[m, c] at all points. We can therefore write out an expres- sion for p[m,...
-
[26]
The game continues to the next timestep with the same num- ber of audit coins remaining
Blue chooses ⟨Run⟩ and uses the solution from U. The game continues to the next timestep with the same num- ber of audit coins remaining. This happens with proba- bility P[xt ≤ g(qd m,c)] = 1 − qd m,c
-
[27]
The game continues to the next timestep having not used an audit coin
Blue chooses ⟨Defer⟩ and does not use the solution. The game continues to the next timestep having not used an audit coin. This happens with probability qd m,c − qa m,c
-
[28]
The game continues having used up one audit coin
Blue chooses ⟨Audit⟩ and no human auditing error is made, so the solution is used. The game continues having used up one audit coin. This happens with probability (1 − pFPR H ) · qa m,c
-
[29]
This happens with proba- bility pFPR H · qa m,c
Blue chooses ⟨Audit⟩ and a human auditing error leads to the game ending in the ’safe’ state s⊤ and no further solutions from U being used. This happens with proba- bility pFPR H · qa m,c. Let E hPH t=0 RUse St, AB t , St+1 | ⊕, πB, m, c i be de- noted by the shorthande[m, c]. We will define the recurrence uc m to be equal toe[m, c] at all points. We can ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.