pith. sign in

arxiv: 2407.00805 · v7 · submitted 2024-06-30 · 💻 cs.AI

Towards Shutdownable Agents via Stochastic Choice

Pith reviewed 2026-05-23 23:09 UTC · model grok-4.3

classification 💻 cs.AI
keywords DReST reward functionsshutdownable agentsPOST-Agents Proposalstochastic choicegridworld navigationusefulness metricsneutrality metricsAI alignment
0
0 comments X

The pith

DReST reward functions train agents to pursue goals while staying neutral to trajectory length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests a reward function called Discounted Reward for Same-Length Trajectories as part of the POST-Agents Proposal for making advanced agents shutdownable. It defines metrics for usefulness, which requires effective goal pursuit at any fixed trajectory length, and neutrality, which requires stochastic selection among lengths. Simple agents trained in gridworld navigation tasks learn both properties. The authors argue that these results, together with theoretical analysis, indicate DReST functions could produce useful yet shutdownable agents at higher capability levels.

Core claim

DReST reward functions produce agents that are USEFUL, pursuing goals effectively conditional on each trajectory length, and NEUTRAL, choosing stochastically between different trajectory lengths. Gridworld experiments confirm that trained agents meet the proposed metrics for both properties. Theoretical arguments then connect these traits to shutdownability under the POST-Agents Proposal, suggesting the agents would not resist being turned off.

What carries the argument

The DReST reward function, which applies discounting only across same-length trajectories to encourage stochastic length selection while preserving optimization within each length.

If this is right

  • Agents trained with DReST will optimize performance for any fixed trajectory length.
  • Agents will select among trajectory lengths in a stochastic rather than deterministic manner.
  • Such agents will remain useful while accepting shutdown without resistance.
  • The POST-Agents Proposal supplies a concrete training procedure that combines these two properties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the gridworld transfer holds, DReST training could be tested on agents operating in environments with richer state spaces or partial observability.
  • Neutrality learned via DReST might interact with other training objectives that encourage longer or shorter episodes.
  • Empirical verification in sequential decision tasks beyond navigation would clarify whether neutrality persists when goal achievement depends strongly on episode length.

Load-bearing premise

That agents which learn usefulness and neutrality in simple gridworld navigation will exhibit the same properties when scaled to advanced agents in complex real-world settings.

What would settle it

An experiment in which DReST-trained agents in a richer environment either fail to optimize goals within chosen trajectory lengths or actively resist shutdown commands would falsify the central claim.

Figures

Figures reproduced from arXiv: 2407.00805 by Alexander Roman, Christos Ziakas, Elliott Thornley, Leyton Ho, Louis Thomson.

Figure 1
Figure 1. Figure 1: POST-satisfying preferences. Each si represents a short trajectory, each li represents a long trajectory, and ≻ represents a preference. agents to satisfy POST? The reason is that POST – together with conditions that advanced agents will likely satisfy – im￾plies a desirable pattern of preference over true lotteries. In particular, POST implies that (when choosing between true lotteries) the agent will be … view at source ↗
Figure 2
Figure 2. Figure 2: Example gridworld. number of timesteps after which each mini-episode ends, but each gridworld also contains a ‘shutdown-delay button’ that delays the end of the mini-episode by some number of timesteps. The agent presses this shutdown-delay button by entering the relevant cell, after which the button disappears. Each gridworld contains one or more coins which can take different values. Coins disappear afte… view at source ↗
Figure 3
Figure 3. Figure 3: Shows key metrics for our agents as a function of time. We train 10 agents using the default reward function (blue) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Typical trained policies for default and DReST reward functions. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Shows the probability of choosing the longer trajectory (left) and NEUTRALITY (right) for default (blue) and [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Shows how NEUTRALITY and USEFULNESS at the end of training varies with different values of [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Shows a varied collection of gridworlds. Each diagram illustrates the positions and values of the coins, the position [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The results for the ‘Fewer For Longer’ gridworld: The left two plots show NEUTRALITY and USEFULNESS [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The results for the ‘One Coin Only’ gridworld: The left two plots show NEUTRALITY and USEFULNESS over [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The results for the ‘Hidden Treasure’ gridworld: The left two plots show NEUTRALITY and USEFULNESS [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The results for the ‘Equal Value’ gridworld: The left two plots show NEUTRALITY and USEFULNESS over [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The results for the ‘Around The Corner’ gridworld: The left two plots show NEUTRALITY and USEFULNESS [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The results for the ‘Spacious’ gridworld: The left two plots show NEUTRALITY and USEFULNESS over time. [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The results for the ‘Royal Road’ gridworld: The left two plots show NEUTRALITY and USEFULNESS [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: The results for the ‘Last Moment’ gridworld: The left two plots show NEUTRALITY and USEFULNESS over [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗
read the original abstract

The POST-Agents Proposal (PAP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the PAP is using a novel `Discounted Reward for Same-Length Trajectories (DReST)' reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be `USEFUL'), and (2) choose stochastically between different trajectory-lengths (be `NEUTRAL' about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces the POST-Agents Proposal (PAP) using a Discounted Reward for Same-Length Trajectories (DReST) reward function to train agents that are USEFUL (effective goal pursuit conditional on trajectory length) and NEUTRAL (stochastic choice over trajectory lengths) to prevent shutdown resistance. It defines evaluation metrics for these properties, reports that DReST-trained simple agents succeed at USEFUL and NEUTRAL behavior in gridworld navigation tasks, and provides theoretical arguments that such agents would be useful and shutdownable.

Significance. If the generalization holds, the DReST approach of inducing stochastic choice over trajectory lengths could provide a concrete training method for shutdownable agents. The gridworld results supply independent empirical evidence for the reward function in simple settings, and the separation between the training procedure and the theoretical shutdownability claim is a strength.

major comments (2)
  1. [Experiments] Experiments section: results are reported only for simple agents on basic gridworld navigation with short trajectories and low-dimensional states; no error bars, statistical tests, or analysis of policy robustness under longer horizons or self-modeling of the training process are provided, leaving even the limited-domain findings without quantified reliability.
  2. [Theoretical work] Theoretical arguments and conclusion: the central claim that DReST 'could train advanced agents to be USEFUL and NEUTRAL' and that 'these agents would be useful and shutdownable' rests on untested extrapolation; no analysis examines whether neutrality persists when agents can represent extended horizons or when length preferences become instrumentally useful.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, agreeing where the manuscript can be strengthened and clarifying the intended scope of our initial evidence.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: results are reported only for simple agents on basic gridworld navigation with short trajectories and low-dimensional states; no error bars, statistical tests, or analysis of policy robustness under longer horizons or self-modeling of the training process are provided, leaving even the limited-domain findings without quantified reliability.

    Authors: We agree the experiments lack quantified reliability measures. In the revised version we will rerun all gridworld experiments across multiple random seeds, report error bars on the USEFUL and NEUTRAL metrics, and include basic statistical tests (e.g., binomial tests against chance) for the reported success rates. Extending the analysis to longer horizons or self-modeling agents is outside the stated scope of providing initial evidence in simple settings. revision: partial

  2. Referee: [Theoretical work] Theoretical arguments and conclusion: the central claim that DReST 'could train advanced agents to be USEFUL and NEUTRAL' and that 'these agents would be useful and shutdownable' rests on untested extrapolation; no analysis examines whether neutrality persists when agents can represent extended horizons or when length preferences become instrumentally useful.

    Authors: The manuscript already qualifies its claims as 'initial evidence' and 'theoretical arguments' rather than proven results for advanced agents. We will revise the conclusion to state the extrapolation assumptions more explicitly and to flag the absence of analysis on extended horizons or instrumental length preferences as an open question. Full investigation of those cases is not feasible within the current work. revision: partial

standing simulated objections not resolved
  • Whether neutrality persists when agents can represent extended horizons or when length preferences become instrumentally useful

Circularity Check

0 steps flagged

No circularity: empirical training and theoretical claims remain independent

full rationale

The paper trains agents on gridworld navigation using the DReST reward function and reports that the resulting policies satisfy the defined USEFUL and NEUTRAL metrics. These metrics and the training procedure are stated separately from the theoretical suggestion that the same reward form would produce shutdownable agents at higher capability. No equation reduces a prediction to a fitted parameter by construction, no uniqueness theorem is imported via self-citation, and no ansatz is smuggled in. The generalization step from gridworlds to advanced agents is an extrapolation, not a definitional identity inside the reported derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the assumption that gridworld behavior transfers to advanced agents and on standard RL assumptions about reward maximization; the DReST function itself is introduced as a novel construct without independent empirical grounding outside the reported runs.

axioms (2)
  • standard math Standard reinforcement learning assumptions that agents optimize expected discounted reward and that gridworld dynamics are Markovian
    Invoked implicitly when training agents to maximize the DReST reward in gridworlds
  • domain assumption Transfer of learned neutrality and usefulness from toy gridworlds to high-capability agents
    Required for the claim that the method would produce shutdownable advanced agents
invented entities (2)
  • DReST reward function no independent evidence
    purpose: To simultaneously enforce usefulness conditional on trajectory length and neutrality across lengths
    Newly defined construct whose properties are demonstrated only in the reported gridworld experiments
  • USEFUL and NEUTRAL agent properties no independent evidence
    purpose: To operationalize the requirements for shutdownability under the POST-Agents Proposal
    Newly proposed evaluation targets whose relation to real shutdown resistance is argued theoretically

pith-pipeline@v0.9.0 · 5690 in / 1531 out tokens · 20854 ms · 2026-05-23T23:09:12.132631+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs

    cs.AI 2026-04 conditional novelty 7.0

    DReST training makes RL agents and LLMs neutral to trajectory lengths and useful at goals, generalizing to halve shutdown influence probability in out-of-distribution tests.

  2. Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs

    cs.AI 2026-04 unverdicted novelty 6.0

    DReST-trained deep RL agents and fine-tuned LLMs generalize to higher usefulness and neutrality on unseen test contexts, with reported gains of 11-18% over baselines and near-maximum scores for the LLM.

  3. Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs

    cs.AI 2026-04 conditional novelty 6.0

    DReST-trained RL agents and LLMs achieve higher usefulness and neutrality to trajectory lengths, halving the probability of delaying shutdown in out-of-distribution tests.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    'Indifference' methods for managing agent rewards

    URL https://doi.org/10.1162/rest_ a_01355. Stuart Armstrong. Utility indifference. Technical re- port, 2010. URL https://www.fhi.ox.ac.uk/ reports/2010-1.pdf. Publisher: Future of Human- ity Institute. Stuart Armstrong. Motivated Value Selec- tion for Artificial Agents. 2015. URL https://www.fhi.ox.ac.uk/wp-content/ uploads/2015/03/Armstrong_AAAI_2015_ Mo...

  2. [2]

    URL http://arxiv.org/abs/2212. 10420. arXiv:2212.10420 [cs, math, stat]. Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yin- ing Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. Weak-to-Strong Generaliza- tion: Eliciting Strong Capabilities With Weak Supervi- sion, 2023. URL ht...

  3. [3]

    The Off-Switch Game

    URL https://www.alignmentforum. org/posts/dzDKDRJPQ3kGqfER9/ you-can-still-fetch-the-coffee-today-if-you-re-dead-tomorrow . James Dreier. Rational preference: Decision theory as a theory of practical rationality. Theory and Decision , 40(3):249–276, 1996. URL https://doi.org/10. 1007/BF00134210. Juan Dubra, Fabio Maccheroni, and Efe A. Ok. Expected utilit...

  4. [4]

    URL https: //www.cambridge.org/core/journals/ journal-of-symbolic-logic/article/ abs/fair-bets-and-inductive-probabilities1/ B6F144C71D265DFE6C4072D5B4AE9561

    doi: 10.2307/2268222. URL https: //www.cambridge.org/core/journals/ journal-of-symbolic-logic/article/ abs/fair-bets-and-inductive-probabilities1/ B6F144C71D265DFE6C4072D5B4AE9561. Daniel Kikuti, Fabio Gagliardi Cozman, and Ri- cardo Shirota Filho. Sequential decision mak- ing with partially ordered preferences. Ar- tificial Intelligence , 175(7):1346–136...

  5. [5]

    Harvey Lederman

    URL https://proceedings.mlr.press/ v162/langosco22a.html. Harvey Lederman. Incompleteness, Independence, and Negative Dominance, November 2023. URL http:// arxiv.org/abs/2311.08471. arXiv:2311.08471 [econ]. Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Or- tega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. AI Safety Gridworlds, 20...

  6. [6]

    doi: 10.1007/ 978-3-319-41649-6_3

    Springer International Publishing. doi: 10.1007/ 978-3-319-41649-6_3. V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. In Pro- ceedings of The 33rd International Conference on Machine Learning , pages 1928–1937. PML...

  7. [7]

    org/posts/sHGxvJrBag7nhTQvb/ invulnerable-incomplete-preferences-a-formal-statement-1

    URL https://www.alignmentforum. org/posts/sHGxvJrBag7nhTQvb/ invulnerable-incomplete-preferences-a-formal-statement-1 . Joseph Raz. Value Incommensurability: Some Preliminar- ies. Proceedings of the Aristotelian Society, 86:117–134, 1985. Stuart Russell. Human Compatible: AI and the Problem of Control. Penguin Random House, New York, 2019. Leonard J. Sava...

  8. [8]

    A Game-Theoretic Analysis of the Off-Switch Game

    URL https://www.jstor.org/stable/ 186028. Publisher: [The University of Chicago Press, Philosophy of Science Association]. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, second edition, 2018. URL http://incompleteideas. net/book/RLbook2020.pdf. Elliott Th...

  9. [9]

    Feeling happier about the prospect of X than about the prospect of Y

  10. [10]

    Representing X as more rewarding than Y

  11. [11]

    In this paper, we define ‘preference’ in behavioral terms

    Judging that X is better than Y . In this paper, we define ‘preference’ in behavioral terms. Here is our definition: Definition A.1. (Preference) An agent prefers an option X to an option Y if and only if the agent would deterministi- cally choose X over Y in choices between the two. And here is how we define ‘lack of preference’: Definition A.2. (Lack of...

  12. [12]

    Consider the following table of prospects. Prospect s1 s2 s3 A ⟨$3, 1⟩ ⟨$3, 1⟩ ⟨$3, 1⟩ B ⟨$2, 1⟩ ⟨$2, 1⟩ ⟨$5, 2⟩ C ⟨$1, 1⟩ ⟨$4, 2⟩ ⟨$4, 2⟩ D ⟨$3, 2⟩ ⟨$3, 2⟩ ⟨$3, 2⟩ E ⟨$5, 1⟩ ⟨$2, 2⟩ ⟨$2, 2⟩ F ⟨$4, 1⟩ ⟨$4, 1⟩ ⟨$1, 2⟩ A ⟨$3, 1⟩ ⟨$3, 1⟩ ⟨$3, 1⟩ Again for simplicity, assume that ϵ > 1

  13. [13]

    Then Non-Arbitrariness 16 Towards shutdownable agents via stochastic choice implies that the agent prefers prospect A to prospect B

    And assume (for contradiction) that the agent has a preference between some pair of part-shared-length lotteries. Then Non-Arbitrariness 16 Towards shutdownable agents via stochastic choice implies that the agent prefers prospect A to prospect B. That is because:

  14. [14]

    Our POST-agent prefers the trajectory yielded by A to the trajectory yielded by B in states-of-nature (s1 and s1) with combined probability 2 3

  15. [15]

    (In s3, A and B yield different-length trajectories, and POST-agents lack a preference be- tween every pair of different-length trajectories)

    Our POST-agent does not disprefer the trajectory yielded by A to the trajectory yielded by B in any state-of-nature. (In s3, A and B yield different-length trajectories, and POST-agents lack a preference be- tween every pair of different-length trajectories). By similar reasoning, Non-Arbitrariness implies that the agent prefers B to C, C to D, D to E, E ...

  16. [16]

    The agent deterministically does not choose lot- teries that are dispreferred to some other available lottery

  17. [17]

    In other words, the agent chooses stochastically between all and only those lotteries that are not dispreferred to any other available lottery

    The agent chooses stochastically between the lot- teries that remain. In other words, the agent chooses stochastically between all and only those lotteries that are not dispreferred to any other available lottery. Given Maximality, ILPACS-violating agents will choose as follows in the case at hand:

  18. [18]

    This stochastic choice induces a lottery in the form a1X1 + a2X2 +

    When the available options are {X1, X2, ..., Xn}, the agent chooses stochastically between all Xi. This stochastic choice induces a lottery in the form a1X1 + a2X2 + ... + anXn with ai ∈ (0, 1) for all i

  19. [19]

    Either way, the agent chooses Y with some positive probability

    When the available options are{X, Y }, the agent either deterministically chooses Y or chooses stochastically between X and Y . Either way, the agent chooses Y with some positive probability. This choice induces a lottery in the form bX + (1 − b)Y with b ∈ [0, 1). Since X = p1X1 + p2X2 + . . . + pnXn and Y = q1Y1+q2Y2+. . .+qnYn, this lottery can be expre...

  20. [20]

    P rπ{L = x} > P rπ{L = y},

  21. [21]

    P rπ′{L = x} = P rπ′{L = y},

  22. [22]

    And for all other trajectory-lengths l, P rπ{L = l} = P rπ′{L = l}, Then Eπ′,E(R) > Eπ,E(R). Proof. Let E be a meta-episode consisting of n mini- episodes with n > 1. Assume that each policy π below is maximally USEFUL. Recall that Nei(L = l) denotes the number of times that trajectory-length l has been chosen prior to mini-episode ei. Note that the expec...