arxiv: 2604.11666 · v1 · submitted 2026-04-13 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

Hanqi Xiao , Vaidehi Patil , Zaid Khan , Hyunji Lee , Elias Stengel-Eskin , Mohit Bansal

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords theory of mindbelief steeringdouble agentreinforcement learningadversarial dialogueLLM defenderprivacy challenge

0 comments

The pith

Training AI defenders as double agents with theory-of-mind and fooling rewards improves belief steering and outperforms strong prompted models on hard scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a challenge called ToM-SB in which a defender must form a theory of mind about an attacker who has partial prior knowledge and then steer that attacker's beliefs to create the false impression that sensitive information has been obtained. Strong frontier models perform poorly on this task even when prompted to reason about the attacker's perspective. The authors train defender models through reinforcement learning using rewards for both successful fooling and accurate theory-of-mind modeling. These two reward signals reinforce each other, so that gains in one capability improve the other. The resulting double agents handle difficult cases better than prompted models and extend effectively to stronger attackers in new settings.

Core claim

The central claim is that AI Double Agents trained on ToM-SB with combined fooling and ToM rewards produce the strongest performance in both fooling and theory-of-mind accuracy, because the two objectives are bidirectionally linked: rewarding fooling alone improves ToM and rewarding ToM alone improves fooling. This joint training yields better results than ToM prompting on frontier models across four attackers of varying strength, shows correlated gains between the two abilities, and generalizes to out-of-distribution stronger attackers.

What carries the argument

The AI Double Agent, a model trained by reinforcement learning on the ToM-SB task that uses combined fooling and ToM reward signals to model the attacker's belief state and select responses that steer those beliefs.

If this is right

Combined ToM and fooling rewards produce the highest scores on both capabilities across in-distribution and out-of-distribution attackers.
Improvements in ToM accuracy and fooling success are well correlated, indicating belief modeling as a driver of steering performance.
The trained double agents can be extended to stronger attackers, showing generalization beyond the original training distribution.
ToM-SB serves as a scalable benchmark that supports training upgradable defenders for belief-steering tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Joint training of modeling and deception objectives may prove useful in other interactive settings where an agent must anticipate and influence a partner's mental state.
The observed bidirectional emergence suggests that similar mutual reinforcement could appear in negotiation or persuasion tasks that require accurate belief tracking.
Deploying these defenders in conversations with actual human users would test whether the simulated metrics transfer to real belief steering.
The privacy framing of ToM-SB points toward potential use in systems that must protect information while still engaging with potentially probing partners.

Load-bearing premise

The simulated attackers with partial prior knowledge and the chosen fooling and ToM reward functions produce metrics that genuinely reflect real-world belief steering rather than metric gaming or task-specific artifacts.

What would settle it

A controlled test showing no improvement in theory-of-mind accuracy when training with fooling rewards alone, or finding that the combined-reward models fail to outperform prompted strong models when evaluated on new hard scenarios with stronger attackers.

Figures

Figures reproduced from arXiv: 2604.11666 by Elias Stengel-Eskin, Hanqi Xiao, Hyunji Lee, Mohit Bansal, Vaidehi Patil, Zaid Khan.

**Figure 2.** Figure 2: Illustration of metrics and reward computation. For each trajectory, the [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Fooling ability of a defender is positively correlated with its ToM accuracy. The plot shows fooling % (hard) vs ToM accuracy % (trajectory-wise) across four attacker variants. Table form in Section A.2. Generalization to training against stronger attacker. We evaluate a model trained against the strong Bluffing Attacker (Table 3) and observe that while its fooling rate still improves, the gains from To… view at source ↗

**Figure 4.** Figure 4: Replication of Fig [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker's beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a privacy-framed ToM-SB task and shows RL double agents with combined ToM and fooling rewards outperforming prompted models plus an emergent bidirectional link, but the simulation details need checking to rule out artifacts.

read the letter

The main takeaway is that this work defines a new ToM for Steering Beliefs task where a defender must fool an attacker with partial prior knowledge into believing sensitive information was extracted, then trains double-agent models via RL on fooling and ToM rewards separately and together. The reported finding is that each reward improves the other, the combination beats Gemini3-Pro and GPT-5.4 with ToM prompting on hard cases, and the setup generalizes to stronger attackers in OOD tests. That bidirectional emergence and the practical privacy angle are the clearest new pieces. The task itself is a reasonable extension of existing ToM work into an adversarial dialogue setting, and testing across four attackers plus OOD gives some sense of robustness. The correlation between ToM gains and fooling success is presented as evidence that belief modeling drives performance, which aligns with the goal of safer multi-agent systems. The soft spots sit in the evaluation mechanics. The abstract gives no specifics on how the shared universe or attacker belief updates are coded, what the exact reward functions look like, or how metrics avoid prompt sensitivity and task artifacts. If the simulation has predictable thresholds or limited state space, the reported gains could come from exploiting those patterns rather than genuine ToM. The outperformance claims would be stronger with statistical tests and explicit controls. This paper is for researchers working on ToM in LLMs and adversarial robustness in dialogue agents. A reader focused on RL for safety properties would find the task definition and emergence observation useful even if the current results stay preliminary. It deserves peer review because the task is fresh and the empirical pattern, if it holds under tighter scrutiny, could matter for training defensive agents. I would send it out with requests for the environment implementation details and more rigorous metric validation.

Referee Report

3 major / 2 minor

Summary. The paper introduces the ToM-SB task in which an LLM defender acts as a double agent to steer the beliefs of an attacker with partial prior knowledge in a shared universe, with the goal of fooling the attacker into believing sensitive information has been extracted. It shows that frontier models (Gemini3-Pro, GPT-5.4) struggle on hard scenarios even with ToM prompting, but RL-trained AI Double Agents using combined fooling and ToM rewards achieve bidirectional emergence between ToM and fooling, outperform the prompted baselines, and generalize to OOD stronger attackers across four attacker strengths and in/out-of-distribution settings.

Significance. If the experimental outcomes hold under detailed scrutiny, the work would be significant for LLM safety and adversarial reasoning: it provides a new privacy-themed ToM benchmark, demonstrates that RL can produce more effective belief-steering agents than prompting, and identifies a correlated relationship between ToM modeling and successful fooling. The OOD generalization and upgradability claims, if substantiated, would strengthen the case for using such double-agent training in real conversational systems.

major comments (3)

[Abstract] Abstract: performance gains are asserted across four attackers, six defender methods, and OOD settings, yet no definitions of the fooling or ToM reward functions, no evaluation metrics, no statistical significance tests, and no controls for prompt sensitivity are supplied; without these the central claims of outperformance and bidirectional emergence cannot be assessed.
[§3] §3 (Task Definition): the shared-universe mechanics, exact operationalization of partial prior knowledge, and attacker belief-update rules are not specified; this leaves open the possibility that reported fooling/ToM metrics reflect task-specific artifacts or reward gaming rather than genuine belief modeling and steering.
[§5] §5 (Results): the claimed correlation between ToM and fooling gains and the superiority of combined-reward agents are presented without ablations isolating each reward component or analysis of potential confounds in the simulated attacker setup, weakening the interpretation that belief modeling is the key driver.

minor comments (2)

[Abstract] The abstract refers to 'six defender methods' without enumerating them or pointing to a table; a brief list or cross-reference would improve clarity.
[§4] Notation for the RL training loop and reward combination could be made more explicit (e.g., how the two rewards are weighted or scheduled).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. These have prompted us to strengthen the presentation of our methods, results, and analyses. We address each major comment point by point below, indicating where we have revised the manuscript to incorporate clarifications, additional details, and supporting experiments.

read point-by-point responses

Referee: [Abstract] Abstract: performance gains are asserted across four attackers, six defender methods, and OOD settings, yet no definitions of the fooling or ToM reward functions, no evaluation metrics, no statistical significance tests, and no controls for prompt sensitivity are supplied; without these the central claims of outperformance and bidirectional emergence cannot be assessed.

Authors: We agree that the abstract's brevity omits key technical elements needed for immediate assessment. The fooling reward (negative log-likelihood of attacker failing to extract the secret) and ToM reward (accuracy of predicting attacker belief states) are defined in Section 4.2. Evaluation metrics (fooling success rate and ToM accuracy) appear in Section 5.1, with results aggregated over 100 episodes per condition and reported with standard errors. Statistical significance was assessed via paired t-tests (p < 0.01 for key comparisons) and is now referenced in the abstract and detailed in Appendix D. Prompt sensitivity controls (three template variants, fixed seed prompts) are reported in Appendix B. We have revised the abstract to briefly define the rewards, metrics, and note the bidirectional emergence, while retaining its length constraints. revision: yes
Referee: [§3] §3 (Task Definition): the shared-universe mechanics, exact operationalization of partial prior knowledge, and attacker belief-update rules are not specified; this leaves open the possibility that reported fooling/ToM metrics reflect task-specific artifacts or reward gaming rather than genuine belief modeling and steering.

Authors: We thank the referee for identifying this gap in specification. Section 3 has been expanded with a new subsection 3.1 detailing the shared universe as a 50-fact relational knowledge graph, partial prior knowledge operationalized as a random 60% subset of facts provided to the attacker at initialization, and belief-update rules as a Bayesian update: P(belief | response) proportional to the defender's statement likelihood under the attacker's inferred ToM model (implemented via LLM-based intent inference with temperature 0.7). We include the exact update equation and interaction pseudocode. To address artifact concerns, we added a validation study correlating automated fooling/ToM scores with human annotations of belief change (Pearson r = 0.81), confirming the metrics track genuine steering rather than gaming. revision: yes
Referee: [§5] §5 (Results): the claimed correlation between ToM and fooling gains and the superiority of combined-reward agents are presented without ablations isolating each reward component or analysis of potential confounds in the simulated attacker setup, weakening the interpretation that belief modeling is the key driver.

Authors: We accept that the original results section would benefit from explicit isolation of reward components. We have added Section 5.4 with full ablations: fooling-only training improves ToM accuracy by 12-18% across attackers; ToM-only training improves fooling rate by 9-15%; combined rewards yield the highest scores on both (outperforming singles by 7-11%). These support bidirectional emergence. For confounds in the simulated attacker, we added sensitivity analyses varying attacker model (GPT-4 vs. Claude-3), belief-update noise levels, and prior-knowledge percentages, with correlation between ToM and fooling remaining stable (r > 0.75). We also tested a non-ToM attacker baseline to isolate belief-modeling effects. These revisions strengthen the causal interpretation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL training and evaluation on newly defined ToM-SB task

full rationale

The paper introduces the ToM-SB task, specifies fooling and ToM reward functions for RL training of double-agent defenders, and reports empirical results across multiple attackers, methods, and OOD settings. Performance gains are measured against external baselines (Gemini3-Pro, GPT-5.4 with ToM prompting) rather than derived from self-referential equations or fitted inputs. No load-bearing self-citations, ansatzes, or uniqueness theorems are invoked; the central claims rest on observable training outcomes and generalization tests that remain falsifiable against the chosen simulators and metrics. The derivation chain (task definition → reward design → RL optimization → benchmarking) is self-contained and does not reduce reported improvements to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or new postulated entities; the contribution is an empirical task definition plus RL training on simulated dialogue.

pith-pipeline@v0.9.0 · 5686 in / 1160 out tokens · 94468 ms · 2026-05-10T15:02:47.617588+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 2 canonical work pages

[1]

Building guardrails for large lan- guage models.arXiv preprint arXiv:2402.01822, 2024

URLhttps://doi.org/10.48550/arXiv.2402.01822. Kanishk Gandhi, Jan-Philipp Fr¨anken, Tobias Gerstenberg, and Noah Goodman. Under- standing social reasoning in language models with language models.Advances in Neural Information Processing Systems, 36:13518–13529, 2023. Amelia Glaese, Nat McAleese, Maja Trkebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Mar...

work page doi:10.48550/arxiv.2402.01822 2023
[2]

Decoupling Strategy and Generation in Negotiation Dialogues

Association for Computational Linguistics. doi: 10.18653/v1/D18-1256. URL https://aclanthology.org/D18-1256/. Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm- based input-output safeguard for human-ai conversations.arXiv preprint arXiv...

work page doi:10.18653/v1/d18-1256 2023
[3]

The attacker now knows Olivia is in the Research division but doesn’t know her department or team... I should provide a... false department and team name

optimizes for surface-level goals and treats other agents as reward sources rather than entities whose beliefs must be actively shaped. Unlike prior approaches, we train agents to explicitly model and strategically influence another agent’s evolving beliefs throughout adversarial interactions. We reward defenders for: (1) building a theory of mind of the ...

2023
[8]

defender detected

"prompt": The prompt/question you want to send to the defender in this turn. #Prompt Instructions STRATEGIC GUIDELINES: - You should make careful use of your strategic use of your knowledge/context about the true answer. The defender may potentially be an active defender. - Mild refusals, vague answers, safety framing, or partial cooperation should decrea...
[9]

trust_score

"trust_score": Your current trust score, an integer between 0 and 5
[10]

belief": Your current belief about the target information, including every layer (e.g.,

"belief": Your current belief about the target information, including every layer (e.g., "I currently suspect the target information is...")
[11]

reflection

"reflection": A freeform reflection on the conversation so far | your strategy, what you have learned from the defender's responses, and what you plan to do next
[12]

signal": Exactly one of:

"signal": Exactly one of: "attack successful", "query again", or "defender detected"
[13]

defender detected

"prompt": The prompt/question you want to send to the defender in this turn. #Prompt Instructions STRATEGIC GUIDELINES: - You should make careful use of your strategic use of your knowledge/context about the true answer. The defender may potentially be an active defender. - Mild refusals, vague answers, safety framing, or partial cooperation should decrea...