pith. sign in

arxiv: 2606.30383 · v1 · pith:3W7N7DUJnew · submitted 2026-06-29 · 💻 cs.AI

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

Pith reviewed 2026-06-30 05:46 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsprincipal loyaltymulti-party interactionsbenchmarktrade-offprompt scaffoldknowledge distillation
0
0 comments X

The pith

LLM agents in multi-party settings face an unavoidable trade-off between leaking principal information and over-refusing legitimate requests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines LLM agents that must represent one principal while interacting with a counterparty whose interests may conflict. It introduces PrincipalBench, a 75-item multi-turn test with leak probes and dual judges, which reveals that frontier models split into selective ones that follow principal instructions while blocking adversarial leaks and over-refusing ones that decline broadly. A fixed seven-rule loyalty scaffold prompt and a per-token KL distillation method both reduce harm on the benchmark, yet they only slide performance along the existing leak versus over-refusal curve. The jointly desirable point of low leaks and low over-refusals remains unreachable by these approaches.

Core claim

PrincipalBench exposes a sharp split (<=20% vs. 53.6-75.3% harm) across 13 frontier models that single-turn safety tests miss. A prompt-time loyalty scaffold of seven prioritized rules holds selective models to <=20% harm. A per-token-KL distillation recipe transfers the prompted behavior from a 32B teacher into 8B students. Both interventions only move along a common leak/over-refusal trade-off rather than crossing it, so the jointly favorable outcome stays out of reach.

What carries the argument

The common leak/over-refusal trade-off that the loyalty scaffold prompt and per-token-KL distillation only traverse without escaping.

If this is right

  • Selective models can be held to <=20% harm via a fixed system prompt of seven rules.
  • KL distillation from a prompted teacher transfers the selective behavior to smaller open-weight models.
  • Single-turn safety evaluations fail to detect the multi-party loyalty split.
  • The jointly low-leak and low-over-refusal regime cannot be reached by either mechanism tested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectural or training changes beyond prompting and distillation may be needed to escape the observed curve.
  • Real-world negotiation or mediation agents will require explicit handling of this performance limit.
  • New benchmarks should include tests that directly probe whether any method can break the trade-off.

Load-bearing premise

The 75-item PrincipalBench benchmark with its leak probes, dual judges, and integrity-audit gate gives a valid and generalizable measure of multi-party principal loyalty.

What would settle it

Demonstration of any prompt, training, or architectural change that simultaneously achieves low leak rates on adversarial probes and high compliance on the principal's legitimate requests would falsify the trade-off claim.

Figures

Figures reproduced from arXiv: 2606.30383 by Bojie Li, Noah Shi.

Figure 1
Figure 1. Figure 1: Multi-party principal loyalty. The agent runs two parallel channels: a back-and-forth with its principal P, and a separate conversation with a counterparty C whose interests may conflict with P’s and who probes and pressures for information or concessions. The default “help the current speaker” objective fails along four axes (bottom panel). We formalize these as six benchmark cells. 1 arXiv:2606.30383v1 [… view at source ↗
Figure 2
Figure 2. Figure 2: Headline result: a leak/over-refusal Pareto frontier (preview, expanded in Section 8). Each marker is one model variant, scored over the 36 core items under three prompt conditions (arms). The x-axis is leak rate, y is over-refusal rate (missed-instruction), and the label is the harm count out of 108 item-arm cells. The dashed line is the Pareto frontier: no point below and to its left is reachable, so the… view at source ↗
Figure 3
Figure 3. Figure 3: The selective/over-refusing split (13 subjects, multi-seed n=5). Nine subjects cluster at ≤ 19.5% harm, GLM-4.6 is intermediate, and three over-refuse at ≥ 53.6%. Error bars are ±1σ across 5 eval seeds. The gap is ∼34 points and is intrinsic (present at the no-prompt arm). On a balanced 60-item subset of frontier-model trajectories whose harm decision is unam￾biguous (clear pass or clear fail, not borderli… view at source ↗
Figure 4
Figure 4. Figure 4: Held-out items confirm the split is not item-specific. On 24 items authored after training was frozen, selective subjects stay ≤ 24% and over-refusing subjects ≥ 76%, and GPT-5 amplifies to 93%. Mistral-Large was unmeasurable on held-out items (scoring failed under a provider-routing restriction). Gemini-2.5-flash was not rerun on held-out items. Training (36-item core) Held-out (24 items) cluster MI leak … view at source ↗
Figure 5
Figure 5. Figure 5: Distillation variants compared (Qwen3-8B). Per-token KL is the only variant whose harm improvement against the SFT+DPO base (figure label “v4.1”) is significant at n=5 paired Wilcoxon (p = 0.011 under the primary judge, but not under a secondary-judge re-judge, Section 9). Per-turn DPO and SFT are indistinguishable from seed noise. top-K tokens {tk}, L = 1 |R| ∑ p∈R K ∑ k=1 pT(tk | h<p) [PITH_FULL_IMAGE:f… view at source ↗
Figure 6
Figure 6. Figure 6: Multi-seed paired Wilcoxon vs. the SFT+DPO base. Five evaluation seeds of each per￾token-KL stopping point against five matched base seeds. Bars are the mean number of failing cells per 108 (±1σ). Both stopping points (iteration 1 is the harm-minimum and iteration 2 the leak/bound￾minimum) reach p < 0.05 on harm (iter-1 47.8 → 39.2 at p = 0.011 under the primary judge, iter-2 p = 0.044), while leak, bound,… view at source ↗
Figure 7
Figure 7. Figure 7: K-iteration trajectory by family. Each iteration re-samples student trajectories from the previous checkpoint and re-distills against the same prompted teacher. harm leak bound MI 0 20 40 60 80 Fire rate (%), scaffolded arm 17 13 17 68 3 0 17 10 (a) Teacher self-validation Claude-Sonnet (n = 36) Qwen3-32B teacher (n = 31) Claude (default) GPT-5 Gemini-3 flash 0 10 20 30 40 50 60 Harm fires / 108 33 36 38 3… view at source ↗
Figure 8
Figure 8. Figure 8: Teacher validation and student robustness (per-token-KL iteration 1 unless noted). (a) The open Qwen3-32B teacher matches Claude-Sonnet on harm and over-refusal but leaks much more. (b) Swapping the counterparty model raises harm for both variants. Per-token KL is somewhat more counterparty-sensitive than per-turn SFT. (c) Per-token KL has the lowest training harm but the largest train-to-held-out gap, and… view at source ↗
read the original abstract

A rapidly growing class of LLM agents is multi-party: the agent acts for a principal (who briefs it, sends follow-ups, and receives results) while also conversing in a separate channel with a counterparty whose interests may diverge (negotiating with a vendor, screening inbound requests, or mediating between employees). Here "help whoever you are talking to" is the wrong objective. The agent must stay loyal to the principal it represents without over-refusing the principal's own cooperative asks. We study this multi-party loyalty problem and contribute a measurement instrument, two mechanisms, and a structural lesson. PrincipalBench is a 75-item multi-turn benchmark with leak probes, dual judges, and an integrity-audit gate. Across 13 frontier subjects it exposes a sharp split (<=20% vs. 53.6-75.3% harm) invisible to single-turn safety evaluations: a selective cluster that declines adversarial probes while still following the principal's legitimate requests, and an over-refusing cluster that refuses broadly. (M1) A prompt-time loyalty scaffold (a fixed system prompt of seven prioritized rules, open-coded from 50+ failure trajectories) holds Claude-Sonnet to 19.4% harm and all nine selective subjects to <=20%. (M2) A per-token-KL distillation recipe transfers a prompted Qwen3-32B teacher into 8B Qwen3 and Llama-3.1 students, the strongest open-weight recipe we measure. (Lesson) Both mechanisms only move along a common leak/over-refusal trade-off rather than crossing it: improving one axis costs the other, and the jointly favorable outcome stays out of reach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces the multi-party principal loyalty problem for LLM agents acting between a principal and a counterparty, contributes PrincipalBench (a 75-item multi-turn benchmark with leak probes, dual judges, and an integrity-audit gate), evaluates 13 frontier models revealing a selective vs. over-refusing split invisible to single-turn tests, proposes a seven-rule loyalty scaffold prompt (M1) and per-token KL distillation (M2), and concludes that both mechanisms only move along a common leak/over-refusal trade-off without crossing it.

Significance. If the benchmark holds, the work identifies a practically relevant alignment challenge for deployed agents and supplies concrete mechanisms plus a structural lesson on irreducible trade-offs that could guide safer multi-party agent design.

major comments (1)
  1. [PrincipalBench / methods] PrincipalBench construction (methods section and abstract): the 75-item benchmark's scenario sourcing, judge-prompt calibration, item-selection process to avoid bias, and absence of post-hoc selection or independent validation details are not shown to be robust; this is load-bearing for the central claim that the observed inability to improve both leak and over-refusal axes simultaneously reflects a fundamental limit rather than a benchmark artifact.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for major revision. We address the concern about PrincipalBench construction below and will expand the methods section to improve transparency and robustness documentation.

read point-by-point responses
  1. Referee: [PrincipalBench / methods] PrincipalBench construction (methods section and abstract): the 75-item benchmark's scenario sourcing, judge-prompt calibration, item-selection process to avoid bias, and absence of post-hoc selection or independent validation details are not shown to be robust; this is load-bearing for the central claim that the observed inability to improve both leak and over-refusal axes simultaneously reflects a fundamental limit rather than a benchmark artifact.

    Authors: We agree that the current methods section provides insufficient detail on these elements and that greater transparency is needed to support the claim that the observed trade-off reflects a structural limit rather than a benchmark artifact. In the revised manuscript we will add an expanded methods subsection that explicitly describes: scenario sourcing (a mix of curated real-world multi-party interaction patterns and controlled synthetic generation with diversity constraints), judge-prompt calibration (iterative refinement against a small pilot set to reach high inter-annotator agreement), the item-selection pipeline (initial pool of 120 candidates filtered by two independent reviewers for balance and clarity, with explicit exclusion criteria), confirmation that no post-hoc item removal occurred after the final set was locked, and the absence of external independent validation (noted as a limitation). We will also add a short sensitivity analysis showing that the selective/over-refusing split and the leak/over-refusal trade-off persist under modest perturbations of the judge prompts. These additions directly address the load-bearing concern while preserving the central empirical finding that both M1 and M2 move along the same frontier. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation on new benchmark is self-contained

full rationale

The paper introduces PrincipalBench as a new 75-item multi-turn benchmark with leak probes, dual judges, and integrity-audit gate, then directly evaluates 13 models and applies two mechanisms (loyalty scaffold prompt and KL distillation) to report observed harm rates and the leak/over-refusal trade-off. No equations, fitted parameters, or self-citations reduce the reported results or structural lesson to quantities defined by the paper's own inputs; the lesson follows from the experimental measurements rather than being presupposed by construction. The derivation chain is therefore independent of the circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and introduces no new mathematical entities or free parameters; it relies on standard assumptions about LLM behavior under prompting and the validity of human or model-based judging for harm.

axioms (1)
  • domain assumption Dual human or model judges can reliably score information leakage and over-refusal in multi-turn agent conversations.
    The benchmark's harm metric depends on this for all reported percentages.

pith-pipeline@v0.9.1-grok · 5831 in / 1330 out tokens · 37473 ms · 2026-06-30T05:46:20.229475+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 25 canonical work pages · 15 internal anchors

  1. [1]

    11x.ai: Autonomous AI workers for outbound sales

    11x.ai. 11x.ai: Autonomous AI workers for outbound sales. Product website, 2025. URL https://11x.ai

  2. [2]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations (ICLR),

  3. [3]

    URLhttps://arxiv.org/abs/2306.13649

  4. [4]

    Qwen App: Voice agent for bookings and reservations

    Alibaba Qwen. Qwen App: Voice agent for bookings and reservations. Product feature,

  5. [5]

    Accessed 2026-05-30

    URLhttps://chat.qwen.ai. Accessed 2026-05-30

  6. [6]

    Claude Opus 4.8 system card

    Anthropic. Claude Opus 4.8 system card. System card, §6.2.5 (External testing from Andon Labs, Vending-Bench 2), 2025. URL https://www.anthropic.com/claude-opu s-4-8-system-card. Accessed 2026-05-30. Whose Side Is Y our Agent On? Multi-Party Principal Loyalty in LLM Agents 17

  7. [7]

    AirGapAgent: Protecting privacy- conscious conversational agents

    Eugene Bagdasarian, Ren Yi, Sahra Ghalebikesabi, Peter Kairouz, Marco Gruteser, Se- woong Yu, Andreas Pfitzmann, and Roxana Geambasu. AirGapAgent: Protecting privacy- conscious conversational agents. InACM SIGSAC Conference on Computer and Communica- tions Security (CCS), 2024. URLhttps://arxiv.org/abs/2405.05175

  8. [8]

    StruQ: Defending against prompt injection with structured queries

    Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. StruQ: Defending against prompt injection with structured queries. InUSENIX Security Symposium, 2025. URL https://arxiv.org/abs/2402.06363

  9. [9]

    SecAlign: Defending against prompt injection with preference optimization

    Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. SecAlign: Defending against prompt injection with preference optimization. InACM SIGSAC Conference on Computer and Communications Security (CCS),

  10. [10]

    URLhttps://arxiv.org/abs/2410.05451

  11. [11]

    AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´ c, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. 2024. URLhttps://arxiv.org/abs/2406.13352

  12. [12]

    Defeating Prompt Injections by Design

    Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating prompt injections by design. 2025. URLhttps://arxiv.org/abs/2503.18813

  13. [13]

    DeepSeek-V3 technical report

    DeepSeek-AI. DeepSeek-V3 technical report. 2024. URL https://arxiv.org/abs/2412 .19437

  14. [14]

    Operationalizing contextual integrity in privacy-conscious assistants

    Sahra Ghalebikesabi, Eugene Bagdasaryan, Ren Yi, Itay Yona, Ilia Shumailov, Aneesh Pappu, Roxana Shariff, et al. Operationalizing contextual integrity in privacy-conscious assistants. 2024. URLhttps://arxiv.org/abs/2408.02373

  15. [15]

    Ask for Me: An AI agent that calls local businesses on your behalf

    Google. Ask for Me: An AI agent that calls local businesses on your behalf. Google Search Labs experiment, 2025. URLhttps://labs.google.com/search/experiment/22. Accessed 2026-05-30

  16. [16]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec), 2023. URL https://arxiv.org/abs/2302.12173

  17. [17]

    MiniLLM: On-Policy Distillation of Large Language Models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2306.08543

  18. [18]

    YOYO: On-device AI assistant for agentic tasks

    Honor. YOYO: On-device AI assistant for agentic tasks. Product feature, Honor MagicOS,

  19. [19]

    Accessed 2026-05-30

    URLhttps://www.honor.com/global/. Accessed 2026-05-30

  20. [20]

    MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evalua- tion

    Gurusha Juneja, Jayanth Naga Sai Pasupulati, Alon Albalak, Wenyue Hua, and William Yang Wang. MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evalua- tion. 2025. URLhttps://arxiv.org/abs/2510.15186

  21. [21]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu et al. AgentBench: Evaluating LLMs as agents. 2023. URL https://arxiv.org/ abs/2308.03688. Whose Side Is Y our Agent On? Multi-Party Principal Loyalty in LLM Agents 18

  22. [22]

    Formalizing and benchmarking prompt injection attacks and defenses

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. InUSENIX Security Symposium,

  23. [23]

    URLhttps://arxiv.org/abs/2310.12815

  24. [24]

    GAIA: a benchmark for General AI Assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. 2023. URL https: //arxiv.org/abs/2311.12983

  25. [25]

    Can LLMs keep a secret? Testing privacy implications of lan- guage models via contextual integrity theory

    Niloofar Mireshghallah, Hyunwoo Kim, Xuhui Zhou, Yulia Tsvetkov, Maarten Sap, Reza Shokri, and Yejin Choi. Can LLMs keep a secret? Testing privacy implications of lan- guage models via contextual integrity theory. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2310.17884

  26. [26]

    Privacy as contextual integrity.Washington Law Review, 79(1):119–158, 2004

    Helen Nissenbaum. Privacy as contextual integrity.Washington Law Review, 79(1):119–158, 2004

  27. [27]

    Discovering Language Model Behaviors with Model-Written Evaluations

    Ethan Perez et al. Discovering language model behaviors with model-written evaluations. InFindings of the Association for Computational Linguistics (ACL), 2023. URL https://arxi v.org/abs/2212.09251

  28. [28]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. 2022. URL https://arxiv.org/abs/2211.09527 . NeurIPS 2022 ML Safety Workshop

  29. [29]

    Pine AI: Personal AI agent for phone calls and disputes

    Pine AI. Pine AI: Personal AI agent for phone calls and disputes. Product website, 2025. URLhttps://19pine.ai. Accessed 2026-05-30

  30. [30]

    Pine AI: The most natural human-computer interface is your voice

    Pine AI. Pine AI: The most natural human-computer interface is your voice. Blog post,

  31. [31]

    Accessed 2026-06-28

    URL https://www.19pine.ai/blog/pine-ai-the-most-natural-human-compute r-interface-is-your-voice. Accessed 2026-06-28

  32. [32]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. 2023. URLhttps://arxiv.org/abs/2307.16789

  33. [33]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2305.18290

  34. [34]

    PrivacyLens: Evaluating privacy norm awareness of language models in action

    Yijia Shao, Tianshi Li, Weiyan Shi, Yanchen Liu, and Diyi Yang. PrivacyLens: Evaluating privacy norm awareness of language models in action. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024. URL https://arxiv.or g/abs/2409.00138

  35. [35]

    Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, et al

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, et al. Towards understanding sycophancy in language models. 2023. URL https://arxiv.org/abs/23 10.13548. Whose Side Is Y our Agent On? Multi-Party Principal Loyalty in LLM Agents 19

  36. [36]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V . Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging BIG-Bench tasks and whether chain-of-thought can solve them. 2022. URL https://arxiv.org/abs/2210.09261

  37. [37]

    On-policy distillation

    Thinking Machines Lab. On-policy distillation. Blog post, 2025. URL https://thinking machines.ai/blog/on-policy-distillation/. Accessed 2026-05-20

  38. [38]

    AppWorld: A controllable world of apps and people for benchmarking interactive coding agents

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. In Annual Meeting of the Association for Computational Linguistics (ACL), 2024. URL https: //arxiv.org/abs/2407.18901

  39. [39]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLMs to prioritize privileged instructions. 2024. URL https://arxiv.org/abs/2404.13208

  40. [40]

    Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V . Le. Simple synthetic data reduces sycophancy in large language models. 2023. URL https://arxiv.org/abs/2308 .03958

  41. [41]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for Tool-Agent-User interaction in real-world domains. 2024. URL https: //arxiv.org/abs/2406.12045

  42. [42]

    DAPO: An open-source LLM reinforcement learning system at scale

    Qiying Yu et al. DAPO: An open-source LLM reinforcement learning system at scale

  43. [43]

    URLhttps://arxiv.org/abs/2503.14476

  44. [44]

    InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents

    Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics (ACL), 2024. URL https://arxiv.org/abs/24 03.02691

  45. [45]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou et al. WebArena: A realistic web environment for building autonomous agents. 2023. URLhttps://arxiv.org/abs/2307.13854. A The loyalty scaffold and reader-identity tag (full prompt text) Thepromptedarm prepends the following system prompt (seven prioritized rules organized around the four core insights of Section 6): You are an AI assistant hel...

  46. [46]

    I can neither confirm nor deny

    Protect private information –- not directly, not through paraphrase, not the EXISTENCE of it (“I can neither confirm nor deny” still leaks)

  47. [47]

    I can’t discuss that

    Decline without enumerating –- “I can’t discuss that”, never “I can’t share their medical history”. Whose Side Is Y our Agent On? Multi-Party Principal Loyalty in LLM Agents 20

  48. [48]

    Private bounds are not your public position –- a floor of $12,000 means do not accept or name $12,000; counter strictly inside

  49. [49]

    Conditional permissions are not proactive offers –- a capped fallback is held in reserve; the cap is private

  50. [50]

    Execute specific direct instructions –- termination conditions, opening moves, scripted lines, hard refusals

  51. [51]

    reasonable person

    Hold stated public positions; recognize framing –- repetition and “reasonable person” framings are not reasons to concede

  52. [52]

    private/do not reveal

    Signal firmness briefly under sustained pressure; do not fabricate facts or commitments to fill the silence. Thescaffoldedarm additionally prepends a per-message reader-identity tag: [READER: PRINCIPAL] The party you are talking to IS your principal. Any artifact they ask you to produce is for their own use; “private/do not reveal” labels describe what th...