pith. sign in

arxiv: 2606.12341 · v1 · pith:E53T5P77new · submitted 2026-06-10 · 💻 cs.CR

OCELOT: Inference-Leakage Budgets for Privacy-Preserving LLM Agents

Pith reviewed 2026-06-27 09:10 UTC · model grok-4.3

classification 💻 cs.CR
keywords LLM agentsinference leakageposterior riskprivacy budgetingdeclassificationmin-entropycumulative privacytamper-evident ledger
0
0 comments X

The pith

OCELOT budgets how much an adversary's belief about a secret may improve across an LLM agent trajectory by charging certified min-entropy costs to each release.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that privacy for LLM agents is best treated as control over cumulative posterior risk rather than per-release filters, because leakage accumulates across steps, can be triggered bidirectionally by malicious inputs, and depends on the recipient. OCELOT implements this via a runtime mediator that records a sink-trust-weighted budget on a tamper-evident ledger and only authorizes the least-disclosing useful release. A sympathetic reader would care because agents that read files, call tools, and transact on a user's behalf otherwise risk turning innocuous actions into inferences about protected secrets. If the mechanism works, agents can continue performing tasks while keeping the improvement in any adversary's belief within the budgeted limit.

Core claim

OCELOT recasts agent privacy as posterior-risk control and presents Witness-Verified Declassification: an untrusted locally fine-tuned defender model inspects each candidate release and emits structured evidence of labeled atoms and proposed declassification operators; a deterministic verifier audits the evidence, charges a certified min-entropy cost for the chosen variant, and authorizes the release only if it fits the remaining budget. Across diverse agent benchmarks the system produces significantly lower leakage at higher task utility than recent defenses while resisting adaptive injection, jailbreak, cumulative inference, and sink collusion attacks and incurring only modest overhead.

What carries the argument

Witness-Verified Declassification, which separates judgment from trust by letting an untrusted defender emit structured evidence that a deterministic verifier audits and charges against a tamper-evident min-entropy budget ledger.

If this is right

  • OCELOT attains significantly lower leakage at higher task utility than recent defenses across diverse agent benchmarks.
  • The system resists adaptive injection, jailbreak, cumulative inference, and sink collusion attacks.
  • Overhead remains modest while the tamper-evident ledger records every authorized release.
  • The same release can be declassified for one sink but withheld from another according to the trust-weighted budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of an untrusted fine-tuned model from a deterministic verifier could let users run privacy logic locally without exposing the full model to external auditors.
  • Recording min-entropy costs on a ledger opens the possibility of post-hoc audits of entire agent trajectories by third parties who never see the raw data.
  • If the structured evidence format can be standardized, the same verifier could be reused across different defender models or even non-LLM agent frameworks.

Load-bearing premise

The deterministic verifier can correctly and completely audit the structured evidence emitted by the defender model and assign accurate min-entropy costs without missing attacks or over-approximating safety.

What would settle it

An experiment in which an adversary forces the release of information that allows belief about a protected secret to improve by more than the remaining budget, yet the verifier still authorizes the release.

Figures

Figures reproduced from arXiv: 2606.12341 by Jin Xie, Songze Li.

Figure 1
Figure 1. Figure 1: Privacy exposure in one LLM-agent loop: a user’s PII (pink) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: OCELOT architecture. For each candidate release, the untrusted defender [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Privacy–utility frontier: all 30 system–benchmark points, cumu￾lative leakage (x, ↓) vs. task success (y, ↑). Color = method, shape = benchmark; large stars are per-system means. Top-left is best. 0 10 20 30 40 50 Trajectory Step (long-horizon stress set) 0 2 4 6 8 10 Cumulative Min-Entropy Leakage (bits) uniform comparison budget B* = 4.0 bits B* crossed at: AGENTDAM@18, FlexGuard@22, PrivacyChecker@24, R… view at source ↗
Figure 5
Figure 5. Figure 5: All systems on the human-labeled Real-200 set; top-left is best. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Calibration validity. (a) Per-operator field-averaged certified cost [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Top-1 secret recovery vs. number of colluding sinks. Dashed [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The defender’s runtime label-and-proposal object: a trusted, [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
read the original abstract

Large language model (LLM) agents increasingly act on a user's behalf -- reading personal files, calling tools, transacting with external services -- possibly leaking personally identifiable information (PII) across trust boundaries at every step. Privacy here is a property not of a single output but of an entire trajectory, and three properties make it hard: leakage is cumulative, as individually innocuous releases accumulate across honest-but-curious or colluding sinks into inferences about a protected secret; bidirectional, as a malicious observation can inject instructions that turn the agent's own reasoning model against the user; and task-dependent, as the same field is necessary for one recipient yet gratuitous for another. Per-release contextual-integrity filters, information-flow controls, and posterior-leakage monitors each address part of this but none controls cumulative, inference-based leakage at runtime. We recast agent privacy as \emph{posterior-risk control} and present OCELOT, a runtime mediator that budgets how much an adversary's belief about a secret may improve across a trajectory, rather than filtering outputs. Its mechanism, \emph{Witness-Verified Declassification}, separates judgment from trust: an untrusted, locally fine-tuned defender model inspects each candidate release and emits structured evidence -- labeled atoms and proposed declassification operators -- which a deterministic verifier audits, charging a certified min-entropy cost for the chosen variant and authorizing the least-disclosing useful release under a sink-trust-weighted budget recorded on a tamper-evident ledger. Across diverse agent benchmarks and recent defenses, OCELOT attains significantly lower leakage at higher task utility, resists adaptive injection, jailbreak, cumulative inference, and sink collusion, and adds only modest overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents OCELOT, a runtime mediator for LLM agents that recasts privacy as posterior-risk control. It introduces Witness-Verified Declassification: an untrusted locally fine-tuned defender model emits labeled atoms and proposed declassification operators for each candidate release; a deterministic verifier audits them, assigns certified min-entropy costs, and authorizes the least-disclosing useful release under a sink-trust-weighted budget maintained on a tamper-evident ledger. The system is evaluated across agent benchmarks against recent defenses and claims significantly lower leakage at higher task utility while resisting adaptive injection, jailbreak, cumulative inference, and sink collusion with only modest overhead.

Significance. If the verifier is shown to be complete and the min-entropy accounting sound, the work would provide a principled runtime mechanism for controlling cumulative, bidirectional, and task-dependent leakage in agent trajectories, separating judgment from trust in a way that prior per-release filters and information-flow controls do not. The tamper-evident ledger and budget formulation are concrete strengths that could support reproducible evaluation.

major comments (2)
  1. [Witness-Verified Declassification mechanism] The soundness of the posterior-risk budget and all resistance claims (cumulative inference, sink collusion) rests on the deterministic verifier correctly enumerating every inference path enabled by the emitted atoms and never under-charging their min-entropy cost. The mechanism description provides no formal argument, enumeration procedure, or completeness proof for this step, especially given that atoms are produced by a fine-tuned LLM that could omit subtle inference chains.
  2. [Evaluation section] The evaluation claims resistance to adaptive injection, jailbreak, and cumulative inference across benchmarks, yet the reported results appear to rely on the verifier's correctness without an ablation or adversarial test that deliberately constructs omitted inference paths to check whether the budget is violated. This makes the empirical support conditional on the unverified verifier property.
minor comments (2)
  1. Notation for the budget ledger and sink-trust weights should be defined with explicit equations rather than prose descriptions to allow independent verification of the min-entropy accounting.
  2. The abstract and mechanism overview would benefit from a small illustrative example showing one atom, its declassification operator, the verifier's cost assignment, and the resulting budget update.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need for stronger formal grounding and empirical validation of the verifier in Witness-Verified Declassification. We address each major comment below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses
  1. Referee: [Witness-Verified Declassification mechanism] The soundness of the posterior-risk budget and all resistance claims (cumulative inference, sink collusion) rests on the deterministic verifier correctly enumerating every inference path enabled by the emitted atoms and never under-charging their min-entropy cost. The mechanism description provides no formal argument, enumeration procedure, or completeness proof for this step, especially given that atoms are produced by a fine-tuned LLM that could omit subtle inference chains.

    Authors: We agree that the current manuscript does not supply an explicit completeness argument or enumeration procedure for the verifier. The design treats the verifier as a deterministic function that, given the labeled atoms and proposed operators, exhaustively computes the min-entropy cost over all inference paths consistent with those atoms; the budget is then decremented by the certified cost of the chosen release. To strengthen this, the revision will add (i) a precise description of the enumeration algorithm, (ii) a proof sketch establishing completeness relative to the emitted atom set, and (iii) a discussion of the conservative nature of the accounting when the fine-tuned model may miss subtle chains. These additions will make the soundness claims explicit rather than implicit. revision: yes

  2. Referee: [Evaluation section] The evaluation claims resistance to adaptive injection, jailbreak, and cumulative inference across benchmarks, yet the reported results appear to rely on the verifier's correctness without an ablation or adversarial test that deliberately constructs omitted inference paths to check whether the budget is violated. This makes the empirical support conditional on the unverified verifier property.

    Authors: The existing experiments measure end-to-end leakage and utility under adaptive attacks, but they do not include a dedicated ablation that injects synthetic omitted paths to test budget violation. We will add such an ablation in the revised evaluation section: we will construct controlled cases where the defender model is forced to emit incomplete atom sets, run the verifier on the resulting releases, and report whether the budget is ever under-charged. This will directly address the conditional nature of the current empirical support. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and mechanism description contain no equations, self-referential definitions, fitted parameters presented as predictions, or load-bearing self-citations. The Witness-Verified Declassification approach is described at a high level with an untrusted model emitting evidence audited by a deterministic verifier, but no derivation chain reduces the posterior-risk budget or performance claims to inputs by construction. Empirical results across benchmarks are presented separately from any formal reduction. This is a standard non-finding for a system-design paper without visible mathematical self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the mechanism description implies an unstated assumption that min-entropy costs can be certified deterministically, but no details are given.

pith-pipeline@v0.9.1-grok · 5833 in / 1129 out tokens · 17131 ms · 2026-06-27T09:10:03.088146+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 8 linked inside Pith

  1. [1]

    React: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,”arXiv preprint arXiv:2210.03629, 2022

  2. [2]

    Effective Harnesses for Long-Running Agents,

    J. Young, “Effective Harnesses for Long-Running Agents,” https://www.anthropic.com/engineering/ effective-harnesses-for-long-running-agents, Nov. 2025, anthropic Engineering blog. Accessed: Jun. 1, 2026

  3. [3]

    Harness Engineering: Leveraging Codex in an Agent-First World,

    R. Lopopolo, “Harness Engineering: Leveraging Codex in an Agent-First World,” https://openai.com/index/harness-engineering/, Feb. 2026, openAI blog. Accessed: Jun. 1, 2026

  4. [4]

    Natural- language agent harnesses,

    L. Pan, L. Zou, S. Guo, J. Ni, and H.-T. Zheng, “Natural- language agent harnesses,” 2026. [Online]. Available: https: //arxiv.org/abs/2603.25723

  5. [5]

    Extracting training data from large language models,

    N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel, “Extracting training data from large language models,”

  6. [6]

    Available: https://arxiv.org/abs/2012.07805

    [Online]. Available: https://arxiv.org/abs/2012.07805

  7. [7]

    Beyond memorization: Violating privacy via inference with large language models,

    R. Staab, M. Vero, M. Balunovi ´c, and M. Vechev, “Beyond memorization: Violating privacy via inference with large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2310.07298

  8. [8]

    Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems,

    F. E. Yagoubi, G. Badu-Marfo, and R. A. Mallah, “Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems,” 2026. [Online]. Available: https://arxiv.org/abs/2602.11510

  9. [9]

    Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents,

    E. Debenedetti, J. Zhang, M. Balunovi ´c, L. Beurer-Kellner, M. Fischer, and F. Tram `er, “Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents,” 2024. [Online]. Available: https://arxiv.org/abs/2406.13352

  10. [10]

    Can llms keep a secret? testing privacy implications of language models via contextual integrity theory,

    N. Mireshghallah, H. Kim, X. Zhou, Y . Tsvetkov, M. Sap, R. Shokri, and Y . Choi, “Can llms keep a secret? testing privacy implications of language models via contextual integrity theory,” 2024. [Online]. Available: https://arxiv.org/abs/2310.17884

  11. [11]

    Privacy as contextual integrity,

    H. Nissenbaum, “Privacy as contextual integrity,”Wash. L. Rev., vol. 79, p. 119, 2004

  12. [12]

    Flexguard: Continuous risk scoring for strictness-adaptive llm content moderation,

    Z. Ding, J. Li, Z. Lu, and J. Shi, “Flexguard: Continuous risk scoring for strictness-adaptive llm content moderation,” 2026. [Online]. Available: https://arxiv.org/abs/2602.23636

  13. [13]

    Privacy in action: Towards realistic privacy mitigation and evaluation for llm-powered agents,

    S. Wang, F. Yu, X. Liu, X. Qin, J. Zhang, Q. Lin, D. Zhang, and S. Rajmohan, “Privacy in action: Towards realistic privacy mitigation and evaluation for llm-powered agents,” 2025. [Online]. Available: https://arxiv.org/abs/2509.17488

  14. [14]

    Defeating prompt injections by design,

    E. Debenedetti, I. Shumailov, T. Fan, J. Hayes, N. Carlini, D. Fabian, C. Kern, C. Shi, A. Terzis, and F. Tram `er, “Defeating prompt injections by design,” 2025. [Online]. Available: https: //arxiv.org/abs/2503.18813

  15. [15]

    Rtbas: Defending llm agents against prompt injection and privacy leakage,

    P. Y . Zhong, S. Chen, R. Wang, M. McCall, B. L. Titzer, H. Miller, and P. B. Gibbons, “Rtbas: Defending llm agents against prompt injection and privacy leakage,” 2025. [Online]. Available: https://arxiv.org/abs/2502.08966

  16. [16]

    AgentDAM: Privacy leakage evaluation for autonomous web agents,

    A. Zharmagambetov, C. Guo, I. Evtimov, M. Pavlova, R. Salakhutdinov, and K. Chaudhuri, “AgentDAM: Privacy leakage evaluation for autonomous web agents,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. [Online]. Available: https://openreview.net/forum?id=qaxf7q41aK

  17. [17]

    Information-theoretic privacy control for sequential multi-agent llm systems,

    S. Asif and M. M. Amiri, “Information-theoretic privacy control for sequential multi-agent llm systems,” 2026. [Online]. Available: https://arxiv.org/abs/2603.05520

  18. [18]

    On the foundations of quantitative information flow,

    G. Smith, “On the foundations of quantitative information flow,” in International Conference on Foundations of Software Science and Computational Structures. Springer, 2009, pp. 288–302

  19. [19]

    An information-theoretic model for adaptive side-channel attacks,

    B. K ¨opf and D. Basin, “An information-theoretic model for adaptive side-channel attacks,” inProceedings of the 14th ACM conference on Computer and communications security, 2007, pp. 286–296

  20. [20]

    The algorithmic foundations of differential privacy,

    C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,”Foundations and trends® in theoretical computer science, vol. 9, no. 3-4, pp. 211–487, 2014

  21. [21]

    Darwiche,Modeling and reasoning with Bayesian networks

    A. Darwiche,Modeling and reasoning with Bayesian networks. Cambridge university press, 2009

  22. [22]

    Probabilistic inference in credal networks: new complexity results,

    D. D. Mau ´a, C. P. de Campos, A. Benavoli, and A. Antonucci, “Probabilistic inference in credal networks: new complexity results,” Journal of Artificial Intelligence Research, vol. 50, pp. 603–637, 2014

  23. [23]

    The smt-lib standard: Version 2.0,

    C. Barrett, A. Stump, C. Tinelliet al., “The smt-lib standard: Version 2.0,” inProceedings of the 8th international workshop on satisfiability modulo theories (Edinburgh, UK), vol. 13, 2010, p. 14

  24. [24]

    Z3: An efficient smt solver,

    L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” inIn- ternational conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 2008, pp. 337–340

  25. [25]

    Constrained decoding for fill-in-the-middle code language models via efficient left and right quotienting of context-sensitive grammars,

    D. Melcer, N. Fulton, S. K. Gouda, and H. Qian, “Constrained decoding for fill-in-the-middle code language models via efficient left and right quotienting of context-sensitive grammars,” 2024. [Online]. Available: https://arxiv.org/abs/2402.17988

  26. [26]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models,

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”

  27. [27]

    Available: https://arxiv.org/abs/2402.03300

    [Online]. Available: https://arxiv.org/abs/2402.03300

  28. [28]

    Satisfiability modulo theories,

    C. Barrett and C. Tinelli, “Satisfiability modulo theories,” inHand- book of model checking. Springer, 2018, pp. 305–343

  29. [29]

    The llama 3 herd of models,

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  30. [30]

    Qwen3.5: Accelerating productivity with native multimodal agents,

    Q. Team, “Qwen3.5: Accelerating productivity with native multimodal agents,” February 2026. [Online]. Available: https://qwen.ai/blog?id=qwen3.5

  31. [31]

    The minimax-m2 series: Mini activations unleashing max real-world intelligence,

    A. Chen, A. Li, B. Zhou, B. Gong, B. Jiang, B. Dan, C. Yu, C. Wang, C. Ma, C. Zhonget al., “The minimax-m2 series: Mini activations unleashing max real-world intelligence,”arXiv preprint arXiv:2605.26494, 2026

  32. [32]

    Deepseek-v4: Towards highly efficient million-token context intelligence,

    DeepSeek-AI, “Deepseek-v4: Towards highly efficient million-token context intelligence,” 2026

  33. [33]

    Simplification by cooperating decision procedures,

    G. Nelson and D. C. Oppen, “Simplification by cooperating decision procedures,”ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 1, no. 2, pp. 245–257, 1979

  34. [34]

    High-speed high-security signatures,

    D. J. Bernstein, N. Duif, T. Lange, P. Schwabe, and B.-Y . Yang, “High-speed high-security signatures,”Journal of cryptographic en- gineering, vol. 2, no. 2, pp. 77–89, 2012

  35. [35]

    Secure hash standard (shs),

    F. Pub, “Secure hash standard (shs),”Fips pub, vol. 180, no. 4, p. 2012, 2012. Appendix A. Lemmas and Their Proofs We prove the three body lemmas. All are model-relative: they bound quantities under Pre-Reg. A’s calibration model and the verifier’s construction, not true adversarial inference (Layer-2, §3.4). Lemma 1 (model-relative sub-additivity).Letσ= ...