pith. machine review for the scientific record. sign in

arxiv: 2605.08646 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.CL· cs.DC

Recognition: 2 theorem links

· Lean Theorem

PAAC: Privacy-Aware Agentic Device-Cloud Collaboration

Christopher G. Brinton, Liangqi Yuan, Shiqiang Wang, Wenzhi Fang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:58 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.DC
keywords privacy-aware agentsdevice-cloud collaborationLLM agentsdata sanitizationagentic workflowsplaceholder tokensprivacy-accuracy trade-off
0
0 comments X

The pith

PAAC splits LLM agent tasks across device and cloud using typed placeholders so the cloud reasons without seeing private data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops PAAC to resolve the tension between cloud reasoning power and on-device privacy in LLM agents. It aligns the planner-executor split with the device-cloud boundary by replacing sensitive values with typed placeholders that retain only their structural role. The on-device agent proposes masks for sensitive spans and condenses execution results into key findings, while a deterministic registry handles all substitutions and reversals. This yields 15-36% higher average accuracy and 2-6 times lower leakage than prior device-cloud baselines on three agentic benchmarks, with gains holding across 17 more tasks in math, science, and finance. The largest improvements appear when privacy targets fall outside fixed entity categories.

Core claim

PAAC aligns planner-executor decomposition with the device-cloud boundary so role specialization itself becomes the privacy mechanism: the cloud reasons over typed placeholder tokens that preserve each sensitive value's reasoning role while discarding its content, the on-device agent identifies sensitive spans and distills each step's execution outcome into compact key findings, and sanitization confines the on-device LLM to proposing masks while a deterministic registry performs all substitution and reversal.

What carries the argument

Typed placeholder tokens that encode the reasoning role of each sensitive value, combined with a deterministic registry that performs substitution and reversal without relying on the on-device LLM for those operations.

If this is right

  • The same decomposition improves performance consistently across 17 additional benchmarks spanning math, science, and finance.
  • Privacy gains are largest when targets do not fit fixed entity taxonomies.
  • Confining the on-device model to mask proposals and outcome distillation limits error propagation.
  • The approach treats the device-cloud boundary as a trust boundary rather than a simple compute split.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar role-based masking could apply to other multi-party agent systems where functional decomposition matches trust levels.
  • The method implies that agent architectures can embed privacy as a structural property rather than a post-hoc filter.
  • Extending placeholder types to capture more relational context might further reduce information loss on complex tasks.

Load-bearing premise

Typed placeholder tokens must retain enough structural and role information for the cloud to produce correct reasoning steps, and the on-device LLM must reliably identify sensitive spans and distill outcomes without introducing errors that reach the final answer.

What would settle it

A controlled test on an agentic benchmark where replacing real values with typed placeholders causes the cloud planner to output incorrect reasoning steps at a rate that makes final accuracy no higher than existing device-cloud baselines.

Figures

Figures reproduced from arXiv: 2605.08646 by Christopher G. Brinton, Liangqi Yuan, Shiqiang Wang, Wenzhi Fang.

Figure 1
Figure 1. Figure 1: Comparison of device-cloud LLM collaboration paradigms. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PAAC keeps each agent’s per-step input compact, avoiding single-agent trajec￾tory accumulation. Trajectory-Coupled Context Growth. In standard agentic workflows, the agent accumulates reasoning traces, tool invocations, and execution results at each step, causing the context length to grow progressively with the number of agentic steps, as illustrated in Fig￾ure 2. This trajectory-coupled growth is prohibi… view at source ↗
Figure 3
Figure 3. Figure 3: Challenges in privacy sanitization. Semantic Alignment in Privacy Sanitization. Replacing sensitive spans with proxy tokens reli￾ably is non-trivial. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of our PAAC framework. The cloud agent reasons over sanitized representations while the on-device agent performs privacy saniti￾zation and execution judgment. Cloud Agent. The cloud agent is responsible for high-level reasoning and planning. Given the sanitized task description x˜, tool set T , and cloud memory Fc, the cloud agent generates a reasoning trace rt, a sanitized action a˜t, and a termi… view at source ↗
Figure 5
Figure 5. Figure 5: Example privacy policy. User-Defined Privacy Policy. The sanitizer is governed by a privacy policy P that specifies what constitutes sensitive information. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy vs. privacy leakage rate across three agentic benchmarks. Privacy-Accuracy Trade-off Across Benchmarks. Fig￾ure 6 reports accuracy against leak rate at P3 across the three agentic benchmarks. The baselines trace a clear frontier along a single axis, namely how aggressively the sanitizer strips structure from the cloud-side input. PBS￾based methods sit at the low-leak low-accuracy end, where fixed … view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy vs. average token cost on GAIA. Cost-Accuracy Trade-off Across Privacy Levels. Fig￾ure 7 illustrates the accuracy against average token cost on GAIA across P0 to P3. PAPILLON + ReAct rewrites the entire user query, stripping the cloud agent of cues for tool calls, resulting the agent terminates early. PBS-based ReAct approaches fail in the reverse direction. As more categories activate, the noun-c… view at source ↗
Figure 8
Figure 8. Figure 8: Four representative examples comparing privacy sanitization methods. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Alignment error distribution of the on-device LLM (Qwen3-4B) and the large-scale LLM [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt-injection payloads used to evaluate active threats T1 (Sanitize). All payloads [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Accuracy versus maximum agentic steps across different privacy levels. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Termination dynamics over agentic steps across benchmarks (columns) and privacy levels [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: PAAC reasoning trace on τ 2 -Bench Airline. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: PAAC reasoning trace on τ 2 -Bench Retail. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: PAAC reasoning trace on GAIA. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: PAAC reasoning trace on GSM8K. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: PAAC reasoning trace on MathQA. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: PAAC reasoning trace on Geometry3K. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: PAAC reasoning trace on MathVista. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: PAAC reasoning trace on SciBench. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: PAAC reasoning trace on SciQ. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: PAAC reasoning trace on TruthfulQA. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: PAAC reasoning trace on HotpotQA. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: PAAC reasoning trace on FEVER. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: PAAC reasoning trace on CLUTRR. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: PAAC reasoning trace on AGIEval LSAT-AR. [PITH_FULL_IMAGE:figures/full_fig_p043_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: PAAC reasoning trace on MedQA. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: PAAC reasoning trace on FinQA. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: PAAC reasoning trace on MMLU Prof. Accounting. [PITH_FULL_IMAGE:figures/full_fig_p046_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: PAAC reasoning trace on MMMU Accounting. [PITH_FULL_IMAGE:figures/full_fig_p047_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: PAAC reasoning trace on Jeopardy History. [PITH_FULL_IMAGE:figures/full_fig_p048_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: PAAC reasoning trace on Jeopardy Literature. [PITH_FULL_IMAGE:figures/full_fig_p049_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: PAAC reasoning trace for Collaborative Scene Reconstruction. [PITH_FULL_IMAGE:figures/full_fig_p050_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Results on Collaborative Scene Reconstruction. [PITH_FULL_IMAGE:figures/full_fig_p051_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Prompt template for the cloud agent under the Parallel Plan-and-Solve strategy (Algo [PITH_FULL_IMAGE:figures/full_fig_p056_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Prompt template for the on-device privacy sanitizer (Algorithm 1, Lines 2 and 8). The [PITH_FULL_IMAGE:figures/full_fig_p057_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Prompt template for the on-device judge, which produces key findings, feedback, and the [PITH_FULL_IMAGE:figures/full_fig_p057_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Prompt template for the on-device final answer generator, invoked upon consensus [PITH_FULL_IMAGE:figures/full_fig_p058_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Prompt template for the on-device privacy sanitizer reflection, used exclusively in PAAC [PITH_FULL_IMAGE:figures/full_fig_p058_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Prompt template for the final answer evaluator. We use Gemini-3-Flash to assess the [PITH_FULL_IMAGE:figures/full_fig_p059_40.png] view at source ↗
read the original abstract

Large language model (LLM) agents face a structural tension: cloud agents provide strong reasoning but expose user data, while on-device agents preserve privacy at the cost of overall capability. Existing device-cloud designs treat this boundary as a compute split rather than a trust boundary suited to agentic workloads, and existing sanitizers force a choice between policy flexibility and the structural fidelity tool calls require. In this work, we develop PAAC, a privacy-aware agentic framework that aligns planner--executor decomposition with the device-cloud boundary so that role specialization itself becomes the privacy mechanism. The cloud agent reasons over typed placeholder tokens that preserve each sensitive value's reasoning role while discarding its content, while the on-device agent identifies sensitive spans and distills each step's execution outcome into compact key findings. Sanitization confines the on-device LLM to proposing which spans to mask, while a deterministic registry performs all substitution and reversal, keeping actions directly executable on device. On three agentic benchmarks under strict privacy settings, PAAC dominates the Pareto frontier of privacy and accuracy, improving average accuracy by 15-36\% and reducing average leakage by 2-6$\times$ over state-of-the-art device-cloud baselines, with the largest margins on privacy targets outside fixed entity taxonomies. We find consistent improvements on 17 additional benchmarks spanning 10 domains, including math, science, and finance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces PAAC, a privacy-aware agentic framework for LLM agents that aligns planner-executor decomposition with the device-cloud trust boundary. The cloud reasons over typed placeholder tokens that preserve each sensitive value's reasoning role while discarding content; the on-device agent identifies sensitive spans and distills execution outcomes, with a deterministic registry handling all substitution and reversal. The central claim is that, on three agentic benchmarks under strict privacy settings, PAAC dominates the privacy-accuracy Pareto frontier, delivering 15-36% higher average accuracy and 2-6× lower average leakage than state-of-the-art device-cloud baselines (with largest gains on targets outside fixed entity taxonomies), plus consistent improvements on 17 additional benchmarks across 10 domains.

Significance. If the experimental results hold, the work would be significant for private LLM agents by treating the device-cloud boundary as a trust boundary rather than a simple compute split and by using role specialization itself as the privacy mechanism. This avoids the policy-flexibility versus structural-fidelity trade-off of existing sanitizers and could enable more capable on-device agents without exposing raw data.

major comments (2)
  1. Abstract (central claim paragraph): The reported 15-36% accuracy gains and 2-6× leakage reductions are stated without any description of the three agentic benchmarks, how 'strict privacy settings' are enforced, baseline implementations, leakage measurement protocol, error bars, data exclusion rules, or statistical tests. This omission is load-bearing for the Pareto-dominance claim because the numerical improvements cannot be evaluated for robustness or reproducibility from the given text.
  2. Abstract (framework description): The approach relies on two load-bearing assumptions that receive no quantitative support: (1) typed placeholder tokens retain enough structural/role information for the cloud planner to produce correct multi-step reasoning, and (2) the on-device LLM reliably identifies sensitive spans and distills outcomes without injecting errors that propagate to final answers. No ablations on placeholder fidelity, measured on-device error rates, or propagation analysis are mentioned, so the gains cannot be confidently attributed to PAAC rather than baseline artifacts or task-specific factors.
minor comments (1)
  1. Abstract (final sentence): The claim of 'consistent improvements on 17 additional benchmarks spanning 10 domains' is presented without any quantitative details or domain-specific breakdowns, reducing its utility for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the abstract would benefit from additional context and will revise it accordingly to improve clarity on our claims and framework while preserving conciseness.

read point-by-point responses
  1. Referee: [—] Abstract (central claim paragraph): The reported 15-36% accuracy gains and 2-6× leakage reductions are stated without any description of the three agentic benchmarks, how 'strict privacy settings' are enforced, baseline implementations, leakage measurement protocol, error bars, data exclusion rules, or statistical tests. This omission is load-bearing for the Pareto-dominance claim because the numerical improvements cannot be evaluated for robustness or reproducibility from the given text.

    Authors: We agree that the abstract is too concise to allow full evaluation of the central claims. In the revised version we will expand the abstract with brief descriptions of the three agentic benchmarks, the definition and enforcement of strict privacy settings, the baseline implementations, the leakage measurement protocol, and explicit references to the error bars, data exclusion rules, and statistical tests reported in the main text. revision: yes

  2. Referee: [—] Abstract (framework description): The approach relies on two load-bearing assumptions that receive no quantitative support: (1) typed placeholder tokens retain enough structural/role information for the cloud planner to produce correct multi-step reasoning, and (2) the on-device LLM reliably identifies sensitive spans and distills outcomes without injecting errors that propagate to final answers. No ablations on placeholder fidelity, measured on-device error rates, or propagation analysis are mentioned, so the gains cannot be confidently attributed to PAAC rather than baseline artifacts or task-specific factors.

    Authors: We agree that the abstract does not mention the quantitative validation of these assumptions. We will revise the abstract to state that the structural fidelity of typed placeholders and the reliability of on-device sanitization (including error rates and propagation) are supported by dedicated ablations and analyses presented in the experimental sections of the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity; architectural framework with empirical results only

full rationale

The abstract describes PAAC via high-level architectural choices (planner-executor split, typed placeholders, deterministic registry) and reports benchmark outcomes (15-36% accuracy gains, 2-6× leakage reduction) without any equations, fitted parameters, derivations, or self-citations. No load-bearing step reduces to its own inputs by construction, as there is no mathematical chain or uniqueness theorem invoked. The text is self-contained at the level of system design and external benchmark comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on standard assumptions about LLM agent decomposability and the feasibility of role-preserving sanitization; no free parameters or invented physical entities are mentioned.

axioms (2)
  • domain assumption LLM agents can be usefully decomposed into planner and executor roles whose separation aligns with device-cloud trust boundaries.
    Invoked in the design of PAAC as the privacy mechanism.
  • domain assumption Typed placeholder tokens can preserve sufficient reasoning structure for cloud agents without exposing content.
    Core to the cloud-side reasoning step described in the abstract.
invented entities (2)
  • typed placeholder tokens no independent evidence
    purpose: Represent sensitive values by reasoning role while discarding content for cloud processing.
    Introduced as the key sanitization primitive that enables cloud reasoning without data exposure.
  • deterministic registry no independent evidence
    purpose: Perform all substitutions and reversals so actions remain executable on device.
    Described as the component that confines the on-device LLM to proposing masks only.

pith-pipeline@v0.9.0 · 5526 in / 1669 out tokens · 39568 ms · 2026-05-12T00:58:45.221767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 8 internal anchors

  1. [1]

    AI4Privacy: PII Masking 400k

    AI4Privacy. AI4Privacy: PII Masking 400k. https://huggingface.co/datasets/ ai4privacy/pii-masking-400k

  2. [2]

    MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

    Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and ...

  3. [3]

    $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

    Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-Bench: Eval- uating Conversational Agents in a Dual-Control Environment.arXiv preprint arXiv:2506.07982, 2025

  4. [4]

    Hide and Seek (HaS): A Lightweight Framework for Prompt Privacy Protection.arXiv preprint arXiv:2309.03057, 2023

    Yu Chen, Tingxin Li, Huiming Liu, and Yang Yu. Hide and Seek (HaS): A Lightweight Framework for Prompt Privacy Protection.arXiv preprint arXiv:2309.03057, 2023

  5. [5]

    FinQA: A Dataset of Numerical Reasoning over Financial Data

    Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al. FinQA: A Dataset of Numerical Reasoning over Financial Data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711, 2021

  6. [6]

    Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formu- lation From Arithmetic Computation

    Ziling Cheng, Meng Cao, Leila Pishdad, Yanshuai Cao, and Jackie CK Cheung. Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formu- lation From Arithmetic Computation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14317–14344, 2025

  7. [7]

    Casper: Prompt Sanitization for Protecting User Privacy in Web-Based Large Language Models

    Chun Jie Chong, Chenxi Hou, Zhihao Yao, and Seyed Mohammadjavad Seyed Talebi. Casper: Prompt Sanitization for Protecting User Privacy in Web-Based Large Language Models. In 2025 IEEE 12th International Conference on Cyber Security and Cloud Computing (CSCloud), pages 122–133. IEEE, 2025

  8. [8]

    Pr ϵϵmpt: Sanitizing Sensitive Prompts for LLMs

    Amrita Roy Chowdhury, David Glukhov, Divyam Anshumaan, Prasad Chalasani, Nicolas Papernot, Somesh Jha, and Mihir Bellare. Pr ϵϵmpt: Sanitizing Sensitive Prompts for LLMs. arXiv preprint arXiv:2504.05147, 2025

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168, 2021

  10. [10]

    jeopardy-gen2mc.https://huggingface.co/datasets/allenai/ jeopardy-gen2mc

    Allen Institute for AI. jeopardy-gen2mc.https://huggingface.co/datasets/allenai/ jeopardy-gen2mc

  11. [11]

    Gemini 3 Flash

    Google DeepMind. Gemini 3 Flash. https://deepmind.google/models/gemini/flash/

  12. [12]

    Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty.arXiv preprint arXiv:2412.06771, 2024

    Meera Hahn, Wenjun Zeng, Nithish Kannen, Rich Galt, Kartikeya Badola, Been Kim, and Zi Wang. Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty.arXiv preprint arXiv:2412.06771, 2024

  13. [13]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding.arXiv preprint arXiv:2009.03300, 2020

  14. [14]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. InThe twelfth international conference on learning representations, 2023

  15. [15]

    spaCy: Industrial- Strength Natural Language Processing in Python

    Matthew Honnibal, Ines Montani, Sofie Van Landeghem, Adriane Boyd, et al. spaCy: Industrial- Strength Natural Language Processing in Python. 2020

  16. [16]

    What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14):6421, 2021

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14):6421, 2021. 10

  17. [17]

    Acon: Optimizing context compression for long-horizon llm agents.arXiv preprint arXiv:2510.00615, 2025

    Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. ACON: Optimizing Context Compression for Long-horizon LLM Agents.arXiv preprint arXiv:2510.00615, 2025

  18. [18]

    CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication

    Jin-Hwa Kim, Nikita Kitaev, Xinlei Chen, Marcus Rohrbach, Byoung-Tak Zhang, Yuandong Tian, Dhruv Batra, and Devi Parikh. CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6495–6513, 2019

  19. [19]

    Compressing Context to Enhance Inference Efficiency of Large Language Models

    Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing Context to Enhance Inference Efficiency of Large Language Models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 6342–6353, 2023

  20. [20]

    Emo- jiPrompt: Generative Prompt Obfuscation for Privacy-Preserving Communication with Cloud- based LLMs

    Sam Lin, Wenyue Hua, Zhenting Wang, Mingyu Jin, Lizhou Fan, and Yongfeng Zhang. Emo- jiPrompt: Generative Prompt Obfuscation for Privacy-Preserving Communication with Cloud- based LLMs. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long P...

  21. [21]

    TruthfulQA: Measuring How Models Mimic Hu- man Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring How Models Mimic Hu- man Falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

  22. [22]

    Microsoft COCO: Common Objects in Context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common Objects in Context. In European conference on computer vision, pages 740–755. Springer, 2014

  23. [23]

    Anonymisation Models for Text Data: State of the Art, Challenges and Future Directions

    Pierre Lison, Ildikó Pilán, David Sanchez, Montserrat Batet, and Lilja Øvrelid. Anonymisation Models for Text Data: State of the Art, Challenges and Future Directions. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),...

  24. [24]

    Formalizing and Benchmarking Prompt Injection Attacks and Defenses

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. In33rd USENIX Security Symposium (USENIX Security 24), pages 1831–1847, 2024

  25. [25]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts.arXiv preprint arXiv:2310.02255, 2023

  26. [26]

    Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volu...

  27. [27]

    Self-Refine: Iterative Refinement with Self-Feedback.Advances in neural information processing systems, 36:46534–46594, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-Refine: Iterative Refinement with Self-Feedback.Advances in neural information processing systems, 36:46534–46594, 2023

  28. [28]

    Evaluating Very Long-Term Conversational Memory of LLM Agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating Very Long-Term Conversational Memory of LLM Agents. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

  29. [29]

    Split-and-Denoise: Protect Large Language Model Inference with Local Differential Privacy.arXiv preprint arXiv:2310.09130, 2023

    Peihua Mai, Ran Yan, Zhe Huang, Youjia Yang, and Yan Pang. Split-and-Denoise: Protect Large Language Model Inference with Local Differential Privacy.arXiv preprint arXiv:2310.09130, 2023

  30. [30]

    Microsoft Presidio: Context Aware, Pluggable and Customizable PII Anonymization Service for Text and Images.Microsoft, 2018

    Omri Mendels, Coby Peled, Nava Vaisman Levy, Tomer Rosenthal, Limor Lahiani, et al. Microsoft Presidio: Context Aware, Pluggable and Customizable PII Anonymization Service for Text and Images.Microsoft, 2018. 11

  31. [31]

    GAIA: a benchmark for General AI Assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for General AI Assistants. InThe Twelfth International Conference on Learning Representations, 2023

  32. [32]

    Trust No Bot: Discovering Personal Disclosures in Human-LLM Conversations in the Wild.arXiv preprint arXiv:2407.11438, 2024

    Niloofar Mireshghallah, Maria Antoniak, Yash More, Yejin Choi, and Golnoosh Farnadi. Trust No Bot: Discovering Personal Disclosures in Human-LLM Conversations in the Wild.arXiv preprint arXiv:2407.11438, 2024

  33. [33]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Fábio Perez and Ian Ribeiro. Ignore Previous Prompt: Attack Techniques For Language Models. arXiv preprint arXiv:2211.09527, 2022

  34. [34]

    Toolformer: Language Models Can Teach Themselves to Use Tools.Advances in neural information processing systems, 36:68539–68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language Models Can Teach Themselves to Use Tools.Advances in neural information processing systems, 36:68539–68551, 2023

  35. [35]

    Improving LLM’s Attachment to External Knowledge In Dialogue Generation Tasks Through Entity Anonymization

    Hadi Sheikhi, Chenyang Huang, and Osmar R Zaïane. Improving LLM’s Attachment to External Knowledge In Dialogue Generation Tasks Through Entity Anonymization. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 472–...

  36. [36]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

  37. [37]

    Re- flexion: Language Agents with Verbal Reinforcement Learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language Agents with Verbal Reinforcement Learning.Advances in neural information processing systems, 36:8634–8652, 2023

  38. [38]

    CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text

    Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, and William L Hamilton. CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4506–4515, 2019

  39. [39]

    PAPILLON: Privacy Preservation from Internet-based and Local Language Model Ensembles

    Li Siyan, Vethavikashini Chithrra Raghuram, Omar Khattab, Julia Hirschberg, and Zhou Yu. PAPILLON: Privacy Preservation from Internet-based and Local Language Model Ensembles. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pa...

  40. [40]

    k-Anonymity: A Model for Protecting Privacy.International journal of uncertainty, fuzziness and knowledge-based systems, 10(05):557–570, 2002

    Latanya Sweeney. k-Anonymity: A Model for Protecting Privacy.International journal of uncertainty, fuzziness and knowledge-based systems, 10(05):557–570, 2002

  41. [41]

    FEVER: a large-scale dataset for Fact Extraction and VERification

    James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large-scale dataset for Fact Extraction and VERification. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, 2018

  42. [42]

    Tensor Trust: Interpretable prompt injection attacks from an online game,

    Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, et al. Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game.arXiv preprint arXiv:2311.01011, 2023

  43. [43]

    Locally Differentially Private Document Generation Using Zero Shot Prompting

    Saiteja Utpala, Sara Hooker, and Pin-Yu Chen. Locally Differentially Private Document Generation Using Zero Shot Prompting. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 8442–8457, 2023

  44. [44]

    Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

    Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee- Peng Lim. Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 2609–2634, 2023. 12

  45. [45]

    arXiv preprint arXiv:2307.10635

    Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. SciBench: Evaluating College-Level Sci- entific Problem-Solving Abilities of Large Language Models.arXiv preprint arXiv:2307.10635, 2023

  46. [46]

    Crowdsourcing Multiple Choice Science Questions

    Johannes Welbl, Nelson F Liu, and Matt Gardner. Crowdsourcing Multiple Choice Science Questions. InProceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, 2017

  47. [47]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. InFirst conference on language modeling, 2024

  48. [48]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025

  49. [49]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

  50. [50]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models. InThe eleventh international conference on learning representations, 2022

  51. [51]

    Toward Super Agent System with Hybrid AI Routers

    Yuhang Yao, Haixin Wang, Yibo Chen, Jiawen Wang, Min Chang Jordan Ren, Bosheng Ding, Salman Avestimehr, and Chaoyang He. Toward Super Agent System with Hybrid AI Routers. arXiv preprint arXiv:2504.10519, 2025

  52. [52]

    EcoAgent: An Efficient Device-Cloud Collaborative Multi-Agent Framework for Mobile Automation

    Biao Yi, Xueyu Hu, Yurun Chen, Shengyu Zhang, Hongxia Yang, and Fan Wu. EcoAgent: An Efficient Device-Cloud Collaborative Multi-Agent Framework for Mobile Automation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29838–29846, 2026

  53. [53]

    Local-Cloud Inference Offloading for LLMs in Multi-Modal, Multi-Task, Multi-Dialogue Settings

    Liangqi Yuan, Dong-Jun Han, Shiqiang Wang, and Christopher Brinton. Local-Cloud Inference Offloading for LLMs in Multi-Modal, Multi-Task, Multi-Dialogue Settings. InProceedings of the Twenty-sixth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, pages 201–210, 2025

  54. [54]

    MMMU: A Massive Multi-discipline Mul- timodal Understanding and Reasoning Benchmark for Expert AGI

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A Massive Multi-discipline Mul- timodal Understanding and Reasoning Benchmark for Expert AGI. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

  55. [55]

    MasRouter: Learning to Route LLMs for Multi-Agent Systems

    Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, and Yiyan Qi. MasRouter: Learning to Route LLMs for Multi-Agent Systems. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15549–15572, 2025

  56. [56]

    PRISM: Privacy-Aware Routing for Adaptive Cloud–Edge LLM Inference via Semantic Sketch Collaboration

    Junfei Zhan, Haoxun Shen, Zheng Lin, and Tengjiao He. PRISM: Privacy-Aware Routing for Adaptive Cloud–Edge LLM Inference via Semantic Sketch Collaboration. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 28150–28158, 2026

  57. [57]

    InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

    Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10471–10506, 2024

  58. [58]

    Searching for Privacy Risks in LLM Agents via Simulation

    Yanzhe Zhang and Diyi Yang. Searching for Privacy Risks in LLM Agents via Simulation. arXiv preprint arXiv:2508.10880, 2025

  59. [59]

    Chain of Agents: Large Language Models Collaborating on Long-Context Tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024

    Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Ö Arık. Chain of Agents: Large Language Models Collaborating on Long-Context Tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024. 13

  60. [60]

    AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. InFindings of the association for computational linguistics: NAACL 2024, pages 2299–2314, 2024

  61. [61]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. WebArena: A Realistic Web Environment for Building Autonomous Agents.arXiv preprint arXiv:2307.13854, 2023

  62. [62]

    ‘ Adaptive (Template-Aware) [24] Note for the evaluator: per the

    Wangchunshu Zhou, Yuchen Eleanor Jiang, Peng Cui, Tiannan Wang, Zhenxin Xiao, Yifan Hou, Ryan Cotterell, and Mrinmaya Sachan. RecurrentGPT: Interactive Generation of (Arbitrarily) Long Text.arXiv preprint arXiv:2305.13304, 2023. 14 Appendix A Further Analysis 17 A.1 Privacy Sanitization: Privacy Identification as the Bottleneck . . . . . . . . . . . . 17 ...

  63. [63]

    YEAR_1":

    Therefore, m$\angle $W, which is 3x$^\circ $, equals 108$^\circ $. Privacy Mapping: { "YEAR_1": "360", "ANGLE_1": "108", "QUANTITY_1": "2x", "QUANTITY_2": "4x", "QUANTITY_3": "3x", "COUNT_1": "36", "QUANTITY_4": "x", "COUNT_2": "360", "COUNT_3": "108", "NUMBER_1": "360", ... [TRUNCATED] } Figure 18: PAAC reasoning trace on Geometry3K. 35 B.3.2 MathVista T...