Need to Know: Contextual-Integrity-Grounded Query Rewriting for Privacy-Conscious LLM Delegation

Wenyuan Yang; Xiaochun Cao; Xinyue Huang

arxiv: 2606.04067 · v1 · pith:BM76XZGFnew · submitted 2026-06-02 · 💻 cs.CR · cs.AI

Need to Know: Contextual-Integrity-Grounded Query Rewriting for Privacy-Conscious LLM Delegation

Xinyue Huang , Xiaochun Cao , Wenyuan Yang This is my paper

Pith reviewed 2026-06-28 09:39 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords contextual integrityquery rewritingprivacy preservationLLM delegationreinforcement learningprivacy-utility tradeoffbenchmark dataset

0 comments

The pith

Contextual integrity rules let a learned rewriter forward only task-necessary spans to cloud LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that privacy-preserving query rewriting improves when reframed around contextual integrity, so that a span reaches the cloud model only if it is required for the user's task. This addresses the limits of type-based PII redaction, which either leaks untyped context or removes answer-bearing content. A new benchmark of 3,167 samples and a reinforcement-learning procedure that treats essential and non-essential spans as explicit signals produce a rewriter with the strongest observed privacy-utility balance.

Core claim

By treating query rewriting as a contextual-integrity decision problem and training a rewriter with reinforcement learning on signals that mark task-essential versus non-essential sensitive spans, the resulting model achieves the best privacy-utility tradeoff on DelegateCI-Bench and records up to 10.1 higher average utility than on-device baselines.

What carries the argument

CI-guided reinforcement learning framework that converts essential and non-essential span annotations into verifiable optimization signals for the query rewriter.

If this is right

The rewriter preserves task-critical information while suppressing unnecessary sensitive disclosure.
It records the best privacy-utility tradeoff across synthetic tasks, real user queries, and a medical challenge set.
Average utility rises by as much as 10.1 points relative to on-device baselines.
The approach works on both high-quality synthetic data spanning 11 tasks and 20 task types and on WildChat-derived real queries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same necessity filter could be applied to other forms of AI delegation where context must be shared selectively.
On-device models might become less necessary if cloud queries can be safely reduced to task-essential content only.
The method invites direct comparison with rule-based or LLM-prompt rewriting baselines on live user traffic.

Load-bearing premise

The annotations supplied in DelegateCI-Bench correctly label which spans are necessary versus non-essential under contextual integrity principles for each task.

What would settle it

Independent human raters applying contextual integrity criteria to the rewriter's decisions on a fresh set of queries find that many forwarded spans are unnecessary or that utility gains disappear.

Figures

Figures reproduced from arXiv: 2606.04067 by Wenyuan Yang, Xiaochun Cao, Xinyue Huang.

**Figure 2.** Figure 2: Overview of our framework. Training (left): rollouts from πθ are scored by a composite reward combining a hard privacy term Rpriv, a concave essential term Ress, and a one-sided length penalty Lbrev, which drives a GRPO update of πθ. Inference (right): the frozen πθ rewrites the query on the trusted side; only the rewrite is sent to the untrusted remote LLM, whose response is recombined locally to produce … view at source ↗

**Figure 3.** Figure 3: Joint distribution of utility score u and reidentification risk score r across the three systems. produces flags for four privacy-leak , and returns a re-identification risk score r ∈ [0, 1] and a utility score u ∈ [0, 1]. Category definitions and prompts are in Appendix D. Aggregate results. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

As LLMs become increasingly woven into everyday workflows, user queries sent to cloud hosted LLMs routinely mix task-essential content with task non-essential sensitive disclosures, yet type based PII redaction is context agnostic and may raise two issues: over disclosing untyped sensitive context and over removing answer bearing spans. We recast privacy preserving query rewriting under Contextual Integrity: a span should be forwarded only if it is necessary for the task. We introduce DelegateCI-Bench, the first task based Contextual Integrity benchmark for privacy-conscious delegation, comprising 3,167 samples that combine high quality synthetic data spanning 11 tasks and 20 task types, WildChat based real user queries, and a medical challenge set with dense sensitive information. Building on this benchmark, we propose a CI-guided reinforcement learning framework that converts essential and non-essential sensitive spans into verifiable optimization signals, and train a query rewriter to preserve task critical information while suppressing unnecessary sensitive disclosure. Experiments show that our learned rewriter achieves the best privacy-utility tradeoff, achieving up to +10.1 average utility over on-device baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New benchmark and RL rewriter for CI-based LLM query privacy, but the label quality and missing experimental details make the gains hard to trust yet.

read the letter

The paper's core move is to treat query rewriting as a Contextual Integrity problem: forward a span only if the task needs it. They release DelegateCI-Bench (3167 samples across synthetic tasks, WildChat logs, and a medical set) and train an RL rewriter that uses the essential/non-essential labels as reward signals. That framing and the benchmark construction are the actual new pieces.

The work is useful in one narrow way. It gives a concrete task-based testbed instead of generic PII redaction, and the reported +10.1 utility lift over on-device baselines suggests the RL signal can move the privacy-utility line. If the labels are reliable, other groups could reuse the benchmark.

The soft spots are exactly where the stress-test note says. No inter-annotator agreement, no expert check against Nissenbaum's parameters, and no ablation on label noise are mentioned. The abstract also gives zero information on how baselines were implemented, what statistical tests were used, or how utility and privacy were scored. Those gaps matter because the entire claim rests on the annotations being faithful ground truth.

This is for people already working on applied privacy mechanisms for cloud LLMs or on reward design for text rewriting. A reader who wants a ready benchmark to run their own rewriters on could get value once the label validation is shown. It is coherent enough on its own terms to deserve referee time, mainly to check the annotation pipeline and the experimental controls.

Referee Report

1 major / 1 minor

Summary. The paper introduces DelegateCI-Bench, a 3,167-sample benchmark for task-based Contextual Integrity (CI) in privacy-preserving LLM query delegation that combines synthetic data across 11 tasks, WildChat queries, and a medical set. It proposes a CI-guided reinforcement learning framework that uses essential/non-essential span annotations as reward signals to train a query rewriter, claiming this yields the best privacy-utility tradeoff with up to +10.1 average utility gain over on-device baselines.

Significance. If the benchmark labels faithfully encode CI necessity judgments and the experimental results prove robust, the work would supply the first task-oriented CI benchmark and a verifiable optimization approach that moves beyond type-based PII redaction, offering a concrete path to reduce unnecessary sensitive disclosures while preserving task utility in cloud LLM delegation.

major comments (1)

[Abstract] Abstract: The headline claim of superior privacy-utility tradeoff (+10.1 utility) depends entirely on DelegateCI-Bench annotations serving as accurate ground truth for both RL rewards and evaluation metrics. No inter-annotator agreement, expert validation against Nissenbaum's CI parameters (e.g., actors, attributes, transmission principles), or ablation on label noise is reported, rendering both the learned rewriter and the reported deltas unreliable if the labels systematically misclassify task-critical spans.

minor comments (1)

[Abstract] Abstract: Experimental design details (baseline definitions, statistical tests, error analysis, and exact utility metric) are absent, preventing assessment of the +10.1 gain.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for emphasizing the need to substantiate the reliability of DelegateCI-Bench annotations. We address the concern directly below and commit to strengthening the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim of superior privacy-utility tradeoff (+10.1 utility) depends entirely on DelegateCI-Bench annotations serving as accurate ground truth for both RL rewards and evaluation metrics. No inter-annotator agreement, expert validation against Nissenbaum's CI parameters (e.g., actors, attributes, transmission principles), or ablation on label noise is reported, rendering both the learned rewriter and the reported deltas unreliable if the labels systematically misclassify task-critical spans.

Authors: We agree that the validity of the annotations is foundational to both the RL training signals and the reported utility gains. DelegateCI-Bench annotations were produced by defining essential spans as those required to fulfill the stated task under Contextual Integrity (explicitly incorporating actors, attributes, and transmission principles during synthetic data generation for the 11 tasks). WildChat and medical samples were annotated by the authors using the same CI criteria. However, the current manuscript does not report inter-annotator agreement, a systematic mapping to Nissenbaum's parameters, or label-noise ablations. In the revised manuscript we will add (1) IAA statistics from at least two additional annotators, (2) an explicit table mapping each annotation decision to the three CI parameters, and (3) an ablation that randomly flips 10-20% of essential/non-essential labels and re-evaluates the rewriter's utility delta. These additions will directly test whether the +10.1 gain remains stable under plausible label noise. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and RL training with external baselines

full rationale

The paper describes an empirical pipeline: creation of DelegateCI-Bench from synthetic, WildChat, and medical data; conversion of span labels into RL signals; training of a rewriter; and comparison of privacy-utility metrics against on-device baselines. No equations, derivations, or first-principles claims appear that reduce any reported utility gain (+10.1) to a fitted parameter or self-defined quantity by construction. The benchmark labels serve as standard ground truth for both reward and evaluation, which is conventional supervised learning rather than a self-referential loop. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the validity of Contextual Integrity as a decision criterion for query rewriting and on the quality of the new benchmark annotations; no free parameters or invented physical entities are described.

axioms (1)

domain assumption Contextual Integrity theory provides a context-dependent criterion for whether a span should be forwarded.
The paper recasts privacy-preserving rewriting under CI.

invented entities (2)

DelegateCI-Bench no independent evidence
purpose: Task-based benchmark for evaluating CI-grounded query rewriting
Newly constructed dataset of 3167 samples
CI-guided reinforcement learning framework no independent evidence
purpose: Training signal generator that labels essential versus non-essential spans
Proposed training method

pith-pipeline@v0.9.1-grok · 5726 in / 1203 out tokens · 24910 ms · 2026-06-28T09:39:04.021441+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 4 canonical work pages

[1]

2010 , publisher =

Helen Nissenbaum , title =. 2010 , publisher =

2010
[2]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

Niloofar Mireshghallah and Hyunwoo Kim and Xuhui Zhou and Yulia Tsvetkov and Maarten Sap and Reza Shokri and Yejin Choi , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
[3]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Yijia Shao and Tianshi Li and Weiyan Shi and Yanchen Liu and Diyi Yang , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[5]

Weisz and Amit Dhurandhar and Karthikeyan Natesan Ramamurthy , title =

Ivoline Ngong and Swanand Kadhe and Hao Wang and Keerthiram Murugesan and Justin D. Weisz and Amit Dhurandhar and Karthikeyan Natesan Ramamurthy , title =. Findings of the Association for Computational Linguistics: ACL 2025 , year =

2025
[8]

Inan and Andre Manoel and Fatemehsadat Mireshghallah and Zinan Lin and Sivakanth Gopi and Janardhan Kulkarni and Robert Sim , title =

Xinyu Tang and Richard Shin and Huseyin A. Inan and Andre Manoel and Fatemehsadat Mireshghallah and Zinan Lin and Sivakanth Gopi and Janardhan Kulkarni and Robert Sim , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
[9]

Wang and Chenhui Zhang and Zhangheng Li and Bo Li and Zhangyang Wang , title =

Junyuan Hong and Jiachen T. Wang and Chenhui Zhang and Zhangheng Li and Bo Li and Zhangyang Wang , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
[10]

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL) , year =

Li Siyan and Vethavikashini Chithrra Raghuram and Omar Khattab and Julia Hirschberg and Zhou Yu , title =. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL) , year =

2025
[11]

2021 , howpublished =

2021
[12]

The Text Anonymization Benchmark (

Ildik\'. The Text Anonymization Benchmark (. Computational Linguistics , volume =. 2022 , publisher =

2022
[13]

Analyzing Leakage of Personally Identifiable Information in Language Models , booktitle =

Nils Lukas and Ahmed Salem and Robert Sim and Shruti Tople and Lukas Wutschitz and Santiago Zanella-B. Analyzing Leakage of Personally Identifiable Information in Language Models , booktitle =. 2023 , pages =

2023
[14]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
[15]

2023 , howpublished =

2023
[19]

Advances in Neural Information Processing Systems (NeurIPS) , year =

John Langford and Tong Zhang , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[20]

Joar Skalse and Nikolaus H. R. Howe and Dmitrii Krasheninnikov and David Krueger , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[21]

Weinberger and Yoav Artzi , title =

Tianyi Zhang and Varsha Kishore and Felix Wu and Kilian Q. Weinberger and Yoav Artzi , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
[22]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[23]

Publications Manual , year = "1983", publisher =

1983
[24]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[25]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[26]

Dan Gusfield , title =. 1997

1997
[27]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[28]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[29]

Proceedings of the Twentieth European Conference on Computer Systems (EuroSys '25) , year =

Guangming Sheng and Chi Zhang and Zilingfeng Ye and Xibin Wu and Wang Zhang and Ru Zhang and Yanghua Peng and Haibin Lin and Chuan Wu , title =. Proceedings of the Twentieth European Conference on Computer Systems (EuroSys '25) , year =
[31]

Jordan and Ion Stoica , title =

Philipp Moritz and Robert Nishihara and Stephanie Wang and Alexey Tumanov and Richard Liaw and Eric Liang and Melih Elibol and Zongheng Yang and William Paul and Michael I. Jordan and Ion Stoica , title =. Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI) , year =
[32]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

Tri Dao , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
[36]

Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. 2024. https://arxiv.org/abs/2412.18925 HuatuoGPT-o1 : Towards medical complex reasoning with LLM s . arXiv preprint arXiv:2412.18925

Pith/arXiv arXiv 2024
[37]

Yu Chen, Tingxin Li, Huiming Liu, and Yang Yu. 2023. https://arxiv.org/abs/2309.03057 Hide and seek ( HaS ): A lightweight framework for prompt privacy protection . In arXiv preprint arXiv:2309.03057

arXiv 2023
[38]

Zhao Cheng, Diane Bouchacourt, and Mark Ibrahim. 2024. https://arxiv.org/abs/2409.13903 CI-Bench : Benchmarking contextual integrity of AI assistants on synthetic data . arXiv preprint arXiv:2409.13903

arXiv 2024
[39]

Tri Dao. 2024. https://arxiv.org/abs/2307.08691 FlashAttention-2 : Faster attention with better parallelism and work partitioning . In Proceedings of the International Conference on Learning Representations (ICLR)

Pith/arXiv arXiv 2024
[40]

Wang, Chenhui Zhang, Zhangheng Li, Bo Li, and Zhangyang Wang

Junyuan Hong, Jiachen T. Wang, Chenhui Zhang, Zhangheng Li, Bo Li, and Zhangyang Wang. 2024. https://openreview.net/forum?id=Ifz3IgsEPX DP-OPT : Make large language model your privacy-preserving prompt engineer . In Proceedings of the International Conference on Learning Representations (ICLR)

2024
[41]

Zhigang Kan, Linbo Qiao, Hao Yu, Liwen Peng, Yifu Gao, and Dongsheng Li. 2023. https://arxiv.org/abs/2306.08223 Protecting user privacy in remote conversational systems: A privacy-preserving framework based on text sanitization . In arXiv preprint arXiv:2306.08223

arXiv 2023
[42]

Proceedings of the 29th Symposium on Operating Systems Principles , pages =

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. https://doi.org/10.1145/3600006.3613165 Efficient memory management for large language model serving with PagedAttention . In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP)

work page doi:10.1145/3600006.3613165 2023
[43]

John Langford and Tong Zhang. 2007. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems (NeurIPS)

2007
[44]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. https://arxiv.org/abs/1907.11692 RoBERTa : A robustly optimized BERT pretraining approach . arXiv preprint arXiv:1907.11692

Pith/arXiv arXiv 2019
[45]

Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-B \'e guelin. 2023. https://doi.org/10.1109/SP46215.2023.10179300 Analyzing leakage of personally identifiable information in language models . In Proceedings of the IEEE Symposium on Security and Privacy (SP), pages 346--363

work page doi:10.1109/sp46215.2023.10179300 2023
[46]

Microsoft . 2021. Microsoft Presidio : Context-aware, pluggable and customizable PII anonymization service for text and images. https://microsoft.github.io/presidio/

2021
[47]

Niloofar Mireshghallah, Hyunwoo Kim, Xuhui Zhou, Yulia Tsvetkov, Maarten Sap, Reza Shokri, and Yejin Choi. 2024. https://openreview.net/forum?id=gmg7t8b4s0 Can LLM s keep a secret? T esting privacy implications of language models via contextual integrity theory . In Proceedings of the International Conference on Learning Representations (ICLR)

2024
[48]

Jordan, and Ion Stoica

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. https://www.usenix.org/conference/osdi18/presentation/moritz Ray : A distributed framework for emerging AI applications . In Proceedings of the 13th USENIX Symposium on Operating Sy...

2018
[49]

Seth Neel and Peter Chang. 2023. https://arxiv.org/abs/2312.06717 Privacy issues in large language models: A survey . arXiv preprint arXiv:2312.06717

arXiv 2023
[50]

Weisz, Amit Dhurandhar, and Karthikeyan Natesan Ramamurthy

Ivoline Ngong, Swanand Kadhe, Hao Wang, Keerthiram Murugesan, Justin D. Weisz, Amit Dhurandhar, and Karthikeyan Natesan Ramamurthy. 2025. https://aclanthology.org/2025.findings-acl.1343/ Protecting user privacy in online settings via supervised learning guided by contextual integrity . In Findings of the Association for Computational Linguistics: ACL 2025...

2025
[51]

Helen Nissenbaum. 2010. Privacy in Context: Technology, Policy, and the Integrity of Social Life. Stanford University Press, Stanford, CA

2010
[52]

Ildik\' o Pil\' a n, Pierre Lison, Lilja vrelid, Anthi Papadopoulou, David S\' a nchez, and Montserrat Batet. 2022. https://doi.org/10.1162/coli_a_00458 The text anonymization benchmark ( TAB ): A dedicated corpus and evaluation framework for text anonymization . Computational Linguistics, 48(4):1053--1101

work page doi:10.1162/coli_a_00458 2022
[53]

Qwen Team , An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, and 24 others. 2025. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report . arXiv preprint arXiv:2412.15115

Pith/arXiv arXiv 2025
[54]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. https://arxiv.org/abs/1707.06347 Proximal policy optimization algorithms . arXiv preprint arXiv:1707.06347

Pith/arXiv arXiv 2017
[55]

Yijia Shao, Tianshi Li, Weiyan Shi, Yanchen Liu, and Diyi Yang. 2024 a . PrivacyLens : Evaluating privacy norm awareness of language models in action. In Advances in Neural Information Processing Systems (NeurIPS)

2024
[56]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024 b . https://arxiv.org/abs/2402.03300 DeepSeekMath : Pushing the limits of mathematical reasoning in open language models . arXiv preprint arXiv:2402.03300

Pith/arXiv arXiv 2024
[57]

ShareGPT . 2023. ShareGPT : Share your wildest ChatGPT conversations with one click. https://sharegpt.com/. Accessed: 2025-05-24

2023
[58]

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. https://arxiv.org/abs/2409.19256 HybridFlow : A flexible and efficient RLHF framework . In Proceedings of the Twentieth European Conference on Computer Systems (EuroSys '25)

Pith/arXiv arXiv 2025
[59]

Li Siyan, Vethavikashini Chithrra Raghuram, Omar Khattab, Julia Hirschberg, and Zhou Yu. 2025. https://arxiv.org/abs/2410.17127 PAPILLON : Privacy preservation from I nternet-based and L ocal L anguage M odel O rchestratio N . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)

arXiv 2025
[60]

Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. 2022. https://arxiv.org/abs/2209.13085 Defining and characterizing reward hacking . In Advances in Neural Information Processing Systems (NeurIPS)

arXiv 2022
[61]

Inan, Andre Manoel, Fatemehsadat Mireshghallah, Zinan Lin, Sivakanth Gopi, Janardhan Kulkarni, and Robert Sim

Xinyu Tang, Richard Shin, Huseyin A. Inan, Andre Manoel, Fatemehsadat Mireshghallah, Zinan Lin, Sivakanth Gopi, Janardhan Kulkarni, and Robert Sim. 2024. https://openreview.net/forum?id=oZtt0pRnOl Privacy-preserving in-context learning with differentially private few-shot generation . In Proceedings of the International Conference on Learning Representati...

2024
[62]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. https://openreview.net/forum?id=SkeHuCVFDr BERTScore : Evaluating text generation with BERT . In Proceedings of the International Conference on Learning Representations (ICLR)

2020
[63]

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. 2024. https://openreview.net/forum?id=Bl8u7ZRlbM WildChat : 1 M ChatGPT interaction logs in the wild . In Proceedings of the International Conference on Learning Representations (ICLR)

2024

[1] [1]

2010 , publisher =

Helen Nissenbaum , title =. 2010 , publisher =

2010

[2] [2]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

Niloofar Mireshghallah and Hyunwoo Kim and Xuhui Zhou and Yulia Tsvetkov and Maarten Sap and Reza Shokri and Yejin Choi , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

[3] [3]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Yijia Shao and Tianshi Li and Weiyan Shi and Yanchen Liu and Diyi Yang , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[4] [5]

Weisz and Amit Dhurandhar and Karthikeyan Natesan Ramamurthy , title =

Ivoline Ngong and Swanand Kadhe and Hao Wang and Keerthiram Murugesan and Justin D. Weisz and Amit Dhurandhar and Karthikeyan Natesan Ramamurthy , title =. Findings of the Association for Computational Linguistics: ACL 2025 , year =

2025

[5] [8]

Inan and Andre Manoel and Fatemehsadat Mireshghallah and Zinan Lin and Sivakanth Gopi and Janardhan Kulkarni and Robert Sim , title =

Xinyu Tang and Richard Shin and Huseyin A. Inan and Andre Manoel and Fatemehsadat Mireshghallah and Zinan Lin and Sivakanth Gopi and Janardhan Kulkarni and Robert Sim , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

[6] [9]

Wang and Chenhui Zhang and Zhangheng Li and Bo Li and Zhangyang Wang , title =

Junyuan Hong and Jiachen T. Wang and Chenhui Zhang and Zhangheng Li and Bo Li and Zhangyang Wang , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

[7] [10]

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL) , year =

Li Siyan and Vethavikashini Chithrra Raghuram and Omar Khattab and Julia Hirschberg and Zhou Yu , title =. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL) , year =

2025

[8] [11]

2021 , howpublished =

2021

[9] [12]

The Text Anonymization Benchmark (

Ildik\'. The Text Anonymization Benchmark (. Computational Linguistics , volume =. 2022 , publisher =

2022

[10] [13]

Analyzing Leakage of Personally Identifiable Information in Language Models , booktitle =

Nils Lukas and Ahmed Salem and Robert Sim and Shruti Tople and Lukas Wutschitz and Santiago Zanella-B. Analyzing Leakage of Personally Identifiable Information in Language Models , booktitle =. 2023 , pages =

2023

[11] [14]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

[12] [15]

2023 , howpublished =

2023

[13] [19]

Advances in Neural Information Processing Systems (NeurIPS) , year =

John Langford and Tong Zhang , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[14] [20]

Joar Skalse and Nikolaus H. R. Howe and Dmitrii Krasheninnikov and David Krueger , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[15] [21]

Weinberger and Yoav Artzi , title =

Tianyi Zhang and Varsha Kishore and Felix Wu and Kilian Q. Weinberger and Yoav Artzi , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

[16] [22]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[17] [23]

Publications Manual , year = "1983", publisher =

1983

[18] [24]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[19] [25]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[20] [26]

Dan Gusfield , title =. 1997

1997

[21] [27]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[22] [28]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[23] [29]

Proceedings of the Twentieth European Conference on Computer Systems (EuroSys '25) , year =

Guangming Sheng and Chi Zhang and Zilingfeng Ye and Xibin Wu and Wang Zhang and Ru Zhang and Yanghua Peng and Haibin Lin and Chuan Wu , title =. Proceedings of the Twentieth European Conference on Computer Systems (EuroSys '25) , year =

[24] [31]

Jordan and Ion Stoica , title =

Philipp Moritz and Robert Nishihara and Stephanie Wang and Alexey Tumanov and Richard Liaw and Eric Liang and Melih Elibol and Zongheng Yang and William Paul and Michael I. Jordan and Ion Stoica , title =. Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI) , year =

[25] [32]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

Tri Dao , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

[26] [36]

Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. 2024. https://arxiv.org/abs/2412.18925 HuatuoGPT-o1 : Towards medical complex reasoning with LLM s . arXiv preprint arXiv:2412.18925

Pith/arXiv arXiv 2024

[27] [37]

Yu Chen, Tingxin Li, Huiming Liu, and Yang Yu. 2023. https://arxiv.org/abs/2309.03057 Hide and seek ( HaS ): A lightweight framework for prompt privacy protection . In arXiv preprint arXiv:2309.03057

arXiv 2023

[28] [38]

Zhao Cheng, Diane Bouchacourt, and Mark Ibrahim. 2024. https://arxiv.org/abs/2409.13903 CI-Bench : Benchmarking contextual integrity of AI assistants on synthetic data . arXiv preprint arXiv:2409.13903

arXiv 2024

[29] [39]

Tri Dao. 2024. https://arxiv.org/abs/2307.08691 FlashAttention-2 : Faster attention with better parallelism and work partitioning . In Proceedings of the International Conference on Learning Representations (ICLR)

Pith/arXiv arXiv 2024

[30] [40]

Wang, Chenhui Zhang, Zhangheng Li, Bo Li, and Zhangyang Wang

Junyuan Hong, Jiachen T. Wang, Chenhui Zhang, Zhangheng Li, Bo Li, and Zhangyang Wang. 2024. https://openreview.net/forum?id=Ifz3IgsEPX DP-OPT : Make large language model your privacy-preserving prompt engineer . In Proceedings of the International Conference on Learning Representations (ICLR)

2024

[31] [41]

Zhigang Kan, Linbo Qiao, Hao Yu, Liwen Peng, Yifu Gao, and Dongsheng Li. 2023. https://arxiv.org/abs/2306.08223 Protecting user privacy in remote conversational systems: A privacy-preserving framework based on text sanitization . In arXiv preprint arXiv:2306.08223

arXiv 2023

[32] [42]

Proceedings of the 29th Symposium on Operating Systems Principles , pages =

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. https://doi.org/10.1145/3600006.3613165 Efficient memory management for large language model serving with PagedAttention . In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP)

work page doi:10.1145/3600006.3613165 2023

[33] [43]

John Langford and Tong Zhang. 2007. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems (NeurIPS)

2007

[34] [44]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. https://arxiv.org/abs/1907.11692 RoBERTa : A robustly optimized BERT pretraining approach . arXiv preprint arXiv:1907.11692

Pith/arXiv arXiv 2019

[35] [45]

Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-B \'e guelin. 2023. https://doi.org/10.1109/SP46215.2023.10179300 Analyzing leakage of personally identifiable information in language models . In Proceedings of the IEEE Symposium on Security and Privacy (SP), pages 346--363

work page doi:10.1109/sp46215.2023.10179300 2023

[36] [46]

Microsoft . 2021. Microsoft Presidio : Context-aware, pluggable and customizable PII anonymization service for text and images. https://microsoft.github.io/presidio/

2021

[37] [47]

Niloofar Mireshghallah, Hyunwoo Kim, Xuhui Zhou, Yulia Tsvetkov, Maarten Sap, Reza Shokri, and Yejin Choi. 2024. https://openreview.net/forum?id=gmg7t8b4s0 Can LLM s keep a secret? T esting privacy implications of language models via contextual integrity theory . In Proceedings of the International Conference on Learning Representations (ICLR)

2024

[38] [48]

Jordan, and Ion Stoica

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. https://www.usenix.org/conference/osdi18/presentation/moritz Ray : A distributed framework for emerging AI applications . In Proceedings of the 13th USENIX Symposium on Operating Sy...

2018

[39] [49]

Seth Neel and Peter Chang. 2023. https://arxiv.org/abs/2312.06717 Privacy issues in large language models: A survey . arXiv preprint arXiv:2312.06717

arXiv 2023

[40] [50]

Weisz, Amit Dhurandhar, and Karthikeyan Natesan Ramamurthy

Ivoline Ngong, Swanand Kadhe, Hao Wang, Keerthiram Murugesan, Justin D. Weisz, Amit Dhurandhar, and Karthikeyan Natesan Ramamurthy. 2025. https://aclanthology.org/2025.findings-acl.1343/ Protecting user privacy in online settings via supervised learning guided by contextual integrity . In Findings of the Association for Computational Linguistics: ACL 2025...

2025

[41] [51]

Helen Nissenbaum. 2010. Privacy in Context: Technology, Policy, and the Integrity of Social Life. Stanford University Press, Stanford, CA

2010

[42] [52]

Ildik\' o Pil\' a n, Pierre Lison, Lilja vrelid, Anthi Papadopoulou, David S\' a nchez, and Montserrat Batet. 2022. https://doi.org/10.1162/coli_a_00458 The text anonymization benchmark ( TAB ): A dedicated corpus and evaluation framework for text anonymization . Computational Linguistics, 48(4):1053--1101

work page doi:10.1162/coli_a_00458 2022

[43] [53]

Qwen Team , An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, and 24 others. 2025. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report . arXiv preprint arXiv:2412.15115

Pith/arXiv arXiv 2025

[44] [54]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. https://arxiv.org/abs/1707.06347 Proximal policy optimization algorithms . arXiv preprint arXiv:1707.06347

Pith/arXiv arXiv 2017

[45] [55]

Yijia Shao, Tianshi Li, Weiyan Shi, Yanchen Liu, and Diyi Yang. 2024 a . PrivacyLens : Evaluating privacy norm awareness of language models in action. In Advances in Neural Information Processing Systems (NeurIPS)

2024

[46] [56]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024 b . https://arxiv.org/abs/2402.03300 DeepSeekMath : Pushing the limits of mathematical reasoning in open language models . arXiv preprint arXiv:2402.03300

Pith/arXiv arXiv 2024

[47] [57]

ShareGPT . 2023. ShareGPT : Share your wildest ChatGPT conversations with one click. https://sharegpt.com/. Accessed: 2025-05-24

2023

[48] [58]

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. https://arxiv.org/abs/2409.19256 HybridFlow : A flexible and efficient RLHF framework . In Proceedings of the Twentieth European Conference on Computer Systems (EuroSys '25)

Pith/arXiv arXiv 2025

[49] [59]

Li Siyan, Vethavikashini Chithrra Raghuram, Omar Khattab, Julia Hirschberg, and Zhou Yu. 2025. https://arxiv.org/abs/2410.17127 PAPILLON : Privacy preservation from I nternet-based and L ocal L anguage M odel O rchestratio N . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)

arXiv 2025

[50] [60]

Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. 2022. https://arxiv.org/abs/2209.13085 Defining and characterizing reward hacking . In Advances in Neural Information Processing Systems (NeurIPS)

arXiv 2022

[51] [61]

Inan, Andre Manoel, Fatemehsadat Mireshghallah, Zinan Lin, Sivakanth Gopi, Janardhan Kulkarni, and Robert Sim

Xinyu Tang, Richard Shin, Huseyin A. Inan, Andre Manoel, Fatemehsadat Mireshghallah, Zinan Lin, Sivakanth Gopi, Janardhan Kulkarni, and Robert Sim. 2024. https://openreview.net/forum?id=oZtt0pRnOl Privacy-preserving in-context learning with differentially private few-shot generation . In Proceedings of the International Conference on Learning Representati...

2024

[52] [62]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. https://openreview.net/forum?id=SkeHuCVFDr BERTScore : Evaluating text generation with BERT . In Proceedings of the International Conference on Learning Representations (ICLR)

2020

[53] [63]

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. 2024. https://openreview.net/forum?id=Bl8u7ZRlbM WildChat : 1 M ChatGPT interaction logs in the wild . In Proceedings of the International Conference on Learning Representations (ICLR)

2024