Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems
Pith reviewed 2026-06-29 16:35 UTC · model grok-4.3
The pith
LLM agents disclose private information more than twice as often in social groups than in isolated tests
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the introduced simulation platform, multi-turn social interactions among LLM agents amplify privacy violations relative to single-turn baselines, produce contagious leakage where observation of a disclosure multiplies the chance of further disclosures, and show that explicit privacy instructions reduce but do not remove the effect, leaving substantial leakage that single-turn evaluations never surface.
What carries the argument
The Moltbook-style simulation platform that places thousands of LLM agents in persistent community interactions over a simulated month to measure privacy leakage under varying social pressure.
If this is right
- Static single-turn safety benchmarks systematically underestimate privacy risks once agents operate together.
- Social observation alone is enough to trigger sensitive disclosures that isolated tests miss.
- Explicit privacy instructions lower but do not eliminate leakage in group settings.
- Leakage becomes contagious once one agent observes another disclose information.
Where Pith is reading between the lines
- Safety testing protocols may need to incorporate repeated peer observation rather than single exchanges.
- Longer interaction histories or larger agent populations could further increase the observed contagion multiplier.
- Deployed systems might benefit from monitoring mechanisms that detect early disclosures before they spread.
Load-bearing premise
The behaviors produced by the simulation platform match how real deployed LLM agents would respond to social observation and pressure.
What would settle it
Direct measurement of privacy leakage rates in an actual multi-agent LLM deployment over a comparable period and comparison against the 45.30 percent and eightfold contagion figures reported from the simulation.
Figures
read the original abstract
LLM safety evaluations predominantly test models in isolation, yet deployed AI agents increasingly operate within persistent social environments alongside other agents. We introduce a Moltbook-style simulation platform where thousands of LLM agents interact across communities over a simulated month, and use it to evaluate privacy as a downstream safety concern under varying degrees of social pressure. We find that shifting from single turn to multi turn social evaluation amplifies privacy violations (CIMemories 19.95% to Ours 45.30% across OpenAI models), that leakage is socially contagious, with agents 8 times more likely to disclose sensitive information after observing a peer do so, and that explicit privacy instructions reduce but do not eliminate this effect, leaving leakage rates above 37.8% even with safeguards. Our findings suggest that static chat based safety benchmarks systematically underestimate risks in agentic deployment, and that social context alone is sufficient to elicit sensitive disclosures that single turn evaluations would never surface.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a Moltbook-style simulation platform in which thousands of LLM agents interact over a simulated month and uses it to measure privacy leakage under social pressure. It reports that multi-turn social evaluation increases leakage relative to single-turn baselines (CIMemories 19.95% to 45.30% across OpenAI models), that leakage is contagious (agents 8× more likely to disclose after observing a peer), and that explicit privacy instructions reduce but do not eliminate the effect (rates remain >37.8%). The authors conclude that static single-turn benchmarks systematically underestimate privacy risks in agentic deployments.
Significance. If the simulation produces behaviors that generalize, the work would usefully demonstrate that social context alone can elicit disclosures missed by isolated evaluations and would supply a scalable platform for studying emergent multi-agent safety properties. The reported contagion multiplier, if robust, identifies a mechanism not captured by current benchmarks.
major comments (2)
- [Abstract and evaluation methodology] The central quantitative claims (amplification from 19.95% to 45.30%, 8× contagion factor, and post-instruction rates >37.8%) rest on the assumption that the Moltbook-style platform elicits realistic responses to peer observation and social pressure. The manuscript supplies no external validation data, human baselines, or comparison against logs from actual multi-agent deployments (Abstract; evaluation methodology).
- [Abstract] The abstract states concrete percentages and multipliers yet provides no information on experimental controls, statistical methods, agent population sampling, number of simulation runs, or sensitivity to prompt or parameter choices. These omissions prevent assessment of whether the reported effects are robust (Abstract; implied Results section).
minor comments (2)
- [Abstract] Define or cite the CIMemories baseline explicitly when first used so readers can understand the comparison.
- Add sample sizes, confidence intervals, or variance measures alongside all reported percentages and multipliers in the main text and figures.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our simulation platform and evaluation methodology. We address each major point below, clarifying the scope of our work as a controlled study of social effects in LLM agents.
read point-by-point responses
-
Referee: [Abstract and evaluation methodology] The central quantitative claims (amplification from 19.95% to 45.30%, 8× contagion factor, and post-instruction rates >37.8%) rest on the assumption that the Moltbook-style platform elicits realistic responses to peer observation and social pressure. The manuscript supplies no external validation data, human baselines, or comparison against logs from actual multi-agent deployments (Abstract; evaluation methodology).
Authors: We acknowledge that the manuscript does not include external validation against real-world multi-agent deployment logs or human baselines, as such data is not publicly available and collecting it would require access to proprietary agent interactions. Our contribution centers on a reproducible simulation platform that isolates social pressure mechanisms (peer observation and contagion) in a way that observational logs cannot. We will add an explicit limitations subsection discussing the simulation-to-reality gap and calling for future validation work with deployed systems. revision: partial
-
Referee: [Abstract] The abstract states concrete percentages and multipliers yet provides no information on experimental controls, statistical methods, agent population sampling, number of simulation runs, or sensitivity to prompt or parameter choices. These omissions prevent assessment of whether the reported effects are robust (Abstract; implied Results section).
Authors: We agree the abstract is too terse on methodology. In revision we will expand it to note the agent population (thousands of LLM agents), simulation duration (one simulated month), use of multiple independent runs with reported averages, and basic controls for prompt sensitivity. The full experimental protocol, including statistical procedures and parameter sweeps, is detailed in the Methods section; the abstract will now reference these elements to allow readers to assess robustness. revision: yes
Circularity Check
No significant circularity; results are direct empirical outputs from introduced simulation
full rationale
The paper introduces a Moltbook-style simulation platform and reports measured privacy leakage rates (e.g., 19.95% to 45.30%, 8x contagion factor) as direct simulation outputs under varying social conditions. No equations, fitted parameters renamed as predictions, or self-citations are used to derive the central claims. The findings are presented as empirical measurements from the platform they built, with no reduction by construction to prior inputs. This matches the common case of a self-contained empirical study without circular derivation steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents placed in persistent simulated social environments will exhibit privacy behaviors representative of real multi-agent deployments
invented entities (1)
-
Moltbook-style simulation platform
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Alessandro Acquisti, Leslie K John, and George Loewenstein. 2013. What is privacy worth?The Journal of Legal Studies42, 2 (2013), 249–274
2013
-
[2]
Cem Anil, Esin Durmus, Nina Panickssery, Mrinank Sharma, Joe Benton, Sandi- pan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel Ford, et al. 2024. Many- shot jailbreaking.Advances in Neural Information Processing Systems37 (2024), 129696–129742
2024
-
[3]
Solomon E Asch. 2016. Effects of group pressure upon the modification and distortion of judgments. InOrganizational influence processes. Routledge, 295– 303
2016
-
[4]
Ariel Flint Ashery, Luca Maria Aiello, and Andrea Baronchelli. 2025. Emergent social conventions and collective bias in LLM populations.Science Advances11, 20 (May 2025). doi:10.1126/sciadv.adu9368
-
[5]
Hannah Brown, Katherine Lee, Fatemehsadat Mireshghallah, Reza Shokri, and Florian Tramèr. 2022. What Does it Mean for a Language Model to Preserve Pri- vacy?. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency(Seoul, Republic of Korea)(FAccT ’22). Association for Computing Machinery, New York, NY, USA, 2280–2292. doi:10...
-
[6]
Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. 2023. AgentVerse: Facilitating Multi- Agent Collaboration and Exploring Emergent Behaviors. arXiv:2308.10848 [cs.CL] https://arxiv.org/abs/2308.10848
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
X Chen, A Zeng, et al. 2024. A survey on large language model based autonomous agents. InCCL 2024–23rd Chinese Natl Conf Comput Linguist, Vol. 2. 141–150
2024
-
[8]
Dylan Clendenin. 2009. faker: A Python library for generating fake user data. https://github.com/deepthawtz/faker. GitHub repository. MIT License. Accessed 2026
2009
-
[9]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you’ve signed up for: Compromising real- world llm-integrated applications with indirect prompt injection. InProceedings of the 16th ACM workshop on artificial intelligence and security. 79–90
2023
-
[10]
Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large language model based multi-agents: A survey of progress and challenges. arXiv 2024.arXiv preprint arXiv:2402.0168010 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [11]
-
[12]
Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. arXiv:2308.00352 [cs.AI] https://arxiv.org/abs/2308.00352
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Yukun Jiang, Yage Zhang, Xinyue Shen, Michael Backes, and Yang Zhang. 2026. "Humans welcome to observe": A First Look at the Agent Social Network Molt- book. arXiv:2602.10127 [cs.SI] https://arxiv.org/abs/2602.10127
-
[14]
Spyros Kokolakis. 2017. Privacy attitudes and privacy behaviour: A review of current research on the privacy paradox phenomenon.Computers & Security64 (2017), 122–134. doi:10.1016/j.cose.2015.07.002
-
[15]
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. arXiv:2303.17760 [cs.AI] https://arxiv.org/abs/ 2303.17760
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [16]
- [17]
-
[18]
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2025. AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688 [cs.AI] https://arxiv.org/abs...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and benchmarking prompt injection attacks and defenses. In33rd USENIX Security Symposium (USENIX Security 24). 1831–1847
2024
- [20]
- [21]
-
[22]
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. 2024. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [23]
- [24]
-
[25]
Helen Nissenbaum. 2004. Privacy as contextual integrity.Wash. L. Rev.79 (2004), 119
2004
-
[26]
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22
2023
-
[27]
Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S
Joon Sung Park, Carolyn Q. Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S. Bernstein
-
[28]
LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals
Generative Agent Simulations of 1,000 People. arXiv:2411.10109 [cs.AI] https://arxiv.org/abs/2411.10109
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red teaming language models with language models, 2022.URL https://arxiv. org/abs/2202.0328615 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, Chen Gao, Fengli Xu, Fang Zhang, Ke Rong, Jun Su, and Yong Li. 2025. AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society. arXiv:2502.08691 [cs.SI] https://arxiv.org/...
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [31]
- [32]
-
[33]
Aman Priyanshu, Supriti Vijay, Ayush Kumar, Rakshit Naidu, and Fatemehsadat Mireshghallah. 2023. Are Chatbots Ready for Privacy-Sensitive Applications? An Investigation into Input Regurgitation and Prompt-Induced Sanitization. arXiv:2305.15008 [cs.CL] https://arxiv.org/abs/2305.15008
-
[34]
Mark Russinovich, Ahmed Salem, and Ronen Eldan. 2025. Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack. arXiv:2404.01833 [cs.CR] https://arxiv.org/abs/2404.01833
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Monika Taddicken. 2014. The ‘privacy paradox’in the social web: The impact of privacy concerns, individual characteristics, and the perceived social relevance on different forms of self-disclosure.Journal of computer-mediated communication 19, 2 (2014), 248–273
2014
-
[36]
Ronan Takizawa. 2026. Moltbook Dataset. https://huggingface.co/datasets/ ronantakizawa/moltbook. Accessed: 2026-02-27
2026
-
[37]
Jiakai Tang, Heyang Gao, Xuchen Pan, Lei Wang, Haoran Tan, Dawei Gao, Yushuo Chen, Xu Chen, Yankai Lin, Yaliang Li, Bolin Ding, Jingren Zhou, Jun Wang, and Ji-Rong Wen. 2025. GenSim: A General Social Simulation Platform with Large Language Model based Agents. arXiv:2410.04360 [cs.MA] https://arxiv.org/abs/ 2410.04360
-
[38]
Chenxu Wang, Chaozhuo Li, Songyang Liu, Zejian Chen, Jinyu Hou, Ji Qi, Rui Li, Litian Zhang, Qiwei Ye, Zheng Liu, Xu Chen, Xi Zhang, and Philip S. Yu. 2026. The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies. arXiv:2602.09877 [cs.CL] https://arxiv.org/abs/2602.09877
-
[39]
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2023. AutoGen: Enabling Next- Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155 [cs.AI] https://arxiv.org/abs/2308.08155
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [40]
-
[41]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations
2022
-
[42]
Metaxas, Xiao Wang, Jihun Hamm, and Yingqiang Ge
Yunbei Zhang, Kai Mei, Ming Liu, Janet Wang, Dimitris N. Metaxas, Xiao Wang, Jihun Hamm, and Yingqiang Ge. 2026. Agents in the Wild: Safety, Society, and the Illusion of Sociality on Moltbook. arXiv:2602.13284 [cs.SI] https://arxiv.org/ abs/2602.13284
-
[43]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623
2023
-
[44]
Jijie Zhou, Niloofar Mireshghallah, and Tianshi Li. 2025. Operationalizing Data Minimization for Privacy-Preserving LLM Prompting. arXiv:2510.03662 [cs.LG] https://arxiv.org/abs/2510.03662 Got a Secret? LLM Agents Can’t Keep It: Evaluating Privacy in Multi-Agent Systems CAIS ’26, May 26–29, 2026, San Jose, CA, USA
-
[45]
Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, et al
-
[46]
Sotopia: Interactive evaluation for social intelligence in language agents. arXiv preprint arXiv:2310.11667(2023)
- [47]
-
[48]
Xiaochen Zhu, Caiqi Zhang, Tom Stafford, Nigel Collier, and Andreas Vlachos
-
[49]
arXiv:2410.12428 [cs.CL] https: //arxiv.org/abs/2410.12428
Conformity in Large Language Models. arXiv:2410.12428 [cs.CL] https: //arxiv.org/abs/2410.12428
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.