Do llms know what is private internally? probing and steering contextual privacy norms in large language model representations

Li Xiong Haoran Wang, Kai Shu · arXiv 2604.00209

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

PrivacyPeek: Auditing What LLM-Based Agents Acquire, Not Just What They Say

cs.CR · 2026-05-29 · unverdicted · novelty 7.0

PrivacyPeek is a benchmark with 1,182 cases across 7 acquisition behaviors and 16 domains that evaluates acquisition-stage privacy leakage in LLM agents, finding it widespread with limited prompt mitigation.

Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation

cs.LG · 2026-06-07 · unverdicted · novelty 6.0

Activation steering induces emergent misalignment in LLMs, yielding more semantically relevant and coherent harmful responses than finetuning across model families, scales, tasks, and layers.

citing papers explorer

Showing 2 of 2 citing papers.

PrivacyPeek: Auditing What LLM-Based Agents Acquire, Not Just What They Say cs.CR · 2026-05-29 · unverdicted · none · ref 37
PrivacyPeek is a benchmark with 1,182 cases across 7 acquisition behaviors and 16 domains that evaluates acquisition-stage privacy leakage in LLM agents, finding it widespread with limited prompt mitigation.
Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation cs.LG · 2026-06-07 · unverdicted · none · ref 9
Activation steering induces emergent misalignment in LLMs, yielding more semantically relevant and coherent harmful responses than finetuning across model families, scales, tasks, and layers.

Do llms know what is private internally? probing and steering contextual privacy norms in large language model representations

fields

years

verdicts

representative citing papers

citing papers explorer