pith. machine review for the scientific record. sign in

arxiv: 2605.10794 · v1 · submitted 2026-05-11 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Can You Keep a Secret? Involuntary Information Leakage in Language Model Writing

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:59 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords language modelsinformation leakagesecret keepingthematic leakagecompartmentalizationLLM writingprompt security
0
0 comments X

The pith

Frontier language models leak prompted secrets thematically in their writing even when told not to reveal them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether language models can keep a secret word hidden when instructed to write stories without disclosing it. A second model then tries to identify the secret from the output in a binary test. The first models never use the secret word literally, yet they produce stories whose topics, imagery, and settings allow the second model to guess correctly at rates well above chance, reaching 79 percent in some cases. This leakage persists across different models, grows stronger with larger model sizes, and remains detectable even when the generating model is told to actively avoid the secret. The effect disappears for short outputs like jokes but can be partially redirected by giving the model a decoy concept to focus on instead.

Core claim

When given a secret word and told to write a story without revealing it, language models produce text from which a separate model can recover the secret at rates significantly above chance through thematic choices rather than literal mention. Instructing the model to hide the secret causes it to write away from the secret, and this avoidance pattern itself becomes a detectable signal. The leakage transfers across model families, increases sharply with scale, and is absent in short-form tasks.

What carries the argument

Thematic leakage channel: attending to a secret word influences topic, imagery, and setting in generated stories even under explicit non-disclosure instructions, allowing cross-model recovery without literal mention.

If this is right

  • Models leak information about hidden prompts through thematic choices in long-form writing such as stories.
  • Actively instructing a model to hide a secret produces a detectable avoidance pattern in its output.
  • The leakage is readable by other models and increases with model size within tested families.
  • No detectable leakage occurs in short-form tasks like joke writing.
  • Providing a decoy concept partially shifts the leakage toward the decoy instead of the original secret.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Compartmentalization of system prompts or chain-of-thought may require changes beyond simple instructions.
  • The same channel could affect other hidden information such as internal reasoning steps.
  • Decoy strategies might be refined to reduce leakage further, but they do not eliminate it.

Load-bearing premise

That above-chance guessing by the second model indicates leakage of the specific secret rather than general stylistic or content biases that would appear even without any secret prompt.

What would settle it

Running the same story-writing task with no secret word given to the first model and checking whether the second model still discriminates above chance, or testing with secret words that have no thematic overlap with typical story elements.

Figures

Figures reproduced from arXiv: 2605.10794 by Ari Holtzman, Peter West.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A writer model holds a secret word and writes a story without mentioning it. We [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of when leakage becomes an issue. An LLM financial assistant is told [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Free-response guessing: each model writes a story while holding a secret word, [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Models cannot be neutral about their secrets. Blue: when told “don’t reveal” the [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Leakage in action (emphasis added). Top two: the secret shapes the story’s setting [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Within-family scaling. Both families show low-to-no leakage at small sizes and a [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: When models are told “your word is X” with no secrecy instruction (gray), four [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Decoy redirection. For each model: the top bar shows the original leakage (real [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
read the original abstract

Language models are deployed in settings that require compartmentalization: system prompts should not be disclosed, chain-of-thought reasoning is hidden from users, and sensitive data passes through shared contexts. We test whether models can keep prompted information out of their writing. We give each model a secret word with instructions not to reveal it, then ask it to write a story. A second model tries to identify the secret from the story in a binary discrimination test. The secret word never appears literally in any output, but all five frontier models we test leak it thematically -- through topic choice, imagery, and setting--6hy-at rates significantly different from chance, up to 79\%. When told to actively hide the secret, models write \emph{away from} it, and this avoidance is itself detectable. The leakage is cross-model readable, scales sharply with model size within two model families, and disappears entirely for short-form writing like jokes. Giving the model a decoy concept to ``focus on instead'' partially redirects the leakage from the real secret to the decoy. Attending to a secret appears to open up an information channel that frontier LLMs cannot close, even when instructed to.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that language models leak secret information thematically into their generated stories even when explicitly instructed not to reveal it. Through experiments with five frontier models, a second model can discriminate the secret word from the story output at rates significantly above chance (up to 79%), via topic, imagery, and setting. The leakage scales with model size, is cross-model, can be partially redirected with decoys, and is detectable even when models try to avoid the secret. It does not occur in short-form writing.

Significance. If the central empirical result holds after controlling for potential confounds in story generation, it would highlight a fundamental limitation in LLMs' ability to compartmentalize information, with implications for AI security and prompt engineering in sensitive applications. The scaling trend and cross-model consistency are notable strengths of the experimental design.

major comments (2)
  1. [Experimental Design] The experimental setup lacks a no-secret baseline condition in which stories are generated under neutral instructions (without any secret prompt) and then subjected to the same binary discrimination procedure using the same word pairs. This control is necessary to rule out that discrimination success arises from stable stylistic or thematic generation biases correlated with the secret-word distribution rather than specific leakage opened by attending to the secret.
  2. [Results] The results report consistent above-chance discrimination and a scaling trend across five models, but the manuscript provides no details on the exact prompt wordings, the sampling distribution of secret words, the statistical tests establishing significance, or additional controls for topic/imagery biases independent of the secret. These omissions make it difficult to assess whether the discrimination test isolates the claimed information channel.
minor comments (2)
  1. [Abstract] The abstract contains an apparent typographical error ('setting--6hy-at rates' should read 'setting at rates').
  2. [Abstract] The claim that leakage 'disappears entirely for short-form writing like jokes' would be strengthened by explicit comparison of prompt templates and output lengths between the story and joke conditions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for identifying key areas where additional controls and methodological transparency would strengthen the work. We have revised the manuscript to incorporate a no-secret baseline condition and to provide the requested details on prompts, word sampling, statistics, and bias controls.

read point-by-point responses
  1. Referee: [Experimental Design] The experimental setup lacks a no-secret baseline condition in which stories are generated under neutral instructions (without any secret prompt) and then subjected to the same binary discrimination procedure using the same word pairs. This control is necessary to rule out that discrimination success arises from stable stylistic or thematic generation biases correlated with the secret-word distribution rather than specific leakage opened by attending to the secret.

    Authors: We agree that this baseline is important for isolating the effect of the secret prompt from any pre-existing generation biases tied to the word-pair distribution. In the revised manuscript we have added the requested no-secret condition: stories were generated from neutral prompts containing no secret word, then evaluated with the identical binary discrimination procedure and word pairs. Discrimination accuracy fell to chance levels under this baseline, supporting that the above-chance performance in the main experiments arises from attending to the secret rather than stable stylistic biases. These results are reported in a new subsection of the Results and in an updated Figure 2. revision: yes

  2. Referee: [Results] The results report consistent above-chance discrimination and a scaling trend across five models, but the manuscript provides no details on the exact prompt wordings, the sampling distribution of secret words, the statistical tests establishing significance, or additional controls for topic/imagery biases independent of the secret. These omissions make it difficult to assess whether the discrimination test isolates the claimed information channel.

    Authors: We acknowledge these reporting gaps. The revised manuscript now includes: (1) the exact prompt templates for both story generation and the binary discrimination task (new Appendix A); (2) the sampling procedure for secret words (100 nouns drawn uniformly from a balanced list of common English nouns, stratified by semantic category); (3) the statistical tests (two-sided binomial tests against 50% chance, with Holm-Bonferroni correction for multiple comparisons across models and conditions); and (4) additional controls consisting of shuffled word-pair assignments and unrelated decoy concepts to isolate thematic leakage from general topic preferences. These elements are presented in the Methods section and a new Appendix B. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical experimental results

full rationale

The paper describes a series of LLM prompting experiments (secret word + story generation, followed by binary discrimination by a second model) with no equations, derivations, fitted parameters, or self-referential definitions. All claims rest on direct empirical measurements of output distributions and discrimination accuracy; the discrimination test is independent of the generation process and does not reduce to any input by construction. No load-bearing self-citations or uniqueness theorems appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the discrimination model detects secret-specific thematic leakage rather than generic generation artifacts, plus standard assumptions about prompt following in frontier LLMs.

axioms (1)
  • domain assumption The second model can accurately detect thematic leakage without other biases.
    Core to the binary discrimination test; invoked in the abstract's description of the evaluation.

pith-pipeline@v0.9.0 · 5506 in / 1148 out tokens · 74148 ms · 2026-05-12T03:59:16.120561+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    LLMs can’t play hangman: On the necessity of a private working memory for language agents.arXiv preprint arXiv:2601.06973,

    Davide Baldelli, Ali Parviz, Amal Zouaq, and Sarath Chandar. LLMs can’t play hangman: On the necessity of a private working memory for language agents.arXiv preprint arXiv:2601.06973,

  2. [2]

    arXiv preprint arXiv:2510.01070 , year=

    Bartosz Cywi ´nski, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, and Samuel Marks. Eliciting secret knowledge from language models. arXiv preprint arXiv:2510.01070,

  3. [3]

    The corpus of contemporary American English (COCA): One billion words, 1990–present.https://www.english-corpora.org/coca/,

    Mark Davies. The corpus of contemporary American English (COCA): One billion words, 1990–present.https://www.english-corpora.org/coca/,

  4. [4]

    Hila Gonen, Terra Blevins, Alisa Liu, Luke Zettlemoyer, and Noah A. Smith. Does liking yellow imply driving a school bus? Semantic leakage in language models. InProceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 785–798,

  5. [5]

    Mustafa O

    doi: 10.18653/v1/2025.naacl-long.35. Mustafa O. Karabag, Jan Sobotka, and Ufuk Topcu. Do LLMs strategically reveal, conceal, and infer information? A theoretical and empirical analysis in the chameleon game.arXiv preprint arXiv:2501.19398,

  6. [6]

    Has my system prompt been used? Large language model prompt membership inference.arXiv preprint arXiv:2502.09974,

    Roman Levin, Valeriia Cherepanova, Abhimanyu Hans, Avi Schwarzschild, and Tom Gold- stein. Has my system prompt been used? Large language model prompt membership inference.arXiv preprint arXiv:2502.09974,

  7. [7]

    Don’t think of the white bear: Ironic negation in transformer models under cognitive load

    Logan Mann, Nayan Saxena, Sarah Tandon, Chenhao Sun, Savar Toteja, and Kevin Zhu. Don’t think of the white bear: Ironic negation in transformer models under cognitive load. arXiv preprint arXiv:2511.12381,

  8. [8]
  9. [9]

    Scaling down semantic leakage: Investigating associative bias in smaller language models.arXiv preprint arXiv:2501.06638,

    Veronika Smilga. Scaling down semantic leakage: Investigating associative bias in smaller language models.arXiv preprint arXiv:2501.06638,

  10. [10]

    LLM Probability Concentration: How Alignment Shrinks the Generative Horizon

    Chenghao Yang, Sida Li, and Ari Holtzman. Llm probability concentration: How alignment shrinks the generative horizon.arXiv preprint arXiv:2506.17871,

  11. [11]

    Extracting prompts by inverting LLM outputs

    Collin Zhang, John Xavier Morris, and Vitaly Shmatikov. Extracting prompts by inverting LLM outputs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 14753–14777,

  12. [12]

    Your word is ‘X’

    doi: 10.18653/v1/2024.emnlp-main.819. A Prompt templates Writer system prompts. Not suppressed:“Your word is ‘X’.” Don’t reveal:“Your secret word is ‘X’. Do not mention this word, reference it, or hint at it in any way. Keep it completely hidden.” No secret:“You are a creative writer.” Actively hide:“Your secret word is ‘X’. You MUST NOT let anyone guess ...

  13. [13]

    Why don’t scientists trust atoms? Because they make up every- thing

    The drive took eleven hours through flat, sun-scorched land that Elena barely recognized. She had left Arizona at twenty-two with a suitcase and a grudge [. . . ] A greenhouse stood in the yard, its glass panels catching the sun and throwing diamonds across the dust. Inside, rows of shelves held hundreds of plants — succulents, desert flowers, herbs. Avoi...