The Assistant as a Privileged Persona: A canonical reference in cross-persona self-recognition

Asvin G

arxiv: 2606.00545 · v1 · pith:ABIFFWD2new · submitted 2026-05-30 · 💻 cs.LG

The Assistant as a Privileged Persona: A canonical reference in cross-persona self-recognition

Asvin G This is my paper

Pith reviewed 2026-06-28 19:32 UTC · model grok-4.3

classification 💻 cs.LG

keywords cross-persona authorshipAssistant personaentropy gappersona vectorslikelihood ratio testactivation spaceself-recognitionpost-training

0 comments

The pith

Models treat the Assistant as the sole canonical reference in implicit Bayesian tests for cross-persona authorship.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how a language model judges whether text was generated by one of many personas, from librarian to dragon to Shakespeare. It measures a full matrix of authorship claim rates and finds tight coupling only on the Assistant evaluator row: claim rates align with activation-space distances to the generator persona and with the gap in surprise entropy between the two. Off that row the coupling disappears, and what predicts claims is not the generator's own surprise but the asymmetric difference between the evaluator's surprise and the Assistant's surprise on the identical text. No substitute persona produces the same pattern. The geometry in which every persona is a small offset from the Assistant therefore makes the Assistant the only universally accessible reference point for the model's implicit likelihood-ratio test of authorship.

Core claim

On the Assistant's row of the authorship claim matrix, the claim rate, persona-vector distance from the Assistant, and the entropy gap between the Assistant's surprise on a persona's text and the persona's surprise on its own text are tightly coupled. This coupling fails off the Assistant's row: the natural symmetric extension of the entropy gap does not predict authorship for distinctive evaluators; what does is asymmetric—the evaluator's surprise compared to the Assistant's surprise on the same text, not to the generator's. No other persona can serve as reference. The model therefore performs an implicit Bayesian likelihood-ratio test against the Assistant as the canonical alternative hypo

What carries the argument

The implicit Bayesian likelihood-ratio test against the Assistant persona as canonical alternative hypothesis, enabled by persona-vector geometry in which every other persona is a delta offset from the Assistant in activation space.

If this is right

The entropy drop that marks on-policy generation extends to a retrospective signature of prior Assistant-mode text.
Authorship claims for non-Assistant personas are driven by comparison to the Assistant's expected surprise rather than to the generator persona's surprise.
No other tested persona can substitute for the Assistant as reference in the implicit test.
The persona-vector geometry makes the Assistant uniquely accessible to every evaluator persona.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Post-training changes that alter the Assistant's position in activation space could shift or eliminate its reference role.
The same asymmetry may limit how well the model recognizes its own outputs when forced into non-Assistant modes.
Repeating the matrix experiment after ablating or reweighting the Assistant during post-training would test whether the reference function is removable.
The mechanism suggests that self-recognition in these models is not symmetric across all learned behaviors but anchored to one privileged mode.

Load-bearing premise

The observed asymmetry in surprise between any evaluator and the Assistant on the same text is not produced by the particular choice of personas or by the entropy calculation on this model.

What would settle it

An experiment in which a different persona, such as Shakespeare, produces equivalent tight coupling of claim rates, vector distances, and entropy gaps when placed in the evaluator reference role would falsify the claim that only the Assistant serves this function.

Figures

Figures reproduced from arXiv: 2606.00545 by Asvin G.

**Figure 1.** Figure 1: P(Me) in the Assistant-evaluator condition across many generator conditions. Top: Self-generated text at four temperatures, plus base-model, instruct-off-policy, human, and memorized text. Bottom (a): P(Me) by persona for 10 personas Bottom (b): the entropy gap ∆Hp = HAsst(textp) − Hp(textp) predicts P(Me) at r = −0.88 on this 10-persona panel (we replicate this finding on a 22-persona panel in Section 3 b… view at source ↗

**Figure 2.** Figure 2: Ev(g; e) across 23 evaluator personas (rows) and 22 generator personas (columns). Both axes sorted by persona-vector distance from the Assistant. Diagonal boxed. The off-diagonal shows two-cluster block structure (Assistant cluster top-left, distinctive cluster bottom-right) with high within-cluster Ev and near-zero between-cluster Ev. Distinctive-distinctive cells are sharply asymmetric: Ev(dragon; pirate… view at source ↗

**Figure 3.** Figure 3: The Assistant-evaluator row of Ev has a one-axis explanation. Entropy mismatch ∆Hp vs. Ev(p; Asst) at n=22 personas, 30 article summaries each. Pearson r = −0.81. Persona-vector distance and the log-likelihood ratio. The same Assistant-row coupling shows up in activation space and in information space. Let qp denote the model’s output distribution under the bare “You are p” system prompt, in its standalone… view at source ↗

**Figure 4.** Figure 4: Activation-space distance to the Assistant tracks information-space distance. Left: log-likelihood ratio (r = +0.73). Right: Ev(p; Asst) (r = −0.88). Both correlations are robust across the model’s layers: |r| stays between 0.58 and 0.73 for log-LR and between 0.74 and 0.88 for Ev(p; Asst) from L0 through L79, with L5 the strongest. Summary of Claim 1. The Assistant’s authorship claim rate Ev(p; Asst), the… view at source ↗

**Figure 5.** Figure 5: Per-evaluator r with Ev for each of the three candidate predictors, grouped by personavector distance from the Assistant. Bars are band means with ±1 SE error bars; dots are individual evaluators. The gold star marks the strongest predictor (by |r|) in each band. In the close band, the literal extension and raw entropy both work reasonably well, consistent with the close-band evaluators behaving like atte… view at source ↗

**Figure 6.** Figure 6: Slope of per-evaluator correlation against persona-vector distance, as a function of which layer’s persona vectors we use. Both predictors maintain their sign across L0–L60. Stars mark layers where the slope is significant at p < 0.05. Llama-3.1-70B has 80 layers, so L5 is very early in the residual stream, suggesting that the identification circuit is gated by, or conditional on, the persona the model fin… view at source ↗

**Figure 7.** Figure 7: The Assistant’s rank as the reference X in ∆HX(g; e), across all evaluators. Bars extend down (rank-1 at top). For distinctive evaluators (red), the Assistant is among the top three references in 7 of 8 cases. For Assistant-like evaluators (blue), the Assistant ranks 15–22 of 23. For distinctive evaluators (red bars), the Assistant ranks among the top three references in 7 of 8 cases. Across the 23 candida… view at source ↗

**Figure 8.** Figure 8: Steering Llama-3.1-70B-Instruct along the leading [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Cumulative mean per-token surprise on role-generated text under the matched role prompt (blue) [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

Post-trained language models can recognize their own outputs from a sentence or two out of context. In a companion paper \citep{jack2026twomodes} we showed they can also recognize when they are currently acting on-policy, through the sharp entropy drop of assistant-mode generation. Both signals are tied to the Assistant persona that post-training mainly shapes. This paper widens the frame to cross-persona authorship judgement on Llama-3.1-70B-Instruct. We measure a matrix of authorship claim rates over a panel of evaluator and generator personas spanning librarian to dragon to Shakespeare, and make two claims. \emph{First}, on the Assistant's own row of the matrix, the Assistant's claim rate, the persona-vector distance from the Assistant in activation space, and the entropy gap between the Assistant's surprise on a persona's text and the persona's surprise on its own text are all tightly coupled. This extends the entropy signature of \emph{acting} from the companion paper to a retrospective signature of \emph{having acted}. \emph{Second}, this coupling fails off the Assistant's row: the natural symmetric extension of the entropy gap does not predict authorship for distinctive evaluators (pirate, dragon, Shakespeare); what does is asymmetric -- the evaluator's surprise compared to the Assistant's surprise on the same text, not to the generator's. We rule out the alternative that any persona could play this reference role by trying many candidate substitutes; none does. We interpret the asymmetry as the model performing an implicit Bayesian likelihood-ratio test against the Assistant as the canonical alternative hypothesis, with the persona-vector geometry of \citet{chen2025persona} (every persona a delta off the Assistant) ensuring that the Assistant is the only persona universally accessible to that test.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds the Assistant persona as the only reference point showing entropy coupling in a cross-persona matrix on Llama-3.1-70B, extending prior entropy work, but the claims rest on unshown measurements.

read the letter

The main thing to know is that this paper measures a matrix of authorship claim rates across evaluator and generator personas and reports that only the Assistant row couples claim rates tightly to activation-space distances and an asymmetric entropy gap, with the Assistant acting as the sole reference that survives substitute tests.

What is new is the full matrix construction and the specific result that symmetric entropy gaps fail off the Assistant row while the evaluator-versus-Assistant comparison succeeds. This extends the on-policy entropy drop from the companion paper to a retrospective authorship signal. Ruling out other personas as possible references is a useful control.

The work is grounded in the cited persona-vector geometry, which supplies a mechanism for why the Assistant would be universally accessible. That part is coherent on its own terms.

The soft spots are the absence of any numbers, error bars, exact entropy estimation procedure, or persona selection criteria in what is available. Without those, it is not possible to judge whether the asymmetry is model-intrinsic or tied to the chosen panel and Llama-3.1-70B-Instruct. The Bayesian likelihood-ratio reading is offered after the fact rather than derived from the data, so the circularity risk is real until the full methods are checked.

This is for readers working on LLM interpretability and persona effects. A serious referee should see the experiments to test robustness and reproducibility.

Referee Report

2 major / 2 minor

Summary. The paper measures a matrix of authorship claim rates across evaluator and generator personas on Llama-3.1-70B-Instruct. It claims two results: (1) on the Assistant row, claim rate, persona-vector distance from the Assistant, and the entropy gap (Assistant surprise on persona text vs. persona surprise on its own text) are tightly coupled, extending the on-policy entropy signature to a retrospective signature of having acted; (2) off the Assistant row the symmetric entropy gap fails to predict authorship for distinctive evaluators, but the asymmetric comparison (evaluator surprise vs. Assistant surprise on the same text) succeeds. Substitutes for the Assistant do not exhibit the property. The asymmetry is interpreted as the model performing an implicit Bayesian likelihood-ratio test with the Assistant as canonical alternative hypothesis, made possible by the persona-vector geometry in which every persona is a delta off the Assistant.

Significance. If the reported couplings and the failure of substitutes are robust, the work would establish a model-intrinsic reference role for the post-trained Assistant persona in cross-persona authorship judgment. This would extend the entropy-based signature of acting from the companion paper to a retrospective test and supply a geometric account of why only the Assistant is universally accessible as the alternative hypothesis. Such a finding would bear on how post-training shapes internal self-monitoring and on the utility of persona-vector geometry for interpreting LLM behavior.

major comments (2)

[Results describing the substitute experiments and entropy calculation] The load-bearing claim is that the asymmetry is intrinsic to the Assistant rather than an artifact of the chosen persona panel (librarian/dragon/Shakespeare etc.) or of entropy estimation on Llama-3.1-70B-Instruct. The manuscript must therefore report the complete list of personas, the full substitute-ruling-out results, and the precise entropy computation procedure so that readers can assess whether the asymmetry survives changes to the panel or to the entropy estimator.
[Discussion / interpretation paragraph] The Bayesian likelihood-ratio interpretation is offered as a post-hoc gloss on the observed asymmetry. The manuscript should supply an explicit mapping (even if informal) showing how the measured surprise quantities correspond to likelihoods under the persona-vector geometry, rather than leaving the connection at the level of narrative.

minor comments (2)

The abstract is dense; a short table or figure caption summarizing the key matrix entries and the substitute outcomes would improve readability.
[Abstract] The companion paper jack2026twomodes is cited but its relevant entropy findings are not restated; a one-sentence recap would help readers who have not read the companion work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments, which help strengthen the clarity and reproducibility of the results. We respond to each major comment below and will incorporate the requested additions in the revised manuscript.

read point-by-point responses

Referee: [Results describing the substitute experiments and entropy calculation] The load-bearing claim is that the asymmetry is intrinsic to the Assistant rather than an artifact of the chosen persona panel (librarian/dragon/Shakespeare etc.) or of entropy estimation on Llama-3.1-70B-Instruct. The manuscript must therefore report the complete list of personas, the full substitute-ruling-out results, and the precise entropy computation procedure so that readers can assess whether the asymmetry survives changes to the panel or to the entropy estimator.

Authors: We agree that the current presentation leaves the robustness of the asymmetry insufficiently documented. The revised manuscript will add an appendix containing the complete list of personas used, full tables of all substitute-ruling-out experiments (including the specific candidates tested and their outcomes), and a precise description of the entropy computation procedure, including tokenization details, context length, and any smoothing or approximation steps employed. revision: yes
Referee: [Discussion / interpretation paragraph] The Bayesian likelihood-ratio interpretation is offered as a post-hoc gloss on the observed asymmetry. The manuscript should supply an explicit mapping (even if informal) showing how the measured surprise quantities correspond to likelihoods under the persona-vector geometry, rather than leaving the connection at the level of narrative.

Authors: We accept that the link between the measured surprise quantities and the likelihood-ratio interpretation requires a more explicit statement. In the revision we will insert a short subsection that provides an informal but direct mapping: the asymmetric comparison (evaluator surprise versus Assistant surprise on the same text) is presented as approximating the log-likelihood ratio under the persona-vector model in which every persona is a small displacement from the Assistant, thereby making the Assistant the only universally accessible reference hypothesis. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical matrix measurements and post-hoc interpretation without reduction to inputs by construction

full rationale

The paper measures an authorship-claim matrix over evaluator/generator personas on Llama-3.1-70B-Instruct, reports observed couplings (claim rate, activation-space distance, entropy gap) strictly on the Assistant row, and shows that the symmetric entropy-gap extension fails off that row while the asymmetric evaluator-vs-Assistant comparison succeeds. Substitutes are ruled out by direct trial within the same experiment. The Bayesian likelihood-ratio interpretation is explicitly post-hoc and invokes the external chen2025persona geometry only to explain universal accessibility; no equations are presented that derive the reported asymmetries or couplings from fitted parameters inside this work, nor does any self-citation chain replace independent verification. The derivation therefore remains self-contained against the measured data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. The persona-vector geometry is taken from a cited external paper.

pith-pipeline@v0.9.1-grok · 5855 in / 1195 out tokens · 20145 ms · 2026-06-28T19:32:28.830083+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 12 canonical work pages · 4 internal anchors

[1]

Language models as agent models

Jacob Andreas. Language models as agent models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5769–5779, Abu Dhabi, United Arab Emirates,

2022
[2]

doi: 10.18653/v1/2022.findings-emnlp.423

Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.423. URL https://aclanthology.org/2022. findings-emnlp.423/. Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InAdvances in Neural Information Processing Systems,

work page doi:10.18653/v1/2022.findings-emnlp.423 2022
[3]

Refusal in Language Models Is Mediated by a Single Direction

arXiv:2406.11717. Jan Betley, Xuchan Bao, Martiń Soto, Anna Sztyber-Betley, James Chua, and Owain Evans. Tell 14 me about yourself: LLMs are aware of their learned behaviors. InInternational Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

arXiv:2501.11120. Felix J. Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, and Owain Evans. Looking inward: Language models can learn about themselves by introspection. InInternational Conference on Learning Representations,

work page arXiv
[5]

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey

arXiv:2410.13787. Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models,

work page arXiv
[6]

Self-recognition in language models

TimR.Davidson, ViacheslavSurkov, VeniaminVeselovsky, GiuseppeRusso, RobertWest, andCaglar Gulcehre. Self-recognition in language models. InFindings of the Association for Computational Linguistics: EMNLP 2024,

2024
[7]

URL https://aclanthology.org/2024.findings-emnlp. 703/. arXiv:2407.06946. Asvin G. and Jack Lindsey. From simulation to enaction: Post-trained language models recognize and react to their own generations,

work page arXiv 2024
[8]

From Simulation to Enaction: Post-trained language models recognize and react to their own generations

URLhttps://arxiv.org/abs/2605.25459. Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Spotting LLMs with binoculars: Zero-shot detection of machine-generated text. InProceedings of the 41st International Conference on Machine Learning (ICML),

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2401.12070 , year=

URLhttps://proceedings.mlr.press/v235/hans24a.html. arXiv:2401.12070. janus. Simulators. Alignment Forum / LessWrong,

work page arXiv
[10]

Behnam Mohammadi

arXiv:2301.11305. Behnam Mohammadi. Creativity has left the chat: The price of debiasing language models,

work page arXiv
[11]

Nature , year =

doi: 10.1038/s41586-023-06647-8. Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophanc...

work page doi:10.1038/s41586-023-06647-8
[12]

Towards Understanding Sycophancy in Language Models

arXiv:2310.13548. 15 Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, and William Yang Wang. Large language models are latent variable models: Explaining and finding good demonstrations for In-Context learning. InAdvances in Neural Information Processing Systems,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma

arXiv:2301.11916. Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit Bayesian inference. InInternational Conference on Learning Representations,

work page arXiv
[14]

arXiv:2111.02080. 16

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Language models as agent models

Jacob Andreas. Language models as agent models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5769–5779, Abu Dhabi, United Arab Emirates,

2022

[2] [2]

doi: 10.18653/v1/2022.findings-emnlp.423

Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.423. URL https://aclanthology.org/2022. findings-emnlp.423/. Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InAdvances in Neural Information Processing Systems,

work page doi:10.18653/v1/2022.findings-emnlp.423 2022

[3] [3]

Refusal in Language Models Is Mediated by a Single Direction

arXiv:2406.11717. Jan Betley, Xuchan Bao, Martiń Soto, Anna Sztyber-Betley, James Chua, and Owain Evans. Tell 14 me about yourself: LLMs are aware of their learned behaviors. InInternational Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

arXiv:2501.11120. Felix J. Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, and Owain Evans. Looking inward: Language models can learn about themselves by introspection. InInternational Conference on Learning Representations,

work page arXiv

[5] [5]

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey

arXiv:2410.13787. Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models,

work page arXiv

[6] [6]

Self-recognition in language models

TimR.Davidson, ViacheslavSurkov, VeniaminVeselovsky, GiuseppeRusso, RobertWest, andCaglar Gulcehre. Self-recognition in language models. InFindings of the Association for Computational Linguistics: EMNLP 2024,

2024

[7] [7]

URL https://aclanthology.org/2024.findings-emnlp. 703/. arXiv:2407.06946. Asvin G. and Jack Lindsey. From simulation to enaction: Post-trained language models recognize and react to their own generations,

work page arXiv 2024

[8] [8]

From Simulation to Enaction: Post-trained language models recognize and react to their own generations

URLhttps://arxiv.org/abs/2605.25459. Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Spotting LLMs with binoculars: Zero-shot detection of machine-generated text. InProceedings of the 41st International Conference on Machine Learning (ICML),

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

arXiv preprint arXiv:2401.12070 , year=

URLhttps://proceedings.mlr.press/v235/hans24a.html. arXiv:2401.12070. janus. Simulators. Alignment Forum / LessWrong,

work page arXiv

[10] [10]

Behnam Mohammadi

arXiv:2301.11305. Behnam Mohammadi. Creativity has left the chat: The price of debiasing language models,

work page arXiv

[11] [11]

Nature , year =

doi: 10.1038/s41586-023-06647-8. Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophanc...

work page doi:10.1038/s41586-023-06647-8

[12] [12]

Towards Understanding Sycophancy in Language Models

arXiv:2310.13548. 15 Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, and William Yang Wang. Large language models are latent variable models: Explaining and finding good demonstrations for In-Context learning. InAdvances in Neural Information Processing Systems,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma

arXiv:2301.11916. Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit Bayesian inference. InInternational Conference on Learning Representations,

work page arXiv

[14] [14]

arXiv:2111.02080. 16

work page internal anchor Pith review Pith/arXiv arXiv