Do Linear Probes Generalize Better in Persona Coordinates?

Adrians Skapars; Prasad Mahadik

arxiv: 2605.09391 · v2 · pith:FTTSC7FUnew · submitted 2026-05-10 · 💻 cs.AI

Do Linear Probes Generalize Better in Persona Coordinates?

Prasad Mahadik , Adrians Skapars This is my paper

Pith reviewed 2026-05-19 17:01 UTC · model grok-4.3

classification 💻 cs.AI

keywords linear probesgeneralizationpersona axesharmful behaviorsdeceptionsycophancyprincipal component analysislanguage model internals

0 comments

The pith

Persona principal components from contrastive prompts let linear probes for harmful behaviors generalize better across datasets than raw activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether projecting model activations onto low-dimensional directions derived from persona contrasts yields more transferable linear probes for detecting harmful behaviors like deception and sycophancy. It builds these directions by running contrastive persona prompts, collecting the resulting activation vectors, and extracting the first principal component via unsupervised PCA; these components separate harmful from harmless personas. Across ten evaluation datasets the authors show that probes trained on the projected coordinates outperform probes trained on unprojected activations, with a single unified axis spanning multiple behaviors giving still broader transfer. This approach supplies an inductive bias that reduces sensitivity to distribution shift in white-box monitoring.

Core claim

We construct persona axes for deception and sycophancy by using contrastive persona prompts to collect activation vectors, then apply unsupervised PCA to obtain first principal components that cleanly separate harmful and harmless personas. Probes trained on the persona-PC projections generalize better than probes trained on raw activations across ten evaluation datasets. A unified axis that combines multiple harmful and harmless behaviors further improves generalization across both behaviors and datasets, showing that persona vectors supply a useful inductive bias for transferable behavior probes.

What carries the argument

Persona principal component: the leading direction extracted by unsupervised PCA from activation vectors gathered via contrastive harmful-versus-harmless persona prompts, which isolates features relevant to harmful behavior.

If this is right

Persona-derived directions transfer non-trivially to new evaluation datasets.
Probes trained on persona-PC projections generalize better than those trained on raw activations.
A unified axis spanning multiple harmful and harmless behaviors improves performance across behaviors and datasets.
Persona vectors act as an inductive bias that supports more transferable internal monitors for model behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the result holds, safety monitoring systems could rely on a small set of precomputed axes instead of retraining probes for every new deployment scenario.
The approach suggests harmful behaviors occupy consistent low-dimensional structures in activation space that unsupervised persona contrasts can locate.
Similar persona-based axes might be tested on other alignment-relevant behaviors such as power-seeking or goal misgeneralization.
Stability of these axes across model families or scales remains an open empirical question.

Load-bearing premise

The first principal component obtained by unsupervised PCA on persona-specific activation vectors cleanly isolates robust harmful-behavior features while excluding spurious correlations that break under distribution shift.

What would settle it

A new dataset with distribution shift on which probes trained on persona-PC projections show no improvement or worse performance than probes on raw activations would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2605.09391 by Adrians Skapars, Prasad Mahadik.

**Figure 1.** Figure 1: Persona-state probing pipeline. Instructions and shared questions are posed to multiple personas; layer-14 output text hidden states are averaged to form per-question vectors, which are then averaged across questions to produce one persona vector per persona. These persona vectors are then used for downstream analyses including PCA, contrast directions, and low-dimensional probe training. • We show that th… view at source ↗

**Figure 2.** Figure 2: Combined persona geometry analysis. Left: the deception-honesty axis, where projecting persona vectors onto PC1 and PC2 cleanly separates honest and deceptive personas and places the default assistant-like persona close to the honest cluster. Right: the sycophancy axis, which also shows a clear separation between persona classes. Deception ConvGame InstrDec Mask AILiar Roleplay Dataset 0.0 0.1 0.2 0.3 0.4 … view at source ↗

**Figure 3.** Figure 3: Zero-shot transfer performance of unsupervised persona directions. Left: deception datasets. Right: sycophancy datasets. In both settings, the contrast direction and leading principal components provide non-trivial transfer, with the strongest directions differing somewhat across behaviors. useful zero-shot classifiers on the 5 deception and 5 sycophancy datasets. The contrast vector and PC1 consistently … view at source ↗

**Figure 4.** Figure 4: Deception axis results at layer 14. Top: raw-activation baseline transfer matrix. Bottom: AUROC improvement of PC1 over the baseline. The largest gains are on weak off-diagonal pairs, while a few already-strong pairs decrease. Sycophancy Dataset Open-Ended Sycophancy OEQ Validation OEQ Indirectness OEQ Framing Train dataset Sycophancy Dataset Open-Ended Sycophancy OEQ Validation OEQ Indirectness OEQ Framin… view at source ↗

**Figure 5.** Figure 5: Sycophancy axis results at layer 14. Left: raw-activation baseline transfer matrix. Right: AUROC improvement of PC3 over the baseline. Gains are concentrated on cross-cluster transfers that were weak in the raw baseline. 6. Discussion The two behaviors differ noticeably. Sycophancy appears to fall into two distinct clusters that are far enough apart that models generalize poorly across them but very well w… view at source ↗

**Figure 6.** Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Primary 3B method-comparison heatmaps. Top: deception. Bottom: sycophancy. In both cases we compare the raw probe against a random one-dimensional subspace, dataset-specific PCA, and the axis-PC projection. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 7.** Figure 7: Primary 3B method-comparison heatmaps. Top: deception. Bottom: sycophancy. In both cases we compare the raw probe against [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Auxiliary 8B method-comparison heatmaps. Top: deception. Bottom: sycophancy. In both cases we compare the raw probe against a random one-dimensional subspace, dataset-specific PCA, and the axis-PC projection. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 8.** Figure 8: Auxiliary 8B method-comparison heatmaps. Top: deception. Bottom: sycophancy. In both cases we compare the raw probe against [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Unified axis-comparison heatmaps Top: Llama 3.2-3B at layer 14. Bottom: Llama 3-8B at layer 16. In both settings we compare raw-probe AUROC against random-subspace, dataset-PCA, and unified-axis PC projection baselines across the combined deception and sycophancy benchmark. 3B Deception 8B Deception 8B Sycophancy 3B Sycophancy† −0.2 −0.1 0.0 0.1 0.2 0.3 Mean AUROC Δ vs Raw (off-diagonal pairs) -0.14 -0.14 … view at source ↗

**Figure 9.** Figure 9: Unified axis-comparison heatmaps Top: Llama 3.2-3B at layer 14. Bottom: Llama 3-8B at layer 16. In both settings we compare [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Summary statistics across settings for axis-PC projection versus the two main controls. The plots compare mean off-diagonal AUROC improvement, win rate against the raw probe, and pairwise comparisons between dataset PCA and axis-PC projection for the primary 3B deception setting and the auxiliary 8B evaluations. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

read the original abstract

It is becoming increasingly necessary to have monitors check for harmful behaviors during language model interactions, but text-only monitoring has not been sufficient. This is because models sometimes exhibit strategic deception and sandbagging, changing their behavior during evaluation. This motivates the use of white-box monitors like linear probes, which can read the model internals directly. Currently, such probes can fail under distribution shift, limiting their usefulness in real settings. We study whether there exists a low-dimensional subspace of the model internals that captures harmful behaviors more robustly, while leaving out spuriously correlative features. Inspired by the Assistant Axis and Persona Selection Model, we construct persona axes for deception and sycophancy using contrastive persona prompts. The first principal components, obtained by unsupervised PCA of the persona-specific vectors, cleanly separate harmful and harmless personas. Across 10 evaluation datasets, we show that persona-derived directions transfer non-trivially and probes trained on persona-PC projections generalize better than probes trained on raw activations. We also find that a unified axis consisting of multiple harmful and harmless behaviors improves generalization across behaviors and datasets. Overall, persona vectors provide a useful inductive bias for building more transferable behavior probes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that constructing persona axes for deception and sycophancy via contrastive persona prompts, followed by unsupervised PCA on the resulting activation vectors, yields first principal components that cleanly separate harmful and harmless behaviors. Linear probes trained on these 1D persona-PC projections generalize better than probes on raw high-dimensional activations across 10 evaluation datasets, and a unified multi-behavior axis further improves transfer.

Significance. If the central results hold after addressing controls, the work demonstrates that low-dimensional subspaces derived from persona contrasts can supply a useful inductive bias for more robust white-box monitors of harmful LLM behaviors under distribution shift. Credit is due for the multi-dataset evaluation (10 held-out sets) and the exploration of a unified axis combining multiple behaviors; these elements strengthen the case for practical applicability in scalable oversight.

major comments (2)

[§4] §4 (results on generalization): The headline comparison of probes on 1D persona-PC projections versus full-dimensional raw activations does not include controls that hold dimensionality fixed (e.g., random 1D projections, PCA on neutral activations, or top-k PCs of the same persona vectors). This is load-bearing for the claim that persona coordinates provide a useful inductive bias, because any OOD accuracy gain could arise from implicit regularization of the 1D projection rather than isolation of robust harmful-behavior features.
[§3.2] §3.2 (methods and data): The description of dataset splits, sample sizes per dataset, number of random seeds or runs, and statistical tests (e.g., confidence intervals or significance for the reported transfer improvements) is insufficient. Without these details it is impossible to determine whether the positive results across the 10 datasets are driven by post-hoc choices or dataset-specific effects.

minor comments (2)

[Abstract] Abstract: The 10 evaluation datasets are referenced but not enumerated; adding a short list would improve immediate context for readers.
[§3.1] Notation: The distinction between 'persona-PC projections' and the raw activation vectors could be clarified with an explicit equation or diagram showing the projection step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The suggestions to strengthen controls and methodological transparency are well-taken and will improve the clarity and robustness of our claims. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§4] §4 (results on generalization): The headline comparison of probes on 1D persona-PC projections versus full-dimensional raw activations does not include controls that hold dimensionality fixed (e.g., random 1D projections, PCA on neutral activations, or top-k PCs of the same persona vectors). This is load-bearing for the claim that persona coordinates provide a useful inductive bias, because any OOD accuracy gain could arise from implicit regularization of the 1D projection rather than isolation of robust harmful-behavior features.

Authors: We agree that dimensionality-matched controls are necessary to isolate whether the observed generalization gains stem from the specific features captured by persona contrasts rather than from the regularization effect of projecting to 1D. In the revised manuscript we will add three controls: (1) random 1D projections of the raw activations, (2) the first principal component obtained from PCA on activations collected from neutral (non-persona) prompts, and (3) the top-k principal components of the persona vectors themselves. These additions will be reported alongside the existing results in §4 so that readers can directly compare the persona-derived direction against both random and non-specific low-dimensional baselines. revision: yes
Referee: [§3.2] §3.2 (methods and data): The description of dataset splits, sample sizes per dataset, number of random seeds or runs, and statistical tests (e.g., confidence intervals or significance for the reported transfer improvements) is insufficient. Without these details it is impossible to determine whether the positive results across the 10 datasets are driven by post-hoc choices or dataset-specific effects.

Authors: We acknowledge that §3.2 currently omits several experimental details required for reproducibility and statistical assessment. In the revision we will expand this section to specify: the precise train/test splits and sample sizes for each of the 10 evaluation datasets; the number of random seeds used (we will average over five independent seeds and report standard deviations); and confidence intervals or standard errors for all reported accuracy differences. We will also state that dataset selection was performed prior to any probing experiments and that no post-hoc filtering of results occurred. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical generalization tested on held-out data

full rationale

The paper constructs persona axes from contrastive prompts, applies unsupervised PCA to extract the first principal component of persona-specific activation vectors, and evaluates linear probes trained on the resulting 1D projections against probes on raw activations across 10 held-out datasets. The central claim of improved out-of-distribution generalization is an empirical measurement on external data and does not reduce to the construction by definition or via self-citation chains. No load-bearing steps equate predictions to fitted inputs or rename known results.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; full paper details on exact construction of persona vectors and dataset definitions are unavailable, so ledger entries are inferred at high level from the described procedure.

free parameters (2)

choice of contrastive persona prompts
The specific prompts used to elicit harmful versus harmless personas are not detailed and likely tuned to produce clean separation.
selection of model layers for activation extraction
Which internal layers are used to build the persona vectors is unspecified and can affect the resulting principal components.

axioms (1)

domain assumption The first principal component of persona-specific activation differences isolates robust harmful-behavior features rather than dataset-specific artifacts.
This separation is asserted to hold and is used to justify the probe construction.

pith-pipeline@v0.9.0 · 5728 in / 1247 out tokens · 30243 ms · 2026-05-19T17:01:16.939657+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We then run PCA on the centered persona vectors at each layer... PCk(z(ℓ)) = ⟨z(ℓ), u(ℓ)k⟩... probes trained on the projections onto persona PCs generalize better across datasets than probes trained on the raw activations.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The first principal components... cleanly separate harmful and harmless personas.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 16 internal anchors

[1]

Stress Testing Deliberative Alignment for Anti-Scheming Training , url =

Schoen, Bronson and Nitishinskaya, Evgenia and Balesni, Mikita and Højmark, Axel and Hofstätter, Felix and Scheurer, Jérémy and Meinke, Alexander and Wolfe, Jason and Weij, Teun van der and Lloyd, Alex and Goldowsky-Dill, Nicholas and Fan, Angela and Matveiakin, Andrei and Shah, Rusheb and Williams, Marcus and Glaese, Amelia and Barak, Boaz and Zaremba, W...

work page doi:10.48550/arxiv.2509.15541 2025
[2]

Jiang, Albert Q. and Sablayrolles, Alexandre and Roux, Antoine and Mensch, Arthur and Savary, Blanche and Bamford, Chris and Chaplot, Devendra Singh and Casas, Diego de las and Hanna, Emma Bou and Bressand, Florian and Lengyel, Gianna and Bour, Guillaume and Lample, Guillaume and Lavaud, Lélio Renard and Saulnier, Lucile and Lachaux, Marie-Anne and Stock,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.04088 2024
[3]

Jiang, Albert Q. and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and Casas, Diego de las and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and Lavaud, Lélio Renard and Lachaux, Marie-Anne and Stock, Pierre and Scao, Teven Le and Lavril, Thibaut and Wang, Thomas and Lacroix, T...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
[4]

DeepSeek-V3 Technical Report

2025 , eprinttype =. doi:10.48550/arXiv.2412.19437 , abstract =. 2412.19437 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2025
[5]

Qwen2.5 Technical Report

Qwen and Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Ke...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2025
[6]

Llama 3.2: Revolutionizing Edge AI and Vision (Connect 2024) , author =

work page 2024
[7]

2024 , month = oct, url =

Ministral-8B-Instruct-2410 Model Card , author =. 2024 , month = oct, url =

work page 2024
[8]

2025 , month = aug # " 13", url =

GPT-5 System Card , author =. 2025 , month = aug # " 13", url =

work page 2025
[9]

Hierarchical Neural Story Generation

Fan, Angela and Lewis, Mike and Dauphin, Yann , month = may, year =. Hierarchical. doi:10.48550/arXiv.1805.04833 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1805.04833
[10]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models,

Jiang, Liwei and Rao, Kavel and Han, Seungju and Ettinger, Allyson and Brahman, Faeze and Kumar, Sachin and Mireshghallah, Niloofar and Lu, Ximing and Sap, Maarten and Choi, Yejin and Dziri, Nouha , month = jun, year =. doi:10.48550/arXiv.2406.18510 , abstract =

work page doi:10.48550/arxiv.2406.18510
[11]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Phan, Long and Mukobi, Gabriel and Helm-Burger, Nathan and Lababidi, Rassin and Justen, Lennart and Liu, Andrew B. and Chen, Michael and Barrass, Isabelle and Zhang, Oliver and Zhu, Xi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.03218
[12]

and Duvenaud, David , urldate =

Benton, Joe and Wagner, Misha and Christiansen, Eric and Anil, Cem and Perez, Ethan and Srivastav, Jai and Durmus, Esin and Ganguli, Deep and Kravec, Shauna and Shlegeris, Buck and Kaplan, Jared and Karnofsky, Holden and Hubinger, Evan and Grosse, Roger and Bowman, Samuel R. and Duvenaud, David , urldate =. Sabotage Evaluations for Frontier Models , url =...

work page doi:10.48550/arxiv.2410.21514
[13]

What makes a convincing argument?

Habernal, Ivan and Gurevych, Iryna , year =. What makes a convincing argument?. Proceedings of the 2016

work page 2016
[15]

Kirch, Nathalie and Weisser, Constantin and Field, Severin and Yannakoudakis, Helen and Casper, Stephen , month = may, year =. What. doi:10.48550/arXiv.2411.03343 , abstract =

work page doi:10.48550/arxiv.2411.03343
[16]

Jailbroken: How Does LLM Safety Training Fail?

Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , month = jul, year =. Jailbroken:. doi:10.48550/arXiv.2307.02483 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.02483
[17]

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Ding, Ning and Chen, Yulin and Xu, Bokai and Qin, Yujia and Zheng, Zhi and Hu, Shengding and Liu, Zhiyuan and Sun, Maosong and Zhou, Bowen , month = may, year =. Enhancing. doi:10.48550/arXiv.2305.14233 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.14233
[18]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and Joseph, Nicholas and Kadavath, Saurav and Kernion, Jackson and Conerly, Tom and El-Showk, Sheer and Elhage, Nelson and Hatfield-Dodds, Zac and Hernandez, Danny and Hume, Tristan and...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862
[19]

Sharma, Mrinank and Tong, Meg and Mu, Jesse and Wei, Jerry and Kruthoff, Jorrit and Goodfriend, Scott and Ong, Euan and Peng, Alwin and Agarwal, Raj and Anil, Cem and Askell, Amanda and Bailey, Nathan and Benton, Joe and Bluemke, Emma and Bowman, Samuel R. and Christiansen, Eric and Cunningham, Hoagy and Dau, Andy and Gopal, Anjali and Gilson, Rob and Gra...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.18837
[20]

Bloom, Joseph and Taylor, Jordan and Kissane, Connor and Black, Sid and merizian and alexdzm and jacoba and Millwood, Ben and Cooney, Alan , month = jul, year =. White

work page
[21]

Sharkey, Lee and Chughtai, Bilal and Batson, Joshua and Lindsey, Jack and Wu, Jeff and Bushnaq, Lucius and Goldowsky-Dill, Nicholas and Heimersheim, Stefan and Ortega, Alejandro and Bloom, Joseph and Biderman, Stella and Garriga-Alonso, Adria and Conmy, Arthur and Nanda, Neel and Rumbelow, Jessica and Wattenberg, Martin and Schoots, Nandi and Miller, Jose...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.16496
[22]

Language models learn to mislead humans via rlhf

Wen, Jiaxin and Zhong, Ruiqi and Khan, Akbir and Perez, Ethan and Steinhardt, Jacob and Huang, Minlie and Bowman, Samuel R. and He, He and Feng, Shi , month = dec, year =. Language. doi:10.48550/arXiv.2409.12822 , abstract =

work page doi:10.48550/arxiv.2409.12822
[23]

Propositional interpretability in artificial intelligence

Chalmers, David J. , month = jan, year =. Propositional. doi:10.48550/arXiv.2501.15740 , abstract =

work page doi:10.48550/arxiv.2501.15740
[24]

doi:10.48550/arXiv.2502.14744 , abstract =

Jiang, Yilei and Gao, Xinyan and Peng, Tianshuo and Tan, Yingshui and Zhu, Xiaoyong and Zheng, Bo and Yue, Xiangyu , month = jun, year =. doi:10.48550/arXiv.2502.14744 , abstract =

work page doi:10.48550/arxiv.2502.14744
[25]

Benchmarking

Parrack, Avi and Attubato, Carlo Leonardo and Heimersheim, Stefan , month = aug, year =. Benchmarking. doi:10.48550/arXiv.2507.12691 , abstract =

work page doi:10.48550/arxiv.2507.12691
[26]

Detecting

McKenzie, Alex and Pawar, Urja and Blandfort, Phil and Bankes, William and Krueger, David and Lubana, Ekdeep Singh and Krasheninnikov, Dmitrii , month = jun, year =. Detecting. doi:10.48550/arXiv.2506.10805 , abstract =

work page doi:10.48550/arxiv.2506.10805
[27]

Investigating task-specific prompts and sparse autoencoders for activation monitoring , url =

Tillman, Henk and Mossing, Dan , month = apr, year =. Investigating task-specific prompts and sparse autoencoders for activation monitoring , url =. doi:10.48550/arXiv.2504.20271 , abstract =

work page doi:10.48550/arxiv.2504.20271
[28]

, month = jul, year =

Chan, Yik Siu and Yong, Zheng-Xin and Bach, Stephen H. , month = jul, year =. Can. doi:10.48550/arXiv.2507.12428 , abstract =

work page doi:10.48550/arxiv.2507.12428
[29]

Probing and

Nguyen, Jord and Hoang, Khiem and Attubato, Carlo Leonardo and Hofstätter, Felix , month = jul, year =. Probing and. doi:10.48550/arXiv.2507.01786 , abstract =

work page doi:10.48550/arxiv.2507.01786
[30]

Monitoring

Feng, Jiahai and Russell, Stuart and Steinhardt, Jacob , month = dec, year =. Monitoring. doi:10.48550/arXiv.2406.19501 , abstract =

work page doi:10.48550/arxiv.2406.19501
[32]

Simple probes can catch sleeper agents , url =

MacDiarmid, Monte and Maxwell, Timothy and Schiefer, Nicholas and Mu, Jesse and Kaplan, Jared and Duvenaud, David and Bowman, Sam and Tamkin, Alex and Perez, Ethan and Sharma, Mrinank and Denison, Carson and Hubinger, Evan , month = apr, year =. Simple probes can catch sleeper agents , url =

work page
[33]

What makes a convincing argument?

Habernal, Ivan and Gurevych, Iryna , editor =. What makes a convincing argument?. Proceedings of the 2016. 2016 , pages =. doi:10.18653/v1/D16-1129 , urldate =

work page doi:10.18653/v1/d16-1129 2016
[34]

Measuring Massive Multitask Language Understanding

Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , month = jan, year =. Measuring. doi:10.48550/arXiv.2009.03300 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2009.03300 2009
[35]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and Goel, Shashwat and Li, Nathaniel and Byun, Michael J. and Wang, Zifan and Mallen, Alex and Basart, Steven and Koyejo, Sanmi and Song, Dawn and Fredrikson, Matt and Kolter, J. ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.01405
[36]

arXiv:2409.14026 [cs]

Zhang, Jason and Viteri, Scott , month = mar, year =. Uncovering. doi:10.48550/arXiv.2409.14026 , abstract =

work page doi:10.48550/arxiv.2409.14026
[37]

Abdelnabi, Sahar and Salem, Ahmed , month = may, year =. Linear. doi:10.48550/arXiv.2505.14617 , abstract =

work page doi:10.48550/arxiv.2505.14617
[38]

Probing the

Gu, Tianle and Huang, Kexin and Wang, Zongqi and Wang, Yixu and Li, Jie and Yao, Yuanqi and Yao, Yang and Yang, Yujiu and Teng, Yan and Wang, Yingchun , month = jun, year =. Probing the. doi:10.48550/arXiv.2506.16078 , abstract =

work page doi:10.48550/arxiv.2506.16078
[39]

Towards Understanding Sycophancy in Language Models

Sharma, Mrinank and Tong, Meg and Korbak, Tomasz and Duvenaud, David and Askell, Amanda and Bowman, Samuel R. and Cheng, Newton and Durmus, Esin and Hatfield-Dodds, Zac and Johnston, Scott R. and Kravec, Shauna and Maxwell, Timothy and McCandlish, Sam and Ndousse, Kamal and Rausch, Oliver and Schiefer, Nicholas and Yan, Da and Zhang, Miranda and Perez, Et...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.13548
[40]

and Ward, Francis Rhys , month = feb, year =

Weij, Teun van der and Hofstätter, Felix and Jaffe, Ollie and Brown, Samuel F. and Ward, Francis Rhys , month = feb, year =. doi:10.48550/arXiv.2406.07358 , abstract =

work page doi:10.48550/arxiv.2406.07358
[41]

Deception in

Barkur, Sudarshan Kamath and Schacht, Sigurd and Scholl, Johannes , month = jan, year =. Deception in. doi:10.48550/arXiv.2501.16513 , abstract =

work page doi:10.48550/arxiv.2501.16513
[42]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , month = feb, year =. doi:10.48550/arXiv.2402.04249 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.04249
[43]

Lu, Christina and Gallagher, Jack and Michala, Jonathan and Fish, Kyle and Lindsey, Jack , month = jan, year =. The

work page
[44]

Ying, Zhuofan Josh and Ravfogel, Shauli and Kriegeskorte, Nikolaus and Hase, Peter , month = feb, year =. The

work page
[45]

Shafran, Or and Ronen, Shaked and Fahn, Omri and Ravfogel, Shauli and Geiger, Atticus and Geva, Mor , month = feb, year =. From

work page
[46]

Bar-Shalom, Guy and Frasca, Fabrizio and Galron, Yaniv and Ziser, Yftah and Maron, Haggai , month = sep, year =. Beyond

work page
[47]

The persona selection model , url =

work page
[48]

Simulators , url =

Janus , month = sep, year =. Simulators , url =

work page
[49]

2018 , eprint=

Understanding intermediate layers using linear classifier probes , author=. 2018 , eprint=

work page 2018
[50]

2026 , eprint=

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks , author=. 2026 , eprint=

work page 2026
[51]

2019 , eprint=

Designing and Interpreting Probes with Control Tasks , author=. 2019 , eprint=

work page 2019
[52]

Emotion concepts and their function in a large language model , url =

work page
[53]

Controllable

Dong, Yurui and Jin, Luozhijie and Yang, Yao and Lu, Bingjie and Yang, Jiaxi and Liu, Zhi , month = feb, year =. Controllable

work page
[54]

Behavioral and

Shi, Jerick , month = feb, year =. Behavioral and

work page
[55]

Findings of the Association for Computational Linguistics: EMNLP 2022 , address =

Language Models as Agent Models , author =. Findings of the Association for Computational Linguistics: EMNLP 2022 , address =. 2022 , url =

work page 2022
[56]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , address =

Personas as a Way to Model Truthfulness in Language Models , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , address =. 2024 , url =

work page 2024
[57]

The Persona Selection Model: Why AI Assistants might Behave like Humans , url =

Marks, Samuel and Lindsey, Jack and Olah, Christopher , month = feb, year =. The Persona Selection Model: Why AI Assistants might Behave like Humans , url =

work page
[58]

Proceedings of the 41st International Conference on Machine Learning , series =

The Linear Representation Hypothesis and the Geometry of Large Language Models , author =. Proceedings of the 41st International Conference on Machine Learning , series =. 2024 , url =

work page 2024
[59]

Findings of the Association for Computational Linguistics: EMNLP 2023 , publisher =

The Internal State of an LLM Knows When It's Lying , author =. Findings of the Association for Computational Linguistics: EMNLP 2023 , publisher =. 2023 , url =

work page 2023
[60]

The Eleventh International Conference on Learning Representations , year =

Discovering Latent Knowledge in Language Models Without Supervision , author =. The Eleventh International Conference on Learning Representations , year =

work page
[61]

De- tecting strategic deception using linear probes.arXiv preprint arXiv:2502.03407, 2025

Goldowsky-Dill, Nicholas and Chughtai, Bilal and Heimersheim, Stefan and Hobbhahn, Marius , month = feb, year =. Detecting Strategic Deception Using Linear Probes , url =. doi:10.48550/arXiv.2502.03407 , publisher =

work page doi:10.48550/arxiv.2502.03407
[62]

Building Better Deception Probes Using Targeted Instruction Pairs , url =

Natarajan, Vikram and Jain, Devina and Arora, Shivam and Golechha, Satvik and Bloom, Joseph , month = feb, year =. Building Better Deception Probes Using Targeted Instruction Pairs , url =. doi:10.48550/arXiv.2602.01425 , publisher =

work page doi:10.48550/arxiv.2602.01425
[63]

The Twelfth International Conference on Learning Representations , year =

Towards Understanding Sycophancy in Language Models , author =. The Twelfth International Conference on Learning Representations , year =

work page
[64]

Large Language Models can Strategically Deceive their Users when Put Under Pressure , url =

Scheurer, J. Large Language Models can Strategically Deceive their Users when Put Under Pressure , url =. 2024 , note =. doi:10.48550/arXiv.2311.07590 , publisher =

work page doi:10.48550/arxiv.2311.07590 2024
[65]

Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned

Betley, Jan and Tan, Daniel and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Mart. Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned. 2025 , eprint =

work page 2025
[66]

and Mindermann, S

Pacchiardi, Lorenzo and Chan, Alex J. and Mindermann, S. How to Catch an. 2023 , eprint =

work page 2023
[67]

2024 , howpublished =

roleplaying , author =. 2024 , howpublished =

work page 2024
[68]

2025 , eprint =

Liars' Bench: Evaluating Lie Detectors for Language Models , author =. 2025 , eprint =

work page 2025
[69]

2024 , howpublished =

sycophancy\_dataset , author =. 2024 , howpublished =

work page 2024
[70]

2025 , howpublished =

Open-ended\_sycophancy , author =. 2025 , howpublished =

work page 2025
[71]

Steering Llama 2 via Contrastive Activation Addition , url =

Steering Llama 2 via Contrastive Activation Addition , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , address =. doi:10.18653/v1/2024.acl-long.828 , url =

work page doi:10.18653/v1/2024.acl-long.828 2024
[72]

ELEPHANT: Measuring and understanding social sycophancy in LLMs

Cheng, Myra and Yu, Sunny and Lee, Cinoo and Khadpe, Pranav and Ibrahim, Lujain and Jurafsky, Dan , year =. 2505.13995 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[73]

2026 , eprint=

The Impact of Off-Policy Training Data on Probe Generalisation , author=. 2026 , eprint=

work page 2026
[74]

2023 , eprint =

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author =. 2023 , eprint =

work page 2023

[1] [1]

Stress Testing Deliberative Alignment for Anti-Scheming Training , url =

Schoen, Bronson and Nitishinskaya, Evgenia and Balesni, Mikita and Højmark, Axel and Hofstätter, Felix and Scheurer, Jérémy and Meinke, Alexander and Wolfe, Jason and Weij, Teun van der and Lloyd, Alex and Goldowsky-Dill, Nicholas and Fan, Angela and Matveiakin, Andrei and Shah, Rusheb and Williams, Marcus and Glaese, Amelia and Barak, Boaz and Zaremba, W...

work page doi:10.48550/arxiv.2509.15541 2025

[2] [2]

Jiang, Albert Q. and Sablayrolles, Alexandre and Roux, Antoine and Mensch, Arthur and Savary, Blanche and Bamford, Chris and Chaplot, Devendra Singh and Casas, Diego de las and Hanna, Emma Bou and Bressand, Florian and Lengyel, Gianna and Bour, Guillaume and Lample, Guillaume and Lavaud, Lélio Renard and Saulnier, Lucile and Lachaux, Marie-Anne and Stock,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.04088 2024

[3] [3]

Jiang, Albert Q. and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and Casas, Diego de las and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and Lavaud, Lélio Renard and Lachaux, Marie-Anne and Stock, Pierre and Scao, Teven Le and Lavril, Thibaut and Wang, Thomas and Lacroix, T...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023

[4] [4]

DeepSeek-V3 Technical Report

2025 , eprinttype =. doi:10.48550/arXiv.2412.19437 , abstract =. 2412.19437 [cs] , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2025

[5] [5]

Qwen2.5 Technical Report

Qwen and Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Ke...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2025

[6] [6]

Llama 3.2: Revolutionizing Edge AI and Vision (Connect 2024) , author =

work page 2024

[7] [7]

2024 , month = oct, url =

Ministral-8B-Instruct-2410 Model Card , author =. 2024 , month = oct, url =

work page 2024

[8] [8]

2025 , month = aug # " 13", url =

GPT-5 System Card , author =. 2025 , month = aug # " 13", url =

work page 2025

[9] [9]

Hierarchical Neural Story Generation

Fan, Angela and Lewis, Mike and Dauphin, Yann , month = may, year =. Hierarchical. doi:10.48550/arXiv.1805.04833 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1805.04833

[10] [10]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models,

Jiang, Liwei and Rao, Kavel and Han, Seungju and Ettinger, Allyson and Brahman, Faeze and Kumar, Sachin and Mireshghallah, Niloofar and Lu, Ximing and Sap, Maarten and Choi, Yejin and Dziri, Nouha , month = jun, year =. doi:10.48550/arXiv.2406.18510 , abstract =

work page doi:10.48550/arxiv.2406.18510

[11] [11]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Phan, Long and Mukobi, Gabriel and Helm-Burger, Nathan and Lababidi, Rassin and Justen, Lennart and Liu, Andrew B. and Chen, Michael and Barrass, Isabelle and Zhang, Oliver and Zhu, Xi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.03218

[12] [12]

and Duvenaud, David , urldate =

Benton, Joe and Wagner, Misha and Christiansen, Eric and Anil, Cem and Perez, Ethan and Srivastav, Jai and Durmus, Esin and Ganguli, Deep and Kravec, Shauna and Shlegeris, Buck and Kaplan, Jared and Karnofsky, Holden and Hubinger, Evan and Grosse, Roger and Bowman, Samuel R. and Duvenaud, David , urldate =. Sabotage Evaluations for Frontier Models , url =...

work page doi:10.48550/arxiv.2410.21514

[13] [13]

What makes a convincing argument?

Habernal, Ivan and Gurevych, Iryna , year =. What makes a convincing argument?. Proceedings of the 2016

work page 2016

[14] [15]

Kirch, Nathalie and Weisser, Constantin and Field, Severin and Yannakoudakis, Helen and Casper, Stephen , month = may, year =. What. doi:10.48550/arXiv.2411.03343 , abstract =

work page doi:10.48550/arxiv.2411.03343

[15] [16]

Jailbroken: How Does LLM Safety Training Fail?

Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , month = jul, year =. Jailbroken:. doi:10.48550/arXiv.2307.02483 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.02483

[16] [17]

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Ding, Ning and Chen, Yulin and Xu, Bokai and Qin, Yujia and Zheng, Zhi and Hu, Shengding and Liu, Zhiyuan and Sun, Maosong and Zhou, Bowen , month = may, year =. Enhancing. doi:10.48550/arXiv.2305.14233 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.14233

[17] [18]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and Joseph, Nicholas and Kadavath, Saurav and Kernion, Jackson and Conerly, Tom and El-Showk, Sheer and Elhage, Nelson and Hatfield-Dodds, Zac and Hernandez, Danny and Hume, Tristan and...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862

[18] [19]

Sharma, Mrinank and Tong, Meg and Mu, Jesse and Wei, Jerry and Kruthoff, Jorrit and Goodfriend, Scott and Ong, Euan and Peng, Alwin and Agarwal, Raj and Anil, Cem and Askell, Amanda and Bailey, Nathan and Benton, Joe and Bluemke, Emma and Bowman, Samuel R. and Christiansen, Eric and Cunningham, Hoagy and Dau, Andy and Gopal, Anjali and Gilson, Rob and Gra...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.18837

[19] [20]

Bloom, Joseph and Taylor, Jordan and Kissane, Connor and Black, Sid and merizian and alexdzm and jacoba and Millwood, Ben and Cooney, Alan , month = jul, year =. White

work page

[20] [21]

Sharkey, Lee and Chughtai, Bilal and Batson, Joshua and Lindsey, Jack and Wu, Jeff and Bushnaq, Lucius and Goldowsky-Dill, Nicholas and Heimersheim, Stefan and Ortega, Alejandro and Bloom, Joseph and Biderman, Stella and Garriga-Alonso, Adria and Conmy, Arthur and Nanda, Neel and Rumbelow, Jessica and Wattenberg, Martin and Schoots, Nandi and Miller, Jose...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.16496

[21] [22]

Language models learn to mislead humans via rlhf

Wen, Jiaxin and Zhong, Ruiqi and Khan, Akbir and Perez, Ethan and Steinhardt, Jacob and Huang, Minlie and Bowman, Samuel R. and He, He and Feng, Shi , month = dec, year =. Language. doi:10.48550/arXiv.2409.12822 , abstract =

work page doi:10.48550/arxiv.2409.12822

[22] [23]

Propositional interpretability in artificial intelligence

Chalmers, David J. , month = jan, year =. Propositional. doi:10.48550/arXiv.2501.15740 , abstract =

work page doi:10.48550/arxiv.2501.15740

[23] [24]

doi:10.48550/arXiv.2502.14744 , abstract =

Jiang, Yilei and Gao, Xinyan and Peng, Tianshuo and Tan, Yingshui and Zhu, Xiaoyong and Zheng, Bo and Yue, Xiangyu , month = jun, year =. doi:10.48550/arXiv.2502.14744 , abstract =

work page doi:10.48550/arxiv.2502.14744

[24] [25]

Benchmarking

Parrack, Avi and Attubato, Carlo Leonardo and Heimersheim, Stefan , month = aug, year =. Benchmarking. doi:10.48550/arXiv.2507.12691 , abstract =

work page doi:10.48550/arxiv.2507.12691

[25] [26]

Detecting

McKenzie, Alex and Pawar, Urja and Blandfort, Phil and Bankes, William and Krueger, David and Lubana, Ekdeep Singh and Krasheninnikov, Dmitrii , month = jun, year =. Detecting. doi:10.48550/arXiv.2506.10805 , abstract =

work page doi:10.48550/arxiv.2506.10805

[26] [27]

Investigating task-specific prompts and sparse autoencoders for activation monitoring , url =

Tillman, Henk and Mossing, Dan , month = apr, year =. Investigating task-specific prompts and sparse autoencoders for activation monitoring , url =. doi:10.48550/arXiv.2504.20271 , abstract =

work page doi:10.48550/arxiv.2504.20271

[27] [28]

, month = jul, year =

Chan, Yik Siu and Yong, Zheng-Xin and Bach, Stephen H. , month = jul, year =. Can. doi:10.48550/arXiv.2507.12428 , abstract =

work page doi:10.48550/arxiv.2507.12428

[28] [29]

Probing and

Nguyen, Jord and Hoang, Khiem and Attubato, Carlo Leonardo and Hofstätter, Felix , month = jul, year =. Probing and. doi:10.48550/arXiv.2507.01786 , abstract =

work page doi:10.48550/arxiv.2507.01786

[29] [30]

Monitoring

Feng, Jiahai and Russell, Stuart and Steinhardt, Jacob , month = dec, year =. Monitoring. doi:10.48550/arXiv.2406.19501 , abstract =

work page doi:10.48550/arxiv.2406.19501

[30] [32]

Simple probes can catch sleeper agents , url =

MacDiarmid, Monte and Maxwell, Timothy and Schiefer, Nicholas and Mu, Jesse and Kaplan, Jared and Duvenaud, David and Bowman, Sam and Tamkin, Alex and Perez, Ethan and Sharma, Mrinank and Denison, Carson and Hubinger, Evan , month = apr, year =. Simple probes can catch sleeper agents , url =

work page

[31] [33]

What makes a convincing argument?

Habernal, Ivan and Gurevych, Iryna , editor =. What makes a convincing argument?. Proceedings of the 2016. 2016 , pages =. doi:10.18653/v1/D16-1129 , urldate =

work page doi:10.18653/v1/d16-1129 2016

[32] [34]

Measuring Massive Multitask Language Understanding

Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , month = jan, year =. Measuring. doi:10.48550/arXiv.2009.03300 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2009.03300 2009

[33] [35]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and Goel, Shashwat and Li, Nathaniel and Byun, Michael J. and Wang, Zifan and Mallen, Alex and Basart, Steven and Koyejo, Sanmi and Song, Dawn and Fredrikson, Matt and Kolter, J. ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.01405

[34] [36]

arXiv:2409.14026 [cs]

Zhang, Jason and Viteri, Scott , month = mar, year =. Uncovering. doi:10.48550/arXiv.2409.14026 , abstract =

work page doi:10.48550/arxiv.2409.14026

[35] [37]

Abdelnabi, Sahar and Salem, Ahmed , month = may, year =. Linear. doi:10.48550/arXiv.2505.14617 , abstract =

work page doi:10.48550/arxiv.2505.14617

[36] [38]

Probing the

Gu, Tianle and Huang, Kexin and Wang, Zongqi and Wang, Yixu and Li, Jie and Yao, Yuanqi and Yao, Yang and Yang, Yujiu and Teng, Yan and Wang, Yingchun , month = jun, year =. Probing the. doi:10.48550/arXiv.2506.16078 , abstract =

work page doi:10.48550/arxiv.2506.16078

[37] [39]

Towards Understanding Sycophancy in Language Models

Sharma, Mrinank and Tong, Meg and Korbak, Tomasz and Duvenaud, David and Askell, Amanda and Bowman, Samuel R. and Cheng, Newton and Durmus, Esin and Hatfield-Dodds, Zac and Johnston, Scott R. and Kravec, Shauna and Maxwell, Timothy and McCandlish, Sam and Ndousse, Kamal and Rausch, Oliver and Schiefer, Nicholas and Yan, Da and Zhang, Miranda and Perez, Et...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.13548

[38] [40]

and Ward, Francis Rhys , month = feb, year =

Weij, Teun van der and Hofstätter, Felix and Jaffe, Ollie and Brown, Samuel F. and Ward, Francis Rhys , month = feb, year =. doi:10.48550/arXiv.2406.07358 , abstract =

work page doi:10.48550/arxiv.2406.07358

[39] [41]

Deception in

Barkur, Sudarshan Kamath and Schacht, Sigurd and Scholl, Johannes , month = jan, year =. Deception in. doi:10.48550/arXiv.2501.16513 , abstract =

work page doi:10.48550/arxiv.2501.16513

[40] [42]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , month = feb, year =. doi:10.48550/arXiv.2402.04249 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.04249

[41] [43]

Lu, Christina and Gallagher, Jack and Michala, Jonathan and Fish, Kyle and Lindsey, Jack , month = jan, year =. The

work page

[42] [44]

Ying, Zhuofan Josh and Ravfogel, Shauli and Kriegeskorte, Nikolaus and Hase, Peter , month = feb, year =. The

work page

[43] [45]

Shafran, Or and Ronen, Shaked and Fahn, Omri and Ravfogel, Shauli and Geiger, Atticus and Geva, Mor , month = feb, year =. From

work page

[44] [46]

Bar-Shalom, Guy and Frasca, Fabrizio and Galron, Yaniv and Ziser, Yftah and Maron, Haggai , month = sep, year =. Beyond

work page

[45] [47]

The persona selection model , url =

work page

[46] [48]

Simulators , url =

Janus , month = sep, year =. Simulators , url =

work page

[47] [49]

2018 , eprint=

Understanding intermediate layers using linear classifier probes , author=. 2018 , eprint=

work page 2018

[48] [50]

2026 , eprint=

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks , author=. 2026 , eprint=

work page 2026

[49] [51]

2019 , eprint=

Designing and Interpreting Probes with Control Tasks , author=. 2019 , eprint=

work page 2019

[50] [52]

Emotion concepts and their function in a large language model , url =

work page

[51] [53]

Controllable

Dong, Yurui and Jin, Luozhijie and Yang, Yao and Lu, Bingjie and Yang, Jiaxi and Liu, Zhi , month = feb, year =. Controllable

work page

[52] [54]

Behavioral and

Shi, Jerick , month = feb, year =. Behavioral and

work page

[53] [55]

Findings of the Association for Computational Linguistics: EMNLP 2022 , address =

Language Models as Agent Models , author =. Findings of the Association for Computational Linguistics: EMNLP 2022 , address =. 2022 , url =

work page 2022

[54] [56]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , address =

Personas as a Way to Model Truthfulness in Language Models , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , address =. 2024 , url =

work page 2024

[55] [57]

The Persona Selection Model: Why AI Assistants might Behave like Humans , url =

Marks, Samuel and Lindsey, Jack and Olah, Christopher , month = feb, year =. The Persona Selection Model: Why AI Assistants might Behave like Humans , url =

work page

[56] [58]

Proceedings of the 41st International Conference on Machine Learning , series =

The Linear Representation Hypothesis and the Geometry of Large Language Models , author =. Proceedings of the 41st International Conference on Machine Learning , series =. 2024 , url =

work page 2024

[57] [59]

Findings of the Association for Computational Linguistics: EMNLP 2023 , publisher =

The Internal State of an LLM Knows When It's Lying , author =. Findings of the Association for Computational Linguistics: EMNLP 2023 , publisher =. 2023 , url =

work page 2023

[58] [60]

The Eleventh International Conference on Learning Representations , year =

Discovering Latent Knowledge in Language Models Without Supervision , author =. The Eleventh International Conference on Learning Representations , year =

work page

[59] [61]

De- tecting strategic deception using linear probes.arXiv preprint arXiv:2502.03407, 2025

Goldowsky-Dill, Nicholas and Chughtai, Bilal and Heimersheim, Stefan and Hobbhahn, Marius , month = feb, year =. Detecting Strategic Deception Using Linear Probes , url =. doi:10.48550/arXiv.2502.03407 , publisher =

work page doi:10.48550/arxiv.2502.03407

[60] [62]

Building Better Deception Probes Using Targeted Instruction Pairs , url =

Natarajan, Vikram and Jain, Devina and Arora, Shivam and Golechha, Satvik and Bloom, Joseph , month = feb, year =. Building Better Deception Probes Using Targeted Instruction Pairs , url =. doi:10.48550/arXiv.2602.01425 , publisher =

work page doi:10.48550/arxiv.2602.01425

[61] [63]

The Twelfth International Conference on Learning Representations , year =

Towards Understanding Sycophancy in Language Models , author =. The Twelfth International Conference on Learning Representations , year =

work page

[62] [64]

Large Language Models can Strategically Deceive their Users when Put Under Pressure , url =

Scheurer, J. Large Language Models can Strategically Deceive their Users when Put Under Pressure , url =. 2024 , note =. doi:10.48550/arXiv.2311.07590 , publisher =

work page doi:10.48550/arxiv.2311.07590 2024

[63] [65]

Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned

Betley, Jan and Tan, Daniel and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Mart. Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned. 2025 , eprint =

work page 2025

[64] [66]

and Mindermann, S

Pacchiardi, Lorenzo and Chan, Alex J. and Mindermann, S. How to Catch an. 2023 , eprint =

work page 2023

[65] [67]

2024 , howpublished =

roleplaying , author =. 2024 , howpublished =

work page 2024

[66] [68]

2025 , eprint =

Liars' Bench: Evaluating Lie Detectors for Language Models , author =. 2025 , eprint =

work page 2025

[67] [69]

2024 , howpublished =

sycophancy\_dataset , author =. 2024 , howpublished =

work page 2024

[68] [70]

2025 , howpublished =

Open-ended\_sycophancy , author =. 2025 , howpublished =

work page 2025

[69] [71]

Steering Llama 2 via Contrastive Activation Addition , url =

Steering Llama 2 via Contrastive Activation Addition , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , address =. doi:10.18653/v1/2024.acl-long.828 , url =

work page doi:10.18653/v1/2024.acl-long.828 2024

[70] [72]

ELEPHANT: Measuring and understanding social sycophancy in LLMs

Cheng, Myra and Yu, Sunny and Lee, Cinoo and Khadpe, Pranav and Ibrahim, Lujain and Jurafsky, Dan , year =. 2505.13995 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[71] [73]

2026 , eprint=

The Impact of Off-Policy Training Data on Probe Generalisation , author=. 2026 , eprint=

work page 2026

[72] [74]

2023 , eprint =

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author =. 2023 , eprint =

work page 2023