arxiv: 2605.11448 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Deep Minds and Shallow Probes

Su Hyeong Lee , Risi Kondor

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords neural representationsprobingsymmetryaffine transformationscoordinate stabilitycross-model transferprobe quotientshallow probes

0 comments

The pith

Affine symmetries from equivalent realizations select a unique hierarchy of shallow probes, with linear probes as the base case.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neural representations realizing the same computation can differ by reparameterization of their hidden coordinates. A probe family meant to detect structure already present must therefore remain unchanged under the symmetries induced by those reparameterizations rather than depend on a particular basis. At the final readout layer these symmetries are affine transformations of the hidden coordinates. Requiring stability under this group action isolates a specific hierarchy of shallow coordinate-stable probes whose first member is the ordinary linear probe. The same principle identifies the probe-visible quotient of a representation, rather than the full hidden state, as the transferable object for cross-model monitoring.

Core claim

Equivalent realizations induce affine changes of hidden coordinates. Requiring a probe family to be stable under this group action singles out a unique hierarchy of shallow coordinate-stable probes, with linear probes as its degree-1 member. A natural object for cross-model probe transfer is then the shared probe-visible quotient—the representation modulo directions invisible to the probe family—rather than the full hidden state.

What carries the argument

The group action of affine reparameterizations on hidden coordinates at the readout layer, which enforces coordinate-stability and selects the probe hierarchy.

If this is right

Linear probes form the lowest level of a larger family of stable shallow probes.
Degree-2 members of the hierarchy capture additional structure beyond what linear probes detect.
Probe transfer should operate on the quotient modulo invisible directions to achieve coverage-aware portability.
The same stability requirement yields monitors that transfer across different model families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The symmetry analysis could be extended to intermediate layers if analogous group actions can be identified there.
Quotient-based transfer may improve robustness when applying monitors trained on one architecture to another.
The framework suggests that many existing probing techniques can be re-derived as special cases of symmetry-stable families.

Load-bearing premise

That affine coordinate changes from equivalent realizations are the only relevant symmetries and that probes intended to reveal existing structure must be invariant to them.

What would settle it

An experiment in which a probe family extracts reliable structure yet fails to be stable under affine reparameterizations, or in which full hidden-state transfer outperforms quotient-based transfer across models.

Figures

Figures reproduced from arXiv: 2605.11448 by Risi Kondor, Su Hyeong Lee.

**Figure 2.** Figure 2: Recovered minimum probe degree d ⋆ (T, L) for five Boolean tasks at six layers of Pythia-160m (left) and Pythia-410m (right). Cells are shaded by d ⋆ . “Degree” here refers to the minimum polynomial-threshold decision degree over the primitive scores (whether the label is separable by a polynomial threshold of degree d applied to the (A, b B, b Cb) score vector), not to the degree of the Boolean function a… view at source ↗

read the original abstract

Neural representations are not unique objects. Even when two systems realize the same downstream computation, their hidden coordinates may differ by reparameterization. A probe family intended to reveal structure already present in a representation should therefore be stable under the relevant representation symmetries rather than be tied to a particular basis. We study this group action in the tractable exact setting of the final readout layer, where equivalent realizations induce affine changes of hidden coordinates. The resulting symmetry principle singles out a unique hierarchy of shallow coordinate-stable probes, with linear probes as its degree-1 member. We also show that a natural object for cross-model probe transfer is a shared probe-visible quotient--the representation modulo directions invisible to the probe family--rather than the full hidden state. Experiments on synthetic and real-world tasks support both predictions, showing where degree-2 probes help beyond linear ones and how quotient-based transfer enables coverage-aware monitor portability across model families. These results point toward a broader geometric representation theory of neural probing, with coverage-aware monitor transfer as a concrete operational consequence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Symmetry from readout reparameterizations motivates a probe hierarchy and quotient for transfer, but uniqueness looks like it may rest on extra choices about probe form.

read the letter

The paper starts from the fact that equivalent readouts can apply affine reparameterizations to hidden coordinates. It argues that any probe family meant to recover structure already in the representation must stay stable under that group action. This is said to produce a unique hierarchy of shallow probes, with linear probes as the degree-1 case, plus a probe-visible quotient that serves as the right object for cross-model transfer instead of the full hidden state. Experiments on synthetic and real tasks are presented to show where degree-2 probes add value and how the quotient improves monitor portability.

Referee Report

2 major / 2 minor

Summary. The paper argues that neural probes should be invariant under affine reparameterizations of hidden states that arise from equivalent realizations of the final readout layer. It claims this symmetry principle uniquely determines a hierarchy of shallow coordinate-stable probes (linear probes as the degree-1 member) and that a probe-visible quotient (representation modulo directions invisible to the probe family) is the appropriate object for cross-model probe transfer. Experiments on synthetic and real-world tasks are said to illustrate when degree-2 probes add value and how quotient-based transfer improves monitor portability.

Significance. If the uniqueness derivation holds without hidden restrictions on probe functional form, the work supplies a geometric rationale for the prevalence of linear probes and a concrete mechanism for coverage-aware transfer across model families. This could shift probing from empirical heuristics toward symmetry-based design, with the quotient construction offering a practical advance for interpretability and monitoring. The experiments provide initial support for both the hierarchy and the transfer claim.

major comments (2)

[Abstract / §3 (Symmetry Principle)] Abstract and theoretical core: the claim that the symmetry principle 'singles out a unique hierarchy' requires an explicit statement of the probe function class (e.g., polynomials of bounded degree). Without a proof that no other families (non-polynomial or unbounded) satisfy the stability condition under the affine group action, uniqueness does not follow from the group action alone; the skeptic concern on functional-form restriction is load-bearing for the central claim.
[§4 (Quotient and Transfer)] Probe-visible quotient construction: because the quotient is defined relative to the chosen probe family, the transfer claim inherits the same dependence on the hierarchy derivation. If the hierarchy is not uniquely fixed by symmetry, the quotient is likewise not canonical; this affects the cross-model portability result.

minor comments (2)

[§3] Notation for the group action and stability condition should be introduced with a single running example (e.g., a two-layer readout) before the general case to improve readability.
[§5] Experimental section should report the precise synthetic data-generating process and any controls for probe capacity or regularization that could confound the degree-1 vs. degree-2 comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight an important point about the scope of the uniqueness claim, which we address by clarifying the probe function class in the revision. We respond point by point below.

read point-by-point responses

Referee: [Abstract / §3 (Symmetry Principle)] Abstract and theoretical core: the claim that the symmetry principle 'singles out a unique hierarchy' requires an explicit statement of the probe function class (e.g., polynomials of bounded degree). Without a proof that no other families (non-polynomial or unbounded) satisfy the stability condition under the affine group action, uniqueness does not follow from the group action alone; the skeptic concern on functional-form restriction is load-bearing for the central claim.

Authors: We agree that an explicit statement of the function class is needed for the uniqueness claim to be precise. In the manuscript, shallow probes are implicitly the polynomial functions of bounded degree, as this is the natural class closed under affine reparameterizations that admits a grading by total degree (with linear probes as the degree-1 member). We will revise the abstract and §3 to state explicitly that the symmetry principle is applied to the vector space of polynomial probes of degree at most d, and briefly justify why this class is appropriate: affine transformations preserve polynomial degree, yielding a finite-dimensional representation in which the hierarchy of invariant subspaces is uniquely determined by representation theory of the affine group. Within this class the hierarchy is canonical; we do not claim uniqueness over all possible function families, as non-polynomial probes fall outside the shallow-probe setting studied here. revision: yes
Referee: [§4 (Quotient and Transfer)] Probe-visible quotient construction: because the quotient is defined relative to the chosen probe family, the transfer claim inherits the same dependence on the hierarchy derivation. If the hierarchy is not uniquely fixed by symmetry, the quotient is likewise not canonical; this affects the cross-model portability result.

Authors: We concur that the quotient construction is relative to the probe family. With the clarification in §3 that the family is the symmetry-selected hierarchy of polynomial probes of bounded degree, the quotient becomes the canonical object for that family. We will revise §4 to make this dependence explicit, stating that cross-model transfer is performed with respect to the same polynomial probe class on both models, and that the resulting quotient captures precisely the directions visible to the chosen probes. The experimental results on synthetic and real-world portability continue to demonstrate the practical benefit of this coverage-aware transfer within the stated class. revision: yes

Circularity Check

0 steps flagged

Symmetry principle derivation is self-contained without reduction to inputs by construction

full rationale

The paper starts from the group action of affine reparameterizations induced by equivalent readout realizations and derives a stability condition for probe families. This is used to identify a hierarchy whose degree-1 case is the linear probe and to motivate the probe-visible quotient. No equation or claim in the abstract or described chain defines the hierarchy in terms of itself, renames a fitted quantity as a prediction, or relies on a self-citation whose content is unverified. The uniqueness statement is presented as following from the symmetry principle applied to shallow probes; experiments are described as supporting rather than constituting the derivation. The central claims therefore remain independent of the paper's own fitted values or prior self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the domain assumption that equivalent realizations induce only affine changes and that probes should be invariant under those changes; no free parameters or invented entities are visible in the abstract.

axioms (2)

domain assumption Neural representations are not unique objects; equivalent downstream computations may differ by reparameterization of hidden coordinates.
Opening sentence of abstract; used to motivate the symmetry requirement.
domain assumption A probe family intended to reveal structure already present should be stable under the relevant representation symmetries.
Stated as the design principle that selects the hierarchy.

pith-pipeline@v0.9.0 · 5467 in / 1261 out tokens · 65103 ms · 2026-05-13T02:15:29.038919+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 9 internal anchors

[1]

Computational Linguistics , year =

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguis- tics, 48(1):207–219, March 2022. doi: 10.1162/coli_a_00422. URL https://aclanthology. org/2022.cl-1.7/

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
[2]

What you can cram into a single \ &!\#* vector:

Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia, July 2018. Asso...

work page doi:10.18653/v1/p18-1198 2018
[3]

Designing and Interpreting Probes with Control Tasks

John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, Chi...

work page doi:10.18653/v1/d19-1275 2019
[4]

Pareto probing: Trading off accuracy for complexity

Tiago Pimentel, Naomi Saphra, Adina Williams, and Ryan Cotterell. Pareto probing: Trading off accuracy for complexity. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 3138–3153, 2020

work page 2020
[5]

Information-Theoretic Probing for Linguistic Structure

Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. Information-theoretic probing for linguistic structure. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4609–4622, Online, July 2020. ...

work page doi:10.18653/v1/2020.acl-main.420 2020
[6]

White, Tiago Pimentel, Naomi Saphra, and Ryan Cotterell

Jennifer C. White, Tiago Pimentel, Naomi Saphra, and Ryan Cotterell. A non-linear structural probe. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors,Proceedings of the 2021 Conference of the North American Chapter of the Association for Comp...

work page doi:10.18653/v1/2021.naacl-main.12 2021
[7]

Kolda and Brett W

Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications.SIAM Review, 51(3): 455–500, 2009. doi: 10.1137/07070111X. URLhttps://doi.org/10.1137/07070111X

work page doi:10.1137/07070111x 2009
[8]

Scalable interpretability via polynomials

Abhimanyu Dubey, Filip Radenovic, and Dhruv Mahajan. Scalable interpretability via polynomials. InAdvances in Neural Information Processing Systems, volume 35, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/ ee81a23d6b83ac15fbeb5b7a30934e0b-Abstract-Conference.html

work page 2022
[9]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. URLhttps://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXiv:23...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Characterization of polynomials by their invariance properties

José María Almira and Ya-Qing Hu. Characterization of polynomials by their invariance properties. Aequationes Mathematicae, 99:2725–2744, 2025. doi: 10.1007/s00010-025-01190-5. URL https: //doi.org/10.1007/s00010-025-01190-5

work page doi:10.1007/s00010-025-01190-5 2025
[12]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[13]

Liu, Matt Gardner, Yonatan Belinkov, Matthew E

Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. Linguistic knowledge and transferability of contextual representations. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1073–1094, Min...

work page 2019
[14]

and Gardner, Matt and Belinkov, Yonatan and Peters, Matthew E

Association for Computational Linguistics. doi: 10.18653/v1/N19-1112. URL https:// aclanthology.org/N19-1112/. 12

work page doi:10.18653/v1/n19-1112
[15]

BERT Rediscovers the Classical NLP Pipeline

Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593– 4601, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1452. URLhttps://aclanthology.org/P19-1452/

work page doi:10.18653/v1/p19-1452 2019
[16]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Elena V oita and Ivan Titov. Information-theoretic probing with minimum description length. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 183–196, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/ 2020.emnlp-main.14. URLhttps://aclanthology.org/2020.emnlp-main.14/

work page doi:10.18653/v1/ 2020
[17]

John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representa- tions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4129–4138, 2019

work page 2019
[18]

Understanding image representations by measuring their equivariance and equivalence

Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 991–999, 2015

work page 2015
[19]

Revisiting model stitching to compare neural representations

Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. InAdvances in Neural Information Processing Systems, volume 34, pages 225–236, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ 01ded4259d101feb739b06c399e9cd9c-Abstract.html

work page 2021
[20]

On the functional similarity of robust and non-robust neural representations

András Balogh and Márk Jelasity. On the functional similarity of robust and non-robust neural representations. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 1614–1635. PMLR, 2023. URL https: //proceedings.mlr.press/v202/balogh23a.html

work page 2023
[21]

Transferring linear features across language models with model stitching

Alan Chen, Jack Merullo, Alessandro Stolfo, and Ellie Pavlick. Transferring linear features across language models with model stitching. InAdvances in Neural Information Processing Systems 38 (NeurIPS 2025), 2025. URL https://openreview.net/forum?id=Qvvy0X63Fv. Spotlight; arXiv:2506.06609

work page arXiv 2025
[22]

How not to stitch representations to measure similarity: Task loss matching versus direct matching

Andras Balogh and Mark Jelasity. How not to stitch representations to measure similarity: Task loss matching versus direct matching. InProceedings of the AAAI Conference on Artificial Intelligence,

work page
[23]

Svcca: Sin- gular vector canonical correlation analysis for deep learning dynamics and interpretability

Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Sin- gular vector canonical correlation analysis for deep learning dynamics and interpretability. InAdvances in Neural Information Processing Systems 30, pages 6076–6085. Curran As- sociates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/hash/ dc6a7e655d7e5840e6673...

work page 2017
[24]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 3519–3529. PMLR, 2019. URL https://proceedings.mlr.press/v97/kornblith19a.html

work page 2019
[25]

Discovering latent knowledge in language models without supervision

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. InThe Eleventh International Conference on Learning Representations,

work page
[26]

URLhttps://openreview.net/forum?id=ETKGuby0hcs. 13

work page
[27]

The internal state of an LLM knows when it ' s lying

Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.68. URL https: //aclanthology.org/2023.findings-emnlp.68/

work page doi:10.18653/v1/2023.findings-emnlp.68 2023
[28]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. InConference on Language Modeling (COLM), 2024. URLhttps://arxiv.org/abs/2310.06824

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Eliciting latent knowledge from quirky language models

Alex Mallen, Madeline Brumley, Julia Kharchenko, and Nora Belrose. Eliciting latent knowledge from quirky language models. InConference on Language Modeling (COLM), 2024. URL https: //arxiv.org/abs/2312.01037

work page arXiv 2024
[30]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith, Danny Halawi, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023. URLhttps://arxiv.org/abs/2303.08112

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Do LLMs “know” internally when they follow instructions? InThe Thirteenth International Conference on Learning Representations, 2025

Juyeon Heo, Christina Heinze-Deml, Oussama Elachqar, Kwan Ho Ryan Chan, Shirley You Ren, Andrew Miller, Udhyakumar Nallasamy, and Jaya Narain. Do LLMs “know” internally when they follow instructions? InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=qIN5VDdEOr

work page 2025
[32]

Beyond linear probes: Dynamic safety monitoring for language models

James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, and Fazl Barez. Beyond linear probes: Dynamic safety monitoring for language models. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=AGWa8whf92. Published as a conference paper at ICLR 2026

work page 2026
[33]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023. URL https://arxiv.org/ abs/2310.01405

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , shorttitle =

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. InAdvances in Neural Information Processing Systems, volume 36, 2023. URLhttps://arxiv.org/abs/2306.03341

work page arXiv 2023
[35]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023. URLhttps://arxiv.org/abs/2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Steering Llama 2 via Contrastive Activation Addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thailand, Aug...

work page doi:10.18653/v1/2024.acl-long.828 2024
[37]

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), 2024. URLhttps://arxiv.org/abs/2406.11717

work page internal anchor Pith review arXiv 2024
[38]

Linear probe penalties reduce LLM sycophancy

Henry Papadatos and Rachel Freedman. Linear probe penalties reduce LLM sycophancy. InNeurIPS 2024 Workshop on Socially Responsible Language Modelling Research (SoLaR), 2024. URLhttps: //openreview.net/forum?id=6N2yES22rG. 14

work page 2024
[39]

Large language models can strategi- cally deceive their users when put under pressure.arXiv preprint arXiv:2311.07590, 2023

Jérémy Scheurer, Mikita Balesni, and Marius Hobbhahn. Large language models can strategically deceive their users when put under pressure.arXiv preprint arXiv:2311.07590, 2023. URL https: //arxiv.org/abs/2311.07590

work page arXiv 2023
[40]

arXiv preprint arXiv:2503.10965 , year =

Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, et al. Auditing language models for hidden objectives.arXiv preprint arXiv:2503.10965, 2025. URL https://arxiv.org/abs/ 2503.10965

work page arXiv 2025
[41]

When truthful representations flip under deceptive instructions?arXiv preprint arXiv:2507.22149, 2025

Xianxuan Long, Yao Fu, Runchao Li, Mu Sheng, Haotian Yu, Xiaotian Han, and Pan Li. When truthful representations flip under deceptive instructions?arXiv preprint arXiv:2507.22149, 2025. URL https://arxiv.org/abs/2507.22149

work page arXiv 2025
[42]

A holistic approach to undesired content detection in the real world.Proceedings of the AAAI Conference on Artificial Intelligence, 37(12):15009–15018, 2023

Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world.Proceedings of the AAAI Conference on Artificial Intelligence, 37(12):15009–15018, 2023. doi: 10.1609/aaai.v37i12.26752. URL https://ojs.aaai.org/index.php/...

work page doi:10.1609/aaai.v37i12.26752 2023
[43]

GPT-4 system card, 2023

OpenAI. GPT-4 system card, 2023. URL https://cdn.openai.com/papers/ gpt-4-system-card.pdf. OpenAI system card

work page 2023
[44]

GPT-4o system card, 2024

OpenAI. GPT-4o system card, 2024. URL https://cdn.openai.com/ gpt-4o-system-card.pdf. OpenAI system card

work page 2024
[45]

Openai o3 and o4-mini system card, 2025

OpenAI. Openai o3 and o4-mini system card, 2025. URL https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf . OpenAI system card

work page 2025
[46]

Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming

Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, et al. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming.arXiv preprint arXiv:2501.18837, 2025. URLhttps://arxiv.org/abs/2501.18837

work page arXiv 2025
[47]

Constitutional classifiers++: Efficient production-grade defenses against universal jailbreaks.arXiv preprint arXiv:2601.04603, 2026

Hoagy Cunningham, Jerry Wei, Zihan Wang, Andrew Persic, et al. Constitutional classifiers++: Efficient production-grade defenses against universal jailbreaks.arXiv preprint arXiv:2601.04603, 2026. URL https://arxiv.org/abs/2601.04603

work page arXiv 2026
[48]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Le...

work page 2023
[49]

ToxicChat: Unveiling hidden challenges of toxicity detection in real-world user-AI conversation

Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang. ToxicChat: Unveiling hidden challenges of toxicity detection in real-world user-AI conversation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Com- putational Linguistics: EMNLP 2023, pages 4694–4702, Singapore, December 2023. As...

work page doi:10.18653/v1/2023.findings-emnlp.311 2023
[50]

BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023) Datasets and Benchmarks Track, volume 36, 2023. 15

work page 2023
[51]

Manning, Andrew Ng, and Christopher Potts

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, 2013

work page 2013
[52]

Bowman, Miriam Connor, John Bauer, and Christopher D

Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel R. Bowman, Miriam Connor, John Bauer, and Christopher D. Manning. A gold standard dependency corpus for English. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 2897–2904. European Language Resources Association (ELRA), 2014. URL...

work page 2014
[53]

For each i∈ {1,2} , there exists a unique linear map Qi :H i → C ∗ satisfying Qi(v)(Ei(ℓ)) = ℓ(v)for allv∈H i and allℓ∈V i.In particular,Q i(hi(x)) = evx for allx∈X. 42

work page
[54]

The kernel ofQ i is exactly the probe-invisible subspaceK(V i)

work page
[55]

match all activations

The induced map Qi :Z(V i) =H i/K(Vi)→ C ∗ is a linear isomorphism. Consequently the probe-visible quotients of the two models are canonically isomorphic to the same abstract space C∗. In particular, H1/K(V1) ∼= C∗ ∼= H2/K(V2). Because C is finite-dimensional, the canonical bidual identification also givesC ∼= (C∗)∗. Proof.Fixi∈ {1,2}. Because the concept...

work page arXiv 2048
[56]

4.Linear on raw concat:logistic regression on concatenated hidden states[h subj;h verb](2dfeatures)

Quadratic on scores:logistic regression on [ssubj ⊙s verb;s subj;s verb] where s are the 3 morphological- number probe scores (9 features). 4.Linear on raw concat:logistic regression on concatenated hidden states[h subj;h verb](2dfeatures). 5.Bilinear on raw:logistic regression on[h subj ⊙h verb;h subj;h verb](3dfeatures)

work page
[57]

we transfer probes

Quotient-level quadratic:project each hidden state through the quotient, then fit a quadratic on quotient coordinates. Table 18 shows that the quadratic head on probe scores (bacc = 0.863±0.007 ) outperforms the bilinear probe on raw hidden states, though these operate on different feature representations (5 pre-trained probe scores vs 3d raw dimensions) ...

work page
[58]

Table 27: Effect of alignment-text domain on Qwen-7B → Qwen-3B zero-label transfer (AUROC)

Alignment text must cover the concept space.We tested whether domain-independent text suffices for alignment by using only SST-2 movie reviews (5K samples) instead of safety-relevant text. Table 27: Effect of alignment-text domain on Qwen-7B → Qwen-3B zero-label transfer (AUROC). SST-2 alignment transfers sentiment but fails on safety; safety-domain text ...

work page
[59]

zero-label transfer

Alignment budget scaling.How many paired samples are needed for reliable alignment? We sweep alignment budget n from 100 to 76,000 for Qwen-7B →Qwen-3B transfer (mean AUROC over 5 concepts, 10 repeats per budget): Table 28: Alignment-budget scaling for Qwen-7B → Qwen-3B zero-label transfer. Cells are mean AUROC across five safety concepts (toxicity, jailb...

work page 2000