arxiv: 2604.17663 · v1 · submitted 2026-04-19 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data

Gareth Seneque , Lap-Hang Ho , Nafise Erfanian Saeedi , Jeffrey Molendijk , Tim Elson

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords latent geometryconstitution-conditioned traininglanguage modelsneural perturbation datageometric recurrenceredistributionhidden statessource-defined family

0 comments

The pith

Written constitutions induce recoverable latent geometry that recurs across language models and neural perturbation data even as local details shift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats constitution-conditioned post-training as a structured perturbation of a model's learned representational geometry. It introduces ATLAS to trace a source-local chart and the broader source-defined family of hidden states in Gemma that captures constitution-related behaviors. This family re-identifies in an unadapted Phi model with strong separation metrics and receives support across folds in mouse frontal-cortex perturbation data. The result is geometric recurrence under redistribution: the geometry's organisation stays detectable across model and substrate changes while its coordinates, occupancy, and behavioural expression shift. A sympathetic reader would care because it offers a way to track how abstract written rules shape internal representations without requiring fixed locations or behaviors.

Core claim

ATLAS tests local charts in hidden-state space whose tangent structure, occupancy distribution, and behavioural coupling are measured under system change. On Gemma the anchored source-local chart captures 310 of 320 reviewed source rows and all 84 score-flip rows, so the exportable unit is the broader source-defined family. Freezing that family yields re-identification in Phi with AUC 0.984 and mean gap 5.50, plus support in ALM8 mouse data across 5/5 folds with mean held-out AUC 0.72 and mean gap 4.50. The correspondence is geometric recurrence under redistribution rather than coordinate identity, site identity, or target-side mediation.

What carries the argument

The source-defined family of hidden states, which serves as the exportable unit for re-identification across models and substrates to demonstrate geometric recurrence under redistribution.

If this is right

The exportable unit is the broader source-defined family because compact exact-patch sufficiency does not close.
Nearby target-local signals can appear without source-faithful closure, providing the main boundary condition.
Support holds across all 5 folds in held-out mouse data with consistent mean gaps.
The detectable organisation remains while local coordinates, occupancy distributions, and behavioural couplings redistribute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If recurrence holds, checking for the source-defined family could allow transferring or predicting constitution effects between models without full retraining.
The method might enable direct comparison of how high-level rules alter representations in artificial and biological neural systems.
Testing the family in additional model architectures or perturbation datasets would clarify whether the recurrence is general or specific to the chosen source and targets.

Load-bearing premise

The source-local chart and source-defined family identified in Gemma can be re-identified in an unadapted Phi model and mouse perturbation data as evidence of geometric recurrence rather than coincidence or post-hoc selection.

What would settle it

Observing that the source-defined family fails to separate relevant contrasts with high AUC in additional unadapted models or shows no consistent support beyond chance in new neural perturbation datasets.

Figures

Figures reproduced from arXiv: 2604.17663 by Gareth Seneque, Jeffrey Molendijk, Lap-Hang Ho, Nafise Erfanian Saeedi, Tim Elson.

**Figure 1.** Figure 1: Gemma Source-Local Chart, Source-Defined Family, Phi Target-Local Realisation, ALM8 Held-Out Bridge Realisation, And Claim Tier 1.4 Related Work We position this study as a representation-level structural study of constitution-conditioned posttraining: closer to representational geometry and systems-neuroscience-style correspondence than to full mechanistic decomposition, generic steering, or deployable m… view at source ↗

**Figure 2.** Figure 2: Shared Experiment, Structural Validation, And Discovery [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Gemma Source-Local Chart, Source-Defined Family, And Exact-Patch Boundary organisation; late_reason marks its most portable compact member rather than the whole exported unit, and answer-heavy late wins are better read as drift phases than as chart recovery. This is why the source-side case cannot be reduced to one compact patch even though it is anchored locally. The qualitative source-side audit points i… view at source ↗

**Figure 4.** Figure 4: Phi Search Band, Frozen Target-Local Lane, And Confirmatory Re-Identification condition the model more often lands on deceptive or mixed branches, while under constitutionsv2-high_effective_mi it more often lands on honest substance even when stale shell tokens remain. Manual review is therefore not a cleanup step around the result; it is where the result becomes scientifically legible. The remaining sear… view at source ↗

**Figure 5.** Figure 5: ALM8 Held-Out Corroboration And Redistribution [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: MCQ Boundary: Local Signal, One-Sided Re-Entry, And Displacement (268 centroid-distance failures versus 6 basis-angle failures), so the failure is not explained by a small angular perturbation of the same local slot. 6.3 What The Boundary Means MCQ therefore functions as the paper’s explicit limiting boundary. Target-local localisation is not enough: the positive claim in this paper requires behavioural se… view at source ↗

**Figure 7.** Figure 7: Frozen Local Target, Replayable Denominator, And Bounded Prompt Manipulation 7.3 Limitations and Future Work The main limitations remain occupancy-faithful closure and loose behaviour coupling. The current evidence does not establish exact cross-system identity, a target-side mediation theorem, compact patch sufficiency, or a deployable monitor. Future work will focus on improvements to our geometryfirst … view at source ↗

read the original abstract

Constitution-conditioned post-training can be analysed as a structured perturbation of a model's learned representational geometry. We introduce ATLAS, a geometry-first program that traces constitution-induced hidden-state structure across charts, models, and substrates. Instead of treating the relevant unit as a single behaviour, neuron, vector, or patch, ATLAS tests a local chart whose tangent structure, occupancy distribution, and behavioural coupling can be measured under system change. On Gemma, the anchored source-local chart captures 310 / 320 reviewed source rows and all 84 / 84 reviewed score-flip rows, but compact exact-patch sufficiency does not close, so the exportable unit is the broader source-defined family. Freezing that family, we re-identify a target-local realisation in an unadapted Phi model, where the fully adjudicated confirmatory contrast separates with AUC 0.984 and mean gap 5.50. In held-out ALM8 mouse frontal-cortex perturbation data, the same source-defined family receives support across 5/5 folds, with mean held-out AUC 0.72 and mean fold gap 4.50. A multiple-choice analysis provides the main boundary: nearby target-local signals can appear without source-faithful closure. The resulting correspondence is not coordinate identity, site identity, or a target-side mediation theorem. It is geometric recurrence under redistribution: written constitutions can induce recoverable latent geometry whose organisation remains detectable across model and substrate changes while its local coordinates, occupancy, and behavioural expression shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The ATLAS abstract claims a recoverable constitution-induced latent geometry that reappears across LLMs and mouse data, but the lack of methods makes it impossible to judge whether the re-identification is robust or post-hoc.

read the letter

The main thing here is that the paper sketches ATLAS as a geometry-first approach to track how constitutions perturb hidden-state structure, then claims that a family of local charts identified in Gemma can be recovered in an unadapted Phi model and in mouse frontal-cortex perturbation data. The reported numbers are 0.984 AUC in Phi and 0.72 mean AUC across mouse folds, with the argument that this shows geometric recurrence under redistribution rather than coordinate or site identity.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ATLAS, a geometry-first framework for analyzing how written constitutions induce structured perturbations in the latent geometry of language models. Using Gemma as the source, it identifies a local chart and broader source-defined family that captures 310/320 source rows and all 84/84 score-flip rows. Freezing this family, the authors report re-identification in an unadapted Phi model (AUC 0.984, mean gap 5.50) and support in held-out ALM8 mouse frontal-cortex perturbation data (mean AUC 0.72 across 5/5 folds). The central claim is geometric recurrence under redistribution: the organisation of constitution-induced latent structure remains detectable across model architectures and neural substrates, even as local coordinates, occupancy, and behavioural expression shift. A multiple-choice analysis is presented as the main boundary condition against nearby but non-faithful signals.

Significance. If the reported cross-domain re-identification holds under pre-specified procedures, the result would be a substantive contribution to mechanistic interpretability and alignment research. It would provide concrete evidence that constitutional post-training can induce recoverable geometric signatures that transfer beyond a single model family and even into biological perturbation data, moving beyond neuron- or vector-level analyses to chart- and family-level invariants. This could open new avenues for testing alignment robustness and for linking artificial and neural representational geometry.

major comments (2)

[Abstract] Abstract: The re-identification procedure for the source-defined family in the unadapted Phi model and ALM8 mouse data is not specified (e.g., fixed thresholds, embedding similarity, or data-dependent optimization). Without pre-specification of chart selection criteria, tangent-structure measurement, occupancy metrics, or the exact matching rule, the reported AUC 0.984 and 5/5-fold support cannot be distinguished from post-hoc selection of a family that aligns with target signals, as the manuscript itself flags with the multiple-choice boundary condition.
[Abstract] Abstract: No methods, derivations, data details, exclusion criteria, or error bars are provided for the AUC values, mean gaps, or fold-wise results. The central claim that the source-local chart and family constitute an exportable unit rests on these quantities; their absence makes it impossible to evaluate robustness or rule out circularity in family definition.

minor comments (1)

[Abstract] Abstract: The phrasing 'compact exact-patch sufficiency does not close' is unclear without accompanying definitions or equations for patch sufficiency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments correctly identify that the abstract, as a high-level summary, omits key procedural details needed to assess pre-specification and robustness. We will revise the manuscript to address this by expanding the abstract and ensuring the main text provides explicit descriptions of the methods.

read point-by-point responses

Referee: [Abstract] Abstract: The re-identification procedure for the source-defined family in the unadapted Phi model and ALM8 mouse data is not specified (e.g., fixed thresholds, embedding similarity, or data-dependent optimization). Without pre-specification of chart selection criteria, tangent-structure measurement, occupancy metrics, or the exact matching rule, the reported AUC 0.984 and 5/5-fold support cannot be distinguished from post-hoc selection of a family that aligns with target signals, as the manuscript itself flags with the multiple-choice boundary condition.

Authors: We agree that the abstract does not explicitly state the re-identification procedure. The source-defined family is constructed exclusively from the Gemma source data using the local chart's tangent structure and occupancy distribution; this family is then frozen and applied to the target domains without further optimization. Re-identification relies on a pre-specified matching rule based on embedding similarity to the source family members. The multiple-choice analysis is included precisely to demonstrate that nearby but non-source-faithful signals do not produce the same separation. To eliminate any ambiguity about post-hoc selection, we will revise the abstract to state these pre-specification steps explicitly and reference the source-only definition of the family. revision: yes
Referee: [Abstract] Abstract: No methods, derivations, data details, exclusion criteria, or error bars are provided for the AUC values, mean gaps, or fold-wise results. The central claim that the source-local chart and family constitute an exportable unit rests on these quantities; their absence makes it impossible to evaluate robustness or rule out circularity in family definition.

Authors: We agree that the abstract lacks these supporting details. The reported AUCs, mean gaps, and 5/5-fold results are computed from the frozen source-defined family applied to held-out target data, with the family definition fixed prior to any target evaluation to avoid circularity. In revision we will expand the abstract with a concise methods summary that includes the computation of AUC and gaps, the cross-validation procedure for the folds, and any exclusion criteria applied to the reviewed rows. The full manuscript will also supply the complete derivations, data descriptions, and error bars so that readers can directly assess robustness. revision: yes

Circularity Check

0 steps flagged

No equations, derivations, or self-citations in abstract; claims rest on empirical re-identification without visible reduction to inputs.

full rationale

The provided abstract contains no equations, parameter-fitting steps, or citations. The central procedure—identifying a source-local chart and broader family on Gemma data then freezing and re-identifying it on Phi and mouse data—is described at a high level without any mathematical definition that would allow the re-identification to reduce tautologically to the original selection criteria. No load-bearing step is shown to be self-definitional, fitted-then-renamed-as-prediction, or dependent on a self-citation chain. The text therefore supplies no inspectable derivation chain that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any free parameters, axioms, or invented entities. The method ATLAS and concepts like source-local chart and geometric recurrence under redistribution appear to be introduced but without details on their foundations or independence from prior literature.

pith-pipeline@v0.9.0 · 5562 in / 1149 out tokens · 74692 ms · 2026-05-10T05:51:39.151798+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 35 canonical work pages · 7 internal anchors

[1]

Abc align: Large language model alignment for safety & accuracy, 2024

Gareth Seneque, Lap-Hang Ho, Ariel Kuperman, Nafise Erfanian Saeedi, and Jeffrey Molendijk. Abc align: Large language model alignment for safety & accuracy, 2024. URL https://arxiv.or g/abs/2408.00307

work page arXiv 2024
[2]

Enigma: The geometry of reasoning and alignment in large-language models,

Gareth Seneque, Lap-Hang Ho, Nafise Erfanian Saeedi, Jeffrey Molendijk, Ariel Kupermann, and Tim Elson. Enigma: The geometry of reasoning and alignment in large-language models,
[3]

URL https://arxiv.org/abs/2510.11278

work page arXiv
[4]

google/gemma-3-1b-it, 2025

Google. google/gemma-3-1b-it, 2025. URL https://huggingface.co/google/gemma-3-1b-it. Hugging Face model card

2025
[5]

microsoft/phi-4-mini-instruct, 2025

Microsoft. microsoft/phi-4-mini-instruct, 2025. URL https://huggingface.co/microsoft/Phi-4- mini-instruct. Hugging Face model card

2025
[6]

Dataset (matlab format) from yang et al (2022) thalamus-driven functional populations in frontal cortex support decision-making

Nuo Li and Weiguo Yang. Dataset (matlab format) from yang et al (2022) thalamus-driven functional populations in frontal cortex support decision-making. nat. neurosci., 2022. URL https://zenodo.org/records/6846161. Zenodo dataset, record 6846161

work page arXiv 2022
[7]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Liao, Tom Henighan, Tristan Hume, Zac Hatfield-Dodds, Sören Mindermann, Nicholas Joseph, Sam McCandlish, and Jared Kaplan

Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden McLean, Catherine Olsson, Cassie Evraets, Eli Tran-Johnson, Esin Durmus, Ethan Perez, Jackson Kernion, Jamie Kerr, Kamal Ndousse, Karina Nguyen, Nelson Elhage, Newton Cheng, Nicholas Schiefer, Nova 46 DasSarma, Oli...

work page arXiv 2023
[9]

Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli

Saffron Huang, Divya Siddarth, Liane Lovitt, Thomas I. Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli. Collective constitutional ai: Aligning a language model with public input, 2024. URL https://arxiv.org/abs/2406.07814

work page arXiv 2024
[10]

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis, 2024. URL https://arxiv.org/abs/2405.07987

work page Pith review arXiv 2024
[11]

Similarity of Neural Network Representations Revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited, 2019. URL https://arxiv.org/abs/1905.00414

work page Pith review arXiv 2019
[12]

Transferring linear features across language models with model stitching

Alan Chen, Jack Merullo, Alessandro Stolfo, and Ellie Pavlick. Transferring linear features across language models with model stitching, 2025. URL https://arxiv.org/abs/2506.06609

work page arXiv 2025
[13]

Rishi Jha, Collin Zhang, Vitaly Shmatikov, and John X. Morris. Harnessing the universal geometry of embeddings, 2025. URL https://arxiv.org/abs/2505.12540

work page arXiv 2025
[14]

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model, 2023. URL https: //arxiv.org/abs/2306.03341

work page internal anchor Pith review arXiv 2023
[15]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering, 2023. URL https://arxiv.org/abs/2308.10248

work page internal anchor Pith review arXiv 2023
[16]

2024 , month = feb, number =

Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models, 2023. URL https://arxiv.org/abs/2310.15213

work page arXiv 2023
[17]

and Potts, Christopher , booktitle=

Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. Reft: Representation finetuning for language models, 2024. URL https://arxiv.org/abs/2404.03592

work page arXiv 2024
[18]

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction, 2024. URL https://arxiv.org/abs/2406.11717

work page internal anchor Pith review arXiv 2024
[19]

On the non-identifiability of steering vectors in large language models, 2026

Sohan Venkatesh and Ashish Mahendran Kurapath. On the non-identifiability of steering vectors in large language models, 2026. URL https://arxiv.org/abs/2602.06801

work page arXiv 2026
[20]

Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment

Soham Gadgil, Chris Lin, and Su-In Lee. Where to steer: Input-dependent layer selection for steering improves llm alignment, 2026. URL https://arxiv.org/abs/2604.03867

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Steering vector fields for context-aware inference- time control in large language models, 2026

Jiaqian Li, Yanshu Li, and Kuan-Hao Huang. Steering vector fields for context-aware inference- time control in large language models, 2026. URL https://arxiv.org/abs/2602.01654

work page arXiv 2026
[22]

Churchland, John P

Mark M. Churchland, John P. Cunningham, Matthew T. Kaufman, Justin D. Foster, Paul Nuyujukian, Stephen I. Ryu, and Krishna V. Shenoy. Neural population dynamics during reaching.Nature, 2012. doi: 10.1038/nature11129. URL https://www.nature.com/articles/na ture11129. 47

work page doi:10.1038/nature11129 2012
[23]

Gallego, Matthew G

Juan A. Gallego, Matthew G. Perich, Stephanie N. Naufel, Christian Ethier, Sara A. Solla, and Lee E. Miller. Cortical population activity within a preserved neural manifold underlies multiple motor behaviors.Nature Communications, 2018. doi: 10.1038/s41467-018-06560-z. URL https://www.nature.com/articles/s41467-018-06560-z

work page doi:10.1038/s41467-018-06560-z 2018
[24]

Oby, Alan D

Emily R. Oby, Alan D. Degenhart, Erinn M. Grigsby, Asma Motiwala, Nicole T. McClain, Patrick J. Marino, Byron M. Yu, and Aaron P. Batista. Dynamical constraints on neural population activity.Nature Neuroscience, 2025. doi: 10.1038/s41593-024-01845-7. URL https://www.nature.com/articles/s41593-024-01845-7

work page doi:10.1038/s41593-024-01845-7 2025
[25]

Standardized and reproducible measurement of decision-making in mice.eLife, 2021

The International Brain Laboratory, Valeria Aguillon-Rodriguez, Dora Angelaki, Hannah Bayer, Niccolo Bonacchi, Matteo Carandini, Fanny Cazettes, Gaelle Chapuis, Anne K Churchland, Yang Dan, Eric Dewitt, Mayo Faulkner, Hamish Forrest, Laura Haetzel, Michael Häusser, Sonja B Hofer, Fei Hu, Anup Khanal, Christopher Krasniak, Ines Laranjeira, Zachary F Mainen...

work page doi:10.7554/elife.63711 2021
[26]

Reproducibility of in vivo electrophysiological measurements in mice.eLife, 2025

International Brain Laboratory, Kush Banga, Julius Benson, Jai Bhagat, Dan Biderman, Daniel Birman, Niccolò Bonacchi, Sebastian A Bruijns, Kelly Buchanan, Robert AA Campbell, Matteo Carandini, Gaelle A Chapuis, Anne K Churchland, M Felicia Davatolhagh, Hyun Dong Lee, Mayo Faulkner, Berk Gerçek, Fei Hu, Julia Huntenburg, Cole Lincoln Hurwitz, Anup Khanal, ...

work page doi:10.7554/elife.100840 2025
[27]

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna ...

work page internal anchor Pith review arXiv 2024
[28]

Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms, 2025. URL https://arxiv.org/abs/2502.17424

work page arXiv 2025
[29]

arXiv preprint arXiv:2506.11613 , year=

Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, and Neel Nanda. Model organisms for emergent misalignment, 2025. URL https://arxiv.org/abs/2506.11613

work page arXiv 2025
[30]

AI control: Improving safety despite intentional subversion,

Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. Ai control: Improving safety despite intentional subversion, 2023. URL https://arxiv.org/abs/2312.06942. 48

work page arXiv 2023
[31]

A sketch of an ai control safety case, 2025

Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, and Geoffrey Irving. A sketch of an ai control safety case, 2025. URL https://arxiv.org/abs/2501.17315

work page arXiv 2025
[32]

How to Evaluate Control Measures for LLM Agents? A Trajectory from Today to Superintelligence.arXiv preprint arXiv:2504.05259, 2025

Tomek Korbak, Mikita Balesni, Buck Shlegeris, and Geoffrey Irving. How to evaluate control measures for llm agents? a trajectory from today to superintelligence, 2025. URL https: //arxiv.org/abs/2504.05259

work page arXiv 2025
[33]

Ai sandbagging: Language models can strategically underperform on evaluations.arXiv preprint arXiv:2406.07358, 2024

Teun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, and Francis Rhys Ward. Ai sandbagging: Language models can strategically underperform on evaluations, 2024. URL https://arxiv.org/abs/2406.07358

work page arXiv 2024
[34]

The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness

Sahar Abdelnabi and Ahmed Salem. The hawthorne effect in reasoning models: Evaluating and steering test awareness, 2025. URL https://arxiv.org/abs/2505.14617

work page arXiv 2025
[35]

Dataset (matlab format) from chen kang et al (2021) modularity and robustness of frontal cortex networks

Nuo Li and Guang Chen. Dataset (matlab format) from chen kang et al (2021) modularity and robustness of frontal cortex networks. cell, 184(14):3717-3730., 2022. URL https://zenodo .org/records/6713616. Zenodo dataset, record 6713616

work page arXiv 2021
[36]

Sun, Hemanth Mohan, Xu An, Steven Gluf, Shu-Jing Li, Rhonda Drewes, Emma Cravo, Irene Lenzi, Chaoqun Yin, Björn M

Simon Musall, Xiaonan R. Sun, Hemanth Mohan, Xu An, Steven Gluf, Shu-Jing Li, Rhonda Drewes, Emma Cravo, Irene Lenzi, Chaoqun Yin, Björn M. Kampa, and Anne K. Churchland. Pyramidal cell types drive functionally distinct cortical activity patterns during decision-making. Nature Neuroscience, 2023. doi: 10.1038/s41593-022-01245-9. URL https://www.nature.com...

work page doi:10.1038/s41593-022-01245-9 2023
[37]

pyramidal cell types drive functionally distinct cortical activity patterns during decision-making

Anne Churchland, Xiaonan Sun, and Simon Musall. Data supporting "pyramidal cell types drive functionally distinct cortical activity patterns during decision-making", 2023. URL https://plus.figshare.com/articles/dataset/Data_supporting_Pyramidal_cell_types_driv e_functionally_distinct_cortical_activity_patterns_during_decision-making_/21538458. Figshare da...

2023
[38]

meta-llama/llama-3.1-8b-instruct, 2024

Meta AI. meta-llama/llama-3.1-8b-instruct, 2024. URL https://huggingface.co/meta- llama/Llama-3.1-8B-Instruct. Hugging Face model card

2024
[39]

Qwen/qwen3-8b, 2025

Qwen Team. Qwen/qwen3-8b, 2025. URL https://huggingface.co/Qwen/Qwen3-8B. Hugging Face model card

2025
[40]

churchlandlab/wfieldcelltypes, 2022

Churchland Lab. churchlandlab/wfieldcelltypes, 2022. URL https://github.com/churchlandlab /wfieldCellTypes. GitHub repository

2022
[41]

Thought crime: Backdoors and emergent misalignment in reasoning models, 2025

James Chua, Jan Betley, Mia Taylor, and Owain Evans. Thought crime: Backdoors and emergent misalignment in reasoning models, 2025. URL https://arxiv.org/abs/2506.13206

work page arXiv 2025
[42]

Alignment faking in large language models

Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger. Alignment faking in large language models, 202...

work page internal anchor Pith review arXiv 2024
[43]

Deceptionbench: A comprehensive benchmark for evaluating deceptive behaviors in large language models, 2025

PKU-Alignment. Deceptionbench: A comprehensive benchmark for evaluating deceptive behaviors in large language models, 2025. URL https://huggingface.co/datasets/PKU- Alignment/DeceptionBench. Hugging Face dataset card. 49

2025