Building Better Activation Oracles

Adam Karvonen; Celeste De Schamphelaere; Jan Bauer; Neel Nanda; Niclas Luick

arxiv: 2606.02609 · v2 · pith:PX2B2ICFnew · submitted 2026-05-23 · 💻 cs.LG · cs.AI

Building Better Activation Oracles

Jan Bauer , Celeste De Schamphelaere , Adam Karvonen , Niclas Luick , Neel Nanda This is my paper

Pith reviewed 2026-06-30 14:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords activation oraclesmodel interpretabilityresidual stream activationson-policy trainingevaluation benchmarkslanguage model hallucinationsscalable interpretability

0 comments

The pith

Four changes to Activation Oracle training cut vagueness and hallucinations when interpreting model activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to improve Activation Oracles, which generate text descriptions from residual stream activations in language models, by addressing their tendency to hallucinate or produce vague outputs. It tests four concrete adjustments to the training process: drawing data from on-policy rollouts of the target model, refining the conversational dataset, supplying activations from additional layers, and updating the injection formula. These adjustments produce only modest gains in raw task performance but noticeably better day-to-day usability. The authors also release AObench, an open evaluation suite meant to give more reliable measurements of oracle quality even when text-inversion effects complicate direct comparisons.

Core claim

Training Activation Oracles on on-policy rollouts, an improved conversational dataset, activations from more layers, and with a refined injection formula yields substantial quality-of-life gains by reducing hallucinations and vagueness, even though capability improvements stay marginal. The work also contributes AObench as the first comprehensive open-source suite for measuring AO quality in the presence of text-inversion confounds.

What carries the argument

The Activation Oracle training regime, which conditions a separate model on residual stream activations to produce human-readable interpretations, with four specific updates to data collection, dataset content, layer coverage, and information injection.

If this is right

AOs become more practical tools for inspecting what specific activations encode inside a model.
AObench supplies a shared yardstick that future AO methods can be measured against.
The training changes support the broader goal of scalable, end-to-end interpretability by making activation-based tools more reliable.
Quality-of-life gains may encourage wider adoption of AOs for model debugging even before larger capability jumps appear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar on-policy and multi-layer adjustments could be tested on other activation-conditioned models beyond the oracle setting.
If AObench sees adoption, progress in this area may shift from ad-hoc demos to tracked benchmark improvements.
The modest capability gains hint that architectural changes to the oracle itself, rather than just training data, may be required for bigger leaps.

Load-bearing premise

That the four training changes actually lower hallucination and vagueness rates rather than merely swapping one set of artifacts for another, and that AObench scores reflect true oracle quality without being dominated by text-inversion confounds.

What would settle it

Evaluating the updated oracles on AObench and finding no measurable drop in hallucination frequency or vagueness ratings relative to earlier versions, or finding that human raters judge the new outputs as equally unclear on the same prompts.

Figures

Figures reproduced from arXiv: 2606.02609 by Adam Karvonen, Celeste De Schamphelaere, Jan Bauer, Neel Nanda, Niclas Luick.

**Figure 1.** Figure 1: Activation Oracle overview. The Oracle receives residual-stream activations and a naturallanguage question, then produces an answer about the model state represented by those activations. ∗Equal contribution; author order determined randomly. Preprint. arXiv:2606.02609v2 [cs.LG] 4 Jun 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: How our conversational dataset is constructed. A language model is asked to split a chain-of-thought and to produce a question about the split suffix that is plausibly answerable from the split prefix [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Conversational dataset swap, isolated. Replacing only LatentQA [Pan et al., 2024] with our conversational dataset (leaving past/future-lens corpus and layer choice fixed) improves chance-adjusted AObench score from +0.244 to +0.310 (n = 3 seeds). This is the single largest step in our recipe. 3You can explore our dataset here: https://huggingface.co/datasets/ceselder/ cot-oracle-convqa-chunked-sonnet 3 [P… view at source ↗

**Figure 4.** Figure 4: Layer sweep. Layer 22 causes improved performance over Layer 18 (+0.025 on AObench), and 5 contiguous layers even further still (+0.05 on AObench) 3.3 Training on on-policy data To train Activation Oracles, we need scalable unsupervised training tasks. A common way to achieve this is to predict past and/or future tokens from the activations, known as past or future-lens. This requires some data to source a… view at source ↗

**Figure 5.** Figure 5: On-policy data. Replacing only the past/future-lens corpus from FineWeb to on-policy chain-of-thought rollouts improves chance-adjusted AObench score from +0.244 to +0.274 (n = 3 seeds), a smaller effect than the conversational swap. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: We ablate steering strength and find it marginally increases performance. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: AObench ablation ladder. Each bar adds one of our interventions on top of the previous recipe. The conversational dataset swap (blue) drives the largest single-step improvement; multi-layer extraction and on-policy past/future-lens data each contribute additional uplift, and a 2× injectionstrength tweak yields a final small gain. All runs trained on 50M tokens; error bars show 95% CI of seed mean. Halluci… view at source ↗

**Figure 8.** Figure 8: Hallucination and vagueness across the ablation ladder. Each bar adds one of our interventions on top of the previous recipe; error bars are 95% CI of seed mean. 5 Outlook After substantial work on AOs, we believe they are a useful interpretability technique, but aren’t the best tool in all circumstances. They are best used for complex open-ended questions about activations, for instance, making sense of w… view at source ↗

**Figure 9.** Figure 9: Accuracy underestimates AO capability on Yes/No tasks. Qwen-based AOs default to “No” on several binary-classification items, which suppresses accuracy without affecting the Yes/No logit margin. ROC AUC is much higher and more stable across phrasings. Sweep the AO’s context window when comparing to black-box baselines. Many open-ended AO tasks (e.g., “why is the model about to backtrack?”) concern informat… view at source ↗

**Figure 10.** Figure 10: Backtracking accuracy vs. AO context window. Performance rises steeply with the number of activation positions supplied. The AO matches the black-box baseline at ∼20 tokens and exceeds it at 50 [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Consensus@10 precision/recall on Taboo. Requiring agreement among k = 10 samples cleanly trades coverage for precision, mitigating hallucination on the secret-word extraction task. • Make a good eval, that you think a good Activation Oracle should be able to do (solvability) and is hard for a black box monitor (text inversion; you can explicitly check this)5 . Then try to find training tasks that would ma… view at source ↗

**Figure 12.** Figure 12: Per-eval AObench scores across the ablation ladder. Bars show chance-adjusted scores per task for each recipe in our ablation; white dots are individual seeds. Higher is better for every eval. The broad uplift visible in [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

read the original abstract

Activation Oracles (AOs) are promising methods for interpreting residual stream activations. However, current AOs face important issues, such as hallucinations and vagueness. Additionally, text-inversion confounds make them hard to evaluate. To this end, we improve the Activation Oracle (AO) training regime in four ways: training on on-policy rollouts, improving the conversational dataset, feeding more layers and an improvement to the injection formula. The capability improvements are marginal, but quality of life improvements are quite substantial. In addition, we open source the first comprehensive evaluation suite for AO quality, which we call AObench. Overall, we hope that our work sets a foundation that helps improve AOs and other models in the paradigm of scalable, end-to-end interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Four targeted training changes to Activation Oracles plus the AObench release give usable improvements with ablations and controls that make the results checkable.

read the letter

The paper's main takeaway is that switching to on-policy rollouts, using a better conversational dataset, feeding more layers, and tweaking the injection formula produces only marginal capability gains but noticeably better day-to-day behavior on hallucinations and vagueness. The authors back each change with separate ablation tables and human preference ratings, which is more evidence than most interpretability tooling papers supply.

What stands out is the AObench release. A shared suite that tries to isolate text-inversion artifacts should let the subfield compare oracles without each group inventing its own test. The controls described in the manuscript address the main confounder the authors themselves flag, so the benchmark looks like a genuine step forward rather than another unverified claim.

The soft spots are straightforward. Capability lifts stay small by the paper's own account, so this is engineering refinement rather than a conceptual shift. The assumption that the changes reduce artifacts instead of trading one set for another gets support from the ratings and controls, but downstream correlation with actual interpretability tasks is still thin. No load-bearing math or derivations are present, which matches the empirical focus.

This is for people already working with activation oracles in mechanistic interpretability. It will not move the broader field, but it gives that group better defaults and a common yardstick. The evidence is presented clearly enough to be worth referee time even if the headline claims are modest.

Referee Report

0 major / 3 minor

Summary. The manuscript claims that four modifications to Activation Oracle (AO) training—on-policy rollouts, an improved conversational dataset, feeding activations from more layers, and a tweak to the injection formula—produce only marginal capability gains but substantial quality-of-life improvements by reducing hallucinations and vagueness. It further releases AObench, the first comprehensive open-source evaluation suite for AO quality, which incorporates explicit controls for text-inversion confounds, along with per-change ablation tables and human preference ratings.

Significance. If the reported ablations and human ratings hold under the stated controls, the work is significant for mechanistic interpretability: it supplies concrete, testable training improvements for AOs and introduces a standardized benchmark that directly addresses a known evaluation confound. The provision of ablation tables, preference ratings, and inversion controls makes the central empirical claims falsifiable rather than circular.

minor comments (3)

[Abstract] Abstract: quantitative effect sizes or confidence intervals for the 'marginal' capability gains and 'substantial' QoL gains are absent; adding one or two representative numbers would strengthen the summary claim.
The injection-formula change is described only qualitatively; a short equation or pseudocode block would clarify the precise modification relative to prior work.
[AObench] AObench section: the precise protocol used to control for text-inversion artifacts should be stated explicitly (e.g., which prompts or metrics are held fixed) so readers can replicate the confound mitigation.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work and for recommending minor revision. The referee's summary accurately reflects the manuscript's claims regarding the four training modifications to Activation Oracles and the release of AObench.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript presents four empirical training modifications to Activation Oracles (on-policy rollouts, improved conversational data, more layers, injection formula tweak) plus release of AObench. No equations, derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the abstract or described content. Ablation tables and human ratings are supplied to support the claims, keeping the argument externally testable rather than self-referential by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be extracted. Training hyperparameters and dataset construction details are not reported.

pith-pipeline@v0.9.1-grok · 5663 in / 1061 out tokens · 44606 ms · 2026-06-30T14:16:46.701664+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 29 canonical work pages · 13 internal anchors

[1]

Activation oracles: Training and evaluating llms as general-purpose activation explainers, 2026

Karvonen, Adam and Chua, James and Dumas, Cl. Activation Oracles: Training and Evaluating. arXiv preprint arXiv:2512.15674 , year =. 2512.15674 , eprinttype =

work page arXiv
[2]

2026 , eprint=

Introspection Adapters: Training LLMs to Report Their Learned Behaviors , author=. 2026 , eprint=

2026
[3]

Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R Bowman, Sara Price, Samuel Marks, and Rowan Wang

Sheshadri, Abhay and Ewart, Aidan and Fronsdal, Kai and Gupta, Isha and Bowman, Samuel R. and Price, Sara and Marks, Samuel and Wang, Rowan , title =. arXiv preprint arXiv:2602.22755 , year =. 2602.22755 , eprinttype =

work page arXiv
[4]

, title =

Goel, Avichal and Kim, Yoon and Shavit, Nir and Wang, Tony T. , title =. arXiv preprint arXiv:2510.05092 , year =. 2510.05092 , eprinttype =

work page arXiv
[5]

Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =

Charakorn, Rujikorn and Cetin, Edoardo and Tang, Yujin and Lange, Robert Tjarko , title =. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =. 2506.06105 , eprinttype =

work page arXiv
[6]

arXiv preprint arXiv:2602.15902 , year =

Charakorn, Rujikorn and Cetin, Edoardo and Uesaka, Shinnosuke and Lange, Robert Tjarko , title =. arXiv preprint arXiv:2602.15902 , year =. 2602.15902 , eprinttype =

work page arXiv
[7]

Hubinger, Evan and Denison, Carson and Mu, Jesse and Lambert, Mike and Tong, Meg and MacDiarmid, Monte and Lanham, Tamera and Ziegler, Daniel M. and Maxwell, Tim and Cheng, Newton and Jermyn, Adam and Askell, Amanda and Radhakrishnan, Ansh and Anil, Cem and Duvenaud, David and Ganguli, Deep and Barez, Fazl and Clark, Jack and Ndousse, Kamal and Sachan, Ks...
[8]

arXiv preprint arXiv:2602.03085 , year =

Bullwinkel, Blake and Severi, Giorgio and Hines, Keegan and Minnich, Amanda and Siva Kumar, Ram Shankar and Zunger, Yonatan , title =. arXiv preprint arXiv:2602.03085 , year =. 2602.03085 , eprinttype =

work page arXiv
[9]

Weight space Detection of Backdoors in LoRA Adapters

Pu. Weight space Detection of Backdoors in. arXiv preprint arXiv:2602.15195 , year =. 2602.15195 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv
[10]

arXiv preprint arXiv:2510.05169 , year =

Shen, Guangyu and Cheng, Siyuan and Xu, Xiangzhe and Zhou, Yuan and Guo, Hanxi and Zhang, Zhuo and Zhang, Xiangyu , title =. arXiv preprint arXiv:2510.05169 , year =. 2510.05169 , eprinttype =

work page arXiv
[11]

Wang, Rowan and Griffin, Avery and Treutlein, Johannes and Perez, Ethan and Michael, Julian and Roger, Fabien and Marks, Sam , title =
[12]

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., et al

Betley, Jan and Tan, Daniel and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Mart. Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =. 2502.17424 , eprinttype =

work page arXiv
[13]

2024 , url =

Lindsey, Jack and Templeton, Adly and Marcus, Jonathan and Conerly, Thomas and Batson, Joshua and Olah, Christopher , title =. 2024 , url =

2024
[14]

Qwen3 Technical Report

Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and Zheng, Chujie and Liu, Dayiheng and Zhou, Fan and Huang, Fei and Hu, Feng and Ge, Hao and Wei, Haoran and Lin, Huan and Tang, Jialong and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jia...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

The Llama 3 Herd of Models

Grattafiori, Aaron and others , title =. arXiv preprint arXiv:2407.21783 , year =. 2407.21783 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Qi, Xiangyu and Zeng, Yi and Xie, Tinghao and Chen, Pin-Yu and Jia, Ruoxi and Mittal, Prateek and Henderson, Peter , title =. International Conference on Learning Representations (ICLR) , year =. 2310.03693 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv
[17]

2025 , eprint =

Cloud, Alex and Le, Minh and Chua, James and Betley, Jan and Sztyber-Betley, Anna and Hilton, Jacob and Marks, Samuel and Evans, Owain , title =. 2025 , eprint =

2025
[18]

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

Zhong, Ziqian and Raghunathan, Aditi , title =. arXiv preprint arXiv:2508.00161 , year =. 2508.00161 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv
[19]

2025 , month = aug, url =

Qin, Andrew and Hua, Tim and Marks, Sam and Conmy, Arthur and Nanda, Neel , title =. 2025 , month = aug, url =

2025
[20]

arXiv preprint arXiv:2506.20790 , year =

Bushnaq, Lucius and Braun, Dan and Sharkey, Lee , title =. arXiv preprint arXiv:2506.20790 , year =. 2506.20790 , eprinttype =

work page arXiv
[21]

Bushnaq, Lucius and Braun, Dan and Clive-Griffin, Oliver and Bussmann, Bart and Hu, Nathan and Ivanitskiy, Michael and Linsefors, Linda and Sharkey, Lee , title =
[22]

2025 , url =

Sparse Mixtures of Linear Transforms (. 2025 , url =

2025
[23]

2025 , url =

Anthropic , title =. 2025 , url =

2025
[24]

2026 , url =

Anthropic , title =. 2026 , url =

2026
[25]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D. and Finn, Chelsea , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 2305.18290 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences , booktitle =

Minder, Julian and Dumas, Cl. Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences , booktitle =. 2026 , eprint =

2026
[27]

Towards Eliciting Latent Knowledge from

Cywi. Towards Eliciting Latent Knowledge from. arXiv preprint arXiv:2505.14352 , year =. 2505.14352 , eprinttype =

work page arXiv
[28]

International Conference on Learning Representations (ICLR) , year =

Soligo, Anna and Turner, Edward and Rajamanoharan, Senthooran and Nanda, Neel , title =. International Conference on Learning Representations (ICLR) , year =. 2602.07852 , eprinttype =

work page arXiv
[29]

arXiv preprint arXiv:2506.11613 , year=

Turner, Edward and Soligo, Anna and Taylor, Mia and Rajamanoharan, Senthooran and Nanda, Neel , title =. arXiv preprint arXiv:2506.11613 , year =. 2506.11613 , eprinttype =

work page arXiv
[30]

C-Pack: Packed Resources For General Chinese Embeddings

Xiao, Shitao and Liu, Zheng and Zhang, Peitian and Muennighoff, Niklas , title =. arXiv preprint arXiv:2309.07597 , year =. 2309.07597 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv
[31]

arXiv preprint arXiv:2603.15990 , year =

Han, Xiaolong and Neri, Ferrante and Jiang, Zijian and Wu, Fang and Ye, Yanfang and Yin, Lu and Wang, Zehong , title =. arXiv preprint arXiv:2603.15990 , year =. 2603.15990 , eprinttype =

work page arXiv
[32]

Understanding R1-Zero-Like Training: A Critical Perspective

Liu, Zichen and Chen, Changyu and Li, Wenjun and Qi, Penghui and Pang, Tianyu and Du, Chao and Lee, Wee Sun and Lin, Min , title =. Conference on Language Modeling (COLM) , year =. 2503.20783 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv
[33]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

arXiv preprint arXiv:2506.13585 , year =. 2506.13585 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv
[34]

, title =

Sutton, Richard S. , title =
[35]

2025 , month = dec, url =

Steinhardt, Jacob , title =. 2025 , month = dec, url =

2025
[36]

Predictive concept decoders: Training scalable end-to-end interpretability assistants, 2025

Huang, Vincent and Choi, Dami and Johnson, Daniel D. and Schwettmann, Sarah and Steinhardt, Jacob , title =. arXiv preprint arXiv:2512.15712 , year =. 2512.15712 , eprinttype =

work page arXiv
[37]

2025 , month = nov, googlescholar =

Choi, Dami and Huang, Vincent and Schwettmann, Sarah and Steinhardt, Jacob , title =. 2025 , month = nov, googlescholar =

2025
[38]

KTO: Model Alignment as Prospect Theoretic Optimization

Ethayarajh, Kawin and Xu, Winnie and Muennighoff, Niklas and Jurafsky, Dan and Kiela, Douwe , title =. arXiv preprint arXiv:2402.01306 , year =. 2402.01306 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Penedo, Guilherme and Kydl. The. arXiv preprint arXiv:2406.17557 , year =. 2406.17557 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv
[40]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , title =. International Conference on Learning Representations (ICLR) , year =. 2106.09685 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv
[41]

2026 , eprint=

Training Language Models to Explain Their Own Computations , author=. 2026 , eprint=

2026
[42]

arXiv preprint arXiv:2412.08686 , year =

Pan, Alexander and Chen, Lijie and Steinhardt, Jacob , title =. arXiv preprint arXiv:2412.08686 , year =. 2412.08686 , eprinttype =

work page arXiv
[43]

Bills, Steven and Cammarata, Nick and Mossing, Dan and Tillman, Henk and Gao, Leo and Goh, Gabriel and Sutskever, Ilya and Leike, Jan and Wu, Jeff and Saunders, William , title =
[44]

and Haghtalab, Nika and Steinhardt, Jacob , title =

Halawi, Danny and Wei, Alexander and Wallace, Eric and Wang, Tony T. and Haghtalab, Nika and Steinhardt, Jacob , title =. International Conference on Machine Learning (ICML) , year =. 2406.20053 , eprinttype =

work page arXiv
[45]

Steinhardt, Jacob , title =
[46]

2026 , howpublished =

Hugging Face Hub , author =. 2026 , howpublished =

2026
[47]

and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M

Fraser-Taliente, Kit and Kantamneni, Subhash and Ong, Euan and Mossing, Dan and Lu, Christina and Bogdan, Paul C. and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M. and Hubinger, Evan and Batson, Joshua and Lindsey, Jack and Zimmerman, Samuel and M...
[48]

2026 , month = mar, url =

Jakkli, Arya and Rajamanoharan, Senthooran and Nanda, Neel , title =. 2026 , month = mar, url =

2026
[49]

2026 , month = jan, url =

Luick, Niclas , title =. 2026 , month = jan, url =

2026
[50]

2026 , month = mar, url =

Ivanova, Daria and Tyagi, Riya and Engels, Josh and Nanda, Neel , title =. 2026 , month = mar, url =

2026
[51]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Tianyu Gu and Brendan Dolan. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain , journal =. 2017 , url =. 1708.06733 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2017
[52]

2026 , eprint=

SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass , author=. 2026 , eprint=

2026
[53]

2025 , eprint=

Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights , author=. 2025 , eprint=

2025
[54]

2024 , eprint=

Learning on LoRAs: GL-Equivariant Processing of Low-Rank Weight Spaces for Large Finetuned Models , author=. 2024 , eprint=

2024
[55]

2024 , eprint=

A LoRA is Worth a Thousand Pictures , author=. 2024 , eprint=

2024
[56]

2024 , eprint=

Interpreting the Weight Space of Customized Diffusion Models , author=. 2024 , eprint=

2024
[57]

2024 , eprint=

Dataset Size Recovery from LoRA Weights , author=. 2024 , eprint=

2024
[58]

Towards Weight-Space Interpretation of Low-Rank Adapters for Diffusion Models

Duszenko, Jacek and Bielak, Piotr. Towards Weight-Space Interpretation of Low-Rank Adapters for Diffusion Models. Computational Science -- ICCS 2025. 2025

2025
[59]

2024 , eprint=

SelfIE: Self-Interpretation of Large Language Model Embeddings , author=. 2024 , eprint=

2024
[60]

2024 , eprint=

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models , author=. 2024 , eprint=

2024
[61]

2026 , eprint=

Emergent Introspective Awareness in Large Language Models , author=. 2026 , eprint=

2026
[62]

2024 , eprint=

Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language , author=. 2024 , eprint=

2024

[1] [1]

Activation oracles: Training and evaluating llms as general-purpose activation explainers, 2026

Karvonen, Adam and Chua, James and Dumas, Cl. Activation Oracles: Training and Evaluating. arXiv preprint arXiv:2512.15674 , year =. 2512.15674 , eprinttype =

work page arXiv

[2] [2]

2026 , eprint=

Introspection Adapters: Training LLMs to Report Their Learned Behaviors , author=. 2026 , eprint=

2026

[3] [3]

Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R Bowman, Sara Price, Samuel Marks, and Rowan Wang

Sheshadri, Abhay and Ewart, Aidan and Fronsdal, Kai and Gupta, Isha and Bowman, Samuel R. and Price, Sara and Marks, Samuel and Wang, Rowan , title =. arXiv preprint arXiv:2602.22755 , year =. 2602.22755 , eprinttype =

work page arXiv

[4] [4]

, title =

Goel, Avichal and Kim, Yoon and Shavit, Nir and Wang, Tony T. , title =. arXiv preprint arXiv:2510.05092 , year =. 2510.05092 , eprinttype =

work page arXiv

[5] [5]

Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =

Charakorn, Rujikorn and Cetin, Edoardo and Tang, Yujin and Lange, Robert Tjarko , title =. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =. 2506.06105 , eprinttype =

work page arXiv

[6] [6]

arXiv preprint arXiv:2602.15902 , year =

Charakorn, Rujikorn and Cetin, Edoardo and Uesaka, Shinnosuke and Lange, Robert Tjarko , title =. arXiv preprint arXiv:2602.15902 , year =. 2602.15902 , eprinttype =

work page arXiv

[7] [7]

Hubinger, Evan and Denison, Carson and Mu, Jesse and Lambert, Mike and Tong, Meg and MacDiarmid, Monte and Lanham, Tamera and Ziegler, Daniel M. and Maxwell, Tim and Cheng, Newton and Jermyn, Adam and Askell, Amanda and Radhakrishnan, Ansh and Anil, Cem and Duvenaud, David and Ganguli, Deep and Barez, Fazl and Clark, Jack and Ndousse, Kamal and Sachan, Ks...

[8] [8]

arXiv preprint arXiv:2602.03085 , year =

Bullwinkel, Blake and Severi, Giorgio and Hines, Keegan and Minnich, Amanda and Siva Kumar, Ram Shankar and Zunger, Yonatan , title =. arXiv preprint arXiv:2602.03085 , year =. 2602.03085 , eprinttype =

work page arXiv

[9] [9]

Weight space Detection of Backdoors in LoRA Adapters

Pu. Weight space Detection of Backdoors in. arXiv preprint arXiv:2602.15195 , year =. 2602.15195 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

arXiv preprint arXiv:2510.05169 , year =

Shen, Guangyu and Cheng, Siyuan and Xu, Xiangzhe and Zhou, Yuan and Guo, Hanxi and Zhang, Zhuo and Zhang, Xiangyu , title =. arXiv preprint arXiv:2510.05169 , year =. 2510.05169 , eprinttype =

work page arXiv

[11] [11]

Wang, Rowan and Griffin, Avery and Treutlein, Johannes and Perez, Ethan and Michael, Julian and Roger, Fabien and Marks, Sam , title =

[12] [12]

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., et al

Betley, Jan and Tan, Daniel and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Mart. Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =. 2502.17424 , eprinttype =

work page arXiv

[13] [13]

2024 , url =

Lindsey, Jack and Templeton, Adly and Marcus, Jonathan and Conerly, Thomas and Batson, Joshua and Olah, Christopher , title =. 2024 , url =

2024

[14] [14]

Qwen3 Technical Report

Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and Zheng, Chujie and Liu, Dayiheng and Zhou, Fan and Huang, Fei and Hu, Feng and Ge, Hao and Wei, Haoran and Lin, Huan and Tang, Jialong and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jia...

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

The Llama 3 Herd of Models

Grattafiori, Aaron and others , title =. arXiv preprint arXiv:2407.21783 , year =. 2407.21783 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Qi, Xiangyu and Zeng, Yi and Xie, Tinghao and Chen, Pin-Yu and Jia, Ruoxi and Mittal, Prateek and Henderson, Peter , title =. International Conference on Learning Representations (ICLR) , year =. 2310.03693 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

2025 , eprint =

Cloud, Alex and Le, Minh and Chua, James and Betley, Jan and Sztyber-Betley, Anna and Hilton, Jacob and Marks, Samuel and Evans, Owain , title =. 2025 , eprint =

2025

[18] [18]

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

Zhong, Ziqian and Raghunathan, Aditi , title =. arXiv preprint arXiv:2508.00161 , year =. 2508.00161 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

2025 , month = aug, url =

Qin, Andrew and Hua, Tim and Marks, Sam and Conmy, Arthur and Nanda, Neel , title =. 2025 , month = aug, url =

2025

[20] [20]

arXiv preprint arXiv:2506.20790 , year =

Bushnaq, Lucius and Braun, Dan and Sharkey, Lee , title =. arXiv preprint arXiv:2506.20790 , year =. 2506.20790 , eprinttype =

work page arXiv

[21] [21]

Bushnaq, Lucius and Braun, Dan and Clive-Griffin, Oliver and Bussmann, Bart and Hu, Nathan and Ivanitskiy, Michael and Linsefors, Linda and Sharkey, Lee , title =

[22] [22]

2025 , url =

Sparse Mixtures of Linear Transforms (. 2025 , url =

2025

[23] [23]

2025 , url =

Anthropic , title =. 2025 , url =

2025

[24] [24]

2026 , url =

Anthropic , title =. 2026 , url =

2026

[25] [25]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D. and Finn, Chelsea , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 2305.18290 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences , booktitle =

Minder, Julian and Dumas, Cl. Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences , booktitle =. 2026 , eprint =

2026

[27] [27]

Towards Eliciting Latent Knowledge from

Cywi. Towards Eliciting Latent Knowledge from. arXiv preprint arXiv:2505.14352 , year =. 2505.14352 , eprinttype =

work page arXiv

[28] [28]

International Conference on Learning Representations (ICLR) , year =

Soligo, Anna and Turner, Edward and Rajamanoharan, Senthooran and Nanda, Neel , title =. International Conference on Learning Representations (ICLR) , year =. 2602.07852 , eprinttype =

work page arXiv

[29] [29]

arXiv preprint arXiv:2506.11613 , year=

Turner, Edward and Soligo, Anna and Taylor, Mia and Rajamanoharan, Senthooran and Nanda, Neel , title =. arXiv preprint arXiv:2506.11613 , year =. 2506.11613 , eprinttype =

work page arXiv

[30] [30]

C-Pack: Packed Resources For General Chinese Embeddings

Xiao, Shitao and Liu, Zheng and Zhang, Peitian and Muennighoff, Niklas , title =. arXiv preprint arXiv:2309.07597 , year =. 2309.07597 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

arXiv preprint arXiv:2603.15990 , year =

Han, Xiaolong and Neri, Ferrante and Jiang, Zijian and Wu, Fang and Ye, Yanfang and Yin, Lu and Wang, Zehong , title =. arXiv preprint arXiv:2603.15990 , year =. 2603.15990 , eprinttype =

work page arXiv

[32] [32]

Understanding R1-Zero-Like Training: A Critical Perspective

Liu, Zichen and Chen, Changyu and Li, Wenjun and Qi, Penghui and Pang, Tianyu and Du, Chao and Lee, Wee Sun and Lin, Min , title =. Conference on Language Modeling (COLM) , year =. 2503.20783 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

arXiv preprint arXiv:2506.13585 , year =. 2506.13585 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

, title =

Sutton, Richard S. , title =

[35] [35]

2025 , month = dec, url =

Steinhardt, Jacob , title =. 2025 , month = dec, url =

2025

[36] [36]

Predictive concept decoders: Training scalable end-to-end interpretability assistants, 2025

Huang, Vincent and Choi, Dami and Johnson, Daniel D. and Schwettmann, Sarah and Steinhardt, Jacob , title =. arXiv preprint arXiv:2512.15712 , year =. 2512.15712 , eprinttype =

work page arXiv

[37] [37]

2025 , month = nov, googlescholar =

Choi, Dami and Huang, Vincent and Schwettmann, Sarah and Steinhardt, Jacob , title =. 2025 , month = nov, googlescholar =

2025

[38] [38]

KTO: Model Alignment as Prospect Theoretic Optimization

Ethayarajh, Kawin and Xu, Winnie and Muennighoff, Niklas and Jurafsky, Dan and Kiela, Douwe , title =. arXiv preprint arXiv:2402.01306 , year =. 2402.01306 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Penedo, Guilherme and Kydl. The. arXiv preprint arXiv:2406.17557 , year =. 2406.17557 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , title =. International Conference on Learning Representations (ICLR) , year =. 2106.09685 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

2026 , eprint=

Training Language Models to Explain Their Own Computations , author=. 2026 , eprint=

2026

[42] [42]

arXiv preprint arXiv:2412.08686 , year =

Pan, Alexander and Chen, Lijie and Steinhardt, Jacob , title =. arXiv preprint arXiv:2412.08686 , year =. 2412.08686 , eprinttype =

work page arXiv

[43] [43]

Bills, Steven and Cammarata, Nick and Mossing, Dan and Tillman, Henk and Gao, Leo and Goh, Gabriel and Sutskever, Ilya and Leike, Jan and Wu, Jeff and Saunders, William , title =

[44] [44]

and Haghtalab, Nika and Steinhardt, Jacob , title =

Halawi, Danny and Wei, Alexander and Wallace, Eric and Wang, Tony T. and Haghtalab, Nika and Steinhardt, Jacob , title =. International Conference on Machine Learning (ICML) , year =. 2406.20053 , eprinttype =

work page arXiv

[45] [45]

Steinhardt, Jacob , title =

[46] [46]

2026 , howpublished =

Hugging Face Hub , author =. 2026 , howpublished =

2026

[47] [47]

and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M

Fraser-Taliente, Kit and Kantamneni, Subhash and Ong, Euan and Mossing, Dan and Lu, Christina and Bogdan, Paul C. and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M. and Hubinger, Evan and Batson, Joshua and Lindsey, Jack and Zimmerman, Samuel and M...

[48] [48]

2026 , month = mar, url =

Jakkli, Arya and Rajamanoharan, Senthooran and Nanda, Neel , title =. 2026 , month = mar, url =

2026

[49] [49]

2026 , month = jan, url =

Luick, Niclas , title =. 2026 , month = jan, url =

2026

[50] [50]

2026 , month = mar, url =

Ivanova, Daria and Tyagi, Riya and Engels, Josh and Nanda, Neel , title =. 2026 , month = mar, url =

2026

[51] [51]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Tianyu Gu and Brendan Dolan. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain , journal =. 2017 , url =. 1708.06733 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2017

[52] [52]

2026 , eprint=

SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass , author=. 2026 , eprint=

2026

[53] [53]

2025 , eprint=

Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights , author=. 2025 , eprint=

2025

[54] [54]

2024 , eprint=

Learning on LoRAs: GL-Equivariant Processing of Low-Rank Weight Spaces for Large Finetuned Models , author=. 2024 , eprint=

2024

[55] [55]

2024 , eprint=

A LoRA is Worth a Thousand Pictures , author=. 2024 , eprint=

2024

[56] [56]

2024 , eprint=

Interpreting the Weight Space of Customized Diffusion Models , author=. 2024 , eprint=

2024

[57] [57]

2024 , eprint=

Dataset Size Recovery from LoRA Weights , author=. 2024 , eprint=

2024

[58] [58]

Towards Weight-Space Interpretation of Low-Rank Adapters for Diffusion Models

Duszenko, Jacek and Bielak, Piotr. Towards Weight-Space Interpretation of Low-Rank Adapters for Diffusion Models. Computational Science -- ICCS 2025. 2025

2025

[59] [59]

2024 , eprint=

SelfIE: Self-Interpretation of Large Language Model Embeddings , author=. 2024 , eprint=

2024

[60] [60]

2024 , eprint=

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models , author=. 2024 , eprint=

2024

[61] [61]

2026 , eprint=

Emergent Introspective Awareness in Large Language Models , author=. 2026 , eprint=

2026

[62] [62]

2024 , eprint=

Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language , author=. 2024 , eprint=

2024