Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal
Pith reviewed 2026-06-27 10:31 UTC · model grok-4.3
The pith
Interpretability protocols let practitioners inspect preference data at the concept level and explicitly decide which behaviors a model should learn during post-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, the pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and helps amplify desired properties such as safeguards and model personality.
What carries the argument
A data-centric post-training pipeline that applies interpretability protocols to preference datasets to extract statistical hypotheses about latent concepts distinguishing preferred from dispreferred generations.
If this is right
- Existing preference datasets can be audited for undesirable signals before any optimization occurs.
- Off-target learning during fine-tuning can be reduced by intervening on identified concepts.
- Desired model properties such as safeguards or specific personality traits can be amplified through targeted data or feature changes.
- Post-training moves from scalar reward optimization to direct sculpting of the learning signal.
Where Pith is reading between the lines
- The same concept-level inspection could be applied to earlier training stages to catch issues before post-training.
- Iterative rounds of concept feedback might allow smaller preference datasets to achieve comparable alignment quality.
- Teams without deep interpretability expertise could still use the pipeline if the hypotheses are presented as simple editable lists of behaviors.
Load-bearing premise
Interpretability methods can produce reliable, actionable statistical hypotheses about which latent concepts separate preferred from dispreferred model outputs.
What would settle it
A controlled experiment in which the pipeline is applied to a preference dataset yet the resulting model still exhibits the same rate of off-target behaviors, such as over-stylization, as a model trained on the unmodified data.
Figures
read the original abstract
Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, allowing spurious correlations to be learned by a model and inducing undesirable behaviors such as over-stylization and sycophancy. To address this problem, we ask: can we inspect a preference dataset before optimization and decide, at the level of concepts, which behaviors a model should be allowed to learn? Motivated by this, we introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, we show that our pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and can also help amplify or shape desired properties such as safeguards and model personality. More broadly, our results suggest that interpretability can turn post-training from optimizing opaque proxy rewards into a process of auditing and sculpting the learning signal itself.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a data-centric post-training pipeline for language models that uses interpretability protocols to develop statistical hypotheses for latent concepts separating preferred from dispreferred generations in preference data. This enables fine-grained user feedback, unifies several interpretability-based training protocols as ways of shaping rewards via feature or data interventions, and empirically demonstrates diagnosis of undesirable signals, mitigation of off-target learning, and amplification of desired properties such as safeguards and model personality.
Significance. If the empirical results hold with appropriate controls and statistical rigor, the work could shift post-training from opaque scalar reward optimization to a more transparent process of auditing and sculpting the learning signal at the concept level. The conceptual unification of interpretability protocols and the emphasis on data characterization are strengths that address real issues like spurious correlations leading to sycophancy or over-stylization.
major comments (1)
- [Abstract] Abstract: the claim of empirical success in diagnosing undesirable signals, mitigating off-target learning, and shaping properties rests on interpretability-derived hypotheses being reliable and actionable, but the provided text supplies no details on the specific protocols, datasets, controls, or statistical tests used to support these results.
Simulated Author's Rebuttal
We thank the referee for their review and for highlighting the need for greater clarity on the empirical support for our claims. We address the single major comment below and are prepared to revise the abstract accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of empirical success in diagnosing undesirable signals, mitigating off-target learning, and shaping properties rests on interpretability-derived hypotheses being reliable and actionable, but the provided text supplies no details on the specific protocols, datasets, controls, or statistical tests used to support these results.
Authors: We agree that the abstract is concise and does not enumerate the experimental details. The full manuscript supplies these in dedicated sections: interpretability protocols and hypothesis generation are formalized in Section 3, the preference datasets and concept annotations are described in Section 4, and the controls, ablation studies, and statistical tests (including significance thresholds and multiple-comparison corrections) appear in Section 5. Because abstracts are length-constrained, we propose a modest revision that adds one sentence summarizing the evaluation protocol and points readers to the relevant sections. This change would make the abstract's claims more self-contained without altering its high-level character. revision: partial
Circularity Check
No significant circularity; empirical pipeline is self-contained
full rationale
The paper describes an empirical, data-centric pipeline that applies interpretability protocols to preference datasets for diagnosing signals and shaping rewards via interventions. No equations, derivations, or fitted parameters are presented that reduce any claimed prediction or result to inputs defined by the authors' own prior work. Central claims rest on experimental outcomes rather than self-referential definitions or load-bearing self-citations, satisfying the criteria for a non-circular finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Interpretability protocols applied to preference data can produce statistical hypotheses about latent concepts that separate preferred from dispreferred generations
Reference graph
Works this paper leans on
-
[1]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The twelfth international conference on learning representations, 2024
2024
-
[2]
OLMES , 2025
Allen AI . OLMES , 2025. https://github.com/allenai/olmes
2025
-
[3]
System Card: Claude Mythos Preview , 2026
Anthropic . System Card: Claude Mythos Preview , 2026. https://www-cdn.anthropic.com/08ab9158070959f88f296514c21b7facce6f52bc.pdf
2026
-
[4]
Saes are good for steering--if you select the right features
Dana Arad, Aaron Mueller, and Yonatan Belinkov. Saes are good for steering--if you select the right features. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 10252--10270, 2025
2025
-
[6]
A general theoretical paradigm to understand learning from human preferences
Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp.\ 4447--4455. PMLR, 2024
2024
-
[9]
Probing classifiers: Promises, shortcomings, and advances
Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48 0 (1): 0 207--219, 2022
2022
-
[10]
Emergent misalignment: Narrow finetuning can produce broadly misaligned llms
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Mart \' n Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. arXiv preprint arXiv:2502.17424, 2025
arXiv 2025
-
[11]
Do sparse autoencoders capture concept manifolds? arXiv preprint arXiv:2604.28119, 2026
Usha Bhalla, Thomas Fel, Can Rager, Sheridan Feucht, Tal Haklay, Daniel Wurgaft, Siddharth Boppana, Matthew Kowal, Vasudev Shyam, Jack Merullo, et al. Do sparse autoencoders capture concept manifolds? arXiv preprint arXiv:2604.28119, 2026
Pith/arXiv arXiv 2026
-
[13]
Uncovering conceptual blindspots in generative image models using sparse autoencoders
Matyas Bohacek, Thomas Fel, Maneesh Agrawala, and Ekdeep Singh Lubana. Uncovering conceptual blindspots in generative image models using sparse autoencoders. A r X iv e-print , 2025
2025
-
[14]
Towards monosemanticity: Decomposing language models with dictionary learning
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...
2023
-
[15]
Using dictionary learning features as classifiers
Trenton Bricken, Jonathan Marcus, Siddharth Mishra-Sharma, Meg Tong, Ethan Perez, Mrinank Sharma, Kelley Rivoire, and Thomas Henighan. Using dictionary learning features as classifiers. Anthropic, 2024. https://transformer-circuits.pub/2024/features-as-classifiers/index.html
2024
-
[16]
Batchtopk sparse autoencoders
Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. A r X iv e-print , 2024
2024
-
[18]
A is for absorption: Studying feature splitting and absorption in sparse autoencoders, 2025
David Chanin, James Wilken-Smith, Tomas Dulka, Hardik Bhatnagar, Satvik Golechha, and Joseph Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoencoders, 2025. URL https://arxiv.org/abs/2409.14507
arXiv 2025
-
[21]
Sycophantic ai decreases prosocial intentions and promotes dependence
Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Sycophantic ai decreases prosocial intentions and promotes dependence. Science, 2025. URL https://api.semanticscholar.org/CorpusID:278768575
2025
-
[23]
Deep reinforcement learning from human preferences
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017
2017
-
[24]
Gradient routing: Masking gradients to localize computation in neural networks
Alex Cloud, Jacob Goldman-Wetzler, Evzen Wybitul, Joseph Miller, and Alexander Matt Turner. Gradient routing: Masking gradients to localize computation in neural networks. arXiv preprint arXiv:2410.04332, 2024
arXiv 2024
-
[25]
Subliminal learning: Language models transmit behavioral traits via hidden signals in data
Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, and Owain Evans. Subliminal learning: Language models transmit behavioral traits via hidden signals in data. arXiv preprint arXiv:2507.14805, 2025
arXiv 2025
-
[26]
From flat to hierarchical: Extracting sparse representations with matching pursuit
Val \'e rie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, and Demba Ba. From flat to hierarchical: Extracting sparse representations with matching pursuit. arXiv preprint arXiv:2506.03093, 2025
arXiv 2025
-
[27]
Sparse autoencoders find highly interpretable features in language models
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023
Pith/arXiv arXiv 2023
-
[28]
Plug and play language models: A simple approach to controlled text generation
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1edEyBKDS
2020
-
[29]
How abilities in large language models are affected by supervised fine-tuning data composition
Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. How abilities in large language models are affected by supervised fine-tuning data composition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 177--198, 2024
2024
-
[30]
Kto: Model alignment as prospect theoretic optimization
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024
Pith/arXiv arXiv 2024
-
[31]
Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3. arXiv preprint arXiv:2512.13961, 2025
Pith/arXiv arXiv 2025
-
[32]
Archetypal sae: Adaptive and stable dictionary learning for concept extraction in large vision models
Thomas Fel, Ekdeep Singh Lubana, Jacob S Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba Ba, and Talia Konkle. Archetypal sae: Adaptive and stable dictionary learning for concept extraction in large vision models. Proceedings of the International Conference on Machine Learning (ICML), 2025 a
2025
-
[33]
Into the rabbit hull: From task-relevant concepts in dino to minkowski geometry
Thomas Fel, Binxu Wang, Michael A Lepori, Matthew Kowal, Andrew Lee, Randall Balestriero, Sonia Joseph, Ekdeep S Lubana, Talia Konkle, Demba Ba, et al. Into the rabbit hull: From task-relevant concepts in dino to minkowski geometry. arXiv preprint arXiv:2510.08638, 2025 b
Pith/arXiv arXiv 2025
-
[34]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.\ 10835--10866. PMLR, 2023
2023
-
[35]
Scaling and evaluating sparse autoencoders
Leo Gao, Tom Dupr \'e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024
Pith/arXiv arXiv 2024
-
[36]
Compositional preference models for aligning lms
Dongyoung Go, Tomasz Korbak, Germ \'a n Kruszewski, Jos Rozen, and Marc Dymetman. Compositional preference models for aligning lms. arXiv preprint arXiv:2310.13011, 2023 a
arXiv 2023
-
[37]
Aligning language models with preferences through f-divergence minimization
Dongyoung Go, Tomasz Korbak, Germ \'a n Kruszewski, Jos Rozen, Nahyeon Ryu, and Marc Dymetman. Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215, 2023 b
arXiv 2023
-
[38]
Rubrics as rewards: Reinforcement learning beyond verifiable domains
Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains. arXiv preprint arXiv:2507.17746, 2025
Pith/arXiv arXiv 2025
-
[39]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
Pith/arXiv arXiv 2025
-
[40]
Projecting assumptions: The duality between sparse autoencoders and concept geometry
Sai Sumedh R Hindupur, Ekdeep Singh Lubana, Thomas Fel, and Demba Ba. Projecting assumptions: The duality between sparse autoencoders and concept geometry. arXiv preprint arXiv:2503.01822, 2025
arXiv 2025
-
[41]
Training compute-optimal large language models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022
Pith/arXiv arXiv 2022
-
[42]
Large language models can self-improve
Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. In Proceedings of the 2023 conference on empirical methods in natural language processing, pp.\ 1051--1068, 2023
2023
-
[43]
Jonas H \"u botter, Frederike L \"u beck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802, 2026
Pith/arXiv arXiv 2026
-
[44]
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rockt \"a schel, and David Scott Krueger. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. arXiv preprint arXiv:2311.12786, 2023
arXiv 2023
-
[45]
Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024
Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024. URL https://arxiv.org/abs/2406.18510
arXiv 2024
-
[46]
Interpretable embeddings with sparse autoencoders: A data analysis toolkit
Nick Jiang, Xiaoqing Sun, Lisa Dunlap, Lewis Smith, and Neel Nanda. Interpretable embeddings with sparse autoencoders: A data analysis toolkit. arXiv preprint arXiv:2512.10092, 2025
arXiv 2025
-
[47]
Scaling laws for neural language models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
Pith/arXiv arXiv 2001
-
[48]
Reasoning with sampling: Your base model is smarter than you think
Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think. arXiv preprint arXiv:2510.14901, 2025
Pith/arXiv arXiv 2025
-
[49]
Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability
Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, et al. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability. Proceedings of the International Conference on Machine Learning (ICML), 2025
2025
-
[50]
Goodhart's law in reinforcement learning
Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, and Joar Max Viktor Skalse. Goodhart's law in reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5o9G4XF1LI
2024
-
[51]
Rethinking the role of proxy rewards in language model alignment
Sungdong Kim and Minjoon Seo. Rethinking the role of proxy rewards in language model alignment. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 20656--20674, 2024
2024
-
[52]
Rl with kl penalties is better viewed as bayesian inference
Tomasz Korbak, Ethan Perez, and Christopher Buckley. Rl with kl penalties is better viewed as bayesian inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.\ 1083--1091, 2022
2022
-
[53]
Matthew Kowal, Goncalo Paulo, Louis Jaburi, Tom Tseng, Lev E McKinney, Stefan Heimersheim, Aaron David Tucker, Adam Gleave, and Kellin Pelrine. Concept influence: Leveraging interpretability to improve performance and efficiency in training data attribution. arXiv preprint arXiv:2602.14869, 2026
arXiv 2026
-
[54]
Likelihood-based reward designs for general llm reasoning
Ariel Kwiatkowski, Natasha Butt, Ismail Labiad, Julia Kempe, and Yann Ollivier. Likelihood-based reward designs for general llm reasoning. arXiv preprint arXiv:2602.03979, 2026
arXiv 2026
-
[55]
A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity
Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K Kummerfeld, and Rada Mihalcea. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967, 2024 a
arXiv 2024
-
[56]
Rlaif vs
Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback, 2024 b
2024
-
[57]
Explanations from large language models make small reasoners better
Shiyang Li, Jianshu Chen, Yelong Shen, Zhiyu Chen, Xinlu Zhang, Zekun Li, Hong Wang, Jing Qian, Baolin Peng, Yi Mao, et al. Explanations from large language models make small reasoners better. arXiv preprint arXiv:2210.06726, 2022
arXiv 2022
-
[58]
Hashimoto
Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023
2023
-
[59]
Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...
2025
-
[60]
Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment
Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment. arXiv preprint arXiv:2510.07743, 2025
arXiv 2025
-
[61]
The assistant axis: Situating and stabilizing the default persona of language models
Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, and Jack Lindsey. The assistant axis: Situating and stabilizing the default persona of language models. arXiv preprint arXiv:2601.10387, 2026. URL https://arxiv.org/abs/2601.10387
arXiv 2026
-
[62]
Priors in time: Missing inductive biases for language model interpretability
Ekdeep Singh Lubana, Can Rager, Sai Sumedh R Hindupur, Valerie Costa, Greta Tuckute, Oam Patel, Sonia Krishna Murthy, Thomas Fel, Daniel Wurgaft, Eric J Bigelow, et al. Priors in time: Missing inductive biases for language model interpretability. arXiv preprint arXiv:2511.01836, 2025
arXiv 2025
-
[63]
Teaching small language models to reason
Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. Teaching small language models to reason. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 1773--1781, 2023
2023
-
[64]
Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023
Pith/arXiv arXiv 2023
-
[65]
Harmbench: A standardized evaluation framework for automated red teaming and robust refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024
Pith/arXiv arXiv 2024
-
[66]
Detecting high-stakes interactions with activation probes
Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, and Dmitrii Krasheninnikov. Detecting high-stakes interactions with activation probes. arXiv preprint arXiv:2506.10805, 2025
arXiv 2025
-
[67]
What's in my human feedback? learning interpretable descriptions of preference data
Rajiv Movva, Smitha Milli, Sewon Min, and Emma Pierson. What's in my human feedback? learning interpretable descriptions of preference data. arXiv preprint arXiv:2510.26202, 2025
Pith/arXiv arXiv 2025
-
[68]
Chunky post-training: Data driven failures of generalization
Seoirse Murray, Allison Qi, Timothy Qian, John Schulman, Collin Burns, and Sara Price. Chunky post-training: Data driven failures of generalization. arXiv preprint arXiv:2602.05910, 2026
arXiv 2026
-
[69]
Deploying interpretability to production with rakuten: Sae probes for pii detection
Nam Nguyen, Myra Deng, Dhruvil Gala, Kenta Naruse, Felix Giovanni Virgo, Michael Byun, Dron Hazra, Liv Gorton, Daniel Balsam, Thomas McGrath, Mio Takei, and Yusuke Kaji. Deploying interpretability to production with rakuten: Sae probes for pii detection. Goodfire, 2025. https://www.goodfire.ai/research/rakuten-sae-probes-for-pii-detection
2025
-
[70]
Excessive Use of Bold Formatting , 2025 a
OpenAI . Excessive Use of Bold Formatting , 2025 a . https://community.openai.com/t/excessive-used-of-bold-formatting/1110099
arXiv 2025
-
[71]
Expanding on What we Missed with Sycophancy , 2025 b
OpenAI . Expanding on What we Missed with Sycophancy , 2025 b . https://openai.com/index/expanding-on-sycophancy/
2025
-
[72]
Where the Goblins Came From , 2026
OpenAI . Where the Goblins Came From , 2026. https://openai.com/index/where-the-goblins-came-from/
2026
-
[73]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022
2022
-
[74]
Samuel J. Paech. EQ-Bench creative writing benchmark v3. https://github.com/EQ-bench/creative-writing-bench, 2025
2025
-
[75]
Disentangling length from quality in direct preference optimization
Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 4998--5017, 2024
2024
-
[76]
Automatically interpreting millions of features in large language models
Gon c alo Paulo, Alex Mallen, Caden Juang, and Nora Belrose. Automatically interpreting millions of features in large language models. arXiv preprint arXiv:2410.13928, 2024
arXiv 2024
-
[77]
Pedregosa, G
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12: 0 2825--2830, 2011
2011
-
[78]
The fineweb datasets: Decanting the web for the finest text data at scale
Guilherme Penedo, Hynek Kydlicek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37: 0 30811--30849, 2024
2024
-
[79]
Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Ke...
-
[80]
Features as rewards: Scalable supervision for open-ended tasks via interpretability
Aaditya Vikram Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Thomas McGrath, and Ekdeep Singh Lubana. Features as rewards: Scalable supervision for open-ended tasks via interpretability. arXiv:2602.10067, 2026
arXiv 2026
-
[81]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 0 53728--53741, 2023
2023
-
[82]
Shaping capabilities with token-level data filtering
Neil Rathi and Alec Radford. Shaping capabilities with token-level data filtering. arXiv preprint arXiv:2601.21571, 2026
arXiv 2026
-
[83]
Rate: Causal explainability of reward models with imperfect counterfactuals
David Reber, Sean Richardson, Todd Nief, Cristina Garbacea, and Victor Veitch. Rate: Causal explainability of reward models with imperfect counterfactuals. arXiv preprint arXiv:2410.11348, 2024
arXiv 2024
-
[84]
Xstest: A test suite for identifying exaggerated safety behaviours in large language models
Paul R \"o ttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pap...
2024
-
[85]
Deepseekmath: Pushing the limits of mathematical reasoning in open language models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
Pith/arXiv arXiv 2024
-
[86]
Deepseekmath-v2: Towards self-verifiable mathematical reasoning
Zhihong Shao, Yuxiang Luo, Chengda Lu, ZZ Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, and Xiaokang Zhang. Deepseekmath-v2: Towards self-verifiable mathematical reasoning. arXiv preprint arXiv:2511.22570, 2025
arXiv 2025
-
[87]
Open problems in mechanistic interpretability
Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, et al. Open problems in mechanistic interpretability. arXiv preprint arXiv:2501.16496, 2025
Pith/arXiv arXiv 2025
-
[88]
Towards understanding sycophancy in language models
Mrinank Sharma, Meg Tong, Tomek Korbak, David Duvenaud, Amanda Askell, Sam Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models. In International Conf...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.