Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models

Michael Li, Nishant Subramani

classification 💻 cs.CL cs.LG

keywords identityinflectionallexicalmodelsfeaturesearlymodelwhile

read the original abstract

Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information relies primarily on studies of early models like BERT and GPT-2. We systematically probe 25 models from BERT Base to Qwen2.5-7B focusing on two linguistic properties: lexical identity and inflectional features across 6 diverse languages. We find a consistent pattern: inflectional features are linearly decodable throughout the model, while lexical identity is prominent early but increasingly weakens with depth. Further analysis of the representation geometry reveals that models with aggressive mid-layer dimensionality compression show reduced steering effectiveness in those layers, despite probe accuracy remaining high. Pretraining analysis shows that inflectional structure stabilizes early while lexical identity representations continue evolving. Taken together, our findings suggest that transformers maintain inflectional features across layers, while trading off lexical identity for compact, predictive representations. Our code is available at https://github.com/ml5885/model_internal_sleuthing

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Inference-Time Machine Unlearning via Gated Activation Redirection
cs.LG 2026-05 conditional novelty 8.0

GUARD-IT performs machine unlearning in LLMs via inference-time gated activation redirection, matching or exceeding gradient-based baselines on TOFU and MUSE while preserving utility and working under quantization.
Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

SemGrad is a gradient-based uncertainty quantification technique for free-form LLM generation that operates in semantic space using a Semantic Preservation Score to select stable embeddings.
Defragmenting Language Models: An Interpretability-based Approach for Vocabulary Expansion
cs.CL 2026-04 unverdicted novelty 7.0

Interpretability-based selection of vocabulary items plus FragMend initialization reduces token over-fragmentation and improves performance for non-Latin script languages by roughly 20 points over baselines.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 3 Pith papers · 7 internal anchors

[1]

Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2017. https://openreview.net/forum?id=BJh6Ztuxl Fine-grained analysis of sentence embeddings using auxiliary prediction tasks . In 5th International Conference on Learning Representations (Conference Track)

work page 2017
[2]

Guillaume Alain and Yoshua Bengio. 2017. https://openreview.net/forum?id=ryF7rTqgl Understanding intermediate layers using linear classifier probes . In 5th International Conference on Learning Representations (Workshop Track)

work page 2017
[3]

Yonatan Belinkov and James Glass. 2019. https://doi.org/10.1162/tacl_a_00254 Analysis methods in neural language processing: A survey . Transactions of the Association for Computational Linguistics, 7:49--72

work page doi:10.1162/tacl_a_00254 2019
[4]

Leo Breiman. 2001. https://doi.org/10.1023/A:1010933404324 Random forests . Mach. Learn., 45(1):5–32

work page doi:10.1023/a:1010933404324 2001
[5]

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, and 6 others. 2023. Towards monosemanticity: Decomposing language models with d...

work page 2023
[6]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, and 12 others. 2020. https://arxiv.org/abs/2005.14165 Lan...

work page internal anchor Pith review arXiv 2020
[7]

Chang, Catherine Arnett, Zhuowen Tu, and Benjamin K

Tyler A. Chang, Catherine Arnett, Zhuowen Tu, and Benjamin K. Bergen. 2024. https://arxiv.org/abs/2408.10441 Goldfish: Monolingual language models for 350 languages . Preprint, arXiv:2408.10441

work page arXiv 2024
[8]

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. https://arxiv.org/abs/2309.08600 Sparse autoencoders find highly interpretable features in language models . Preprint, arXiv:2309.08600

work page Pith review arXiv 2023
[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long a...

work page doi:10.18653/v1/n19-1423 2019
[10]

Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. 2021. https://doi.org/10.1162/tacl_a_00359 Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals . Transactions of the Association for Computational Linguistics, 9:160--175

work page doi:10.1162/tacl_a_00359 2021
[11]

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, and 6 others. 2021. A mathematical framework for transformer circuits. Transformer C...

work page 2021
[12]

Kawin Ethayarajh. 2019. https://doi.org/10.18653/v1/D19-1006 How contextual are contextualized word representations? C omparing the geometry of BERT , ELM o, and GPT -2 embeddings . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCN...

work page doi:10.18653/v1/d19-1006 2019
[13]

Atticus Geiger, Hanson Lu, Thomas F Icard, and Christopher Potts. 2021. https://openreview.net/forum?id=RmuXDtjDhG Causal abstractions of neural networks . In Advances in Neural Information Processing Systems

work page 2021
[14]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, and 24 others. 2024. https://doi.org/10.18653/v1/2024.acl-long.841 OLM o: Acceleratin...

work page doi:10.18653/v1/2024.acl-long.841 2024
[16]

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. https://web.stanford.edu/ hastie/ElemStatLearn/ The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2 edition. Springer, New York, NY, USA

work page 2009
[17]

John Hewitt and Percy Liang. 2019. https://doi.org/10.18653/v1/D19-1275 Designing and interpreting probes with control tasks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733--2743, Hong Kong, China. Association fo...

work page doi:10.18653/v1/d19-1275 2019
[18]

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. https://doi.org/10.1016/0893-6080(89)90020-8 Multilayer feedforward networks are universal approximators . Neural Networks, 2(5):359--366

work page doi:10.1016/0893-6080(89)90020-8 1989
[19]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, and 1 others. 2024. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186

work page internal anchor Pith review arXiv 2024
[20]

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. 2023. https://openreview.net/forum?id=6t0Kwf8-jrj Editing models with task arithmetic . In The Eleventh International Conference on Learning Representations

work page 2023
[21]

Ganesh Jawahar, Beno \^i t Sagot, and Djam \'e Seddah. 2019. https://doi.org/10.18653/v1/P19-1356 What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651--3657, Florence, Italy. Association for Computational Linguistics

work page doi:10.18653/v1/p19-1356 2019
[22]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014
[23]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, and 4 others. 2025. https://arxiv.org/abs/2411.15124 Tulu 3: Pushing...

work page internal anchor Pith review arXiv 2025
[24]

Kenneth Li, Oam Patel, Fernanda Vi \'e gas, Hanspeter Pfister, and Martin Wattenberg. 2023. https://openreview.net/forum?id=aLLuYpn83y Inference-time intervention: Eliciting truthful answers from a language model . In Thirty-seventh Conference on Neural Information Processing Systems

work page 2023
[25]

and Gardner, Matt and Belinkov, Yonatan and Peters, Matthew E

Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. 2019. https://doi.org/10.18653/v1/N19-1112 Linguistic knowledge and transferability of contextual representations . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1...

work page doi:10.18653/v1/n19-1112 2019
[26]

Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. 2022. https://openreview.net/forum?id=-h6WAS6eE4 Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems

work page 2022
[27]

nostalgebraist. 2020. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens interpreting gpt: the logit lens

work page 2020
[28]

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. 2024. https://arxiv.org/abs/2312.06681 Steering llama 2 via contrastive activation addition . Preprint, arXiv:2312.06681

work page internal anchor Pith review arXiv 2024
[29]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12:2825--2830

work page 2011
[30]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9

work page 2019
[31]

Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. https://doi.org/10.1162/tacl_a_00349 A primer in BERT ology: What we know about how BERT works . Transactions of the Association for Computational Linguistics, 8:842--866

work page doi:10.1162/tacl_a_00349 2020
[32]

Frank Rosenblatt. 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386

work page 1958
[33]

Nishant Subramani, Jason Eisner, Justin Svegliato, Benjamin Van Durme, Yu Su, and Sam Thomson. 2025. https://aclanthology.org/2025.naacl-long.615/ MICE for CAT s: Model-internal confidence estimation for calibrating agents with tools . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguis...

work page 2025
[34]

Nishant Subramani, Nivedita Suresh, and Matthew Peters. 2022. https://doi.org/10.18653/v1/2022.findings-acl.48 Extracting latent steering vectors from pretrained language models . In Findings of the Association for Computational Linguistics: ACL 2022, pages 566--581, Dublin, Ireland. Association for Computational Linguistics

work page doi:10.18653/v1/2022.findings-acl.48 2022
[35]

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. https://doi.org/10.18653/v1/P19-1452 BERT rediscovers the classical NLP pipeline . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593--4601, Florence, Italy. Association for Computational Linguistics

work page doi:10.18653/v1/p19-1452 2019
[36]

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/92650b2e92217715fe312e6fa7b90d82-Paper.pdf Investigating gender bias in language models using causal mediation analysis . In Advances in Neural Information Processing Systems, volume ...

work page 2020
[37]

Elena Voita and Ivan Titov. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.14 Information-theoretic probing with minimum description length . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 183--196, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.14 2020
[38]

Ivan Vuli \'c , Edoardo Maria Ponti, Robert Litschko, Goran Glava s , and Anna Korhonen. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.586 Probing pretrained language models for lexical semantics . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7222--7240, Online. Association for Computational ...

work page doi:10.18653/v1/2020.emnlp-main.586 2020
[39]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://doi.org/10.18653/v1/2020.emnlp-demos.6 Transformers: Sta...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[40]

BigScience Workshop. 2023. https://arxiv.org/abs/2211.05100 Bloom: A 176b-parameter open-access multilingual language model . Preprint, arXiv:2211.05100

work page internal anchor Pith review arXiv 2023
[41]

Amir Zeldes. 2017. https://doi.org/http://dx.doi.org/10.1007/s10579-016-9343-x The GUM corpus: Creating multilayer resources in the classroom . Language Resources and Evaluation, 51(3):581--612

work page doi:10.1007/s10579-016-9343-x 2017