Mechanistic Interpretability Needs Philosophy

Anders S{\o}gaard; Constanza Fierro; Filippos Stamatiou; Iwan Williams; Joshua Hatherley; Nina Rajcic; Ninell Oldenburg; Ruchira Dhar; Sandrine R. Schiller

arxiv: 2506.18852 · v2 · pith:NUPL4CUDnew · submitted 2025-06-23 · 💻 cs.CL · cs.AI

Mechanistic Interpretability Needs Philosophy

Iwan Williams , Ninell Oldenburg , Ruchira Dhar , Joshua Hatherley , Constanza Fierro , Nina Rajcic , Sandrine R. Schiller , Filippos Stamatiou

show 1 more author

Anders S{\o}gaard

This is my paper

Pith reviewed 2026-05-21 23:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords mechanistic interpretabilityphilosophyAI explanationinterdisciplinary researchneural networksepistemic issuesAI ethics

0 comments

The pith

Mechanistic interpretability requires ongoing partnership with philosophy to clarify its concepts and methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that researchers who study the internal operations of neural networks should treat philosophy as a continuing collaborator rather than an optional add-on. This collaboration would sharpen definitions of key terms, improve how explanations are constructed, and confront questions about knowledge and responsibility that arise when interpreting AI behavior. The authors demonstrate the point by applying philosophical perspectives to three specific open problems in the existing interpretability literature. If the claim is correct, interpretability work would combine technical investigation with conceptual analysis, producing accounts of model behavior that are both more precise and more aware of their own limits. A reader focused on AI transparency would see the proposal as a way to reduce the risk that technical progress outruns clear understanding of what has actually been discovered.

Core claim

Mechanistic interpretability needs philosophy as an ongoing partner in clarifying its concepts, refining its methods, and navigating the epistemic and ethical complexities of interpreting AI systems. There is significant unrealised potential for progress in MI to be gained through deeper engagement with philosophers and philosophical frameworks. Taking three open problems from the MI literature as examples, this paper illustrates the value philosophy can add to MI research, and outlines a path toward deeper interdisciplinary dialogue.

What carries the argument

Application of philosophical frameworks to three open problems drawn from the mechanistic interpretability literature.

If this is right

Clarified concepts would allow more consistent descriptions of what counts as a mechanism inside a neural network.
Refined methods would produce explanations whose success can be evaluated against clearer standards.
Better handling of epistemic issues would improve assessments of how much understanding an interpretation actually delivers.
Attention to ethical complexities would guide more responsible choices about when and how to deploy interpretability tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interpretability teams might begin including philosophers as regular collaborators on specific projects rather than as occasional reviewers.
Training programs for new researchers in the field could incorporate short modules on relevant philosophical distinctions.
Similar partnerships could be tested in adjacent areas such as AI alignment or robustness research.
Published MI papers might start containing explicit sections that state and examine their philosophical assumptions.

Load-bearing premise

The conceptual and methodological gaps in current mechanistic interpretability work are best addressed by engagement with philosophical frameworks rather than through further empirical or engineering advances alone.

What would settle it

A demonstration that new technical tools alone can resolve all major conceptual ambiguities and ethical questions in mechanistic interpretability without any philosophical contribution would show the proposed partnership is unnecessary.

Figures

Figures reproduced from arXiv: 2506.18852 by Anders S{\o}gaard, Constanza Fierro, Filippos Stamatiou, Iwan Williams, Joshua Hatherley, Nina Rajcic, Ninell Oldenburg, Ruchira Dhar, Sandrine R. Schiller.

read the original abstract

Mechanistic interpretability (MI) aims to explain how neural networks work by uncovering their underlying mechanisms. As the field grows in influence, it is increasingly important to examine not just models themselves, but the assumptions, concepts and explanatory strategies implicit in MI research. We argue that mechanistic interpretability needs philosophy as an ongoing partner in clarifying its concepts, refining its methods, and navigating the epistemic and ethical complexities of interpreting AI systems. There is significant unrealised potential for progress in MI to be gained through deeper engagement with philosophers and philosophical frameworks. Taking three open problems from the MI literature as examples, this paper illustrates the value philosophy can add to MI research, and outlines a path toward deeper interdisciplinary dialogue.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Position paper argues mechanistic interpretability needs philosophy for conceptual clarity but offers illustrations rather than worked solutions.

read the letter

The one thing to know is that this is a position paper making the case for philosophy as a partner in mechanistic interpretability, illustrated through three open problems in the field. It does a solid job of selecting concrete examples from MI research and showing where philosophical concepts could help clarify things like the nature of mechanisms or the goals of explanation. The writing is direct and the structure makes it easy to see the connections to existing work in both areas. That targeted approach is better than a vague interdisciplinary pitch. The softer part is the lack of a detailed demonstration. The paper points out the problems and suggests philosophy has tools for them, but it does not apply those tools in depth to produce a revised understanding or improved method. This leaves the argument at the level of a call for collaboration rather than evidence that it would deliver the promised progress. The assumption that philosophy is the best route over further technical work is reasonable but not strongly tested here. This paper is for researchers in mechanistic interpretability who are interested in the conceptual side of their work and for philosophers looking at AI. It could be useful for anyone thinking about how to make interpretability more robust. It deserves a serious referee because it raises legitimate questions about the foundations of the field in a structured way. I would recommend sending it out for peer review to get input from experts in both disciplines.

Referee Report

2 major / 2 minor

Summary. The manuscript argues that mechanistic interpretability (MI) requires ongoing collaboration with philosophy to clarify concepts, refine methods, and address epistemic and ethical issues in interpreting AI systems. It uses three open problems from the MI literature as examples to demonstrate the potential value of philosophical engagement and proposes a path for deeper interdisciplinary dialogue.

Significance. If adopted, the recommendation could foster greater conceptual rigor in MI by drawing on philosophical tools for abstraction, explanation, and normativity, potentially improving the field's handling of interpretability limits and ethical implications. The paper is a clear position statement that draws on existing literature from both fields without circularity or ad-hoc parameters and explicitly credits the illustrative value of its examples rather than claiming resolution of the problems.

major comments (2)

[Sections on the three open problems] Sections on the three open problems: the paper illustrates how philosophical frameworks might address each issue but provides no worked example or detailed application of a specific philosophical method (e.g., conceptual analysis or epistemology of explanation) that produces a measurable advance in an MI technique; this leaves the central claim that philosophy yields progress on open problems as a normative assertion rather than a demonstrated outcome.
[Introduction] Introduction: the argument that conceptual and methodological gaps are best addressed through philosophical partnership rather than further empirical or engineering advances is stated but not supported by a direct comparison showing why intra-MI technical progress has been or will be insufficient; this is load-bearing for the recommendation of ongoing partnership.

minor comments (2)

[Abstract] The abstract and conclusion could more explicitly frame the contribution as a call for dialogue rather than a proof of necessity, to align reader expectations with the position-paper format.
A short table or bullet list summarizing the three open problems and the corresponding philosophical angles would improve readability and make the illustrative structure easier to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and positive evaluation of the manuscript's significance. We address each major comment in turn below.

read point-by-point responses

Referee: Sections on the three open problems: the paper illustrates how philosophical frameworks might address each issue but provides no worked example or detailed application of a specific philosophical method (e.g., conceptual analysis or epistemology of explanation) that produces a measurable advance in an MI technique; this leaves the central claim that philosophy yields progress on open problems as a normative assertion rather than a demonstrated outcome.

Authors: The manuscript is explicitly framed as a position paper whose aim is to illustrate the potential value of philosophical engagement through conceptual analysis of open problems, rather than to deliver a worked example that demonstrates measurable technical progress. As noted in the abstract, we 'illustrate the value philosophy can add' and 'outline a path toward deeper interdisciplinary dialogue,' without claiming to have resolved the problems. Providing a full application with measurable advances would constitute a separate research contribution and falls outside the scope of this work. We maintain that the illustrative approach is sufficient to support the call for partnership and do not intend to expand the examples into full demonstrations in this revision. revision: no
Referee: Introduction: the argument that conceptual and methodological gaps are best addressed through philosophical partnership rather than further empirical or engineering advances is stated but not supported by a direct comparison showing why intra-MI technical progress has been or will be insufficient; this is load-bearing for the recommendation of ongoing partnership.

Authors: We agree that a more explicit comparison would strengthen the introduction. While the paper discusses specific open problems that involve conceptual clarification beyond pure engineering (such as the nature of explanations and normative assumptions), we can revise the introduction to briefly contrast cases where technical advances in MI have not resolved underlying philosophical issues. This will better justify why ongoing partnership is recommended. We will incorporate a short paragraph or sentences to this effect. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a position paper whose central claim is a normative recommendation for ongoing interdisciplinary partnership between mechanistic interpretability and philosophy. It illustrates the claim with three open problems drawn from the existing MI literature but does not derive any technical prediction, parameter fit, or formal result from its own inputs. No self-definitional loops, fitted-input predictions, or load-bearing self-citation chains appear; the argument remains self-contained by appealing to external philosophical frameworks and MI examples without reducing the recommendation to a renaming or ansatz imported from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on domain assumptions about the unique value of philosophical analysis for conceptual clarification in technical fields and the existence of specific open problems in MI that philosophy can address.

axioms (2)

domain assumption Philosophy supplies conceptual clarification tools that are not already adequately provided by current MI research practices.
Invoked in the abstract when stating that MI needs philosophy to clarify concepts and refine methods.
ad hoc to paper Deeper engagement with philosophers will yield progress on open problems in MI.
Central to the claim of unrealised potential and the call for interdisciplinary dialogue.

pith-pipeline@v0.9.0 · 5669 in / 1184 out tokens · 32514 ms · 2026-05-21T23:44:57.479395+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
cs.CL 2026-04 unverdicted novelty 6.0

PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

The Internal State of an LLM Knows When It's Lying

Accessed: 2025-05-21. Amos Azaria and Tom Mitchell. The internal state of an llm knows when it’s lying.arXiv preprint arXiv:2304.13734,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Necessity and sufficiency for explaining text classifiers: A case study in hate speech detection

Esma Balkir, Isar Nejadgholi, Kathleen Fraser, and Svetlana Kiritchenko. Necessity and sufficiency for explaining text classifiers: A case study in hate speech detection. In Marine Carpuat, Marie- Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors,Proceedings of the 2022 Conference 9 of the North American Chapter of the Association for Computatio...

work page 2022
[3]

doi: 10.18653/v1/2022.naacl-main.192

Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.192. URL https://aclanthology.org/2022. naacl-main.192/. Anne Barnhill. How philosophy might contribute to the practical ethics of online manipulation. In The philosophy of online manipulation, pages 49–71. Routledge,

work page doi:10.18653/v1/2022.naacl-main.192 2022
[4]

Mechanistic Interpretability for AI Safety -- A Review

Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety–a review. arXiv preprint arXiv:2404.14082,

work page internal anchor Pith review arXiv
[5]

Discovering Latent Knowledge in Language Models Without Supervision

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

On behalf of the stakeholders: Trends in nlp model interpretability in the era of llms,

Nitay Calderon and Roi Reichart. On behalf of the stakeholders: Trends in nlp model interpretability in the era of llms. arXiv preprint arXiv:2407.19200,

work page arXiv
[7]

Propositional interpretability in artificial intelligence

David J Chalmers. Propositional interpretability in artificial intelligence. arXiv preprint arXiv:2501.15740,

work page arXiv
[8]

Mechanisms in Science

Carl Craver, James Tabery, and Phyllis Illari. Mechanisms in Science. In Edward N. Zalta and Uri Nodelman, editors, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Fall 2024 edition,

work page 2024
[9]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Defining knowledge: Bridging epistemology and large language models

Constanza Fierro, Ruchira Dhar, Filippos Stamatiou, Nicolas Garneau, and Anders Søgaard. Defining knowledge: Bridging epistemology and large language models. arXiv preprint arXiv:2410.02499,

work page arXiv
[11]

Dan Hendrycks and Laura Hiscott

doi: 10.1086/728685. Dan Hendrycks and Laura Hiscott. The misguided quest for mechanistic ai interpretabil- ity. AI Frontiers , May

work page doi:10.1086/728685
[12]

Risks from Learned Optimization in Advanced Machine Learning Systems

11 Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820,

work page internal anchor Pith review Pith/arXiv arXiv 1906
[13]

Explainable automated fact-checking: A survey

Neema Kotonya and Francesca Toni. Explainable automated fact-checking: A survey. arXiv preprint arXiv:2011.03870,

work page arXiv 2011
[14]

Faithful and customizable explanations of black box models

Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. Faithful and customizable explanations of black box models. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 131–138,

work page 2019
[15]

and Mitchell, M

Martha Lewis and Melanie Mitchell. Evaluating the robustness of analogical reasoning in large language models. arXiv preprint arXiv:2411.14215,

work page arXiv
[16]

The Definition of Lying and Deception

James Edwin Mahon. The Definition of Lying and Deception. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Winter 2016 edition,

work page 2016
[17]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

URL https://openreview.net/ forum?id=Ebt7JgMHv1. Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability

Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, et al. The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability. arXiv preprint arXiv:2408.01416,

work page arXiv
[19]

Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models

Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, and Jenia Jitsev. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv preprint arXiv:2406.02061,

work page arXiv
[20]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions

Lorenzo Pacchiardi, Alex J Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y Pan, Yarin Gal, Owain Evans, and Jan Brauner. How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions. arXiv preprint arXiv:2309.15840,

work page arXiv
[22]

A Mechanistic Explanatory Strategy for XAI

Marcin Rabiza. A Mechanistic Explanatory Strategy for XAI. In V . C. Müller, L. Dung, G. Löhr, and A. Rumana, editors, Philosophy of Artificial Intelligence: The State of the Art. Synthese Library, Springer Nature, forthcoming. URL http://arxiv.org/abs/2411.01332. Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. Toward transparent ai: A...

work page internal anchor Pith review arXiv 2023
[23]

Mechanistic? arXiv preprint arXiv:2410.09087,

Naomi Saphra and Sarah Wiegreffe. Mechanistic? arXiv preprint arXiv:2410.09087,

work page arXiv
[24]

Open Problems in Mechanistic Interpretability

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, et al. Open problems in mechanistic interpretability. arXiv preprint arXiv:2501.16496,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

URL https: //doi.org/10.1007/s11023-023-09622-4

doi: 10.1007/s11023-023-09622-4. URL https: //doi.org/10.1007/s11023-023-09622-4 . James WA Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, et al. Testing theory of mind in large language models and humans. Nature Human Behaviour, 8(7):1285–1295,

work page doi:10.1007/s11023-023-09622-4
[26]

Li, Arnab Sen Sharma, Aaron Mueller, Byron C

Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net,

work page 2024
[27]

Large language models fail on trivial alterations to theory-of-mind tasks.arXiv preprint arXiv:2302.08399, 2023

Tomer Ullman. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399,

work page arXiv
[28]

Explainable ai: A brief survey on history, research areas, approaches and challenges

Feiyu Xu, Hans Uszkoreit, Yangzhou Du, Wei Fan, Dongyan Zhao, and Jun Zhu. Explainable ai: A brief survey on history, research areas, approaches and challenges. In Natural language processing and Chinese computing: 8th cCF international conference, NLPCC 2019, dunhuang, China, October 9–14, 2019, proceedings, part II 8, pages 563–574. Springer,

work page 2019
[29]

Visualizing and understanding convolutional networks

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pages 818–833. Springer,

work page 2014

[1] [1]

The Internal State of an LLM Knows When It's Lying

Accessed: 2025-05-21. Amos Azaria and Tom Mitchell. The internal state of an llm knows when it’s lying.arXiv preprint arXiv:2304.13734,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Necessity and sufficiency for explaining text classifiers: A case study in hate speech detection

Esma Balkir, Isar Nejadgholi, Kathleen Fraser, and Svetlana Kiritchenko. Necessity and sufficiency for explaining text classifiers: A case study in hate speech detection. In Marine Carpuat, Marie- Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors,Proceedings of the 2022 Conference 9 of the North American Chapter of the Association for Computatio...

work page 2022

[3] [3]

doi: 10.18653/v1/2022.naacl-main.192

Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.192. URL https://aclanthology.org/2022. naacl-main.192/. Anne Barnhill. How philosophy might contribute to the practical ethics of online manipulation. In The philosophy of online manipulation, pages 49–71. Routledge,

work page doi:10.18653/v1/2022.naacl-main.192 2022

[4] [4]

Mechanistic Interpretability for AI Safety -- A Review

Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety–a review. arXiv preprint arXiv:2404.14082,

work page internal anchor Pith review arXiv

[5] [5]

Discovering Latent Knowledge in Language Models Without Supervision

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

On behalf of the stakeholders: Trends in nlp model interpretability in the era of llms,

Nitay Calderon and Roi Reichart. On behalf of the stakeholders: Trends in nlp model interpretability in the era of llms. arXiv preprint arXiv:2407.19200,

work page arXiv

[7] [7]

Propositional interpretability in artificial intelligence

David J Chalmers. Propositional interpretability in artificial intelligence. arXiv preprint arXiv:2501.15740,

work page arXiv

[8] [8]

Mechanisms in Science

Carl Craver, James Tabery, and Phyllis Illari. Mechanisms in Science. In Edward N. Zalta and Uri Nodelman, editors, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Fall 2024 edition,

work page 2024

[9] [9]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Defining knowledge: Bridging epistemology and large language models

Constanza Fierro, Ruchira Dhar, Filippos Stamatiou, Nicolas Garneau, and Anders Søgaard. Defining knowledge: Bridging epistemology and large language models. arXiv preprint arXiv:2410.02499,

work page arXiv

[11] [11]

Dan Hendrycks and Laura Hiscott

doi: 10.1086/728685. Dan Hendrycks and Laura Hiscott. The misguided quest for mechanistic ai interpretabil- ity. AI Frontiers , May

work page doi:10.1086/728685

[12] [12]

Risks from Learned Optimization in Advanced Machine Learning Systems

11 Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820,

work page internal anchor Pith review Pith/arXiv arXiv 1906

[13] [13]

Explainable automated fact-checking: A survey

Neema Kotonya and Francesca Toni. Explainable automated fact-checking: A survey. arXiv preprint arXiv:2011.03870,

work page arXiv 2011

[14] [14]

Faithful and customizable explanations of black box models

Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. Faithful and customizable explanations of black box models. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 131–138,

work page 2019

[15] [15]

and Mitchell, M

Martha Lewis and Melanie Mitchell. Evaluating the robustness of analogical reasoning in large language models. arXiv preprint arXiv:2411.14215,

work page arXiv

[16] [16]

The Definition of Lying and Deception

James Edwin Mahon. The Definition of Lying and Deception. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Winter 2016 edition,

work page 2016

[17] [17]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

URL https://openreview.net/ forum?id=Ebt7JgMHv1. Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability

Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, et al. The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability. arXiv preprint arXiv:2408.01416,

work page arXiv

[19] [19]

Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models

Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, and Jenia Jitsev. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv preprint arXiv:2406.02061,

work page arXiv

[20] [20]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions

Lorenzo Pacchiardi, Alex J Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y Pan, Yarin Gal, Owain Evans, and Jan Brauner. How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions. arXiv preprint arXiv:2309.15840,

work page arXiv

[22] [22]

A Mechanistic Explanatory Strategy for XAI

Marcin Rabiza. A Mechanistic Explanatory Strategy for XAI. In V . C. Müller, L. Dung, G. Löhr, and A. Rumana, editors, Philosophy of Artificial Intelligence: The State of the Art. Synthese Library, Springer Nature, forthcoming. URL http://arxiv.org/abs/2411.01332. Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. Toward transparent ai: A...

work page internal anchor Pith review arXiv 2023

[23] [23]

Mechanistic? arXiv preprint arXiv:2410.09087,

Naomi Saphra and Sarah Wiegreffe. Mechanistic? arXiv preprint arXiv:2410.09087,

work page arXiv

[24] [24]

Open Problems in Mechanistic Interpretability

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, et al. Open problems in mechanistic interpretability. arXiv preprint arXiv:2501.16496,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

URL https: //doi.org/10.1007/s11023-023-09622-4

doi: 10.1007/s11023-023-09622-4. URL https: //doi.org/10.1007/s11023-023-09622-4 . James WA Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, et al. Testing theory of mind in large language models and humans. Nature Human Behaviour, 8(7):1285–1295,

work page doi:10.1007/s11023-023-09622-4

[26] [26]

Li, Arnab Sen Sharma, Aaron Mueller, Byron C

Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net,

work page 2024

[27] [27]

Large language models fail on trivial alterations to theory-of-mind tasks.arXiv preprint arXiv:2302.08399, 2023

Tomer Ullman. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399,

work page arXiv

[28] [28]

Explainable ai: A brief survey on history, research areas, approaches and challenges

Feiyu Xu, Hans Uszkoreit, Yangzhou Du, Wei Fan, Dongyan Zhao, and Jun Zhu. Explainable ai: A brief survey on history, research areas, approaches and challenges. In Natural language processing and Chinese computing: 8th cCF international conference, NLPCC 2019, dunhuang, China, October 9–14, 2019, proceedings, part II 8, pages 563–574. Springer,

work page 2019

[29] [29]

Visualizing and understanding convolutional networks

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pages 818–833. Springer,

work page 2014