{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:AMZB2PK2OIIHM4LVTRFSKH5RSI","short_pith_number":"pith:AMZB2PK2","schema_version":"1.0","canonical_sha256":"03321d3d5a72107671759c4b251fb1922e88b10314cc0fd577a0fc72e6fa437b","source":{"kind":"arxiv","id":"2501.16496","version":1},"attestation_state":"computed","paper":{"title":"Open Problems in Mechanistic Interpretability","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Mechanistic interpretability must solve open problems in methods, applications, and socio-technical challenges to achieve its goals of AI assurance and scientific insight.","cross_cats":[],"primary_cat":"cs.LG","authors_text":"Adria Garriga-Alonso, Alejandro Ortega, Arthur Conmy, Atticus Geiger, Bilal Chughtai, Daniel Murfet, David Bau, Eric J. Michaud, Eric Todd, Jack Lindsey, Jeff Wu, Jesse Hoogland, Jessica Rumbelow, Joseph Bloom, Joseph Miller, Joshua Batson, Lee Sharkey, Lucius Bushnaq, Martin Wattenberg, Max Tegmark, Mor Geva, Nandi Schoots, Neel Nanda, Nicholas Goldowsky-Dill, Stefan Heimersheim, Stella Biderman, Stephen Casper, Tom McGrath, William Saunders","submitted_at":"2025-01-27T20:57:18Z","abstract_excerpt":"Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2501.16496","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.LG","submitted_at":"2025-01-27T20:57:18Z","cross_cats_sorted":[],"title_canon_sha256":"b9c6e9cb0c692d000881d31e45732bf54e9bca012ee1962914207934fab15ba2","abstract_canon_sha256":"12331aa25501eb215b232c8242d7d5d268c34924e775e461378931445601599e"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:22.195798Z","signature_b64":"coSWZXoJUDi9/bK3ewN8uUyIBemlgl/cTNgZfN/16M0Eu27jya2/LqaQucYE5ALIfGLwBKmMXIAxzAsT3GQwBA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"03321d3d5a72107671759c4b251fb1922e88b10314cc0fd577a0fc72e6fa437b","last_reissued_at":"2026-05-17T23:39:22.194980Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:22.194980Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Open Problems in Mechanistic Interpretability","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Mechanistic interpretability must solve open problems in methods, applications, and socio-technical challenges to achieve its goals of AI assurance and scientific insight.","cross_cats":[],"primary_cat":"cs.LG","authors_text":"Adria Garriga-Alonso, Alejandro Ortega, Arthur Conmy, Atticus Geiger, Bilal Chughtai, Daniel Murfet, David Bau, Eric J. Michaud, Eric Todd, Jack Lindsey, Jeff Wu, Jesse Hoogland, Jessica Rumbelow, Joseph Bloom, Joseph Miller, Joshua Batson, Lee Sharkey, Lucius Bushnaq, Martin Wattenberg, Max Tegmark, Mor Geva, Nandi Schoots, Neel Nanda, Nicholas Goldowsky-Dill, Stefan Heimersheim, Stella Biderman, Stephen Casper, Tom McGrath, William Saunders","submitted_at":"2025-01-27T20:57:18Z","abstract_excerpt":"Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Progress in mechanistic interpretability promises greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence, but many open problems require solutions before these benefits can be realized.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That solving the identified open problems in methods, applications, and socio-technical challenges will directly produce the promised scientific and engineering benefits.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A review paper that organizes conceptual, practical, and socio-technical open problems in mechanistic interpretability.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Mechanistic interpretability must solve open problems in methods, applications, and socio-technical challenges to achieve its goals of AI assurance and scientific insight.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"7d4dfb5b0b1d119ebde1dcc51640c47d8a89f05bef124f224d3053073a780db3"},"source":{"id":"2501.16496","kind":"arxiv","version":1},"verdict":{"id":"0af7e63b-80cf-417c-acdb-5819e542a978","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T18:25:38.231306Z","strongest_claim":"Progress in mechanistic interpretability promises greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence, but many open problems require solutions before these benefits can be realized.","one_line_summary":"A review paper that organizes conceptual, practical, and socio-technical open problems in mechanistic interpretability.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That solving the identified open problems in methods, applications, and socio-technical challenges will directly produce the promised scientific and engineering benefits.","pith_extraction_headline":"Mechanistic interpretability must solve open problems in methods, applications, and socio-technical challenges to achieve its goals of AI assurance and scientific insight."},"references":{"count":77,"sample":[{"doi":"10.1073/pnas.1907375117","year":2024,"title":"Understanding the role of individual units in a deep neural network","work_id":"6b96f855-8b1d-4fc1-8171-8e1a6a16fea9","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.23915/distill.00015","year":2019,"title":"https://distill.pub/2019/activation-atlas","work_id":"aa26f856-6ec2-4377-9256-be457f8d0629","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1162/tacl_a_00359","year":2023,"title":"Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals","work_id":"c60c9b79-a372-4ddc-a28f-4c6fa9d62204","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.18653/v1/w16-2524","year":2009,"title":"Probing for semantic evidence of composition by means of simple classification tasks","work_id":"48f79331-577a-482b-b14d-fcad29a5a94c","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1145/3531146.3533074","year":2024,"title":"ISBN 9781450393522","work_id":"1bb029ce-48e8-4920-9a76-a9d1d929a6b7","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":77,"snapshot_sha256":"2d3b5cdfb4ce25045e3a3ba4b74499d79e9c838326bb6ce0bdfa813f196ef95e","internal_anchors":3},"formal_canon":{"evidence_count":2,"snapshot_sha256":"efb565fc2dbc6f1aa1e1681ca94a0bd28653d2101d76a266eac508169ae5c88f"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2501.16496","created_at":"2026-05-17T23:39:22.195117+00:00"},{"alias_kind":"arxiv_version","alias_value":"2501.16496v1","created_at":"2026-05-17T23:39:22.195117+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2501.16496","created_at":"2026-05-17T23:39:22.195117+00:00"},{"alias_kind":"pith_short_12","alias_value":"AMZB2PK2OIIH","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"AMZB2PK2OIIHM4LV","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"AMZB2PK2","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":40,"internal_anchor_count":40,"sample":[{"citing_arxiv_id":"2510.03271","citing_title":"Decision Potential Surface: A Theoretical and Practical Approximation of Large Language Model Decision Boundary","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2510.03271","citing_title":"Decision Potential Surface: A Theoretical and Practical Approximation of Large Language Model Decision Boundary","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21849","citing_title":"Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22531","citing_title":"Disentanglement Beyond Generative Models with Riemannian ICA","ref_index":76,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22532","citing_title":"Relational Linear Properties in Language Models: An Empirical Investigation","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2506.18852","citing_title":"Mechanistic Interpretability Needs Philosophy","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2512.05742","citing_title":"Internal Deployment in the AI Act","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20607","citing_title":"Mechanistic Interpretability for Learning Assurance of a Vision-Based Landing System","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12770","citing_title":"WriteSAE: Sparse Autoencoders for Recurrent State","ref_index":68,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16041","citing_title":"Explainable AI Isn't Enough! Rethinking Algorithmic Contestability","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09391","citing_title":"Do Linear Probes Generalize Better in Persona Coordinates?","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2508.05463","citing_title":"Task complexity shapes internal representations and robustness in neural networks","ref_index":57,"is_internal_anchor":true},{"citing_arxiv_id":"2509.13316","citing_title":"Do Activation Verbalization Methods Convey Privileged Information?","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2509.25843","citing_title":"ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2510.00468","citing_title":"Feature Identification via the Empirical NTK","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2510.01025","citing_title":"Hypothesis-Driven Feature Manifold Analysis in LLMs via Supervised Multi-Dimensional Scaling","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12770","citing_title":"WriteSAE: Sparse Autoencoders for Recurrent State","ref_index":68,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12770","citing_title":"WriteSAE: Sparse Autoencoders for Recurrent State","ref_index":68,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13329","citing_title":"Tracing Persona Vectors Through LLM Pretraining","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16426","citing_title":"Functional Similarity Metric for Neural Networks: Overcoming Parametric Ambiguity via Activation Region Analysis","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12128","citing_title":"Metaphor Is Not All Attention Needs","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11887","citing_title":"Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models","ref_index":63,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12412","citing_title":"Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space","ref_index":128,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09438","citing_title":"fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09391","citing_title":"Do Linear Probes Generalize Better in Persona Coordinates?","ref_index":21,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/AMZB2PK2OIIHM4LVTRFSKH5RSI","json":"https://pith.science/pith/AMZB2PK2OIIHM4LVTRFSKH5RSI.json","graph_json":"https://pith.science/api/pith-number/AMZB2PK2OIIHM4LVTRFSKH5RSI/graph.json","events_json":"https://pith.science/api/pith-number/AMZB2PK2OIIHM4LVTRFSKH5RSI/events.json","paper":"https://pith.science/paper/AMZB2PK2"},"agent_actions":{"view_html":"https://pith.science/pith/AMZB2PK2OIIHM4LVTRFSKH5RSI","download_json":"https://pith.science/pith/AMZB2PK2OIIHM4LVTRFSKH5RSI.json","view_paper":"https://pith.science/paper/AMZB2PK2","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2501.16496&json=true","fetch_graph":"https://pith.science/api/pith-number/AMZB2PK2OIIHM4LVTRFSKH5RSI/graph.json","fetch_events":"https://pith.science/api/pith-number/AMZB2PK2OIIHM4LVTRFSKH5RSI/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/AMZB2PK2OIIHM4LVTRFSKH5RSI/action/timestamp_anchor","attest_storage":"https://pith.science/pith/AMZB2PK2OIIHM4LVTRFSKH5RSI/action/storage_attestation","attest_author":"https://pith.science/pith/AMZB2PK2OIIHM4LVTRFSKH5RSI/action/author_attestation","sign_citation":"https://pith.science/pith/AMZB2PK2OIIHM4LVTRFSKH5RSI/action/citation_signature","submit_replication":"https://pith.science/pith/AMZB2PK2OIIHM4LVTRFSKH5RSI/action/replication_record"}},"created_at":"2026-05-17T23:39:22.195117+00:00","updated_at":"2026-05-17T23:39:22.195117+00:00"}