From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

Aaron Mueller; Andrew Lee; Dhanya Sridhar; Ekdeep Singh Lubana; Patrik Reizinger; Shruti Joshi

arxiv: 2512.15134 · v2 · pith:ASOSH5O4new · submitted 2025-12-17 · 💻 cs.LG · cs.AI· cs.CL

From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

Aaron Mueller , Andrew Lee , Shruti Joshi , Ekdeep Singh Lubana , Dhanya Sridhar , Patrik Reizinger This is my paper

classification 💻 cs.LG cs.AIcs.CL

keywords featuresconceptconceptsinterpretabilitydisentangledisentangledinsufficientisolation

0 comments

read the original abstract

A goal of interpretability is to recover disentangled representations of latent concepts (features) from the activations of neural networks. The quality of features is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear to what extent common featurization methods such as sparse autoencoders (SAEs) and probes disentangle one concept from another. We propose a multi-concept evaluation setting using concepts including sentiment, domain, voice, and tense. We evaluate how well featurizers produce disentangled representations of each concept, observing that features are typically sensitive to only one concept, but also that concepts are distributed across many features. Then, we steer these features, measuring whether each concept is independently manipulable, and whether features interact. Even in idealized settings, steering a feature often affects many concepts, despite a near absence of interaction effects. These results suggest that correlational metrics are insufficient to establish steering selectivity, and that demonstrating that two features operate in separate spaces is insufficient to claim that they will be selective for one concept. These results underscore the importance of multi-concept evaluations in interpretability research.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Disentanglement Beyond Generative Models with Riemannian ICA
cs.LG 2026-05 unverdicted novelty 8.0

RICA replaces ICA's global generative model with local Riemannian geometry, introducing a disentanglement tensor based on the Hessian of the log-likelihood and Ricci curvature to measure pointwise disentanglement, whi...
Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions
cs.LG 2026-05 unverdicted novelty 7.0

Linear probes for Othello board states factor into tensor-product structure with square and color embeddings composed by a binding matrix, from which the linear probes can be directly recovered.
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
cs.CL 2026-04 unverdicted novelty 6.0

PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.