Query Lens extends Logit Lens to interpret sparse features via key-value analysis and indirect effects, yielding coherent token signatures where Logit Lens fails, and proposes the Subspace Channel Hypothesis.
Enhancing automated interpretability with output-centric feature descriptions
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
PRISM forms predictions as sparse mixtures of learned prototypes trained with clustering objectives, matching dense model accuracy while enabling ~500x faster data attribution and behavior editing without finetuning.
citing papers explorer
-
Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects
Query Lens extends Logit Lens to interpret sparse features via key-value analysis and indirect effects, yielding coherent token signatures where Logit Lens fails, and proposes the Subspace Channel Hypothesis.
-
Prototype Language Models
PRISM forms predictions as sparse mixtures of learned prototypes trained with clustering objectives, matching dense model accuracy while enabling ~500x faster data attribution and behavior editing without finetuning.