Enhancing automated interpretability with output-centric feature descriptions

Yoav Gur-Arieh, Roy Mayan, Chen Agassy, Atticus Geiger, Mor Geva · 2025 · DOI 10.18653/v1/2025.acl-long.288

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open at publisher browse 2 citing papers

representative citing papers

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

Query Lens extends Logit Lens to interpret sparse features via key-value analysis and indirect effects, yielding coherent token signatures where Logit Lens fails, and proposes the Subspace Channel Hypothesis.

Prototype Language Models

cs.LG · 2026-07-01 · unverdicted · novelty 6.0

PRISM forms predictions as sparse mixtures of learned prototypes trained with clustering objectives, matching dense model accuracy while enabling ~500x faster data attribution and behavior editing without finetuning.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects cs.LG · 2026-05-30 · unverdicted · none · ref 17
Query Lens extends Logit Lens to interpret sparse features via key-value analysis and indirect effects, yielding coherent token signatures where Logit Lens fails, and proposes the Subspace Channel Hypothesis.
Prototype Language Models cs.LG · 2026-07-01 · unverdicted · none · ref 153
PRISM forms predictions as sparse mixtures of learned prototypes trained with clustering objectives, matching dense model accuracy while enabling ~500x faster data attribution and behavior editing without finetuning.

Enhancing automated interpretability with output-centric feature descriptions

fields

years

verdicts

representative citing papers

citing papers explorer