One of the Q, K or V weights in transformer self-attention is redundant and replaceable by the identity matrix under mild assumptions, reducing parameters by 25 percent with no loss in small-model performance.
You can remove gpt2’s layernorm by fine-tuning.arXiv preprint arXiv:2409.13710
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 2verdicts
UNVERDICTED 2representative citing papers
Derives an interaction measure between crosscoder features from reconstruction error in compact proofs and applies it to produce computationally sparse crosscoders retaining 60% MLP performance with single-feature selection versus 10% for standard crosscoders.
citing papers explorer
-
Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers
One of the Q, K or V weights in transformer self-attention is redundant and replaceable by the identity matrix under mild assumptions, reducing parameters by 25 percent with no loss in small-model performance.
-
Interactions Between Crosscoder Features: A Compact Proofs Perspective
Derives an interaction measure between crosscoder features from reconstruction error in compact proofs and applies it to produce computationally sparse crosscoders retaining 60% MLP performance with single-feature selection versus 10% for standard crosscoders.