You can remove gpt2’s layernorm by fine-tuning.arXiv preprint arXiv:2409.13710

Stefan Heimersheim · 2024 · arXiv 2409.13710

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers

cs.LG · 2025-10-27 · unverdicted · novelty 7.0

One of the Q, K or V weights in transformer self-attention is redundant and replaceable by the identity matrix under mild assumptions, reducing parameters by 25 percent with no loss in small-model performance.

Interactions Between Crosscoder Features: A Compact Proofs Perspective

cs.LG · 2026-06-08 · unverdicted · novelty 6.0

Derives an interaction measure between crosscoder features from reconstruction error in compact proofs and applies it to produce computationally sparse crosscoders retaining 60% MLP performance with single-feature selection versus 10% for standard crosscoders.

citing papers explorer

Showing 2 of 2 citing papers.

Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers cs.LG · 2025-10-27 · unverdicted · none · ref 8
One of the Q, K or V weights in transformer self-attention is redundant and replaceable by the identity matrix under mild assumptions, reducing parameters by 25 percent with no loss in small-model performance.
Interactions Between Crosscoder Features: A Compact Proofs Perspective cs.LG · 2026-06-08 · unverdicted · none · ref 20
Derives an interaction measure between crosscoder features from reconstruction error in compact proofs and applies it to produce computationally sparse crosscoders retaining 60% MLP performance with single-feature selection versus 10% for standard crosscoders.

You can remove gpt2’s layernorm by fine-tuning.arXiv preprint arXiv:2409.13710

fields

years

verdicts

representative citing papers

citing papers explorer