Transformers develop four algorithmic phases of in-context learning on Markov chains via two distinct multi-layer subcircuit mechanisms, with phase boundaries set by data diversity K.
One way to accomplish this is by reducing the rate at which the first attention layer learns to attend to the previous token
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Distinct mechanisms underlying in-context learning in transformers
Transformers develop four algorithmic phases of in-context learning on Markov chains via two distinct multi-layer subcircuit mechanisms, with phase boundaries set by data diversity K.