Conse- quently, for the model to infer nearest-neighbor 2-point correlations (i.e., bigrams) in a sequence, at least one attention layer must attend to the previous state

Mechanistic Readouts Only the attention layers of the transformer architecture can mix information along the sequence dimension

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Distinct mechanisms underlying in-context learning in transformers

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Transformers develop four algorithmic phases of in-context learning on Markov chains via two distinct multi-layer subcircuit mechanisms, with phase boundaries set by data diversity K.

citing papers explorer

Showing 1 of 1 citing paper.

Distinct mechanisms underlying in-context learning in transformers cs.LG · 2026-04-14 · unverdicted · none · ref 10
Transformers develop four algorithmic phases of in-context learning on Markov chains via two distinct multi-layer subcircuit mechanisms, with phase boundaries set by data diversity K.

Conse- quently, for the model to infer nearest-neighbor 2-point correlations (i.e., bigrams) in a sequence, at least one attention layer must attend to the previous state

fields

years

verdicts

representative citing papers

citing papers explorer