Unveiling T ransformers with LEGO : a synthetic reasoning task

Yi Zhang, Arturs Backurs, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Tal Wagner · 2022 · arXiv 2206.04301

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

cs.LG · 2022-11-01 · conditional · novelty 8.0

GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.

Massive Activations in Large Language Models

cs.CL · 2024-02-27 · unverdicted · novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

Assign and Add: A Mechanistic Study of Compositional Arithmetic

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Transformers reuse the same modular addition MLP for direct and variable-assigned inputs, with learning progressing through three phases that enable compositional generalization to unseen combinations.

Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

BERT learns shortcut solutions that impair generalization and forward transfer in continual LEGO, while ALBERT learns loop-like solutions for better performance, yet both fail at cross-experience composition, with ALBERT rescued by mixed-data training.

Depth-Staggered Fibonacci Spacing for Sparse Attention: Static Schedules Beat Learned Dilation and Extrapolate Where Dense Attention Fails

cs.CL · 2026-06-26 · unverdicted · novelty 5.0

Static depth-staggered Fibonacci sparse attention improves perplexity over fixed/learned variants and extrapolates to 4x context while dense attention fails.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Massive Activations in Large Language Models cs.CL · 2024-02-27 · unverdicted · none · ref 101
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
Depth-Staggered Fibonacci Spacing for Sparse Attention: Static Schedules Beat Learned Dilation and Extrapolate Where Dense Attention Fails cs.CL · 2026-06-26 · unverdicted · none · ref 15
Static depth-staggered Fibonacci sparse attention improves perplexity over fixed/learned variants and extrapolates to 4x context while dense attention fails.

Unveiling T ransformers with LEGO : a synthetic reasoning task

fields

years

verdicts

representative citing papers

citing papers explorer