Length generalization in arithmetic transformers

URL https: //arxiv · 2023 · arXiv 2306.15400

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication

cs.LG · 2026-03-30 · unverdicted · novelty 8.0

Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

cs.LG · 2026-03-30 · unverdicted · novelty 7.0

The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a strong inductive bias that can raise accuracy from failure to 99.8%.

Globally Optimal Training of Spiking Neural Networks via Parameter Reconstruction

cs.NE · 2026-05-08 · unverdicted · novelty 6.0

A new parameter reconstruction method achieves globally optimal training for spiking neural networks by convexifying parallel recurrent threshold networks that include SNNs as a special case.

Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Transformers can be built to act as nonlinear featurizers via attention, supporting in-context regression with proven generalization bounds on synthetic tasks.

Generalization in LLM Problem Solving: The Case of the Shortest Path

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.

citing papers explorer

Showing 5 of 5 citing papers.

On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication cs.LG · 2026-03-30 · unverdicted · none · ref 30
Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.
The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior cs.LG · 2026-03-30 · unverdicted · none · ref 9
The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a strong inductive bias that can raise accuracy from failure to 99.8%.
Globally Optimal Training of Spiking Neural Networks via Parameter Reconstruction cs.NE · 2026-05-08 · unverdicted · none · ref 19
A new parameter reconstruction method achieves globally optimal training for spiking neural networks by convexifying parallel recurrent threshold networks that include SNNs as a special case.
Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer cs.LG · 2026-05-06 · unverdicted · none · ref 18
Transformers can be built to act as nonlinear featurizers via attention, supporting in-context regression with proven generalization bounds on synthetic tasks.
Generalization in LLM Problem Solving: The Case of the Shortest Path cs.AI · 2026-04-16 · unverdicted · none · ref 24
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.

Length generalization in arithmetic transformers

fields

years

verdicts

representative citing papers

citing papers explorer