Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.
Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6roles
background 1polarities
background 1representative citing papers
Transformers can be built to act as nonlinear featurizers via attention, supporting in-context regression with proven generalization bounds on synthetic tasks.
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
Parameter reconstruction algorithm for SNN training obtained by extending convexification of parallel feedforward threshold networks to the recurrent case that subsumes SNNs.
Supervised fine-tuning on gate-by-gate quantum simulation traces allows LLMs to achieve near-perfect accuracy in predicting quantum measurement outcomes, with added GRPO improving generalization to larger qubit counts.
citing papers explorer
-
On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication
Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.
-
Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer
Transformers can be built to act as nonlinear featurizers via attention, supporting in-context regression with proven generalization bounds on synthetic tasks.
-
Generalization in LLM Problem Solving: The Case of the Shortest Path
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
-
Globally Optimal Training of Spiking Neural Networks via Parameter Reconstruction
Parameter reconstruction algorithm for SNN training obtained by extending convexification of parallel feedforward threshold networks to the recurrent case that subsumes SNNs.
-
Fine-Tuning Large Language Models for Quantum Reasoning
Supervised fine-tuning on gate-by-gate quantum simulation traces allows LLMs to achieve near-perfect accuracy in predicting quantum measurement outcomes, with added GRPO improving generalization to larger qubit counts.
- The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior