Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers
read the original abstract
Transformers have revolutionized the field of machine learning. In particular, they can be used to solve complex algorithmic problems, including graph-based tasks. In such algorithmic tasks a key question is what is the minimal size of a transformer that can implement the task. Recent work has begun to explore this problem for graph-based tasks, showing that for sub-linear embedding dimension (i.e., model width) logarithmic depth suffices. However, an open question, which we address here, is what happens if width is allowed to grow linearly, while depth is kept fixed. Here we analyze this setting, and provide the surprising result that with linear width, constant depth suffices for solving a host of graph-based problems. This suggests that a moderate increase in width can allow much shallower models, which are advantageous in terms of inference and train time. For other problems, we show that quadratic width is required. Our results demonstrate the complex and intriguing landscape of transformer implementations of graph-based algorithms. We empirically investigate these trade-offs between the relative powers of depth and width and find tasks where wider models have the same accuracy as deep models, while having much faster train and inference time due to parallelizable hardware.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
Lost in Tokenization: Fundamental Trade-offs in Graph Tokenization for Transformers
Graph tokenizations for Transformers induce distinct depth regimes with proven separations and impossibility results for converting between them at limited depth.
-
Learning to Approximate Uniform Facility Location via Graph Neural Networks
A differentiable MPNN approximates uniform facility location with provable guarantees and outperforms standard approximation algorithms while closing the gap to exact ILP solutions.
-
Deep sequence models tend to memorize geometrically; it is unclear why
Deep sequence models develop geometric memory in embeddings that encodes novel global relationships, transforming l-fold composition tasks into 1-step navigation via a natural spectral bias connected to Node2Vec.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.