Single-layer two-head Transformers learn sparse XOR with O(polylog(d)) parameters in one gradient step, breaking the Omega(d) parameter bottleneck of FFNNs.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
Adam's adaptive preconditioning and first-moment averaging improve high-probability tracking error in noise-dominated nonstationary regimes but can increase it under strong drift, where SGD achieves a smaller floor, with explicit beta-dependent bounds.
Sublevel sets of invex functions are connected under mild assumptions, with the result extended to solution sets in invex-incave minimax problems and incave games.
Metriplectic systems converge to entropy extrema at fixed Hamiltonian under stated conditions; a Landau-inspired class reduces the check to two simpler conditions for use in equilibrium relaxation schemes.
citing papers explorer
-
Transformers Provably Learn Sparse XOR with Polylogarithmic Parameters
Single-layer two-head Transformers learn sparse XOR with O(polylog(d)) parameters in one gradient step, breaking the Omega(d) parameter bottleneck of FFNNs.
-
Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization
Adam's adaptive preconditioning and first-moment averaging improve high-probability tracking error in noise-dominated nonstationary regimes but can increase it under strong drift, where SGD achieves a smaller floor, with explicit beta-dependent bounds.
-
On the Connectedness of Sublevel Sets in Invex Optimization
Sublevel sets of invex functions are connected under mild assumptions, with the result extended to solution sets in invex-incave minimax problems and incave games.
-
Metriplectic relaxation to equilibria
Metriplectic systems converge to entropy extrema at fixed Hamiltonian under stated conditions; a Landau-inspired class reduces the check to two simpler conditions for use in equilibrium relaxation schemes.