SISA adds an SSM importance term inside the attention score and runs the full operation as one SDPA call on augmented Q/K vectors, reporting better LAMBADA and perfect NIAH at small scale.
Dape: Data-adaptive positional encoding for length extrapolation
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Applies optimal transport to bound OOD generalization error in Transformers via Lipschitz continuity and TC^0 circuit depth lower bounds for Dyck-k backtracking, supported by evaluations on 54 configurations.
citing papers explorer
-
Forget Attention: Importance-Aware Attention Is All You Need
SISA adds an SSM importance term inside the attention score and runs the full operation as one SDPA call on augmented Q/K vectors, reporting better LAMBADA and perfect NIAH at small scale.
-
A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits
Applies optimal transport to bound OOD generalization error in Transformers via Lipschitz continuity and TC^0 circuit depth lower bounds for Dyck-k backtracking, supported by evaluations on 54 configurations.