Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.
arXiv preprint arXiv:2305.19466 (2023)
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
method 1polarities
use method 1representative citing papers
PJ-RoPE organizes relative-position mechanisms as a learnable Fourier-Jet-Affine space derived from lag-shift dynamics, extending RoPE and ALiBi with explicit jets and sector selection.
YaRN extends the context window of RoPE-based LLMs like LLaMA more efficiently than prior methods, using 10x fewer tokens and 2.5x fewer steps while surpassing state-of-the-art performance and enabling extrapolation beyond fine-tuning lengths.
Applies optimal transport to bound OOD generalization error in Transformers via Lipschitz continuity and TC^0 circuit depth lower bounds for Dyck-k backtracking, supported by evaluations on 54 configurations.
A survey of positional encoding methods in transformer-based time series models that evaluates fixed, learnable, relative, and hybrid approaches on classification tasks and links effectiveness to data characteristics.
citing papers explorer
No citing papers match the current filters.