GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.
hub Mixed citations
Self-Attention with Relative Position Representations
Mixed citation behavior. Most common role is background (60%).
abstract
Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. (2017) achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structure. Instead, it requires adding representations of absolute positions to its inputs. In this work we present an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements. On the WMT 2014 English-to-German and English-to-French translation tasks, this approach yields improvements of 1.3 BLEU and 0.3 BLEU over absolute position representations, respectively. Notably, we observe that combining relative and absolute position representations yields no further improvement in translation quality. We describe an efficient implementation of our method and cast it as an instance of relation-aware self-attention mechanisms that can generalize to arbitrary graph-labeled inputs.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Selective RoPE adds input-dependent rotations to generalize RoPE, showing implicit positional structure in softmax attention and improving performance on language modeling, copying, state tracking, and retrieval when added to gated transformers.
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.
A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
DyWPE generates positional embeddings for time series transformers from the input signal via Discrete Wavelet Transform and outperforms standard positional encodings on ten datasets, especially longer sequences and biomedical signals.
QARMA applies transformer-augmented reinforcement learning to qubit allocation and reuse in modular quantum systems, reporting up to 86% average reduction in inter-core communications versus optimized Qiskit baselines.
ViTaPEs uses two-stage positional encodings in a multimodal transformer to learn task-agnostic visuotactile representations that outperform baselines on recognition tasks, show zero-shot generalization, and improve robotic grasp success prediction.
LOOPE learns a patch ordering for positional embeddings in ViTs and introduces the Three Cell Experiment benchmark that shows 30-35% gaps in positional retention versus the usual 4-6%.
An attention-based DRL agent with Transformer encoder and GNN learns heuristics for qubit-to-core allocation in multi-core quantum systems to minimize state transfers and online compilation time.
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.
HEXST applies a hexagonal shifted-window Transformer with rotary positional encodings, contrast-sensitive training objectives, and single-cell priors to predict gene expression from histology slides, outperforming prior models on seven datasets while preserving spatial heterogeneity.
DW-Net improves the accuracy versus computational cost Pareto front over standard U-Nets for 2D and 3D multi-scale flow benchmarks by stacking multiple waves while keeping training settings identical.
RIHA proposes a hierarchical alignment transformer that uses multi-scale visual and textual feature pyramids plus optimal transport to generate more accurate radiology reports from medical images.
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Reward-Forcing guides autoregressive video generation with reward feedback to achieve performance comparable to teacher-dependent methods on benchmarks like VBench without relying on distillation.
Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.
The paper introduces Musical Attention, an attention variant that incorporates eight musical features including metadata to generate more coherent and varied music than standard or strided attention baselines.
LSTAN-GERPE uses spatio-temporal attention, graph embedding, and grid-searched rotational position encoding to achieve advanced accuracy on PeMS04 and PeMS08 traffic forecasting datasets without heavy feature engineering.
A survey of positional encoding methods in transformer-based time series models that evaluates fixed, learnable, relative, and hybrid approaches on classification tasks and links effectiveness to data characteristics.
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
citing papers explorer
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.