Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressively complementary.
hub
Neural Machine Translation by Jointly Learning to Align and Translate
51 Pith papers cite this work. Polarity classification is still indexing.
abstract
Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.
hub tools
claims ledger
- abstract Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in
co-cited works
representative citing papers
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.
Neural Turing Machines augment neural networks with differentiable external memory to learn algorithmic tasks such as copying, sorting, and associative recall from examples.
GravityGraphSAGE adapts GraphSAGE with a gravity-inspired decoder to outperform prior graph deep learning methods on directed link prediction across citation networks and 16 real-world graphs.
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token count by 55% on TIMIT.
ARCH is a hierarchical flow-based generative model that enables tractable conditional intensity computation and arbitrary conditioning for spatiotemporal event distributions.
Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.
A pair selection strategy based on negative similarity dynamics strengthens contrastive supervision in gloss-free sign language translation by reducing noisy negatives.
AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, plus optimizations for Google data centers and hardware.
Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.
CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.
SentencePiece trains subword models directly from raw text to enable language-independent neural text processing.
MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.
Pre-trained encoder-decoder transformers fine-tuned for sequence-to-sequence constituent parsing outperform prior seq2seq models and compete with specialized parsers on continuous treebanks.
Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
BERT learns shortcut solutions that impair generalization and forward transfer in continual LEGO, while ALBERT learns loop-like solutions for better performance, yet both fail at cross-experience composition, with ALBERT rescued by mixed-data training.
mBERT with LoRA achieves the best weighted F1 of 0.62 for Tajik POS tagging on context-free dictionary entries, but macro F1 is only 0.11, with all models failing on rare function words.
VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
Sequential machine learning on jet declustering history trees outperforms static models at identifying jet quenching in heavy-ion collision simulations.
S4D state space models correspond exactly to wave propagation and nonlinear wave interactions in a one-dimensional ring oscillator network, with a closed-form operator describing the complete input-output map.
PATH gene embeddings in a graph transformer achieve 0.8766 F1 on pancancer metastasis prediction (8.8% over SOTA) and identify disease-state pathway rewiring.
New seq2seq architectures for permutation indexing outperform baselines on synthetic reference-resolution tasks and reduce real decompilation error rates by 42%.
Transformer models applied to simulated water-Cherenkov array data improve gamma-hadron separation and reconstruction of direction, core position, and energy compared to established techniques.
citing papers explorer
-
In-context Learning and Induction Heads
Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.