citation dossier · hub

Neural Machine Translation by Jointly Learning to Align and Translate

hub 10+ Pith inbound or 1,000+ external citations · 49 Pith inbound

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio · 2014 · cs.CL · arXiv 1409.0473

Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

49Pith papers citing it

50reference links

cs.LGtop field · 17 papers

UNVERDICTEDtop verdict bucket · 41 papers

open full Pith review

why this work matters in Pith

Pith tracks Neural Machine Translation by Jointly Learning to Align and Translate because it crossed a citation-hub threshold.

hub tools

JSON dossier citing papers JSON arXiv source

claims ledger

abstract Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in

co-cited works

representative citing papers

Characterizing the Expressivity of Local Attention in Transformers

cs.CL · 2026-05-01 · unverdicted · novelty 8.0

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressively complementary.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

cs.LG · 2017-01-23 · accept · novelty 8.0

A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.

Adaptive Computation Time for Recurrent Neural Networks

cs.NE · 2016-03-29 · accept · novelty 8.0

ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.

Neural Turing Machines

cs.NE · 2014-10-20 · unverdicted · novelty 8.0

Neural Turing Machines augment neural networks with differentiable external memory to learn algorithmic tasks such as copying, sorting, and associative recall from examples.

GravityGraphSAGE: Link Prediction in Directed Attributed Graphs

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

GravityGraphSAGE adapts GraphSAGE with a gravity-inspired decoder to outperform prior graph deep learning methods on directed link prediction across citation networks and 16 real-world graphs.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token count by 55% on TIMIT.

Arbitrarily Conditioned Hierarchical Flows for Spatiotemporal Events

cs.LG · 2026-05-02 · unverdicted · novelty 7.0

ARCH is a hierarchical flow-based generative model that enables tractable conditional intensity computation and arbitrary conditioning for spatiotemporal event distributions.

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

math.PR · 2026-04-29 · unverdicted · novelty 7.0

Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.

Selective Contrastive Learning For Gloss Free Sign Language Translation

cs.CL · 2026-04-24 · unverdicted · novelty 7.0

A pair selection strategy based on negative similarity dynamics strengthens contrastive supervision in gloss-free sign language translation by reducing noisy negatives.

AlphaEvolve: A coding agent for scientific and algorithmic discovery

cs.AI · 2025-06-16 · unverdicted · novelty 7.0

AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, plus optimizations for Google data centers and hardware.

In-context Learning and Induction Heads

cs.LG · 2022-09-24 · unverdicted · novelty 7.0

Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

cs.CL · 2018-08-19 · accept · novelty 7.0

SentencePiece trains subword models directly from raw text to enable language-independent neural text processing.

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

cs.CL · 2016-11-28 · accept · novelty 7.0

MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.

Cubit: Token Mixer with Kernel Ridge Regression

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.

Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

BERT learns shortcut solutions that impair generalization and forward transfer in continual LEGO, while ALBERT learns loop-like solutions for better performance, yet both fail at cross-experience composition, with ALBERT rescued by mixed-data training.

Benchmarking POS Tagging for the Tajik Language: A Comparative Study of Neural Architectures on the TajPersParallel Corpus

cs.CL · 2026-05-06 · unverdicted · novelty 6.0

mBERT with LoRA achieves the best weighted F1 of 0.62 for Tajik POS tagging on context-free dictionary entries, but macro F1 is only 0.11, with all models failing on rare function words.

A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation

cs.CL · 2026-05-03 · unverdicted · novelty 6.0

VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.

Jet Quenching Identification via Supervised Learning in Simulated Heavy-Ion Collisions

hep-ph · 2026-04-22 · unverdicted · novelty 6.0

Sequential machine learning on jet declustering history trees outperforms static models at identifying jet quenching in heavy-ion collision simulations.

An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling

cs.NE · 2026-04-22 · unverdicted · novelty 6.0

S4D state space models correspond exactly to wave propagation and nonlinear wave interactions in a one-dimensional ring oscillator network, with a closed-form operator describing the complete input-output map.

Graph Transformer-Based Pathway Embedding for Cancer Prognosis

cs.LG · 2026-04-17 · unverdicted · novelty 6.0

PATH gene embeddings in a graph transformer achieve 0.8766 F1 on pancancer metastasis prediction (8.8% over SOTA) and identify disease-state pathway rewiring.

Neural architectures for resolving references in program code

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

New seq2seq architectures for permutation indexing outperform baselines on synthetic reference-resolution tasks and reduce real decompilation error rates by 42%.

Enhancing event reconstruction for $\gamma$-ray particle detector arrays using transformers

astro-ph.IM · 2026-04-10 · unverdicted · novelty 6.0

Transformer models applied to simulated water-Cherenkov array data improve gamma-hadron separation and reconstruction of direction, core position, and energy compared to established techniques.

Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space

cs.CL · 2026-04-06 · unverdicted · novelty 6.0

PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.

Revisiting Feature Prediction for Learning Visual Representations from Video

cs.CV · 2024-02-15 · conditional · novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Adaptive Computation Time for Recurrent Neural Networks cs.NE · 2016-03-29 · accept · none · ref 2 · internal anchor
ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.
Neural Turing Machines cs.NE · 2014-10-20 · unverdicted · none · ref 2 · internal anchor
Neural Turing Machines augment neural networks with differentiable external memory to learn algorithmic tasks such as copying, sorting, and associative recall from examples.
An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling cs.NE · 2026-04-22 · unverdicted · none · ref 3 · internal anchor
S4D state space models correspond exactly to wave propagation and nonlinear wave interactions in a one-dimensional ring oscillator network, with a closed-form operator describing the complete input-output map.