citation dossier

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio · 2014 · cs.CL · arXiv 1409.0473

Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

49Pith papers citing it

50reference links

cs.LGtop field · 17 papers

UNVERDICTEDtop verdict bucket · 41 papers

open full Pith review

why this work matters in Pith

Pith has found this work in 49 reviewed papers. Its strongest current cluster is cs.LG (17 papers). The largest review-status bucket among citing papers is UNVERDICTED (41 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

Characterizing the Expressivity of Local Attention in Transformers

cs.CL · 2026-05-01 · unverdicted · novelty 8.0

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressively complementary.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

cs.LG · 2017-01-23 · accept · novelty 8.0

A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.

Adaptive Computation Time for Recurrent Neural Networks

cs.NE · 2016-03-29 · accept · novelty 8.0

ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.

Neural Turing Machines

cs.NE · 2014-10-20 · unverdicted · novelty 8.0

Neural Turing Machines augment neural networks with differentiable external memory to learn algorithmic tasks such as copying, sorting, and associative recall from examples.

GravityGraphSAGE: Link Prediction in Directed Attributed Graphs

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

GravityGraphSAGE adapts GraphSAGE with a gravity-inspired decoder to outperform prior graph deep learning methods on directed link prediction across citation networks and 16 real-world graphs.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token count by 55% on TIMIT.

Arbitrarily Conditioned Hierarchical Flows for Spatiotemporal Events

cs.LG · 2026-05-02 · unverdicted · novelty 7.0

ARCH is a hierarchical flow-based generative model that enables tractable conditional intensity computation and arbitrary conditioning for spatiotemporal event distributions.

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

math.PR · 2026-04-29 · unverdicted · novelty 7.0

Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.

Selective Contrastive Learning For Gloss Free Sign Language Translation

cs.CL · 2026-04-24 · unverdicted · novelty 7.0

A pair selection strategy based on negative similarity dynamics strengthens contrastive supervision in gloss-free sign language translation by reducing noisy negatives.

AlphaEvolve: A coding agent for scientific and algorithmic discovery

cs.AI · 2025-06-16 · unverdicted · novelty 7.0

AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, plus optimizations for Google data centers and hardware.

In-context Learning and Induction Heads

cs.LG · 2022-09-24 · unverdicted · novelty 7.0

Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

cs.CL · 2018-08-19 · accept · novelty 7.0

SentencePiece trains subword models directly from raw text to enable language-independent neural text processing.

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

cs.CL · 2016-11-28 · accept · novelty 7.0

MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.

Cubit: Token Mixer with Kernel Ridge Regression

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.

Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

BERT learns shortcut solutions that impair generalization and forward transfer in continual LEGO, while ALBERT learns loop-like solutions for better performance, yet both fail at cross-experience composition, with ALBERT rescued by mixed-data training.

Benchmarking POS Tagging for the Tajik Language: A Comparative Study of Neural Architectures on the TajPersParallel Corpus

cs.CL · 2026-05-06 · unverdicted · novelty 6.0

mBERT with LoRA achieves the best weighted F1 of 0.62 for Tajik POS tagging on context-free dictionary entries, but macro F1 is only 0.11, with all models failing on rare function words.

A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation

cs.CL · 2026-05-03 · unverdicted · novelty 6.0

VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.

Jet Quenching Identification via Supervised Learning in Simulated Heavy-Ion Collisions

hep-ph · 2026-04-22 · unverdicted · novelty 6.0

Sequential machine learning on jet declustering history trees outperforms static models at identifying jet quenching in heavy-ion collision simulations.

An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling

cs.NE · 2026-04-22 · unverdicted · novelty 6.0

S4D state space models correspond exactly to wave propagation and nonlinear wave interactions in a one-dimensional ring oscillator network, with a closed-form operator describing the complete input-output map.

Graph Transformer-Based Pathway Embedding for Cancer Prognosis

cs.LG · 2026-04-17 · unverdicted · novelty 6.0

PATH gene embeddings in a graph transformer achieve 0.8766 F1 on pancancer metastasis prediction (8.8% over SOTA) and identify disease-state pathway rewiring.

Neural architectures for resolving references in program code

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

New seq2seq architectures for permutation indexing outperform baselines on synthetic reference-resolution tasks and reduce real decompilation error rates by 42%.

Enhancing event reconstruction for $\gamma$-ray particle detector arrays using transformers

astro-ph.IM · 2026-04-10 · unverdicted · novelty 6.0

Transformer models applied to simulated water-Cherenkov array data improve gamma-hadron separation and reconstruction of direction, core position, and energy compared to established techniques.

Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space

cs.CL · 2026-04-06 · unverdicted · novelty 6.0

PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.

Revisiting Feature Prediction for Learning Visual Representations from Video

cs.CV · 2024-02-15 · conditional · novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.

citing papers explorer

Showing 49 of 49 citing papers.

Characterizing the Expressivity of Local Attention in Transformers cs.CL · 2026-05-01 · unverdicted · none · ref 2 · internal anchor
Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressively complementary.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer cs.LG · 2017-01-23 · accept · none · ref 6 · internal anchor
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
Adaptive Computation Time for Recurrent Neural Networks cs.NE · 2016-03-29 · accept · none · ref 2 · internal anchor
ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.
Neural Turing Machines cs.NE · 2014-10-20 · unverdicted · none · ref 2 · internal anchor
Neural Turing Machines augment neural networks with differentiable external memory to learn algorithmic tasks such as copying, sorting, and associative recall from examples.
GravityGraphSAGE: Link Prediction in Directed Attributed Graphs cs.LG · 2026-05-10 · unverdicted · none · ref 30 · internal anchor
GravityGraphSAGE adapts GraphSAGE with a gravity-inspired decoder to outperform prior graph deep learning methods on directed link prediction across citation networks and 16 real-world graphs.
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization cs.LG · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token count by 55% on TIMIT.
Arbitrarily Conditioned Hierarchical Flows for Spatiotemporal Events cs.LG · 2026-05-02 · unverdicted · none · ref 254 · internal anchor
ARCH is a hierarchical flow-based generative model that enables tractable conditional intensity computation and arbitrary conditioning for spatiotemporal event distributions.
Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models math.PR · 2026-04-29 · unverdicted · none · ref 3 · internal anchor
Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.
Selective Contrastive Learning For Gloss Free Sign Language Translation cs.CL · 2026-04-24 · unverdicted · none · ref 1 · internal anchor
A pair selection strategy based on negative similarity dynamics strengthens contrastive supervision in gloss-free sign language translation by reducing noisy negatives.
AlphaEvolve: A coding agent for scientific and algorithmic discovery cs.AI · 2025-06-16 · unverdicted · none · ref 4 · internal anchor
AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, plus optimizations for Google data centers and hardware.
In-context Learning and Induction Heads cs.LG · 2022-09-24 · unverdicted · none · ref 23 · internal anchor
Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing cs.CL · 2018-08-19 · accept · none · ref 2 · internal anchor
SentencePiece trains subword models directly from raw text to enable language-independent neural text processing.
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset cs.CL · 2016-11-28 · accept · none · ref 1 · internal anchor
MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.
Cubit: Token Mixer with Kernel Ridge Regression cs.LG · 2026-05-07 · unverdicted · none · ref 2 · internal anchor
Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning cs.LG · 2026-05-06 · unverdicted · none · ref 3 · internal anchor
BERT learns shortcut solutions that impair generalization and forward transfer in continual LEGO, while ALBERT learns loop-like solutions for better performance, yet both fail at cross-experience composition, with ALBERT rescued by mixed-data training.
Benchmarking POS Tagging for the Tajik Language: A Comparative Study of Neural Architectures on the TajPersParallel Corpus cs.CL · 2026-05-06 · unverdicted · none · ref 14 · internal anchor
mBERT with LoRA achieves the best weighted F1 of 0.62 for Tajik POS tagging on context-free dictionary entries, but macro F1 is only 0.11, with all models failing on rare function words.
A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation cs.CL · 2026-05-03 · unverdicted · none · ref 61 · internal anchor
VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
Jet Quenching Identification via Supervised Learning in Simulated Heavy-Ion Collisions hep-ph · 2026-04-22 · unverdicted · none · ref 23 · internal anchor
Sequential machine learning on jet declustering history trees outperforms static models at identifying jet quenching in heavy-ion collision simulations.
An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling cs.NE · 2026-04-22 · unverdicted · none · ref 3 · internal anchor
S4D state space models correspond exactly to wave propagation and nonlinear wave interactions in a one-dimensional ring oscillator network, with a closed-form operator describing the complete input-output map.
Graph Transformer-Based Pathway Embedding for Cancer Prognosis cs.LG · 2026-04-17 · unverdicted · none · ref 82 · internal anchor
PATH gene embeddings in a graph transformer achieve 0.8766 F1 on pancancer metastasis prediction (8.8% over SOTA) and identify disease-state pathway rewiring.
Neural architectures for resolving references in program code cs.LG · 2026-04-15 · unverdicted · none · ref 1 · internal anchor
New seq2seq architectures for permutation indexing outperform baselines on synthetic reference-resolution tasks and reduce real decompilation error rates by 42%.
Enhancing event reconstruction for $\gamma$-ray particle detector arrays using transformers astro-ph.IM · 2026-04-10 · unverdicted · none · ref 34 · internal anchor
Transformer models applied to simulated water-Cherenkov array data improve gamma-hadron separation and reconstruction of direction, core position, and energy compared to established techniques.
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space cs.CL · 2026-04-06 · unverdicted · none · ref 43 · internal anchor
PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
Revisiting Feature Prediction for Learning Visual Representations from Video cs.CV · 2024-02-15 · conditional · none · ref 95 · internal anchor
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Vision Transformers Need Registers cs.CV · 2023-09-28 · unverdicted · none · ref 156 · internal anchor
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
On the Opportunities and Risks of Foundation Models cs.LG · 2021-08-16 · accept · none · ref 2 · internal anchor
Foundation models are large adaptable AI systems with emergent capabilities that offer broad opportunities but carry risks from homogenization, opacity, and inherited defects across downstream applications.
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges cs.LG · 2021-04-27 · accept · none · ref 4 · internal anchor
Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding cs.CL · 2020-06-30 · unverdicted · none · ref 61 · internal anchor
GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
Universal Transformers cs.CL · 2018-07-10 · unverdicted · none · ref 4 · internal anchor
Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.
Attention U-Net: Learning Where to Look for the Pancreas cs.CV · 2018-04-11 · unverdicted · none · ref 2 · internal anchor
Attention gates added to U-Net automatically focus on target organs in CT images and improve segmentation performance on abdominal datasets.
Kaczmarz Linear Attention cs.LG · 2026-05-09 · unverdicted · none · ref 2 · internal anchor
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.
Boosting Automatic Java-to-Cangjie Translation with Multi-Stage LLM Training and Error Repair cs.SE · 2026-05-08 · unverdicted · none · ref 46 · internal anchor
Multi-stage LLM training plus compiler-guided error repair boosts functional equivalence in Java-to-Cangjie translation by 6.06% over prior methods despite scarce parallel data.
Adaptive Memory Decay for Log-Linear Attention cs.LG · 2026-05-07 · conditional · none · ref 2 · internal anchor
Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.
Neural Equalisers for Highly Compressed Faster-than-Nyquist Signalling: Design, Performance, Complexity and Robustness cs.IT · 2026-05-02 · unverdicted · none · ref 47 · internal anchor
Deep learning receivers enable reliable FTN signaling with up to 75% spectral compression via sliding-window detection while maintaining low latency and robustness to channel variations.
Beyond the Final Label: Exploiting the Untapped Potential of Classification Histories in Astronomical Light Curve Analysis astro-ph.IM · 2026-04-26 · unverdicted · none · ref 4 · internal anchor
An RNN-plus-attention model that ingests classification histories outperforms standard final-label classifiers on ELAsTiCC synthetic data and is accompanied by new Wasserstein-based metrics for temporal stability and early performance.
Topological Dualities for Modal Algebras math.CT · 2026-04-22 · unverdicted · none · ref 6 · internal anchor
A family of dualities links modal frames to relational spaces, with simplifications for semicontinuous relations that match modal axioms to relational properties.
An Explainable Approach to Document-level Translation Evaluation with Topic Modeling cs.CE · 2026-04-22 · unverdicted · none · ref 1 · internal anchor
A topic-modeling framework measures document-level thematic consistency in translations by aligning key tokens across languages with a bilingual dictionary and scoring via cosine similarity, providing explainable insights beyond sentence-level metrics.
MambaSL: Exploring Single-Layer Mamba for Time Series Classification cs.LG · 2026-04-16 · unverdicted · none · ref 4 · internal anchor
A single-layer Mamba variant with targeted redesigns sets new state-of-the-art average performance on all 30 UEA time series classification datasets under a unified reproducible protocol.
MyoVision: A Mobile Research Tool and NEATBoost-Attention Ensemble Framework for Real Time Chicken Breast Myopathy Detection cs.LG · 2026-04-15 · unverdicted · none · ref 48 · internal anchor
Smartphone transillumination imaging paired with a neuroevolution-tuned ensemble model classifies chicken breast myopathies at 82.4% accuracy on 336 fillets, matching costly hyperspectral systems.
Towards Automated Pentesting with Large Language Models cs.CR · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
RedShell fine-tunes LLMs on enhanced malicious PowerShell data to produce syntactically valid offensive code for pentesting, reporting over 90% validity, strong semantic match to references, and better edit-distance similarity than prior methods plus functional execution success.
Attention Is All You Need cs.CL · 2017-06-12 · unverdicted · none · ref 2 · internal anchor
Pith review generated a malformed one-line summary.
Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration cs.LG · 2026-05-01 · unverdicted · none · ref 37 · 2 links · internal anchor
Group Cognition Learning uses governed two-stage agents after separate modality encoding to mitigate dominance and spurious coupling, reporting state-of-the-art results on CMU-MOSI, CMU-MOSEI, and MIntRec for regression and classification.
Text Style Transfer with Machine Translation for Graphic Designs cs.CL · 2026-04-29 · unverdicted · none · ref 1 · internal anchor
Custom tag methods with NMT and LLMs for word alignment in text style transfer perform no better than standard attention-based alignment from NMT models.
JSSFF: A Joint Structural-Semantic Fusion Framework for Remote Sensing Image Captioning cs.CV · 2026-04-27 · unverdicted · none · ref 27 · internal anchor
JSSFF improves remote sensing image captioning by fusing structural edge details with semantic features in an encoder-decoder model and using fairness-based beam search, outperforming baselines on quantitative and qualitative measures.
Sinkhorn doubly stochastic attention rank decay analysis cs.LG · 2026-04-09 · unverdicted · none · ref 4 · internal anchor
Sinkhorn-normalized doubly stochastic attention preserves rank more effectively than Softmax row-stochastic attention, with both showing doubly exponential rank decay to one with network depth.
Video-guided Machine Translation with Global Video Context cs.CV · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
A globally video-guided multimodal translation framework retrieves semantically related video segments with a vector database and applies attention mechanisms to improve subtitle translation accuracy in long videos.
LLMs Struggle with Abstract Meaning Comprehension More Than Expected cs.CL · 2026-04-13 · unverdicted · none · ref 2 · internal anchor
LLMs struggle with abstract meaning comprehension on SemEval-2021 Task 4 more than fine-tuned models, and a new bidirectional attention classifier yields small accuracy gains of 3-4%.
Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS) cs.CV · 2026-04-10 · unverdicted · none · ref 21 · internal anchor
ADRUwAMS reports Dice scores of 0.9229 (whole tumor), 0.8432 (tumor core), and 0.8004 (enhancing tumor) on BraTS 2020 after training on BraTS 2019/2020 datasets.
Lecture Notes on Statistical Physics and Neural Networks cond-mat.dis-nn · 2026-05-07 · unverdicted · none · ref 44 · internal anchor
Lecture notes that treat statistical physics as probability theory and connect Ising models, spin glasses, and renormalization group ideas to Hopfield networks, restricted Boltzmann machines, and large language models.

Neural Machine Translation by Jointly Learning to Align and Translate

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer