Gaussian distributions are invariant under the mean-field Transformer flow, reducing infinite-dimensional dynamics to a bilinear control system on mean and covariance with explicit reachability and stability results.
super hub Canonical reference
Neural Machine Translation by Jointly Learning to Align and Translate
Canonical reference. 74% of citing Pith papers cite this work as background.
abstract
Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in
authors
co-cited works
representative citing papers
Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.
REALM augments language-model pre-training with an unsupervised retriever over Wikipedia documents and reports 4-16% absolute gains on open-domain QA benchmarks over prior implicit and explicit knowledge methods.
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.
Neural Turing Machines augment neural networks with differentiable external memory to learn algorithmic tasks such as copying, sorting, and associative recall from examples.
EVE is a neural quantum state that enforces exact momentum eigenstates by construction, allowing VMC to variationally solve quasiparticle states across multiple phases in 2D interacting bosons.
GravityGraphSAGE adapts GraphSAGE with a gravity-inspired decoder to outperform prior graph deep learning methods on directed link prediction across citation networks and 16 real-world graphs.
ARCH is a hierarchical flow-based generative model that enables tractable conditional intensity computation and arbitrary conditioning for spatiotemporal event distributions.
Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.
A pair selection strategy based on negative similarity dynamics strengthens contrastive supervision in gloss-free sign language translation by reducing noisy negatives.
RACE Attention is a strictly linear-time attention mechanism that approximates softmax attention outputs using Gaussian projections and soft LSH to enable training on contexts up to 12 million tokens.
AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, plus optimizations for Google data centers and hardware.
Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.
GAT uses static attention where neighbor rankings ignore the query node and thus cannot express some graph problems; GATv2 enables dynamic attention and outperforms GAT on 11 OGB and other benchmarks with equal parameters.
CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.
RUSLAN is a 31-hour single-speaker Russian speech corpus for TTS containing 22200 annotated samples, with a baseline end-to-end model scoring 4.05 naturalness and 3.78 intelligibility on MOS tests.
MCAN stacks modular co-attention layers to reach 70.63% accuracy on VQA-v2 test-dev, outperforming prior state-of-the-art models.
Introduces an eight-class taxonomy for semantic image-text relations based on three metrics and a multimodal embedding model for predicting the classes from collected data.
SentencePiece trains subword models directly from raw text to enable language-independent neural text processing.
MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.
citing papers explorer
-
Reachability and asymptotics of Gaussian Transformer dynamics
Gaussian distributions are invariant under the mean-field Transformer flow, reducing infinite-dimensional dynamics to a bilinear control system on mean and covariance with explicit reachability and stability results.
-
Emerging Properties in Self-Supervised Vision Transformers
Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
-
Generative Language Modeling for Automated Theorem Proving
GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.
-
REALM: Retrieval-Augmented Language Model Pre-Training
REALM augments language-model pre-training with an unsupervised retriever over Wikipedia documents and reports 4-16% absolute gains on open-domain QA benchmarks over prior implicit and explicit knowledge methods.
-
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
-
Adaptive Computation Time for Recurrent Neural Networks
ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.
-
Neural Turing Machines
Neural Turing Machines augment neural networks with differentiable external memory to learn algorithmic tasks such as copying, sorting, and associative recall from examples.
-
Continuum Neural Momentum Eigenstate for Variationally Solving Quasiparticles
EVE is a neural quantum state that enforces exact momentum eigenstates by construction, allowing VMC to variationally solve quasiparticle states across multiple phases in 2D interacting bosons.
-
GravityGraphSAGE: Link Prediction in Directed Attributed Graphs
GravityGraphSAGE adapts GraphSAGE with a gravity-inspired decoder to outperform prior graph deep learning methods on directed link prediction across citation networks and 16 real-world graphs.
-
Arbitrarily Conditioned Hierarchical Flows for Spatiotemporal Events
ARCH is a hierarchical flow-based generative model that enables tractable conditional intensity computation and arbitrary conditioning for spatiotemporal event distributions.
-
Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models
Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.
-
Selective Contrastive Learning For Gloss Free Sign Language Translation
A pair selection strategy based on negative similarity dynamics strengthens contrastive supervision in gloss-free sign language translation by reducing noisy negatives.
-
RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts
RACE Attention is a strictly linear-time attention mechanism that approximates softmax attention outputs using Gaussian projections and soft LSH to enable training on contexts up to 12 million tokens.
-
AlphaEvolve: A coding agent for scientific and algorithmic discovery
AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, plus optimizations for Google data centers and hardware.
-
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
-
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
-
In-context Learning and Induction Heads
Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.
-
How Attentive are Graph Attention Networks?
GAT uses static attention where neighbor rankings ignore the query node and thus cannot express some graph problems; GATv2 enables dynamic attention and outperforms GAT on 11 OGB and other benchmarks with equal parameters.
-
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.
-
RUSLAN: Russian Spoken Language Corpus for Speech Synthesis
RUSLAN is a 31-hour single-speaker Russian speech corpus for TTS containing 22200 annotated samples, with a baseline end-to-end model scoring 4.05 naturalness and 3.78 intelligibility on MOS tests.
-
Deep Modular Co-Attention Networks for Visual Question Answering
MCAN stacks modular co-attention layers to reach 70.63% accuracy on VQA-v2 test-dev, outperforming prior state-of-the-art models.
-
Understanding, Categorizing and Predicting Semantic Image-Text Relations
Introduces an eight-class taxonomy for semantic image-text relations based on three metrics and a multimodal embedding model for predicting the classes from collected data.
-
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
SentencePiece trains subword models directly from raw text to enable language-independent neural text processing.
-
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.
-
Revisiting Neural Processes via Fourier Transform and Volterra Series
Introduces SFConvCNPs and SFVConvCNPs using set Fourier convolutions and Volterra expansions for translation-equivariant neural processes on irregular data with global receptive fields and linear scaling.
-
General-Purpose Photonic Computing Primitive for Contemporary Artificial Intelligence
DUET is a photonic tensor core paradigm that uses structural symmetry in VODICs to support arbitrary signed operands directly, experimentally tested on image classification, segmentation, and Transformer tasks.
-
AutoMCU: Feasibility-First MCU Neural Network Customization via LLM-based Multi-Agent Systems
AutoMCU uses feasibility-first LLM multi-agent coordination to automate MCU-constrained neural network design, delivering competitive accuracy on CIFAR-10/100 in 1-2 hours versus hundreds of GPU hours for prior HW-NAS methods.
-
Structured Neural Marked Point Processes for Interpretable Event Interaction Modeling
SNMPP builds a product-form neural influence kernel from a signed interaction network over event classes and a delay-aware monotonic temporal network to enable explicit discovery of inter-event relationships alongside strong prediction.
-
Exploiting Pre-trained Encoder-Decoder Transformers for Sequence-to-Sequence Constituent Parsing
Pre-trained encoder-decoder transformers fine-tuned for sequence-to-sequence constituent parsing outperform prior seq2seq models and compete with specialized parsers on continuous treebanks.
-
Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning
BERT learns shortcut solutions that impair generalization and forward transfer in continual LEGO, while ALBERT learns loop-like solutions for better performance, yet both fail at cross-experience composition, with ALBERT rescued by mixed-data training.
-
Benchmarking POS Tagging for the Tajik Language: A Comparative Study of Neural Architectures on the TajPersParallel Corpus
mBERT with LoRA achieves the best weighted F1 of 0.62 for Tajik POS tagging on context-free dictionary entries, but macro F1 is only 0.11, with all models failing on rare function words.
-
An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling
S4D state space models correspond exactly to wave propagation and nonlinear wave interactions in a one-dimensional ring oscillator network, with a closed-form operator describing the complete input-output map.
-
Graph Transformer-Based Pathway Embedding for Cancer Prognosis
PATH gene embeddings in a graph transformer achieve 0.8766 F1 on pancancer metastasis prediction (8.8% over SOTA) and identify disease-state pathway rewiring.
-
Neural architectures for resolving references in program code
New seq2seq architectures for permutation indexing outperform baselines on synthetic reference-resolution tasks and reduce real decompilation error rates by 42%.
-
Enhancing event reconstruction for $\gamma$-ray particle detector arrays using transformers
Transformer models applied to simulated water-Cherenkov array data improve gamma-hadron separation and reconstruction of direction, core position, and energy compared to established techniques.
-
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
-
TiledAttention: a CUDA Tile SDPA Kernel for PyTorch
TiledAttention is a cuTile-based SDPA kernel that balances performance with Python-level customizability for attention research in PyTorch.
-
Interpretable Alzheimer's Diagnosis via Multimodal Fusion of Regional Brain Experts
MREF-AD applies a mixture-of-experts architecture to regional brain imaging data from multiple modalities to achieve competitive Alzheimer's diagnosis accuracy on ADNI while supplying region- and modality-level interpretability.
-
DyWPE: Signal-Aware Dynamic Wavelet Positional Encoding for Time Series Transformers
DyWPE generates positional embeddings for time series transformers from the input signal via Discrete Wavelet Transform and outperforms standard positional encodings on ten datasets, especially longer sequences and biomedical signals.
-
Category-based Galaxy Image Generation via Diffusion Models
GalCatDiff applies category embeddings and a novel Astro-RAB block inside diffusion models to produce galaxy images whose color and size distributions match observations more closely than prior generative approaches.
-
A PennyLane-Centric Dataset to Enhance LLM-based Quantum Code Generation using RAG
PennyLang dataset of 3,347 PennyLane samples boosts LLM code generation success via RAG from 8.7% to 41.7% for Qwen 7B and 78.8% to 84.8% for LLaMa 4.
-
Titans: Learning to Memorize at Test Time
Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
-
An Empirical Study of Mamba-based Language Models
An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.
-
animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics
Introduces animal2vec, a self-supervised transformer for sparse bioacoustic audio, and the MeerKAT meerkat vocalization dataset, claiming outperformance over baselines including in few-shot settings.
-
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
Vision Transformers Need Registers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
-
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.