citation dossier

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk · 2014 · cs.CL · arXiv 1406.1078

In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

27Pith papers citing it

28reference links

cs.LGtop field · 10 papers

UNVERDICTEDtop verdict bucket · 21 papers

open full Pith review

why this work matters in Pith

Pith has found this work in 27 reviewed papers. Its strongest current cluster is cs.LG (10 papers). The largest review-status bucket among citing papers is UNVERDICTED (21 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

Zero-shot Imitation Learning by Latent Topology Mapping

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.

Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement

cs.CV · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.

How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.

FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.

Geometry-Induced Long-Range Correlations in Recurrent Neural Network Quantum States

quant-ph · 2026-04-09 · conditional · novelty 7.0

Dilated RNN wave functions induce power-law correlations for the critical 1D transverse-field Ising model and the Cluster state, unlike the exponential decay of conventional RNN ansatze.

A Clinical Point Cloud Paradigm for In-Hospital Mortality Prediction from Multi-Level Incomplete Multimodal EHRs

cs.LG · 2026-04-06 · unverdicted · novelty 7.0

HealthPoint represents clinical events as points in a 4D space (content, time, modality, case) and applies low-rank relational attention to achieve state-of-the-art mortality prediction from multi-level incomplete multimodal EHRs.

Mastering Diverse Domains through World Models

cs.AI · 2023-01-10 · unverdicted · novelty 7.0

DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.

Graph Attention Networks

stat.ML · 2017-10-30 · accept · novelty 7.0

Graph Attention Networks compute learnable attention coefficients over node neighborhoods to produce weighted feature aggregations, achieving state-of-the-art results on citation networks and inductive protein-protein interaction graphs.

Mixed Precision Training

cs.AI · 2017-10-10 · accept · novelty 7.0

Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.

3DGS$^3$: Joint Super Sampling and Frame Interpolation for Real-Time Large-Scale 3DGS Rendering

cs.GR · 2026-05-12 · unverdicted · novelty 6.0

3DGS³ adds gradient-guided super-sampling and lightweight temporal interpolation to low-resolution 3DGS renders to produce high-resolution, high-frame-rate output without retraining the underlying scene representation.

Graph Federated Unlearning for Privacy Preservation

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Orthogonal unlearning updates plus server-side virtual clients enable effective user data removal in graph federated learning without major performance loss.

Deep Kernel Learning for Stratifying Glaucoma Trajectories

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

A deep kernel learning architecture with transformer feature extraction on clinical-BERT embeddings and Gaussian process backend identifies three glaucoma subgroups by decoupling progression trajectories from current visual acuity in multimodal EHR data.

IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem

cs.LG · 2026-04-20 · conditional · novelty 6.0

IDOBE compiles over 10,000 epidemiological outbreaks into a public benchmark and shows that MLP-based models deliver the most robust short-term forecasts while statistical methods hold a slight edge pre-peak.

MATRIX: Multi-Layer Code Watermarking via Dual-Channel Constrained Parity-Check Encoding

cs.CR · 2026-04-17 · unverdicted · novelty 6.0

MATRIX embeds multi-layer watermarks in LLM-generated code via dual-channel constrained parity-check encoding, achieving 99.2% detection accuracy with 0-0.14% functionality loss and 7.7-26.67% better attack robustness than prior methods.

Early-Warning Learner Satisfaction Forecasting in MOOCs via Temporal Event Transformers and LLM Text Embeddings

cs.CE · 2026-04-14 · unverdicted · novelty 6.0

TET-LLM predicts MOOC satisfaction early via temporal event transformers on behavior, LLM embeddings on text, and topic distributions, beating baselines at RMSE 0.82 and AUC 0.77 for 7-day forecasts.

Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting

cs.RO · 2026-04-14 · unverdicted · novelty 6.0

Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.

SAM 2: Segment Anything in Images and Videos

cs.CV · 2024-08-01 · conditional · novelty 6.0

SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation dataset collected to date.

Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

cs.LG · 2021-04-27 · accept · novelty 6.0

Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

cs.CL · 2020-02-19 · unverdicted · novelty 6.0

CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.

Universal Transformers

cs.CL · 2018-07-10 · unverdicted · novelty 6.0

Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.

Rethinking Convolutional Networks for Attribute-Aware Sequential Recommendation

cs.IR · 2026-05-06 · unverdicted · novelty 5.0

ConvRec applies hierarchical convolutional layers to generate compact sequence representations for attribute-aware sequential recommendation, achieving linear complexity and outperforming attention-based state-of-the-art models on four real-world datasets.

Learning Fingerprints for Medical Time Series with Redundancy-Constrained Information Maximization

cs.LG · 2026-04-30 · unverdicted · novelty 5.0

A self-supervised method learns a fixed set of disentangled fingerprint tokens from medical time series by combining reconstruction loss with a total coding rate diversity penalty, framed as a disentangled rate-distortion problem.

Attention Is All You Need

cs.CL · 2017-06-12 · unverdicted · novelty 5.0

Pith review generated a malformed one-line summary.

Impact of leaky dynamics on predictive path integration accuracy in recurrent neural networks

cs.NE · 2026-04-17 · unverdicted · novelty 4.0

Leaky RNNs improve grid-cell-like representations and path-integration accuracy by acting as a low-pass filter that stabilizes dynamics against noise.

citing papers explorer

Showing 27 of 27 citing papers.

Zero-shot Imitation Learning by Latent Topology Mapping cs.LG · 2026-05-08 · unverdicted · none · ref 9 · internal anchor
ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement cs.CV · 2026-05-07 · unverdicted · none · ref 3 · 2 links · internal anchor
NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.
How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences cs.LG · 2026-05-06 · unverdicted · none · ref 7 · internal anchor
In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning cs.LG · 2026-05-06 · unverdicted · none · ref 8 · internal anchor
FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.
Geometry-Induced Long-Range Correlations in Recurrent Neural Network Quantum States quant-ph · 2026-04-09 · conditional · none · ref 33 · internal anchor
Dilated RNN wave functions induce power-law correlations for the critical 1D transverse-field Ising model and the Cluster state, unlike the exponential decay of conventional RNN ansatze.
A Clinical Point Cloud Paradigm for In-Hospital Mortality Prediction from Multi-Level Incomplete Multimodal EHRs cs.LG · 2026-04-06 · unverdicted · none · ref 28 · internal anchor
HealthPoint represents clinical events as points in a 4D space (content, time, modality, case) and applies low-rank relational attention to achieve state-of-the-art mortality prediction from multi-level incomplete multimodal EHRs.
Mastering Diverse Domains through World Models cs.AI · 2023-01-10 · unverdicted · none · ref 63 · internal anchor
DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
Graph Attention Networks stat.ML · 2017-10-30 · accept · none · ref 3 · internal anchor
Graph Attention Networks compute learnable attention coefficients over node neighborhoods to produce weighted feature aggregations, achieving state-of-the-art results on citation networks and inductive protein-protein interaction graphs.
Mixed Precision Training cs.AI · 2017-10-10 · accept · none · ref 2 · internal anchor
Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.
3DGS$^3$: Joint Super Sampling and Frame Interpolation for Real-Time Large-Scale 3DGS Rendering cs.GR · 2026-05-12 · unverdicted · none · ref 9 · internal anchor
3DGS³ adds gradient-guided super-sampling and lightweight temporal interpolation to low-resolution 3DGS renders to produce high-resolution, high-frame-rate output without retraining the underlying scene representation.
Graph Federated Unlearning for Privacy Preservation cs.LG · 2026-05-04 · unverdicted · none · ref 15 · internal anchor
Orthogonal unlearning updates plus server-side virtual clients enable effective user data removal in graph federated learning without major performance loss.
Deep Kernel Learning for Stratifying Glaucoma Trajectories cs.LG · 2026-05-01 · unverdicted · none · ref 19 · internal anchor
A deep kernel learning architecture with transformer feature extraction on clinical-BERT embeddings and Gaussian process backend identifies three glaucoma subgroups by decoupling progression trajectories from current visual acuity in multimodal EHR data.
IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem cs.LG · 2026-04-20 · conditional · none · ref 12 · internal anchor
IDOBE compiles over 10,000 epidemiological outbreaks into a public benchmark and shows that MLP-based models deliver the most robust short-term forecasts while statistical methods hold a slight edge pre-peak.
MATRIX: Multi-Layer Code Watermarking via Dual-Channel Constrained Parity-Check Encoding cs.CR · 2026-04-17 · unverdicted · none · ref 65 · internal anchor
MATRIX embeds multi-layer watermarks in LLM-generated code via dual-channel constrained parity-check encoding, achieving 99.2% detection accuracy with 0-0.14% functionality loss and 7.7-26.67% better attack robustness than prior methods.
Early-Warning Learner Satisfaction Forecasting in MOOCs via Temporal Event Transformers and LLM Text Embeddings cs.CE · 2026-04-14 · unverdicted · none · ref 14 · internal anchor
TET-LLM predicts MOOC satisfaction early via temporal event transformers on behavior, LLM embeddings on text, and topic distributions, beating baselines at RMSE 0.82 and AUC 0.77 for 7-day forecasts.
Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting cs.RO · 2026-04-14 · unverdicted · none · ref 4 · internal anchor
Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.
SAM 2: Segment Anything in Images and Videos cs.CV · 2024-08-01 · conditional · none · ref 8 · internal anchor
SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation dataset collected to date.
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges cs.LG · 2021-04-27 · accept · none · ref 16 · internal anchor
Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
CodeBERT: A Pre-Trained Model for Programming and Natural Languages cs.CL · 2020-02-19 · unverdicted · none · ref 35 · internal anchor
CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.
Universal Transformers cs.CL · 2018-07-10 · unverdicted · none · ref 5 · internal anchor
Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.
Rethinking Convolutional Networks for Attribute-Aware Sequential Recommendation cs.IR · 2026-05-06 · unverdicted · none · ref 1 · internal anchor
ConvRec applies hierarchical convolutional layers to generate compact sequence representations for attribute-aware sequential recommendation, achieving linear complexity and outperforming attention-based state-of-the-art models on four real-world datasets.
Learning Fingerprints for Medical Time Series with Redundancy-Constrained Information Maximization cs.LG · 2026-04-30 · unverdicted · none · ref 11 · internal anchor
A self-supervised method learns a fixed set of disentangled fingerprint tokens from medical time series by combining reconstruction loss with a total coding rate diversity penalty, framed as a disentangled rate-distortion problem.
Attention Is All You Need cs.CL · 2017-06-12 · unverdicted · none · ref 5 · internal anchor
Pith review generated a malformed one-line summary.
Impact of leaky dynamics on predictive path integration accuracy in recurrent neural networks cs.NE · 2026-04-17 · unverdicted · none · ref 47 · internal anchor
Leaky RNNs improve grid-cell-like representations and path-integration accuracy by acting as a low-pass filter that stabilizes dynamics against noise.
Beyond Isolated Clients: Integrating Graph-Based Embeddings into Event Sequence Models cs.LG · 2026-04-10 · unverdicted · none · ref 6 · internal anchor
Three strategies for adding graph embeddings to event sequence SSL models improve AUC by up to 2.3% on four financial and e-commerce datasets, with graph density determining the best integration approach.
Learning-Based Spectrum Cartography in Low Earth Orbit Satellite Networks: An Overview cs.NI · 2026-05-11 · unverdicted · none · ref 30 · internal anchor
The paper overviews attention-based learning methods for spectrum cartography in LEO satellite networks to enable adaptive fusion of heterogeneous measurements for inference and resource allocation.
Benchmarking PyCaret AutoML Against BiLSTM for Fine-Grained Emotion Classification: A Comparative Study on 20-Class Emotion Detection cs.CL · 2026-04-29 · unverdicted · none · ref 2 · internal anchor
BiLSTM achieves 89% accuracy and 0.89 weighted F1 on 20-class emotion detection, marginally outperforming SVM at 88.11% on a 79,595-sentence dataset.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer