Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Alex Rudnick; Apurva Shah; Cliff Young; George Kurian; Greg Corrado; Hideto Kazawa; Jason Riesa; Jason Smith; Jeff Klingner; Jeffrey Dean

arxiv: 1609.08144 · v2 · submitted 2016-09-26 · 💻 cs.CL · cs.AI· cs.LG

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu , Mike Schuster , Zhifeng Chen , Quoc V. Le , Mohammad Norouzi , Wolfgang Macherey , Maxim Krikun , Yuan Cao

show 23 more authors

Qin Gao Klaus Macherey Jeff Klingner Apurva Shah Melvin Johnson Xiaobing Liu {\L}ukasz Kaiser Stephan Gouws Yoshikiyo Kato Taku Kudo Hideto Kazawa Keith Stevens George Kurian Nishant Patil Wei Wang Cliff Young Jason Smith Jason Riesa Alex Rudnick Oriol Vinyals Greg Corrado Macduff Hughes Jeffrey Dean

This is my paper

Pith reviewed 2026-05-12 15:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords neural machine translationGNMTdeep LSTMwordpiecesattention mechanismcoverage penaltymachine translationrare words

0 comments

The pith

GNMT, a deep LSTM neural machine translation system with wordpieces and coverage penalties, reduces translation errors by an average of 60% compared to phrase-based systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GNMT as an end-to-end neural approach to machine translation that addresses the computational expense and rare word problems of earlier NMT systems. It details a model using deep LSTMs with eight layers each in encoder and decoder, residual connections, a parallelized attention mechanism, subword wordpieces for vocabulary, low-precision computation for speed, and beam search augmented with length normalization and coverage penalty. Human evaluations on simple sentences demonstrate a 60% average reduction in translation errors relative to the prior phrase-based production system, while matching state-of-the-art on standard benchmarks. This matters because practical deployment requires both high accuracy and fast operation, which previous neural systems struggled to deliver. If the claim holds, it indicates that neural methods can substantially narrow the quality gap to human translators for at least straightforward text.

Core claim

Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. The attention mechanism connects the bottom layer of the decoder to the top layer of the encoder to improve parallelism. Wordpieces are used for both input and output to handle rare words. Low-precision arithmetic accelerates inference. Beam search incorporates length-normalization and a coverage penalty to encourage complete translations. On WMT'14 benchmarks GNMT achieves competitive results, and human side-by-side evaluation shows it reduces translation errors by an average of 60% compared to Google's phrase-based production system.

What carries the argument

GNMT: deep 8-layer LSTM encoder-decoder with attention from decoder bottom to encoder top, wordpiece tokenization, low-precision inference, and coverage-penalized beam search.

If this is right

Improves handling of rare words by breaking them into common sub-word units.
Accelerates training through better parallelism in the attention mechanism.
Speeds up translation inference using low-precision arithmetic.
Encourages more complete output sentences via the coverage penalty in beam search.
Achieves competitive performance on English-to-French and English-to-German WMT benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar architectural choices could be applied to other sequence generation tasks beyond translation.
The gains might increase further with larger training corpora, but the paper does not test this directly.
Results on isolated simple sentences may not fully predict performance on long, context-dependent or technical texts.
Adoption in production could shift the default from phrase-based to neural systems if the error reduction holds across domains.

Load-bearing premise

The performance improvements are due to the specific model architecture and training choices rather than differences in the amount or quality of training data or available compute resources.

What would settle it

Re-training the phrase-based system on the same data volume and hardware as GNMT and re-evaluating both on a diverse set of complex sentences would show whether the 60% error reduction is architecture-specific.

read the original abstract

Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GNMT reports a big human-eval win on simple sentences over their phrase-based system, but the gains are not isolated from possible data or compute differences.

read the letter

The main thing to know is that this paper describes Google's production neural MT setup and claims it cuts translation errors by 60% versus their existing phrase-based system on a human side-by-side test of simple sentences. The architecture uses 8-layer residual LSTMs, attention wired from the bottom decoder layer to the top encoder layer, wordpiece subword units, low-precision inference, and a coverage penalty plus length normalization in beam search.

Referee Report

2 major / 3 minor

Summary. The paper presents GNMT, an 8-layer residual LSTM encoder-decoder NMT system with bottom-decoder-to-top-encoder attention, wordpiece subword segmentation, low-precision inference, and coverage-penalized length-normalized beam search. It reports competitive BLEU scores on the WMT'14 English-to-French and English-to-German benchmarks and claims a 60% average reduction in translation errors versus Google's production phrase-based MT system, measured by human side-by-side ratings on a set of isolated simple sentences.

Significance. If the human-evaluation result holds under controlled conditions, the work is significant for showing that deep NMT can be deployed at production scale and can measurably outperform a mature phrase-based system on the metric that matters most to users. Credit is due for the practical engineering contributions (wordpieces for rare-word handling, residual connections and attention placement for training speed, low-precision arithmetic for inference latency) and for providing a direct, falsifiable human comparison rather than relying solely on automatic metrics.

major comments (2)

[§5] §5 (human side-by-side evaluation): the central claim of a 60% average error reduction is supported only by ratings on 'isolated simple sentences'; the manuscript provides no count of sentences, no description of sentence selection or domain, no inter-rater agreement statistics, and no p-value or confidence interval, making it impossible to judge whether the reported gap is robust or generalizes beyond the evaluated regime.
[§5] §5 (comparison to production PBMT baseline): the 60% error-reduction figure is presented without any statement that training-data volume, parallel-corpus composition, or total optimization effort were held constant between GNMT and the phrase-based production system. Because the production baseline may differ in data scale or tuning regime, the result does not isolate the contribution of the architectural choices (residual LSTMs, attention placement, wordpieces, coverage penalty) that the paper highlights.

minor comments (3)

[Abstract] The abstract states that GNMT 'achieves competitive results' on WMT'14 but omits the actual BLEU numbers; adding the precise scores (and the corresponding state-of-the-art references) would make the summary self-contained.
[Model Architecture] The description of the attention mechanism (bottom decoder layer attending to top encoder layer) would be clearer if accompanied by a small equation or diagram in the model-architecture section.
[§5] Table captions for the WMT results should explicitly note the training data size and whether any external monolingual data were used, to allow direct comparison with contemporaneous systems.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for the detailed review of our paper on Google's Neural Machine Translation system. We appreciate the positive assessment of the work's significance and will address the concerns raised regarding the human evaluation section to improve the manuscript.

read point-by-point responses

Referee: §5 (human side-by-side evaluation): the central claim of a 60% average error reduction is supported only by ratings on 'isolated simple sentences'; the manuscript provides no count of sentences, no description of sentence selection or domain, no inter-rater agreement statistics, and no p-value or confidence interval, making it impossible to judge whether the reported gap is robust or generalizes beyond the evaluated regime.

Authors: We acknowledge that §5 provides limited details on the human evaluation protocol. In the revised manuscript, we will add a description of the sentence selection process (isolated simple sentences sampled from production traffic and test sets) and the approximate number of sentences rated. We will also clarify that the evaluation involved multiple professional raters performing side-by-side comparisons. However, inter-rater agreement statistics and formal statistical significance tests were not part of the original evaluation design; we will note this as a limitation of the reported result rather than claiming robustness beyond what the data supports. revision: partial
Referee: §5 (comparison to production PBMT baseline): the 60% error-reduction figure is presented without any statement that training-data volume, parallel-corpus composition, or total optimization effort were held constant between GNMT and the phrase-based production system. Because the production baseline may differ in data scale or tuning regime, the result does not isolate the contribution of the architectural choices (residual LSTMs, attention placement, wordpieces, coverage penalty) that the paper highlights.

Authors: The comparison is intentionally to the deployed production phrase-based system, which represents the state-of-the-art performance achievable with that paradigm at Google, including all available data and tuning efforts. The goal is to show the practical improvement offered by GNMT over the existing production baseline. We agree that this does not constitute a controlled experiment isolating individual model components. In the revision, we will explicitly state in §5 that the PBMT baseline is the fully optimized production system and that the reported improvement reflects the end-to-end difference rather than the effect of any single architectural decision. revision: yes

standing simulated objections not resolved

The lack of inter-rater agreement and p-value statistics for the human evaluation, as these were not computed in the original study and raw data may not be available for re-analysis.

Circularity Check

0 steps flagged

No circularity: empirical claims rest on independent benchmarks and human evaluation

full rationale

The paper describes GNMT architecture (8-layer residual LSTMs, bottom-decoder attention, wordpieces, coverage penalty) and reports results on WMT'14 benchmarks plus a 60% error reduction from human side-by-side evaluation on held-out simple sentences versus the production PBMT system. These metrics derive from external test sets and separate raters rather than any equation or fit that defines success in terms of the model's own parameters. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation of the central performance claims; the evaluation is presented as an independent measurement of the described system.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim depends on empirical training of a large neural network whose capacity and efficiency choices (layer depth, precision, vocabulary construction) are selected by hand rather than derived from first principles.

free parameters (2)

encoder and decoder depth = 8
Set to 8 layers each to increase capacity while preserving training parallelism.
wordpiece vocabulary size
Chosen as a limited set of common sub-word units to balance coverage and efficiency.

axioms (2)

domain assumption LSTM layers with residual connections can model long-range dependencies in translation sequences.
Foundation for the 8-layer encoder-decoder stack.
domain assumption Low-precision arithmetic during inference maintains acceptable translation quality.
Justification for using reduced precision to speed up decoding.

invented entities (1)

wordpieces no independent evidence
purpose: Sub-word units that allow the model to translate rare words without an open vocabulary.
New tokenization scheme introduced to solve the rare-word problem.

pith-pipeline@v0.9.0 · 5744 in / 1573 out tokens · 69065 ms · 2026-05-12T15:16:34.439988+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data
cs.LG 2026-03 conditional novelty 8.0

ORiGAMi synthesizes sparse semi-structured mixed-type JSON data using path-encoded autoregressive tokenization and schema constraints, outperforming flattened tabular baselines on 17 of 18 fidelity, detection, and uti...
Generative Language Modeling for Automated Theorem Proving
cs.LG 2020-09 unverdicted novelty 8.0

GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
cs.LG 2017-01 accept novelty 8.0

A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
Tokenization with Split Trees
cs.CL 2026-05 unverdicted novelty 7.0

ToaST uses split trees and integer programming to cut token counts by over 11% versus BPE on English text at 40k+ vocab sizes, yielding higher CORE scores in 1.5B-parameter language model training.
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
cs.CL 2026-05 unverdicted novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models
cs.LG 2026-05 unverdicted novelty 7.0

Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
cs.CL 2026-05 conditional novelty 7.0

Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
ReTokSync: Self-Synchronizing Tokenization Disambiguation for Generative Linguistic Steganography
cs.CR 2026-04 unverdicted novelty 7.0

ReTokSync resolves tokenization ambiguity in generative linguistic steganography via targeted self-synchronizing resets, achieving over 99.7% extraction accuracy and 100% recovery with an auxiliary channel while match...
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
cs.CL 2026-04 unverdicted novelty 7.0

ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution
cs.CL 2026-04 unverdicted novelty 7.0

SA-BPE regularizes standard BPE training for code by incorporating source diversity to skip problematic merges, substantially cutting unused tokens without altering inference.
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
cs.LG 2024-02 unverdicted novelty 7.0

Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
Massive Activations in Large Language Models
cs.CL 2024-02 unverdicted novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Efficient Memory Management for Large Language Model Serving with PagedAttention
cs.LG 2023-09 conditional novelty 7.0

PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
iBOT: Image BERT Pre-Training with Online Tokenizer
cs.CV 2021-11 unverdicted novelty 7.0

iBOT achieves 82.3% linear probing accuracy and 87.8% fine-tuning accuracy on ImageNet-1K using masked image modeling with a jointly trained online tokenizer.
Learning to summarize from human feedback
cs.CL 2020-09 conditional novelty 7.0

Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
Fine-Tuning Language Models from Human Preferences
cs.CL 2019-09 unverdicted novelty 7.0

Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
cs.CL 2018-08 accept novelty 7.0

SentencePiece trains subword models directly from raw text to enable language-independent neural text processing.
Mixed Precision Training
cs.AI 2017-10 accept novelty 7.0

Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.
Physics-in-the-Loop: A Hybrid Agentic Architecture for Validated CAD Engineering Design
cs.CV 2026-05 unverdicted novelty 6.0

A hybrid agentic architecture integrates knowledge-based physical verification tools into LLM-driven CAD design loops, producing more complex and functionally valid designs than prior agentic baselines.
Multi-Headed Transformer Architectures as Time-dependent Wasserstein Gradient Flows
cs.LG 2026-05 unverdicted novelty 6.0

Models multi-head transformer data flow as time-dependent Wasserstein gradient flows of an attention-capturing interaction energy, with proofs on omega-limit stationary points and stability under weight and input pert...
Decaf: Improving Neural Decompilation with Automatic Feedback and Search
cs.SE 2026-05 unverdicted novelty 6.0

Decaf uses compiler feedback and search to improve neural decompilation, boosting semantic success rate from 26.0% to 83.9% on ExeBench Real -O2 split.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...
Context-Aware Wireless Token Communication via Joint Token Masking and Detection
eess.SP 2026-05 unverdicted novelty 6.0

A joint token masking and detection scheme with masked language models improves token reconstruction over noisy wireless channels by up to 1.77x on Europarl and 1.63x on WikiText-103 compared to conventional methods.
Neural Grammatical Error Correction for Romanian
cs.CL 2026-04 unverdicted novelty 6.0

A new Romanian GEC corpus of 10k pairs plus pretraining a Transformer on artificial errors generated via POS tagger yields F0.5 of 53.76, beating the 44.38 baseline from training only on the corpus.
PARM: Pipeline-Adapted Reward Model
cs.AI 2026-04 unverdicted novelty 6.0

PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.
A Comparative Study of Semantic Log Representations for Software Log-based Anomaly Detection
cs.SE 2026-04 unverdicted novelty 6.0

QTyBERT matches or exceeds BERT-based log anomaly detection effectiveness while reducing embedding generation time to near static word embedding levels.
Step-Audio 2 Technical Report
cs.CL 2025-07 unverdicted novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...
Hallucinations are inevitable but can be made statistically negligible
cs.CL 2025-02 unverdicted novelty 6.0

Hallucinations are inevitable on an infinite set of inputs but can be made statistically negligible with sufficient training data quality and quantity.
Training Language Models to Self-Correct via Reinforcement Learning
cs.LG 2024-09 unverdicted novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
The Falcon Series of Open Language Models
cs.CL 2023-11 conditional novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
cs.CL 2023-09 conditional novelty 6.0

RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
The False Promise of Imitating Proprietary LLMs
cs.CL 2023-05 conditional novelty 6.0

Finetuning open LMs on ChatGPT outputs creates models that mimic style and fool human raters but fail to close the performance gap to proprietary systems on tasks not well-represented in the imitation data.
BloombergGPT: A Large Language Model for Finance
cs.LG 2023-03 conditional novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
Atlas: Few-shot Learning with Retrieval Augmented Language Models
cs.CL 2022-08 unverdicted novelty 6.0

Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
No Language Left Behind: Scaling Human-Centered Machine Translation
cs.CL 2022-07 unverdicted novelty 6.0

A sparsely gated mixture-of-experts model trained on newly mined low-resource data achieves 44% relative BLEU improvement across 200 languages while adding human safety evaluation.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
cs.CV 2022-06 unverdicted novelty 6.0

Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
cs.CL 2022-02 unverdicted novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
Unsupervised Dense Information Retrieval with Contrastive Learning
cs.IR 2021-12 unverdicted novelty 6.0

Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
cs.CL 2020-06 unverdicted novelty 6.0

GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
cs.CL 2020-02 unverdicted novelty 6.0

CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.
Compressive Transformers for Long-Range Sequence Modelling
cs.LG 2019-11 unverdicted novelty 6.0

Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.
CTRL: A Conditional Transformer Language Model for Controllable Generation
cs.CL 2019-09 unverdicted novelty 6.0

CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
VisualBERT: A Simple and Performant Baseline for Vision and Language
cs.CV 2019-08 conditional novelty 6.0

VisualBERT is a Transformer model that implicitly aligns text and image regions through self-attention and achieves competitive or superior results on VQA, VCR, NLVR2, and Flickr30K after pre-training on captions.
BERT-DST: Scalable End-to-End Dialogue State Tracking with Bidirectional Encoder Representations from Transformer
cs.CL 2019-07 unverdicted novelty 6.0

BERT-DST applies a BERT encoder with cross-slot parameter sharing to directly extract slot values from dialogue context, outperforming priors on scalable DST benchmarks Sim-M and Sim-R while remaining competitive on D...
Retrieving Sequential Information for Non-Autoregressive Neural Machine Translation
cs.CL 2019-06 unverdicted novelty 6.0

Reinforce-NAT and FS-decoder retrieve target sequential information for non-autoregressive translation, yielding higher BLEU than baseline NAT while preserving fast decoding and approaching autoregressive quality.
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
cs.CV 2017-06 accept novelty 6.0

Linear learning-rate scaling plus warmup lets minibatch size 8192 train ResNet-50 on ImageNet in one hour at full small-batch accuracy.
Metaphors in Literary Post-Editing: Opening Pandora's Box?
cs.CL 2026-05 unverdicted novelty 5.0

Post-editors changed one in three metaphors in NMT and LLM outputs for literary texts, rated quality poor, and found post-editing more laborious than original translation.
Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective
cs.CL 2026-05 unverdicted novelty 5.0

GRPO with reference-free rewards improves NLLB-200 translation quality on 13 languages up to +5.03 chrF++, competing with supervised fine-tuning on complex languages without target data.
Neural Code Translation of Legacy Code: APL to C#
cs.SE 2026-05 unverdicted novelty 5.0

Guided LLM strategies with custom datasets and execution-based verification enable functional APL-to-C# translation across a range of program complexities.
Stochasticity in Tokenisation Improves Robustness
cs.CL 2026-04 unverdicted novelty 5.0

Stochastic tokenisation during pre-training and fine-tuning improves LLM robustness to perturbations while preserving accuracy.
Space Filling Curves is All You Need: Communication-Avoiding Matrix Multiplication Made Simple
cs.DC 2026-01 unverdicted novelty 5.0

Space-filling curves enable platform- and shape-oblivious communication-avoiding matrix multiplication that outperforms vendor libraries by up to 5.5x on CPUs while also accelerating LLM prefill and distributed workloads.
EPNAS: Efficient Progressive Neural Architecture Search
cs.LG 2019-07 unverdicted novelty 5.0

EPNAS uses a progressive search policy with REINFORCE performance prediction to search neural architectures in parallel, supporting multiple resource constraints and outperforming ENAS and PNAS on CIFAR-10 and ImageNe...
Learning to Reformulate the Queries on the WEB
cs.IR 2019-07 unverdicted novelty 5.0

An unsupervised character-level CNN encoder with attention-based RNN decoder, trained on Clueweb09 anchor phrases, generates query reformulations that improve retrieval on TREC collections.
Do Transformer Attention Heads Provide Transparency in Abstractive Summarization?
cs.CL 2019-07 unverdicted novelty 5.0

Analysis of transformer attention heads in abstractive summarization shows specialization in some heads and proposes a method to measure model reliance on learned attention distributions.
The Impact of Preprocessing on Arabic-English Statistical and Neural Machine Translation
cs.CL 2019-06 unverdicted novelty 5.0

Tokenization scheme performance in Arabic-English MT depends on whether statistical or neural models are used and on data size, with hybrid system selection providing gains.
Conversational Response Re-ranking Based on Event Causality and Role Factored Tensor Event Embedding
cs.CL 2019-06 unverdicted novelty 5.0

Re-ranking conversational responses with event causality and role-factored tensor embeddings improves coherency and dialogue continuity.
Attention Is All You Need
cs.CL 2017-06 unverdicted novelty 5.0

Pith review generated a malformed one-line summary.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 77 Pith papers · 2 internal anchors

[1]

G., Steiner, B., Tucker, P., V asudevan, V., W arden, P., Wicke, M., Yu, Y., and Zheng, X

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., V asudevan, V., W arden, P., Wicke, M., Yu, Y., and Zheng, X. Tensorﬂow: A system for large-scale machine learning. Tech. rep., Google Brain, 2016. arXiv preprint

work page 2016
[2]

Neural machine translation by jointly learning to align and translate

Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. InInternational Conference on Learning Representations(2015)

work page 2015
[3]

D., Pietra, V

Brown, P., Cocke, J., Pietra, S. D., Pietra, V. D., Jelinek, F., Mercer, R., and Roossin, P. A statistical approach to language translation. InProceedings of the 12th Conference on Computational Linguistics - Volume 1(Stroudsburg, PA, USA, 1988), COLING ’88, Association for Computational Linguistics, pp. 71–76

work page 1988
[4]

F., Cocke, J., Pietra, S

Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S. A statistical approach to machine translation.Computational linguistics 16, 2 (1990), 79–85

work page 1990
[5]

F., Pietra, V

Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., and Mercer, R. L. The mathematics of statistical machine translation: Parameter estimation.Comput. Linguist. 19, 2 (June 1993), 263–311

work page 1993
[6]

N-gram counts and language models from the common crawl

Buck, C., Heafield, K., and V an Ooyen, B. N-gram counts and language models from the common crawl. InLREC (2014), vol. 2, Citeseer, p. 4

work page 2014
[7]

Learning phrase representations using RNN encoder-decoder for statistical machine translation

Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing(2014)

work page 2014
[8]

Learning recursive distributed representations for holistic computation.Connection Science 3, 4 (1991), 345–366

Chrisman, L. Learning recursive distributed representations for holistic computation.Connection Science 3, 4 (1991), 345–366. 20

work page 1991
[10]

A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation

Chung, J., Cho, K., and Bengio, Y. A character-level decoder without explicit segmentation for neural machine translation.CoRR abs/1603.06147 (2016)

work page Pith review arXiv 2016
[11]

Character-based Neural Machine Translation

Costa-Jussà, M. R., and Fonollosa, J. A. R. Character-based neural machine translation.CoRR abs/1603.00810 (2016)

work page Pith review arXiv 2016
[12]

S., Monga, R., Chen, K., Devin, M., Le, Q

Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., Mao, M. Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., and Ng, A. Y. Large scale distributed deep networks. In NIPS (2012)

work page 2012
[13]

M., and Makhoul, J

Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R. M., and Makhoul, J. Fast and robust neural network joint models for statistical machine translation. InACL (1)(2014), Citeseer, pp. 1370–1380

work page 2014
[14]

Multi-task learning for multiple language translation

Dong, D., Wu, H., He, W., Yu, D., and W ang, H. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (2015), pp. 1723–1732

work page 2015
[15]

Edinburgh’s phrase-based machine translation systems for WMT-14

Durrani, N., Haddow, B., Koehn, P., and Heafield, K. Edinburgh’s phrase-based machine translation systems for WMT-14. InProceedings of the Ninth Workshop on Statistical Machine Translation (2014), Association for Computational Linguistics Baltimore, MD, USA, pp. 97–104

work page 2014
[16]

E., and Lebiere, C

F ahlman, S. E., and Lebiere, C. The cascade-correlation learning architecture. InAdvances in Neural Information Processing Systems 2(1990), Morgan Kaufmann, pp. 524–532

work page 1990
[17]

A., Schmidhuber, J., and Cummins, F

Gers, F. A., Schmidhuber, J., and Cummins, F. Learning to forget: Continual prediction with LSTM. Neural computation 12, 10 (2000), 2451–2471

work page 2000
[18]

Pointing the Unknown Words

Gülçehre, Ç., Ahn, S., Nallapati, R., Zhou, B., and Bengio, Y. Pointing the unknown words. CoRR abs/1603.08148 (2016)

work page Pith review arXiv 2016
[19]

Deep Learning with Limited Numerical Precision

Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. Deep learning with limited numerical precision.CoRR abs/1502.02551 (2015)

work page Pith review arXiv 2015
[20]

Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural network with pruning, trained quantization and huﬀman coding.CoRR abs/1510.00149 (2015)

work page internal anchor Pith review arXiv 2015
[21]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. InIEEE Conference on Computer Vision and Pattern Recognition(2015)

work page 2015
[22]

Gradient ﬂow in recurrent nets: the diﬃculty of learning long-term dependencies, 2001

Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. Gradient ﬂow in recurrent nets: the diﬃculty of learning long-term dependencies, 2001

work page 2001
[23]

Long short-term memory.Neural computation 9, 8 (1997), 1735–1780

Hochreiter, S., and Schmidhuber, J. Long short-term memory.Neural computation 9, 8 (1997), 1735–1780

work page 1997
[24]

Recurrent continuous translation models

Kalchbrenner, N., and Blunsom, P. Recurrent continuous translation models. InConference on Empirical Methods in Natural Language Processing(2013)

work page 2013
[25]

Adam: A Method for Stochastic Optimization

Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization.CoRR abs/1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[26]

J., and Marcu, D

Koehn, P., Och, F. J., and Marcu, D. Statistical phrase-based translation. InProceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics(2003)

work page 2003
[27]

Ternary weight networks

Li, F., and Liu, B. Ternary weight networks.CoRR abs/1605.04711 (2016). 21

work page arXiv 2016
[28]

Luong, M., and Manning, C. D. Achieving open vocabulary neural machine translation with hybrid word-character models.CoRR abs/1604.00788 (2016)

work page Pith review arXiv 2016
[29]

V., Sutskever, I., Vinyals, O., and Kaiser, L

Luong, M.-T., Le, Q. V., Sutskever, I., Vinyals, O., and Kaiser, L. Multi-task sequence to sequence learning. InInternational Conference on Learning Representations(2015)

work page 2015
[30]

Luong, M.-T., Pham, H., and Manning, C. D. Eﬀective approaches to attention-based neural machine translation. InConference on Empirical Methods in Natural Language Processing(2015)

work page 2015
[31]

V., Vinyals, O., and Zaremba, W

Luong, M.-T., Sutskever, I., Le, Q. V., Vinyals, O., and Zaremba, W. Addressing the rare word problem in neural machine translation. InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (2015)

work page 2015
[32]

Reward augmented maximum likelihood for neural structured prediction

Norouzi, M., Bengio, S., Chen, Z., Jaitly, N., Schuster, M., Wu, Y., and Schuurmans, D. Reward augmented maximum likelihood for neural structured prediction. InNeural Information Processing Systems(2016)

work page 2016
[33]

On the difficulty of training Recurrent Neural Networks

Pascanu, R., Mikolov, T., and Bengio, Y. Understanding the exploding gradient problem.CoRR abs/1211.5063 (2012)

work page Pith review arXiv 2012
[34]

Sequence level training with recurrent neural networks

Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. Sequence level training with recurrent neural networks. InInternational Conference on Learning Representations(2015)

work page 2015
[35]

Japanese and Korean voice search.2012 IEEE International Conference on Acoustics, Speech and Signal Processing(2012)

Schuster, M., and Nakajima, K. Japanese and Korean voice search.2012 IEEE International Conference on Acoustics, Speech and Signal Processing(2012)

work page 2012
[36]

Bidirectional recurrent neural networks.IEEE Transactions on Signal Processing 45, 11 (Nov

Schuster, M., and Paliwal, K. Bidirectional recurrent neural networks.IEEE Transactions on Signal Processing 45, 11 (Nov. 1997), 2673–2681

work page 1997
[37]

On using very large target vocabulary for neural machine translation

Sébastien, J., Kyunghyun, C., Memisevic, R., and Bengio, Y. On using very large target vocabulary for neural machine translation. InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (2015)

work page 2015
[38]

Neural machine translation of rare words with subword units

Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics(2016)

work page 2016
[39]

Minimum risk training for neural machine translation

Shen, S., Cheng, Y., He, Z., He, W., Wu, H., Sun, M., and Liu, Y. Minimum risk training for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(2016)

work page 2016
[40]

Highway Networks

Srivastava, R. K., Greff, K., and Schmidhuber, J. Highway networks.CoRR abs/1505.00387 (2015)

work page Pith review arXiv 2015
[41]

Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems(2014), pp. 3104–3112

work page 2014
[42]

Coverage-based neural machine translation

Tu, Z., Lu, Z., Liu, Y., Liu, X., and Li, H. Coverage-based neural machine translation. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics(2016)

work page 2016
[43]

Quantized convolutional neural networks for mobile devices

Wu, J., Leng, C., W ang, Y., Hu, Q., and Cheng, J. Quantized convolutional neural networks for mobile devices. CoRR abs/1512.06473 (2015)

work page arXiv 2015
[44]

Recurrent neural network regularization, 2014

Zaremba, W., Sutskever, I., and Vinyals, O. Recurrent neural network regularization, 2014

work page 2014
[45]

NMT before RL

Zhou, J., Cao, Y., W ang, X., Li, P., and Xu, W. Deep recurrent models with fast-forward connections for neural machine translation.CoRR abs/1606.04199 (2016). 22 Table 11: Some example translations from PBMT [15], our GNMT system (the "NMT before RL", Table 9), and Human. Source and target sentences (human translations) are from the public benchmark WMT ...

work page arXiv 2016

[1] [1]

G., Steiner, B., Tucker, P., V asudevan, V., W arden, P., Wicke, M., Yu, Y., and Zheng, X

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., V asudevan, V., W arden, P., Wicke, M., Yu, Y., and Zheng, X. Tensorﬂow: A system for large-scale machine learning. Tech. rep., Google Brain, 2016. arXiv preprint

work page 2016

[2] [2]

Neural machine translation by jointly learning to align and translate

Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. InInternational Conference on Learning Representations(2015)

work page 2015

[3] [3]

D., Pietra, V

Brown, P., Cocke, J., Pietra, S. D., Pietra, V. D., Jelinek, F., Mercer, R., and Roossin, P. A statistical approach to language translation. InProceedings of the 12th Conference on Computational Linguistics - Volume 1(Stroudsburg, PA, USA, 1988), COLING ’88, Association for Computational Linguistics, pp. 71–76

work page 1988

[4] [4]

F., Cocke, J., Pietra, S

Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S. A statistical approach to machine translation.Computational linguistics 16, 2 (1990), 79–85

work page 1990

[5] [5]

F., Pietra, V

Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., and Mercer, R. L. The mathematics of statistical machine translation: Parameter estimation.Comput. Linguist. 19, 2 (June 1993), 263–311

work page 1993

[6] [6]

N-gram counts and language models from the common crawl

Buck, C., Heafield, K., and V an Ooyen, B. N-gram counts and language models from the common crawl. InLREC (2014), vol. 2, Citeseer, p. 4

work page 2014

[7] [7]

Learning phrase representations using RNN encoder-decoder for statistical machine translation

Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing(2014)

work page 2014

[8] [8]

Learning recursive distributed representations for holistic computation.Connection Science 3, 4 (1991), 345–366

Chrisman, L. Learning recursive distributed representations for holistic computation.Connection Science 3, 4 (1991), 345–366. 20

work page 1991

[9] [10]

A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation

Chung, J., Cho, K., and Bengio, Y. A character-level decoder without explicit segmentation for neural machine translation.CoRR abs/1603.06147 (2016)

work page Pith review arXiv 2016

[10] [11]

Character-based Neural Machine Translation

Costa-Jussà, M. R., and Fonollosa, J. A. R. Character-based neural machine translation.CoRR abs/1603.00810 (2016)

work page Pith review arXiv 2016

[11] [12]

S., Monga, R., Chen, K., Devin, M., Le, Q

Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., Mao, M. Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., and Ng, A. Y. Large scale distributed deep networks. In NIPS (2012)

work page 2012

[12] [13]

M., and Makhoul, J

Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R. M., and Makhoul, J. Fast and robust neural network joint models for statistical machine translation. InACL (1)(2014), Citeseer, pp. 1370–1380

work page 2014

[13] [14]

Multi-task learning for multiple language translation

Dong, D., Wu, H., He, W., Yu, D., and W ang, H. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (2015), pp. 1723–1732

work page 2015

[14] [15]

Edinburgh’s phrase-based machine translation systems for WMT-14

Durrani, N., Haddow, B., Koehn, P., and Heafield, K. Edinburgh’s phrase-based machine translation systems for WMT-14. InProceedings of the Ninth Workshop on Statistical Machine Translation (2014), Association for Computational Linguistics Baltimore, MD, USA, pp. 97–104

work page 2014

[15] [16]

E., and Lebiere, C

F ahlman, S. E., and Lebiere, C. The cascade-correlation learning architecture. InAdvances in Neural Information Processing Systems 2(1990), Morgan Kaufmann, pp. 524–532

work page 1990

[16] [17]

A., Schmidhuber, J., and Cummins, F

Gers, F. A., Schmidhuber, J., and Cummins, F. Learning to forget: Continual prediction with LSTM. Neural computation 12, 10 (2000), 2451–2471

work page 2000

[17] [18]

Pointing the Unknown Words

Gülçehre, Ç., Ahn, S., Nallapati, R., Zhou, B., and Bengio, Y. Pointing the unknown words. CoRR abs/1603.08148 (2016)

work page Pith review arXiv 2016

[18] [19]

Deep Learning with Limited Numerical Precision

Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. Deep learning with limited numerical precision.CoRR abs/1502.02551 (2015)

work page Pith review arXiv 2015

[19] [20]

Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural network with pruning, trained quantization and huﬀman coding.CoRR abs/1510.00149 (2015)

work page internal anchor Pith review arXiv 2015

[20] [21]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. InIEEE Conference on Computer Vision and Pattern Recognition(2015)

work page 2015

[21] [22]

Gradient ﬂow in recurrent nets: the diﬃculty of learning long-term dependencies, 2001

Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. Gradient ﬂow in recurrent nets: the diﬃculty of learning long-term dependencies, 2001

work page 2001

[22] [23]

Long short-term memory.Neural computation 9, 8 (1997), 1735–1780

Hochreiter, S., and Schmidhuber, J. Long short-term memory.Neural computation 9, 8 (1997), 1735–1780

work page 1997

[23] [24]

Recurrent continuous translation models

Kalchbrenner, N., and Blunsom, P. Recurrent continuous translation models. InConference on Empirical Methods in Natural Language Processing(2013)

work page 2013

[24] [25]

Adam: A Method for Stochastic Optimization

Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization.CoRR abs/1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[25] [26]

J., and Marcu, D

Koehn, P., Och, F. J., and Marcu, D. Statistical phrase-based translation. InProceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics(2003)

work page 2003

[26] [27]

Ternary weight networks

Li, F., and Liu, B. Ternary weight networks.CoRR abs/1605.04711 (2016). 21

work page arXiv 2016

[27] [28]

Luong, M., and Manning, C. D. Achieving open vocabulary neural machine translation with hybrid word-character models.CoRR abs/1604.00788 (2016)

work page Pith review arXiv 2016

[28] [29]

V., Sutskever, I., Vinyals, O., and Kaiser, L

Luong, M.-T., Le, Q. V., Sutskever, I., Vinyals, O., and Kaiser, L. Multi-task sequence to sequence learning. InInternational Conference on Learning Representations(2015)

work page 2015

[29] [30]

Luong, M.-T., Pham, H., and Manning, C. D. Eﬀective approaches to attention-based neural machine translation. InConference on Empirical Methods in Natural Language Processing(2015)

work page 2015

[30] [31]

V., Vinyals, O., and Zaremba, W

Luong, M.-T., Sutskever, I., Le, Q. V., Vinyals, O., and Zaremba, W. Addressing the rare word problem in neural machine translation. InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (2015)

work page 2015

[31] [32]

Reward augmented maximum likelihood for neural structured prediction

Norouzi, M., Bengio, S., Chen, Z., Jaitly, N., Schuster, M., Wu, Y., and Schuurmans, D. Reward augmented maximum likelihood for neural structured prediction. InNeural Information Processing Systems(2016)

work page 2016

[32] [33]

On the difficulty of training Recurrent Neural Networks

Pascanu, R., Mikolov, T., and Bengio, Y. Understanding the exploding gradient problem.CoRR abs/1211.5063 (2012)

work page Pith review arXiv 2012

[33] [34]

Sequence level training with recurrent neural networks

Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. Sequence level training with recurrent neural networks. InInternational Conference on Learning Representations(2015)

work page 2015

[34] [35]

Japanese and Korean voice search.2012 IEEE International Conference on Acoustics, Speech and Signal Processing(2012)

Schuster, M., and Nakajima, K. Japanese and Korean voice search.2012 IEEE International Conference on Acoustics, Speech and Signal Processing(2012)

work page 2012

[35] [36]

Bidirectional recurrent neural networks.IEEE Transactions on Signal Processing 45, 11 (Nov

Schuster, M., and Paliwal, K. Bidirectional recurrent neural networks.IEEE Transactions on Signal Processing 45, 11 (Nov. 1997), 2673–2681

work page 1997

[36] [37]

On using very large target vocabulary for neural machine translation

Sébastien, J., Kyunghyun, C., Memisevic, R., and Bengio, Y. On using very large target vocabulary for neural machine translation. InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (2015)

work page 2015

[37] [38]

Neural machine translation of rare words with subword units

Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics(2016)

work page 2016

[38] [39]

Minimum risk training for neural machine translation

Shen, S., Cheng, Y., He, Z., He, W., Wu, H., Sun, M., and Liu, Y. Minimum risk training for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(2016)

work page 2016

[39] [40]

Highway Networks

Srivastava, R. K., Greff, K., and Schmidhuber, J. Highway networks.CoRR abs/1505.00387 (2015)

work page Pith review arXiv 2015

[40] [41]

Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems(2014), pp. 3104–3112

work page 2014

[41] [42]

Coverage-based neural machine translation

Tu, Z., Lu, Z., Liu, Y., Liu, X., and Li, H. Coverage-based neural machine translation. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics(2016)

work page 2016

[42] [43]

Quantized convolutional neural networks for mobile devices

Wu, J., Leng, C., W ang, Y., Hu, Q., and Cheng, J. Quantized convolutional neural networks for mobile devices. CoRR abs/1512.06473 (2015)

work page arXiv 2015

[43] [44]

Recurrent neural network regularization, 2014

Zaremba, W., Sutskever, I., and Vinyals, O. Recurrent neural network regularization, 2014

work page 2014

[44] [45]

NMT before RL

Zhou, J., Cao, Y., W ang, X., Li, P., and Xu, W. Deep recurrent models with fast-forward connections for neural machine translation.CoRR abs/1606.04199 (2016). 22 Table 11: Some example translations from PBMT [15], our GNMT system (the "NMT before RL", Table 9), and Human. Source and target sentences (human translations) are from the public benchmark WMT ...

work page arXiv 2016