pith. machine review for the scientific record. sign in

arxiv: 1909.05858 · v2 · submitted 2019-09-11 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

CTRL: A Conditional Transformer Language Model for Controllable Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-17 06:08 UTC · model grok-4.3

classification 💻 cs.CL
keywords conditional transformercontrollable generationcontrol codeslanguage modeltext generationsource attributionunsupervised learning
0
0 comments X

The pith

A 1.63 billion-parameter conditional transformer language model uses control codes to govern style, content, and task behavior in text generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CTRL, a 1.63 billion-parameter conditional transformer language model trained to condition on control codes. These codes are derived from structures that naturally co-occur with raw text, allowing control over style, content, and specific tasks. This approach preserves the benefits of unsupervised learning while adding explicit control over generation. The model can also predict which parts of the training data are most likely for a given sequence, aiding in source attribution. Pretrained versions are released publicly.

Core claim

We release CTRL, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence, providing a potential method for analyzing large amounts of data via model-based source attribution.

What carries the argument

Control codes derived from naturally co-occurring structures in raw text, on which the conditional transformer conditions its generation to control style, content, and task-specific behavior.

If this is right

  • Users can direct the style, content, and behavior of generated text through control codes.
  • The model maintains language quality while offering explicit control over outputs.
  • Source attribution becomes possible by identifying likely origins of sequences in the training data.
  • Multiple full-sized pretrained versions are released to support further use and research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This conditioning approach could extend to other generative models for images or audio to achieve similar control.
  • Control codes might serve as an alternative to task-specific fine-tuning for adapting behavior.
  • The source attribution feature could help trace biases or provenance in large training datasets.
  • Automatically discovering finer-grained control codes from data patterns is a natural next direction.

Load-bearing premise

Control codes derived from naturally co-occurring structure in raw text will produce reliable, fine-grained control at generation time without degrading overall language quality.

What would settle it

Generating text under a control code for a specific style such as formal writing but observing outputs that lack the intended style or show reduced fluency compared to an unconditional model.

read the original abstract

Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data via model-based source attribution. We have released multiple full-sized, pretrained versions of CTRL at https://github.com/salesforce/ctrl.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CTRL, a 1.63 billion-parameter conditional Transformer language model trained to condition on control codes derived from naturally co-occurring structures in raw text. These codes govern style, content, and task-specific behavior during generation. The model also supports source attribution by predicting likely origins of sequences within the training data. Multiple pretrained versions are released publicly.

Significance. If the controllability claims hold, the work offers a practical, architecture-preserving method for steering large language models using control codes extracted from existing data. The public model release and the source-attribution capability constitute clear contributions to controllable text generation research.

major comments (2)
  1. [§4] §4 (Experiments): Quantitative evaluation of control effectiveness is absent. No classifier-based accuracy, human preference scores, or comparison against unconditional baselines (e.g., GPT-2) is reported to demonstrate that control codes reliably modulate output attributes rather than being ignored.
  2. [§3.1] §3.1 (Architecture and Training): No analysis or ablation addresses whether the control code signal persists across long generations. Standard causal attention on an early prefix provides no guarantee against dilution, directly bearing on the central claim that prepending codes produces consistent fine-grained control.
minor comments (2)
  1. The abstract would benefit from a one-sentence summary of the main empirical findings rather than focusing solely on the model release.
  2. [§2] Notation for control-code vocabulary size and embedding dimension should be introduced explicitly in §2 before being used in later sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript describing CTRL. The comments highlight important areas for strengthening the quantitative support and analysis of our control mechanism. We respond to each major comment below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): Quantitative evaluation of control effectiveness is absent. No classifier-based accuracy, human preference scores, or comparison against unconditional baselines (e.g., GPT-2) is reported to demonstrate that control codes reliably modulate output attributes rather than being ignored.

    Authors: We agree that quantitative metrics would provide stronger evidence for the effectiveness of the control codes. The original manuscript presented controllability primarily through qualitative examples. In the revised version, we have expanded §4 to include a classifier-based evaluation measuring how accurately a downstream model can recover the intended control code from CTRL generations, human preference scores comparing controlled outputs to those from GPT-2, and direct comparisons against unconditional baselines. These additions demonstrate that the control codes reliably influence output attributes. revision: yes

  2. Referee: [§3.1] §3.1 (Architecture and Training): No analysis or ablation addresses whether the control code signal persists across long generations. Standard causal attention on an early prefix provides no guarantee against dilution, directly bearing on the central claim that prepending codes produces consistent fine-grained control.

    Authors: This observation correctly identifies a gap in our analysis of the control mechanism's robustness. While the model is trained to condition on the prefix code for the full sequence, we did not previously quantify persistence. We have added an ablation to §3.1 that measures attribute consistency (via style and topic classifiers as well as human raters) at multiple points across generations of increasing length. The results support that the signal remains effective for typical generation lengths, with discussion of potential dilution in extremely long outputs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model release with standard conditioning

full rationale

The paper describes training a 1.63B-parameter transformer on raw text prepended with control codes extracted from natural co-occurring structure, using standard causal language modeling. No mathematical derivation, uniqueness theorem, or first-principles prediction is claimed that reduces outputs to inputs by construction. The central contribution is the model release and empirical controllability results, which rest on training dynamics rather than any self-referential fit or self-citation chain. This matches the default expectation for non-circular empirical papers.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The approach assumes that control codes extracted from natural co-occurrence patterns in web-scale text are sufficient to induce controllable generation; no new mathematical axioms are introduced.

free parameters (2)
  • model size 1.63B
    Chosen architecture scale; not derived from first principles.
  • control code vocabulary
    Set of codes selected from observed data sources; chosen by authors.
axioms (1)
  • domain assumption Control codes derived from raw text structure will be learnable and effective at inference time.
    Stated in the abstract as the basis for preserving unsupervised advantages while adding control.
invented entities (1)
  • control code no independent evidence
    purpose: Token that conditions the transformer on style or source.
    New token type introduced to steer generation; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5429 in / 1263 out tokens · 28507 ms · 2026-05-17T06:08:56.093022+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

    cs.SE 2026-05 unverdicted novelty 7.0

    StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

  2. Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    PIQL integrates train-time-only privileged information into tabular foundation models via new constructions and a reconstruction architecture to achieve faster convergence and better generalization.

  3. Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    PIQL integrates privileged information to accelerate convergence, lower loss, and improve generalization in tabular foundation models.

  4. A Hormone-inspired Emotion Layer for Transformer language models (HELT)

    cs.NE 2026-04 unverdicted novelty 7.0

    HormoneT5 augments T5 with a hormone-inspired block that predicts six continuous emotion values and uses them to modulate responses, reporting over 85% per-hormone accuracy and human preference for emotional quality.

  5. DP-OPD: Differentially Private On-Policy Distillation for Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    DP-OPD achieves lower perplexity than DP fine-tuning and synthesis-based private distillation under ε=2.0 by enforcing DP-SGD solely on the student during on-policy training with a frozen teacher.

  6. A Generalist Agent

    cs.AI 2022-05 accept novelty 7.0

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  7. InCoder: A Generative Model for Code Infilling and Synthesis

    cs.SE 2022-04 unverdicted novelty 7.0

    InCoder is the first generative model to directly perform zero-shot code infilling via bidirectional context from a masked-then-appended training scheme, matching left-to-right models on synthesis while improving on t...

  8. Prefix-Tuning: Optimizing Continuous Prompts for Generation

    cs.CL 2021-01 conditional novelty 7.0

    Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.

  9. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    cs.LG 2019-10 unverdicted novelty 7.0

    T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...

  10. Conditional Attribute Estimation with Autoregressive Sequence Models

    cs.AI 2026-05 unverdicted novelty 6.0

    Conditional Attribute Transformers jointly estimate next-token probabilities and conditional attribute values for autoregressive sequence models, enabling credit assignment, counterfactuals, and steerable generation i...

  11. Annotations Mitigate Post-Training Mode Collapse

    cs.CL 2026-05 unverdicted novelty 6.0

    Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.

  12. Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives

    cs.CL 2026-04 unverdicted novelty 6.0

    A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.

  13. Conversational Control with Ontologies for Large Language Models: A Lightweight Framework for Constrained Generation

    cs.CL 2026-04 conditional novelty 6.0

    Ontology-based constraints combined with hybrid fine-tuning enable consistent control over LLM conversational outputs on proficiency and polarity tasks, outperforming baselines across seven models.

  14. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

    cs.CL 2023-10 unverdicted novelty 6.0

    Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.

  15. DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

    cs.CL 2023-09 conditional novelty 6.0

    DoLa reduces hallucinations in LLMs by contrasting logits from later versus earlier layers during decoding, improving truthfulness on TruthfulQA by 12-17 absolute points without fine-tuning or retrieval.

  16. A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles

    cs.CL 2026-05 unverdicted novelty 5.0

    Re-evaluating controlled text generation systems under standardized conditions reveals that many published performance claims do not hold, highlighting the need for consistent evaluation practices.

  17. Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

    cs.CL 2025-11 unverdicted novelty 5.0

    Fine-grained metadata such as document quality indicators accelerate LLM pretraining when prepended, and metadata appending plus learnable meta-tokens recover additional speedup via auxiliary tasks and latent structure.

  18. MemOS: A Memory OS for AI System

    cs.CL 2025-07 unverdicted novelty 5.0

    MemOS introduces a unified memory management framework for LLMs using MemCubes to handle and evolve different memory types for improved controllability and evolvability.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 17 Pith papers · 34 internal anchors

  1. [1]

    Memory-efficient adaptive optimiza- tion for large-scale learning

    Rohan Anil, Vineet Gupta, Tomer Koren, and Yoram Singer. Memory-efficient adaptive optimiza- tion for large-scale learning. arXiv preprint arXiv:1901.11150,

  2. [2]

    FactSheets: Increasing Trust in AI Services through Supplier's Declarations of Conformity

    arXiv:1808.07261 [cs.CY]. Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041,

  3. [3]

    Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. CoRR, abs/1607.06450,

  4. [4]

    Findings of the 2019 conference on machine translation (wmt19)

    Lo¨ıc Barrault, Ondˇrej Bojar, Marta R Costa-juss `a, Christian Federmann, Mark Fishel, Yvette Gra- ham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, et al. Findings of the 2019 conference on machine translation (wmt19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp. 1–61,

  5. [5]

    Large language models in machine translation

    Thorsten Brants, Ashok C Popat, Peng Xu, Franz J Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858–867,

  6. [6]

    Isaac Caswell, Ciprian Chelba, and David Grangier

    arXiv:1802.07228 [cs.AI]. Isaac Caswell, Ciprian Chelba, and David Grangier. Tagged back-translation. arXiv preprint arXiv:1906.06442,

  7. [7]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,

  8. [8]

    Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

    Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860,

  9. [9]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

  10. [10]

    SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

    Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, V olkan Cirik, and Kyunghyun Cho. Searchqa: A new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179,

  11. [11]

    Hierarchical Neural Story Generation

    Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. arXiv preprint arXiv:1805.04833,

  12. [12]

    ELI5: Long Form Question Answering

    Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. Eli5: Long form question answering. arXiv preprint arXiv:1907.09190,

  13. [13]

    Stochastic gradient methods with layer- wise adaptive moments for training of deep networks

    Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, and Jonathan M Cohen. Stochastic gradient methods with layer- wise adaptive moments for training of deep networks. arXiv preprint arXiv:1905.11286,

  14. [14]

    Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies

    Max Grusky, Mor Naaman, and Yoav Artzi. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pp. 708–719, New Orleans, Louisiana, June

  15. [15]

    A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks

    Association for Computational Linguistics. URL http://aclweb.org/anthology/N18-1065. Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. A joint many-task model: Growing a neural network for multiple nlp tasks. arXiv preprint arXiv:1611.01587,

  16. [16]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text degener- ation. arXiv preprint arXiv:1904.09751,

  17. [17]

    Universal Language Model Fine-tuning for Text Classification

    Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146,

  18. [18]

    Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

    Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462,

  19. [19]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    14 Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551,

  20. [20]

    One Model To Learn Them All

    doi: 10.1038/492345a. Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. One model to learn them all. arXiv preprint arXiv:1706.05137,

  21. [21]

    Fast Decoding in Sequence Models using Discrete Latent Variables

    Łukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Parmar, Samy Bengio, Jakob Uszkoreit, and Noam Shazeer. Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382,

  22. [22]

    Unifying question answering and text classification via span extraction

    Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. Unifying question answering and text classification via span extraction. arXiv preprint arXiv:1904.09286,

  23. [23]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

  24. [24]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,

  25. [25]

    Domain Control for Neural Machine Translation

    Catherine Kobus, Josep Crego, and Jean Senellart. Domain control for neural machine translation. arXiv preprint arXiv:1612.06140,

  26. [26]

    Neural text summarization: A critical evaluation

    Wojciech Kry´sci´nski, Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. Neural text summarization: A critical evaluation. arXiv preprint arXiv:1908.08960,

  27. [27]

    Cross-lingual Language Model Pretraining

    Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291,

  28. [28]

    Large memory layers with product keys

    Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herv ´e J´egou. Large memory layers with product keys. arXiv preprint arXiv:1907.05242,

  29. [29]

    Unsupervised question answering by cloze translation

    Patrick Lewis, Ludovic Denoyer, and Sebastian Riedel. Unsupervised question answering by cloze translation. arXiv preprint arXiv:1906.04980,

  30. [30]

    Multi-task Sequence to Sequence Learning

    Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114,

  31. [31]

    The Natural Language Decathlon: Multitask Learning as Question Answering

    Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730,

  32. [32]

    Regularizing and Optimizing LSTM Language Models

    15 Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm lan- guage models. arXiv preprint arXiv:1708.02182,

  33. [33]

    Amit Moryossef, Roee Aharoni, and Yoav Goldberg

    doi: 10.1145/3287560.3287596. Amit Moryossef, Roee Aharoni, and Yoav Goldberg. Filling gender & number gaps in neural ma- chine translation with black-box context injection. arXiv preprint arXiv:1903.03467,

  34. [34]

    Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond

    Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023,

  35. [35]

    Deep contextualized word representations

    Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365,

  36. [36]

    Using the Output Embedding to Improve Language Models

    Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859,

  37. [37]

    Explain Yourself! Leveraging Language Models for Commonsense Reasoning

    Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning.arXiv preprint arXiv:1906.02361,

  38. [38]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250,

  39. [39]

    A Neural Attention Model for Abstractive Sentence Summarization

    Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685,

  40. [40]

    Answers unite! unsupervised metrics for reinforced summarization models

    Thomas Scialom, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano. Answers unite! unsupervised metrics for reinforced summarization models. arXiv preprint arXiv:1909.01610 ,

  41. [41]

    Neural Machine Translation of Rare Words with Subword Units

    Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909,

  42. [42]

    Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

    Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235,

  43. [43]

    Ilya Sutskever, Oriol Vinyals, and Quoc V Le

    doi: 10.1016/j.respol.2013.05.008. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112,

  44. [44]

    A simple method for commonsense reasoning

    Trieu H Trinh and Quoc V Le. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847,

  45. [45]

    NewsQA: A Machine Comprehension Dataset

    Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830,

  46. [46]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin

    arXiv:1909.03290 [cs.CY]. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neu- ral Information Processing Systems 30 , pp. 5998–6008. Cur...

  47. [47]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf . Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding.arXiv preprint arXiv:1804.07461,

  48. [48]

    Neural text generation with unlikelihood training

    Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319,

  49. [49]

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

    Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine trans- lation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144,

  50. [50]

    Sumqe: a bert-based summary quality estimation model

    Stratos Xenouleas, Prodromos Malakasiotis, Marianna Apidianaki, and Ion Androutsopoulos. Sumqe: a bert-based summary quality estimation model. arXiv preprint arXiv:1909.00578,

  51. [51]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600,

  52. [52]

    Defending against neural fake news

    Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. arXiv preprint arXiv:1905.12616,

  53. [53]

    (2016), New York Times and Newsroom (Grusky et al.,

    News News articles from CNN/DailyMail Nallapati et al. (2016), New York Times and Newsroom (Grusky et al.,