arxiv: 1909.05858 · v2 · submitted 2019-09-11 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

CTRL: A Conditional Transformer Language Model for Controllable Generation

Nitish Shirish Keskar , Bryan McCann , Lav R. Varshney , Caiming Xiong , Richard Socher

Authors on Pith no claims yet

Pith reviewed 2026-05-17 06:08 UTC · model grok-4.3

classification 💻 cs.CL

keywords conditional transformercontrollable generationcontrol codeslanguage modeltext generationsource attributionunsupervised learning

0 comments

The pith

A 1.63 billion-parameter conditional transformer language model uses control codes to govern style, content, and task behavior in text generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CTRL, a 1.63 billion-parameter conditional transformer language model trained to condition on control codes. These codes are derived from structures that naturally co-occur with raw text, allowing control over style, content, and specific tasks. This approach preserves the benefits of unsupervised learning while adding explicit control over generation. The model can also predict which parts of the training data are most likely for a given sequence, aiding in source attribution. Pretrained versions are released publicly.

Core claim

We release CTRL, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence, providing a potential method for analyzing large amounts of data via model-based source attribution.

What carries the argument

Control codes derived from naturally co-occurring structures in raw text, on which the conditional transformer conditions its generation to control style, content, and task-specific behavior.

If this is right

Users can direct the style, content, and behavior of generated text through control codes.
The model maintains language quality while offering explicit control over outputs.
Source attribution becomes possible by identifying likely origins of sequences in the training data.
Multiple full-sized pretrained versions are released to support further use and research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This conditioning approach could extend to other generative models for images or audio to achieve similar control.
Control codes might serve as an alternative to task-specific fine-tuning for adapting behavior.
The source attribution feature could help trace biases or provenance in large training datasets.
Automatically discovering finer-grained control codes from data patterns is a natural next direction.

Load-bearing premise

Control codes derived from naturally co-occurring structure in raw text will produce reliable, fine-grained control at generation time without degrading overall language quality.

What would settle it

Generating text under a control code for a specific style such as formal writing but observing outputs that lack the intended style or show reduced fluency compared to an unconditional model.

read the original abstract

Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data via model-based source attribution. We have released multiple full-sized, pretrained versions of CTRL at https://github.com/salesforce/ctrl.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CTRL, a 1.63 billion-parameter conditional Transformer language model trained to condition on control codes derived from naturally co-occurring structures in raw text. These codes govern style, content, and task-specific behavior during generation. The model also supports source attribution by predicting likely origins of sequences within the training data. Multiple pretrained versions are released publicly.

Significance. If the controllability claims hold, the work offers a practical, architecture-preserving method for steering large language models using control codes extracted from existing data. The public model release and the source-attribution capability constitute clear contributions to controllable text generation research.

major comments (2)

[§4] §4 (Experiments): Quantitative evaluation of control effectiveness is absent. No classifier-based accuracy, human preference scores, or comparison against unconditional baselines (e.g., GPT-2) is reported to demonstrate that control codes reliably modulate output attributes rather than being ignored.
[§3.1] §3.1 (Architecture and Training): No analysis or ablation addresses whether the control code signal persists across long generations. Standard causal attention on an early prefix provides no guarantee against dilution, directly bearing on the central claim that prepending codes produces consistent fine-grained control.

minor comments (2)

The abstract would benefit from a one-sentence summary of the main empirical findings rather than focusing solely on the model release.
[§2] Notation for control-code vocabulary size and embedding dimension should be introduced explicitly in §2 before being used in later sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript describing CTRL. The comments highlight important areas for strengthening the quantitative support and analysis of our control mechanism. We respond to each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [§4] §4 (Experiments): Quantitative evaluation of control effectiveness is absent. No classifier-based accuracy, human preference scores, or comparison against unconditional baselines (e.g., GPT-2) is reported to demonstrate that control codes reliably modulate output attributes rather than being ignored.

Authors: We agree that quantitative metrics would provide stronger evidence for the effectiveness of the control codes. The original manuscript presented controllability primarily through qualitative examples. In the revised version, we have expanded §4 to include a classifier-based evaluation measuring how accurately a downstream model can recover the intended control code from CTRL generations, human preference scores comparing controlled outputs to those from GPT-2, and direct comparisons against unconditional baselines. These additions demonstrate that the control codes reliably influence output attributes. revision: yes
Referee: [§3.1] §3.1 (Architecture and Training): No analysis or ablation addresses whether the control code signal persists across long generations. Standard causal attention on an early prefix provides no guarantee against dilution, directly bearing on the central claim that prepending codes produces consistent fine-grained control.

Authors: This observation correctly identifies a gap in our analysis of the control mechanism's robustness. While the model is trained to condition on the prefix code for the full sequence, we did not previously quantify persistence. We have added an ablation to §3.1 that measures attribute consistency (via style and topic classifiers as well as human raters) at multiple points across generations of increasing length. The results support that the signal remains effective for typical generation lengths, with discussion of potential dilution in extremely long outputs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model release with standard conditioning

full rationale

The paper describes training a 1.63B-parameter transformer on raw text prepended with control codes extracted from natural co-occurring structure, using standard causal language modeling. No mathematical derivation, uniqueness theorem, or first-principles prediction is claimed that reduces outputs to inputs by construction. The central contribution is the model release and empirical controllability results, which rest on training dynamics rather than any self-referential fit or self-citation chain. This matches the default expectation for non-circular empirical papers.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The approach assumes that control codes extracted from natural co-occurrence patterns in web-scale text are sufficient to induce controllable generation; no new mathematical axioms are introduced.

free parameters (2)

model size 1.63B
Chosen architecture scale; not derived from first principles.
control code vocabulary
Set of codes selected from observed data sources; chosen by authors.

axioms (1)

domain assumption Control codes derived from raw text structure will be learnable and effective at inference time.
Stated in the abstract as the basis for preserving unsupervised advantages while adding control.

invented entities (1)

control code no independent evidence
purpose: Token that conditions the transformer on style or source.
New token type introduced to steer generation; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5429 in / 1263 out tokens · 28507 ms · 2026-05-17T06:08:56.093022+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning
cs.LG 2026-05 unverdicted novelty 7.0

PIQL integrates train-time-only privileged information into tabular foundation models via new constructions and a reconstruction architecture to achieve faster convergence and better generalization.
Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning
cs.LG 2026-05 unverdicted novelty 7.0

PIQL integrates privileged information to accelerate convergence, lower loss, and improve generalization in tabular foundation models.
A Hormone-inspired Emotion Layer for Transformer language models (HELT)
cs.NE 2026-04 unverdicted novelty 7.0

HormoneT5 augments T5 with a hormone-inspired block that predicts six continuous emotion values and uses them to modulate responses, reporting over 85% per-hormone accuracy and human preference for emotional quality.
DP-OPD: Differentially Private On-Policy Distillation for Language Models
cs.LG 2026-04 unverdicted novelty 7.0

DP-OPD achieves lower perplexity than DP fine-tuning and synthesis-based private distillation under ε=2.0 by enforcing DP-SGD solely on the student during on-policy training with a frozen teacher.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
InCoder: A Generative Model for Code Infilling and Synthesis
cs.SE 2022-04 unverdicted novelty 7.0

InCoder is the first generative model to directly perform zero-shot code infilling via bidirectional context from a masked-then-appended training scheme, matching left-to-right models on synthesis while improving on t...
Prefix-Tuning: Optimizing Continuous Prompts for Generation
cs.CL 2021-01 conditional novelty 7.0

Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
Conditional Attribute Estimation with Autoregressive Sequence Models
cs.AI 2026-05 unverdicted novelty 6.0

Conditional Attribute Transformers jointly estimate next-token probabilities and conditional attribute values for autoregressive sequence models, enabling credit assignment, counterfactuals, and steerable generation i...
Annotations Mitigate Post-Training Mode Collapse
cs.CL 2026-05 unverdicted novelty 6.0

Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives
cs.CL 2026-04 unverdicted novelty 6.0

A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
Conversational Control with Ontologies for Large Language Models: A Lightweight Framework for Constrained Generation
cs.CL 2026-04 conditional novelty 6.0

Ontology-based constraints combined with hybrid fine-tuning enable consistent control over LLM conversational outputs on proficiency and polarity tasks, outperforming baselines across seven models.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
cs.CL 2023-10 unverdicted novelty 6.0

Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
cs.CL 2023-09 conditional novelty 6.0

DoLa reduces hallucinations in LLMs by contrasting logits from later versus earlier layers during decoding, improving truthfulness on TruthfulQA by 12-17 absolute points without fine-tuning or retrieval.
A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles
cs.CL 2026-05 unverdicted novelty 5.0

Re-evaluating controlled text generation systems under standardized conditions reveals that many published performance claims do not hold, highlighting the need for consistent evaluation practices.
Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining
cs.CL 2025-11 unverdicted novelty 5.0

Fine-grained metadata such as document quality indicators accelerate LLM pretraining when prepended, and metadata appending plus learnable meta-tokens recover additional speedup via auxiliary tasks and latent structure.
MemOS: A Memory OS for AI System
cs.CL 2025-07 unverdicted novelty 5.0

MemOS introduces a unified memory management framework for LLMs using MemCubes to handle and evolve different memory types for improved controllability and evolvability.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 17 Pith papers · 34 internal anchors

[1]

Memory-efﬁcient adaptive optimiza- tion for large-scale learning

Rohan Anil, Vineet Gupta, Tomer Koren, and Yoram Singer. Memory-efﬁcient adaptive optimiza- tion for large-scale learning. arXiv preprint arXiv:1901.11150,

work page arXiv 1901
[2]

FactSheets: Increasing Trust in AI Services through Supplier's Declarations of Conformity

arXiv:1808.07261 [cs.CY]. Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. CoRR, abs/1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Findings of the 2019 conference on machine translation (wmt19)

Lo¨ıc Barrault, Ondˇrej Bojar, Marta R Costa-juss `a, Christian Federmann, Mark Fishel, Yvette Gra- ham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, et al. Findings of the 2019 conference on machine translation (wmt19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp. 1–61,

work page 2019
[5]

Large language models in machine translation

Thorsten Brants, Ashok C Popat, Peng Xu, Franz J Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858–867,

work page 2007
[6]

Isaac Caswell, Ciprian Chelba, and David Grangier

arXiv:1802.07228 [cs.AI]. Isaac Caswell, Ciprian Chelba, and David Grangier. Tagged back-translation. arXiv preprint arXiv:1906.06442,

work page arXiv 1906
[7]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[8]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a ﬁxed-length context. arXiv preprint arXiv:1901.02860,

work page internal anchor Pith review Pith/arXiv arXiv 1901
[9]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, V olkan Cirik, and Kyunghyun Cho. Searchqa: A new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Hierarchical Neural Story Generation

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. arXiv preprint arXiv:1805.04833,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

ELI5: Long Form Question Answering

Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. Eli5: Long form question answering. arXiv preprint arXiv:1907.09190,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[13]

Stochastic gradient methods with layer- wise adaptive moments for training of deep networks

Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, and Jonathan M Cohen. Stochastic gradient methods with layer- wise adaptive moments for training of deep networks. arXiv preprint arXiv:1905.11286,

work page arXiv 1905
[14]

Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies

Max Grusky, Mor Naaman, and Yoav Artzi. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pp. 708–719, New Orleans, Louisiana, June

work page 2018
[15]

A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks

Association for Computational Linguistics. URL http://aclweb.org/anthology/N18-1065. Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. A joint many-task model: Growing a neural network for multiple nlp tasks. arXiv preprint arXiv:1611.01587,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text degener- ation. arXiv preprint arXiv:1904.09751,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[17]

Universal Language Model Fine-tuning for Text Classification

Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁcation. arXiv preprint arXiv:1801.06146,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classiﬁers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

14 Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

One Model To Learn Them All

doi: 10.1038/492345a. Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. One model to learn them all. arXiv preprint arXiv:1706.05137,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/492345a
[21]

Fast Decoding in Sequence Models using Discrete Latent Variables

Łukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Parmar, Samy Bengio, Jakob Uszkoreit, and Noam Shazeer. Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Unifying question answering and text classiﬁcation via span extraction

Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. Unifying question answering and text classiﬁcation via span extraction. arXiv preprint arXiv:1904.09286,

work page arXiv 1904
[23]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Domain Control for Neural Machine Translation

Catherine Kobus, Josep Crego, and Jean Senellart. Domain control for neural machine translation. arXiv preprint arXiv:1612.06140,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Neural text summarization: A critical evaluation

Wojciech Kry´sci´nski, Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. Neural text summarization: A critical evaluation. arXiv preprint arXiv:1908.08960,

work page arXiv 1908
[27]

Cross-lingual Language Model Pretraining

Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291,

work page internal anchor Pith review Pith/arXiv arXiv 1901
[28]

Large memory layers with product keys

Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herv ´e J´egou. Large memory layers with product keys. arXiv preprint arXiv:1907.05242,

work page arXiv 1907
[29]

Unsupervised question answering by cloze translation

Patrick Lewis, Ludovic Denoyer, and Sebastian Riedel. Unsupervised question answering by cloze translation. arXiv preprint arXiv:1906.04980,

work page arXiv 1906
[30]

Multi-task Sequence to Sequence Learning

Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

The Natural Language Decathlon: Multitask Learning as Question Answering

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Regularizing and Optimizing LSTM Language Models

15 Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm lan- guage models. arXiv preprint arXiv:1708.02182,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Amit Moryossef, Roee Aharoni, and Yoav Goldberg

doi: 10.1145/3287560.3287596. Amit Moryossef, Roee Aharoni, and Yoav Goldberg. Filling gender & number gaps in neural ma- chine translation with black-box context injection. arXiv preprint arXiv:1903.03467,

work page doi:10.1145/3287560.3287596 1903
[34]

Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond

Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Deep contextualized word representations

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Using the Output Embedding to Improve Language Models

Oﬁr Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Explain Yourself! Leveraging Language Models for Commonsense Reasoning

Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning.arXiv preprint arXiv:1906.02361,

work page internal anchor Pith review Pith/arXiv arXiv 1906
[38]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

A Neural Attention Model for Abstractive Sentence Summarization

Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Answers unite! unsupervised metrics for reinforced summarization models

Thomas Scialom, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano. Answers unite! unsupervised metrics for reinforced summarization models. arXiv preprint arXiv:1909.01610 ,

work page arXiv 1909
[41]

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le

doi: 10.1016/j.respol.2013.05.008. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112,

work page doi:10.1016/j.respol.2013.05.008 2013
[44]

A simple method for commonsense reasoning

Trieu H Trinh and Quoc V Le. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847,

work page arXiv
[45]

NewsQA: A Machine Comprehension Dataset

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830,

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin

arXiv:1909.03290 [cs.CY]. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neu- ral Information Processing Systems 30 , pp. 5998–6008. Cur...

work page arXiv 1909
[47]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf . Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding.arXiv preprint arXiv:1804.07461,

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Neural text generation with unlikelihood training

Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319,

work page arXiv 1908
[49]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine trans- lation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144,

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Sumqe: a bert-based summary quality estimation model

Stratos Xenouleas, Prodromos Malakasiotis, Marianna Apidianaki, and Ion Androutsopoulos. Sumqe: a bert-based summary quality estimation model. arXiv preprint arXiv:1909.00578,

work page arXiv 1909
[51]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600,

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Defending against neural fake news

Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. arXiv preprint arXiv:1905.12616,

work page arXiv 1905
[53]

(2016), New York Times and Newsroom (Grusky et al.,

News News articles from CNN/DailyMail Nallapati et al. (2016), New York Times and Newsroom (Grusky et al.,

work page 2016