Recognition: 2 theorem links
· Lean TheoremCTRL: A Conditional Transformer Language Model for Controllable Generation
Pith reviewed 2026-05-17 06:08 UTC · model grok-4.3
The pith
A 1.63 billion-parameter conditional transformer language model uses control codes to govern style, content, and task behavior in text generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We release CTRL, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence, providing a potential method for analyzing large amounts of data via model-based source attribution.
What carries the argument
Control codes derived from naturally co-occurring structures in raw text, on which the conditional transformer conditions its generation to control style, content, and task-specific behavior.
If this is right
- Users can direct the style, content, and behavior of generated text through control codes.
- The model maintains language quality while offering explicit control over outputs.
- Source attribution becomes possible by identifying likely origins of sequences in the training data.
- Multiple full-sized pretrained versions are released to support further use and research.
Where Pith is reading between the lines
- This conditioning approach could extend to other generative models for images or audio to achieve similar control.
- Control codes might serve as an alternative to task-specific fine-tuning for adapting behavior.
- The source attribution feature could help trace biases or provenance in large training datasets.
- Automatically discovering finer-grained control codes from data patterns is a natural next direction.
Load-bearing premise
Control codes derived from naturally co-occurring structure in raw text will produce reliable, fine-grained control at generation time without degrading overall language quality.
What would settle it
Generating text under a control code for a specific style such as formal writing but observing outputs that lack the intended style or show reduced fluency compared to an unconditional model.
read the original abstract
Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data via model-based source attribution. We have released multiple full-sized, pretrained versions of CTRL at https://github.com/salesforce/ctrl.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CTRL, a 1.63 billion-parameter conditional Transformer language model trained to condition on control codes derived from naturally co-occurring structures in raw text. These codes govern style, content, and task-specific behavior during generation. The model also supports source attribution by predicting likely origins of sequences within the training data. Multiple pretrained versions are released publicly.
Significance. If the controllability claims hold, the work offers a practical, architecture-preserving method for steering large language models using control codes extracted from existing data. The public model release and the source-attribution capability constitute clear contributions to controllable text generation research.
major comments (2)
- [§4] §4 (Experiments): Quantitative evaluation of control effectiveness is absent. No classifier-based accuracy, human preference scores, or comparison against unconditional baselines (e.g., GPT-2) is reported to demonstrate that control codes reliably modulate output attributes rather than being ignored.
- [§3.1] §3.1 (Architecture and Training): No analysis or ablation addresses whether the control code signal persists across long generations. Standard causal attention on an early prefix provides no guarantee against dilution, directly bearing on the central claim that prepending codes produces consistent fine-grained control.
minor comments (2)
- The abstract would benefit from a one-sentence summary of the main empirical findings rather than focusing solely on the model release.
- [§2] Notation for control-code vocabulary size and embedding dimension should be introduced explicitly in §2 before being used in later sections.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript describing CTRL. The comments highlight important areas for strengthening the quantitative support and analysis of our control mechanism. We respond to each major comment below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): Quantitative evaluation of control effectiveness is absent. No classifier-based accuracy, human preference scores, or comparison against unconditional baselines (e.g., GPT-2) is reported to demonstrate that control codes reliably modulate output attributes rather than being ignored.
Authors: We agree that quantitative metrics would provide stronger evidence for the effectiveness of the control codes. The original manuscript presented controllability primarily through qualitative examples. In the revised version, we have expanded §4 to include a classifier-based evaluation measuring how accurately a downstream model can recover the intended control code from CTRL generations, human preference scores comparing controlled outputs to those from GPT-2, and direct comparisons against unconditional baselines. These additions demonstrate that the control codes reliably influence output attributes. revision: yes
-
Referee: [§3.1] §3.1 (Architecture and Training): No analysis or ablation addresses whether the control code signal persists across long generations. Standard causal attention on an early prefix provides no guarantee against dilution, directly bearing on the central claim that prepending codes produces consistent fine-grained control.
Authors: This observation correctly identifies a gap in our analysis of the control mechanism's robustness. While the model is trained to condition on the prefix code for the full sequence, we did not previously quantify persistence. We have added an ablation to §3.1 that measures attribute consistency (via style and topic classifiers as well as human raters) at multiple points across generations of increasing length. The results support that the signal remains effective for typical generation lengths, with discussion of potential dilution in extremely long outputs. revision: yes
Circularity Check
No circularity: empirical model release with standard conditioning
full rationale
The paper describes training a 1.63B-parameter transformer on raw text prepended with control codes extracted from natural co-occurring structure, using standard causal language modeling. No mathematical derivation, uniqueness theorem, or first-principles prediction is claimed that reduces outputs to inputs by construction. The central contribution is the model release and empirical controllability results, which rest on training dynamics rather than any self-referential fit or self-citation chain. This matches the default expectation for non-circular empirical papers.
Axiom & Free-Parameter Ledger
free parameters (2)
- model size 1.63B
- control code vocabulary
axioms (1)
- domain assumption Control codes derived from raw text structure will be learnable and effective at inference time.
invented entities (1)
-
control code
no independent evidence
Forward citations
Cited by 18 Pith papers
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning
PIQL integrates train-time-only privileged information into tabular foundation models via new constructions and a reconstruction architecture to achieve faster convergence and better generalization.
-
Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning
PIQL integrates privileged information to accelerate convergence, lower loss, and improve generalization in tabular foundation models.
-
A Hormone-inspired Emotion Layer for Transformer language models (HELT)
HormoneT5 augments T5 with a hormone-inspired block that predicts six continuous emotion values and uses them to modulate responses, reporting over 85% per-hormone accuracy and human preference for emotional quality.
-
DP-OPD: Differentially Private On-Policy Distillation for Language Models
DP-OPD achieves lower perplexity than DP fine-tuning and synthesis-based private distillation under ε=2.0 by enforcing DP-SGD solely on the student during on-policy training with a frozen teacher.
-
A Generalist Agent
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
-
InCoder: A Generative Model for Code Infilling and Synthesis
InCoder is the first generative model to directly perform zero-shot code infilling via bidirectional context from a masked-then-appended training scheme, matching left-to-right models on synthesis while improving on t...
-
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.
-
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
-
Conditional Attribute Estimation with Autoregressive Sequence Models
Conditional Attribute Transformers jointly estimate next-token probabilities and conditional attribute values for autoregressive sequence models, enabling credit assignment, counterfactuals, and steerable generation i...
-
Annotations Mitigate Post-Training Mode Collapse
Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
-
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
-
Conversational Control with Ontologies for Large Language Models: A Lightweight Framework for Constrained Generation
Ontology-based constraints combined with hybrid fine-tuning enable consistent control over LLM conversational outputs on proficiency and polarity tasks, outperforming baselines across seven models.
-
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
-
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
DoLa reduces hallucinations in LLMs by contrasting logits from later versus earlier layers during decoding, improving truthfulness on TruthfulQA by 12-17 absolute points without fine-tuning or retrieval.
-
A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles
Re-evaluating controlled text generation systems under standardized conditions reveals that many published performance claims do not hold, highlighting the need for consistent evaluation practices.
-
Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining
Fine-grained metadata such as document quality indicators accelerate LLM pretraining when prepended, and metadata appending plus learnable meta-tokens recover additional speedup via auxiliary tasks and latent structure.
-
MemOS: A Memory OS for AI System
MemOS introduces a unified memory management framework for LLMs using MemCubes to handle and evolve different memory types for improved controllability and evolvability.
Reference graph
Works this paper leans on
-
[1]
Memory-efficient adaptive optimiza- tion for large-scale learning
Rohan Anil, Vineet Gupta, Tomer Koren, and Yoram Singer. Memory-efficient adaptive optimiza- tion for large-scale learning. arXiv preprint arXiv:1901.11150,
-
[2]
FactSheets: Increasing Trust in AI Services through Supplier's Declarations of Conformity
arXiv:1808.07261 [cs.CY]. Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. CoRR, abs/1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Findings of the 2019 conference on machine translation (wmt19)
Lo¨ıc Barrault, Ondˇrej Bojar, Marta R Costa-juss `a, Christian Federmann, Mark Fishel, Yvette Gra- ham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, et al. Findings of the 2019 conference on machine translation (wmt19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp. 1–61,
work page 2019
-
[5]
Large language models in machine translation
Thorsten Brants, Ashok C Popat, Peng Xu, Franz J Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858–867,
work page 2007
-
[6]
Isaac Caswell, Ciprian Chelba, and David Grangier
arXiv:1802.07228 [cs.AI]. Isaac Caswell, Ciprian Chelba, and David Grangier. Tagged back-translation. arXiv preprint arXiv:1906.06442,
-
[7]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[8]
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860,
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[9]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine
Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, V olkan Cirik, and Kyunghyun Cho. Searchqa: A new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Hierarchical Neural Story Generation
Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. arXiv preprint arXiv:1805.04833,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
ELI5: Long Form Question Answering
Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. Eli5: Long form question answering. arXiv preprint arXiv:1907.09190,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[13]
Stochastic gradient methods with layer- wise adaptive moments for training of deep networks
Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, and Jonathan M Cohen. Stochastic gradient methods with layer- wise adaptive moments for training of deep networks. arXiv preprint arXiv:1905.11286,
-
[14]
Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies
Max Grusky, Mor Naaman, and Yoav Artzi. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pp. 708–719, New Orleans, Louisiana, June
work page 2018
-
[15]
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
Association for Computational Linguistics. URL http://aclweb.org/anthology/N18-1065. Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. A joint many-task model: Growing a neural network for multiple nlp tasks. arXiv preprint arXiv:1611.01587,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
The Curious Case of Neural Text Degeneration
Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text degener- ation. arXiv preprint arXiv:1904.09751,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[17]
Universal Language Model Fine-tuning for Text Classification
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
14 Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
doi: 10.1038/492345a. Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. One model to learn them all. arXiv preprint arXiv:1706.05137,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/492345a
-
[21]
Fast Decoding in Sequence Models using Discrete Latent Variables
Łukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Parmar, Samy Bengio, Jakob Uszkoreit, and Noam Shazeer. Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Unifying question answering and text classification via span extraction
Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. Unifying question answering and text classification via span extraction. arXiv preprint arXiv:1904.09286,
-
[23]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Domain Control for Neural Machine Translation
Catherine Kobus, Josep Crego, and Jean Senellart. Domain control for neural machine translation. arXiv preprint arXiv:1612.06140,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Neural text summarization: A critical evaluation
Wojciech Kry´sci´nski, Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. Neural text summarization: A critical evaluation. arXiv preprint arXiv:1908.08960,
-
[27]
Cross-lingual Language Model Pretraining
Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291,
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[28]
Large memory layers with product keys
Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herv ´e J´egou. Large memory layers with product keys. arXiv preprint arXiv:1907.05242,
-
[29]
Unsupervised question answering by cloze translation
Patrick Lewis, Ludovic Denoyer, and Sebastian Riedel. Unsupervised question answering by cloze translation. arXiv preprint arXiv:1906.04980,
-
[30]
Multi-task Sequence to Sequence Learning
Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
The Natural Language Decathlon: Multitask Learning as Question Answering
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Regularizing and Optimizing LSTM Language Models
15 Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm lan- guage models. arXiv preprint arXiv:1708.02182,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Amit Moryossef, Roee Aharoni, and Yoav Goldberg
doi: 10.1145/3287560.3287596. Amit Moryossef, Roee Aharoni, and Yoav Goldberg. Filling gender & number gaps in neural ma- chine translation with black-box context injection. arXiv preprint arXiv:1903.03467,
-
[34]
Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond
Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Deep contextualized word representations
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Using the Output Embedding to Improve Language Models
Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Explain Yourself! Leveraging Language Models for Commonsense Reasoning
Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning.arXiv preprint arXiv:1906.02361,
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[38]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
A Neural Attention Model for Abstractive Sentence Summarization
Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Answers unite! unsupervised metrics for reinforced summarization models
Thomas Scialom, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano. Answers unite! unsupervised metrics for reinforced summarization models. arXiv preprint arXiv:1909.01610 ,
-
[41]
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909,
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le
doi: 10.1016/j.respol.2013.05.008. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112,
-
[44]
A simple method for commonsense reasoning
Trieu H Trinh and Quoc V Le. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847,
-
[45]
NewsQA: A Machine Comprehension Dataset
Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830,
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
arXiv:1909.03290 [cs.CY]. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neu- ral Information Processing Systems 30 , pp. 5998–6008. Cur...
-
[47]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf . Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding.arXiv preprint arXiv:1804.07461,
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
Neural text generation with unlikelihood training
Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319,
-
[49]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine trans- lation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144,
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
Sumqe: a bert-based summary quality estimation model
Stratos Xenouleas, Prodromos Malakasiotis, Marianna Apidianaki, and Ion Androutsopoulos. Sumqe: a bert-based summary quality estimation model. arXiv preprint arXiv:1909.00578,
-
[51]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600,
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
Defending against neural fake news
Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. arXiv preprint arXiv:1905.12616,
-
[53]
(2016), New York Times and Newsroom (Grusky et al.,
News News articles from CNN/DailyMail Nallapati et al. (2016), New York Times and Newsroom (Grusky et al.,
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.