pith. sign in

arxiv: 1910.13461 · v1 · submitted 2019-10-29 · 💻 cs.CL · cs.LG· stat.ML

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Pith reviewed 2026-05-13 00:10 UTC · model grok-4.3

classification 💻 cs.CL cs.LGstat.ML
keywords BARTdenoising autoencodersequence-to-sequence pretrainingtext generationabstractive summarizationquestion answeringmachine translation
0
0 comments X

The pith

BART pretrains sequence-to-sequence models by corrupting text with noise and reconstructing the original, reaching new state-of-the-art results on generation and comprehension benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BART as a denoising autoencoder that pretrains Transformer-based sequence-to-sequence models. It applies arbitrary text corruptions such as sentence shuffling and span infilling, then trains the model to recover the clean original text. This single framework combines the bidirectional encoding of BERT with the left-to-right decoding of GPT. When fine-tuned, the resulting models set new records on abstractive summarization, dialogue generation, and question answering while matching strong baselines on classification and extraction tasks.

Core claim

BART is trained by corrupting text with an arbitrary noising function and learning a model to reconstruct the original text. It uses a standard Transformer encoder-decoder architecture that generalizes prior pretraining schemes, with the best results obtained by combining sentence shuffling and a novel span-infilling scheme.

What carries the argument

Denoising autoencoder that corrupts input with sentence permutation plus span masking and reconstructs the original via a bidirectional encoder plus autoregressive decoder.

Load-bearing premise

The specific choice of sentence shuffling combined with span infilling yields representations that transfer better to downstream generation and comprehension than earlier noising schemes.

What would settle it

A replication experiment that trains the identical architecture and data on GLUE, SQuAD, and CNN/DM but replaces the shuffling-plus-infilling corruption with a single prior scheme such as token masking alone, and measures whether the ROUGE and accuracy gaps disappear.

read the original abstract

We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript presents BART, a denoising autoencoder for pre-training sequence-to-sequence models. The approach corrupts input text using an arbitrary noising function and trains a standard Transformer encoder-decoder to reconstruct the original text. The authors evaluate multiple noising strategies, finding that sentence permutation combined with span infilling performs best, and show that the resulting model transfers effectively to both generation and comprehension tasks, matching RoBERTa on GLUE and SQuAD while achieving new state-of-the-art results on abstractive summarization, dialogue, and QA benchmarks (gains up to 6 ROUGE) as well as a 1.1 BLEU improvement on machine translation with target-only pre-training. Ablation experiments replicate prior pre-training schemes inside the BART framework to isolate contributing factors.

Significance. If the empirical results hold, the work supplies a simple, unified pre-training recipe that generalizes bidirectional (BERT-style) and autoregressive (GPT-style) objectives through denoising. The direct ablations comparing noising variants on identical architecture and the consistent downstream gains on independent held-out benchmarks (GLUE, SQuAD, ROUGE on summarization corpora) constitute a clear strength, offering both practical utility for generation tasks and diagnostic insight into which pre-training components matter most.

minor comments (2)
  1. [Abstract] Abstract: the statement that BART 'achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE' would be more precise if it named the specific datasets and the strongest prior baselines for each task.
  2. [Section 3 (Noising Strategies)] The description of the span-infilling noising scheme (replacing spans with a single mask token) would benefit from an explicit statement of the length distribution used during pre-training, as this hyper-parameter is listed among the free choices in the method.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work and their recommendation to accept the manuscript. We are pleased that the referee recognizes the value of BART as a unified denoising pre-training approach that generalizes bidirectional and autoregressive objectives, along with the strength of our ablation studies and consistent empirical gains across benchmarks.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The BART paper describes an empirical pretraining procedure (corrupt text via noising functions then reconstruct) and evaluates it through ablations and fine-tuning on independent downstream benchmarks (GLUE, SQuAD, summarization ROUGE, dialogue, MT BLEU). No equations, predictions, or first-principles claims reduce by construction to the inputs; the reported gains are measured on held-out task data separate from the pretraining corpus and noising choices. Self-citations are limited to prior work on related models and do not serve as load-bearing uniqueness theorems. The argument is explicitly experimental and self-contained against external metrics.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

This is an empirical pretraining paper. The central claims rest on experimental outcomes rather than closed-form derivations. Free parameters are the concrete noising probabilities and span lengths selected via ablation; the transformer architecture and optimization are taken from prior literature.

free parameters (2)
  • span infilling length distribution
    Lengths and replacement probabilities chosen through ablation to maximize downstream task performance.
  • sentence permutation rate
    Rate selected as part of the best-performing noising scheme identified in experiments.
axioms (1)
  • domain assumption A standard transformer encoder-decoder can serve as a denoising autoencoder for arbitrary text corruptions.
    Invoked when the paper states that the NMT architecture generalizes BERT and GPT.

pith-pipeline@v0.9.0 · 5582 in / 1433 out tokens · 63631 ms · 2026-05-13T00:10:32.124054+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 50 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Language Models are Few-Shot Learners

    cs.CL 2020-05 accept novelty 8.0

    GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

  2. REALM: Retrieval-Augmented Language Model Pre-Training

    cs.CL 2020-02 accept novelty 8.0

    REALM augments language-model pre-training with an unsupervised retriever over Wikipedia documents and reports 4-16% absolute gains on open-domain QA benchmarks over prior implicit and explicit knowledge methods.

  3. PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

    cs.CR 2026-05 unverdicted novelty 7.0

    PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.

  4. Evaluating Non-English Developer Support in Machine Learning for Software Engineering

    cs.SE 2026-05 unverdicted novelty 7.0

    Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.

  5. TCRTransBench: A Comprehensive Benchmark for Bidirectional TCR-Peptide Sequence Generation

    q-bio.CB 2026-05 unverdicted novelty 7.0

    TCRTransBench provides a new benchmark with bidirectional TCR-peptide generation tasks, a large validated dataset, and metrics to evaluate neural models for immunological sequence modeling.

  6. Deep Graph-Language Fusion for Structure-Aware Code Generation

    cs.SE 2026-05 unverdicted novelty 7.0

    CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.

  7. DocQAC: Adaptive Trie-Guided Decoding for Effective In-Document Query Auto-Completion

    cs.IR 2026-04 conditional novelty 7.0

    Adaptive trie-guided decoding with document context and tunable penalties improves in-document query auto-completion, outperforming baselines and larger models like LLaMA-3 on seen queries.

  8. VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    VidTAG achieves fine-grained global video-to-GPS geolocalization via temporal frame alignment and denoising sequence refinement, reporting 20% gains at 1 km over GeoCLIP and 25% on CityGuessr68k.

  9. LoRM: Learning the Language of Rotating Machinery for Self-Supervised Condition Monitoring

    cs.CL 2026-04 unverdicted novelty 7.0

    LoRM is a self-supervised framework that models multi-modal rotating machinery signals as token sequences for prediction with fine-tuned language models, using prediction errors to monitor machine health in real time.

  10. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

    cs.CL 2024-04 conditional novelty 7.0

    Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.

  11. Chronos: Learning the Language of Time Series

    cs.LG 2024-03 conditional novelty 7.0

    Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.

  12. CodeT: Code Generation with Generated Tests

    cs.CL 2022-07 conditional novelty 7.0

    CodeT improves code generation accuracy by using the same model to create test cases and then selecting solutions via output agreement on those tests, raising HumanEval pass@1 from 47% to 65.8%.

  13. InCoder: A Generative Model for Code Infilling and Synthesis

    cs.SE 2022-04 unverdicted novelty 7.0

    InCoder is the first generative model to directly perform zero-shot code infilling via bidirectional context from a masked-then-appended training scheme, matching left-to-right models on synthesis while improving on t...

  14. Learning to summarize from human feedback

    cs.CL 2020-09 conditional novelty 7.0

    Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.

  15. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    cs.CL 2020-05 accept novelty 7.0

    RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.

  16. FastTab: A Fast Table Recognizer with a Tiny Recursive Module and 1D Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    FastTab combines a Tiny Recursive Module and axial 1D Transformer encoders to predict table grids, headers, and cell spans directly, achieving competitive accuracy on four benchmarks with low-latency inference.

  17. PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts

    cs.CL 2026-05 unverdicted novelty 6.0

    PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.

  18. PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

    cs.CR 2026-05 unverdicted novelty 6.0

    PragLocker generates function-preserving but non-portable prompts for LLM agents via code-symbol semantic anchoring followed by target-model feedback noise injection.

  19. A Unified Benchmark for Evaluating Knowledge Graph Construction Methods and Graph Neural Networks

    cs.LG 2026-05 unverdicted novelty 6.0

    A dual-purpose benchmark supplies two text-derived knowledge graphs and one expert reference graph on the same biomedical corpus to jointly measure construction method quality and GNN robustness via semi-supervised no...

  20. A Sentence Relation-Based Approach to Sanitizing Malicious Instructions

    cs.CR 2026-05 unverdicted novelty 6.0

    SONAR constructs a relational graph from entailment and contradiction scores to prune injected malicious sentences from LLM prompts while preserving context, achieving near-zero attack success rates.

  21. From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

    cs.AI 2026-04 unverdicted novelty 6.0

    Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.

  22. Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies

    cs.AI 2026-04 unverdicted novelty 6.0

    A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.

  23. Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs show mixed results on authorship verification, post generation, and attribute inference from Twitter data, with new frameworks and user studies establishing benchmarks for these analytics tasks.

  24. TableSeq: Unified Generation of Structure, Content, and Layout

    cs.CV 2026-04 unverdicted novelty 6.0

    TableSeq unifies table structure recognition, content extraction, and cell localization by generating an interleaved autoregressive sequence of HTML tags, cell text, and discretized coordinate tokens from an input image.

  25. Generative Retrieval Overcomes Limitations of Dense Retrieval but Struggles with Identifier Ambiguity

    cs.IR 2026-04 conditional novelty 6.0

    Generative retrieval beats dense retrieval and BM25 on the LIMIT dataset but degrades with hard negatives due to identifier ambiguity during decoding.

  26. Progress Ratio Embeddings: An Impatience Signal for Robust Length Control in Neural Text Generation

    cs.CL 2025-12 unverdicted novelty 6.0

    Progress Ratio Embeddings use a trigonometric progress-ratio signal to deliver stable length control in transformers that generalizes to unseen target lengths.

  27. SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

    cs.CV 2025-10 unverdicted novelty 6.0

    SSL4RL reformulates self-supervised learning objectives into dense, verifiable reward signals for RL-based fine-tuning of vision-language models, yielding performance gains on reasoning benchmarks.

  28. LLMs Get Lost In Multi-Turn Conversation

    cs.CL 2025-05 unverdicted novelty 6.0

    LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.

  29. Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

    cs.AI 2024-08 conditional novelty 6.0

    Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.

  30. Nougat: Neural Optical Understanding for Academic Documents

    cs.LG 2023-08 conditional novelty 6.0

    Nougat applies a visual transformer to convert academic PDFs into markup language while accurately handling mathematical content on a new scientific document dataset.

  31. CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    cs.CL 2020-02 unverdicted novelty 6.0

    CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.

  32. HuggingFace's Transformers: State-of-the-art Natural Language Processing

    cs.CL 2019-10 accept novelty 6.0

    Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.

  33. From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning

    cs.AI 2026-05 unverdicted novelty 5.0

    Temporal conditioning in three LLM-based planner architectures for AV scene-to-plan reasoning yields no statistically significant gains on NLP correctness metrics but enables predictive hazard reasoning and stable cor...

  34. TabEmb: Joint Semantic-Structure Embedding for Table Annotation

    cs.LG 2026-04 unverdicted novelty 5.0

    TabEmb decouples LLM-based semantic column embeddings from graph-based structural modeling to produce joint representations that improve table annotation tasks.

  35. Calibrating Model-Based Evaluation Metrics for Summarization

    cs.CL 2026-04 unverdicted novelty 5.0

    A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.

  36. CIMple: Standard-cell SRAM-based CIM with LUT-based split softmax for attention acceleration

    cs.AR 2026-04 unverdicted novelty 5.0

    CIMple delivers a 32 kb digital SRAM-based compute-in-memory accelerator for transformer self-attention that reaches 26.1 TOPS/W at 0.85 V in 28 nm with INT8 precision using dual-banked architecture and LUT-based spli...

  37. Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning

    cs.CL 2026-04 unverdicted novelty 5.0

    Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.

  38. Enriching and Controlling Global Semantics for Text Summarization

    cs.CL 2021-09 unverdicted novelty 5.0

    A normalizing-flow neural topic model plus control mechanism are added to Transformer summarizers to supply and regulate global semantics, with reported gains over prior models on five benchmarks.

  39. ClinQueryAgent: A Conversational Agent for Population Health Management

    cs.IR 2026-04 unverdicted novelty 4.0

    The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 s...

  40. OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework

    cs.IR 2026-03 unverdicted novelty 4.0

    OneSearch-V2 improves generative retrieval via latent reasoning and self-distillation, achieving +3.98% item CTR, +2.07% buyer volume, and +2.11% order volume in online A/B tests.

  41. REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment

    cs.CL 2025-11 unverdicted novelty 4.0

    REFLEX is a reference-free LLM-based evaluation metric for log summarization that assesses quality on relevance, informativeness, and coherence without gold references or human annotations.

  42. When TableQA Meets Noise: A Dual Denoising Framework for Complex Questions and Large-scale Tables

    cs.CL 2025-09 unverdicted novelty 4.0

    EnoTab is a dual denoising framework for TableQA that performs evidence-based question denoising via semantic unit decomposition and evidence tree-guided table pruning with post-order rollback to improve performance o...

  43. Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection

    cs.CL 2025-06 unverdicted novelty 4.0

    An LLM-assisted annotation pipeline creates the PodSarc sarcastic speech dataset from podcasts and validates it via a collaborative gating detection model reaching 73.63% F1.

  44. Large Language Model-Brained GUI Agents: A Survey

    cs.AI 2024-11 unverdicted novelty 4.0

    A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.

  45. A Survey on Foundation Models for Personalized Federated Intelligence

    cs.AI 2025-05 unverdicted novelty 3.0

    The survey introduces personalized federated intelligence (PFI) as a framework integrating federated learning and foundation models to support privacy-aware personalization of AI models.

  46. Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

    cs.AI 2025-01 unverdicted novelty 3.0

    The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.

  47. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

  48. Findings of the Counter Turing Test: AI-Generated Text Detection

    cs.CL 2026-05 unverdicted novelty 2.0

    Shared task findings show F1=1.0000 for binary AI text detection and 0.9531 for model attribution using fine-tuned DeBERTa and BART transformers with ensembles.

  49. From Knowledge to Action: Outcomes of the 2025 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry

    cond-mat.mtrl-sci 2026-05 unverdicted novelty 2.0

    Hackathon submissions indicate LLMs are moving from general assistants toward composable multi-agent systems for structuring scientific knowledge and automating tasks in materials science and chemistry.

  50. A Comprehensive Overview of Large Language Models

    cs.CL 2023-07 unverdicted novelty 2.0

    A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 49 Pith papers · 7 internal anchors

  1. [1]

    Proceedings of the Fourth Interna- tional Workshop on Semantic Evaluations (SemEval- 2007)

    Eneko Agirre, Llu’is M‘arquez, and Richard Wicen- towski (eds.). Proceedings of the Fourth Interna- tional Workshop on Semantic Evaluations (SemEval- 2007). Association for Computational Linguistics, Prague, Czech Republic, June

  2. [2]

    BERT: Pre-training of deep bidirectional transformers for language understand- ing

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understand- ing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies, Volume 1 (Long and Short Papers) , pp. 4171– 4186, Minne...

  3. [3]

    The Second Conversational Intelligence Challenge (ConvAI2)

    Associa- tion for Computational Linguistics. doi: 10.18653/ v1/N19-1423. URL https://www.aclweb. org/anthology/N19-1423. Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. The second conversational in- telligence challenge (convai2). arXiv preprint arXi...

  4. [4]

    Unified language model pre- training for natural language understanding and gen- eration

    Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi- aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre- training for natural language understanding and gen- eration. arXiv preprint arXiv:1905.03197,

  5. [5]

    Pre-trained language model representations for lan- guage generation

    Sergey Edunov, Alexei Baevski, and Michael Auli. Pre-trained language model representations for lan- guage generation. In Proceedings of the 2019 Con- ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long and Short Pa- pers),

  6. [6]

    Controllable Abstractive Summarization

    Angela Fan, David Grangier, and Michael Auli. Con- trollable abstractive summarization. arXiv preprint arXiv:1711.05217,

  7. [7]

    ELI5: Long Form Question Answering

    Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. Eli5: Long form question answering. arXiv preprint arXiv:1907.09190,

  8. [8]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error lin- ear units (gelus). arXiv preprint arXiv:1606.08415,

  9. [9]

    SpanBERT: Improving pre-training by representing and predicting spans

    Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Im- proving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529,

  10. [10]

    Cross-lingual Language Model Pretraining

    Guillaume Lample and Alexis Conneau. Cross- lingual language model pretraining. arXiv preprint arXiv:1901.07291,

  11. [11]

    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

    Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Sori- cut. Albert: A lite bert for self-supervised learn- ing of language representations. arXiv preprint arXiv:1909.11942,

  12. [12]

    Text summariza- tion with pretrained encoders

    Yang Liu and Mirella Lapata. Text summariza- tion with pretrained encoders. arXiv preprint arXiv:1908.08345,

  13. [13]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

  14. [14]

    Efficient Estimation of Word Representations in Vector Space

    Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 ,

  15. [15]

    Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

    Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic- aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745 ,

  16. [16]

    Regularizing Neural Networks by Penalizing Confident Output Distributions

    Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output dis- tributions. arXiv preprint arXiv:1701.06548,

  17. [17]

    Deep contextualized word representations

    Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representa- tions. arXiv preprint arXiv:1802.05365,

  18. [18]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250,

  19. [19]

    Get To The Point: Summarization with Pointer-Generator Networks

    Abigail See, Peter J Liu, and Christopher D Manning. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368,

  20. [20]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461,

  21. [21]

    Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural network acceptability judgments. arXiv preprint 1805.12471,

  22. [22]

    A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

    Adina Williams, Nikita Nangia, and Samuel R Bow- man. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426,

  23. [23]

    XLNet: Generalized Autoregressive Pretraining for Language Understanding

    Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretrain- ing for language understanding. arXiv preprint arXiv:1906.08237, 2019