arxiv: 1909.11942 · v6 · submitted 2019-09-26 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Kevin Gimpel, Mingda Chen, Piyush Sharma, Radu Soricut, Sebastian Goodman, Zhenzhong Lan

Authors on Pith no claims yet

Pith reviewed 2026-05-13 12:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords ALBERTBERTparameter reductionsentence order predictionlanguage model pretrainingGLUE benchmarkSQuADmodel efficiency

0 comments

The pith

ALBERT uses parameter-reduction techniques and a new inter-sentence coherence loss to reach state-of-the-art results on GLUE, RACE, and SQuAD while using fewer parameters than BERT-large.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces two techniques that cut the number of parameters in BERT-style models: one splits the embedding matrix into smaller factors, and the other shares weights across layers instead of duplicating them. These changes reduce memory use and speed up training, allowing larger effective models to fit on the same hardware. The authors also add a self-supervised objective that trains the model to predict whether two sentences appear in the correct order, which improves performance on tasks that require understanding relations between sentences. When these pieces are combined, the resulting models set new records on the GLUE, RACE, and SQuAD benchmarks while remaining smaller than the original BERT-large. The work shows that careful redesign of the architecture and training signal can deliver better scaling behavior than simply increasing model size.

Core claim

By applying factorized embedding parameterization and cross-layer parameter sharing, together with a sentence-order prediction loss that replaces the next-sentence prediction objective, ALBERT produces language representations that achieve higher scores on GLUE, RACE, and SQuAD than BERT-large while using substantially fewer parameters.

What carries the argument

Factorized embedding parameterization (splitting the large vocabulary embedding matrix into two smaller matrices) combined with cross-layer parameter sharing (reusing the same weights across all transformer layers) and the sentence-order prediction loss (a binary classification task on sentence ordering).

If this is right

Models of this architecture can be trained on the same hardware with larger batch sizes or longer sequences.
The inter-sentence coherence objective improves accuracy on multi-sentence reasoning tasks without adding parameters.
Further scaling becomes feasible because parameter count no longer grows linearly with depth or width.
Pretrained checkpoints can be released that are both smaller to store and faster to fine-tune than prior large models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same factorization and sharing ideas could be applied to other transformer families to reduce their training cost.
Efficiency gains of this kind may allow pretraining at scales that would otherwise be blocked by memory limits on current accelerators.
Downstream applications that run on edge devices could adopt these lighter models with little loss in accuracy.
The results suggest that many existing large models contain redundant parameters that can be removed through structured sharing rather than through pruning after training.

Load-bearing premise

That the chosen reductions in embedding size and layer duplication keep enough model capacity to represent the same linguistic patterns that the full BERT-large model captures.

What would settle it

A direct comparison in which an ALBERT model with the reported parameter count scores lower than BERT-large on the GLUE average or on SQuAD F1 when both are trained under identical data and optimization settings.

read the original abstract

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and \squad benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at https://github.com/google-research/ALBERT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents ALBERT, a parameter-efficient variant of BERT for self-supervised language representation learning. It introduces factorized embedding parameterization (§3.1) and cross-layer parameter sharing (§3.2) to reduce memory consumption and training time, along with a sentence-order prediction (SOP) objective (§3.3) to model inter-sentence coherence. Comprehensive experiments demonstrate improved scaling over BERT, with the largest ALBERT model achieving new state-of-the-art results on GLUE, RACE, and SQuAD while using fewer parameters than BERT-large; code and pretrained models are released.

Significance. If the empirical results hold under matched controls, the work is significant for enabling more efficient scaling of pretrained language models. The parameter-reduction techniques address practical GPU/TPU constraints, and the SOP loss offers a targeted improvement for multi-sentence tasks. Releasing code and models supports reproducibility and further research on efficient pretraining.

major comments (2)

[§4 and Table 2] §4 and Table 2: The SOTA claims on GLUE/RACE/SQuAD rest on the assumption that factorized embeddings and cross-layer sharing preserve effective capacity relative to BERT-large; the manuscript lacks a direct capacity analysis (e.g., probing tasks or parameter-efficiency metrics) to confirm this, which is load-bearing for attributing gains to the proposed architecture rather than training dynamics.
[§3.3 and §4.3] §3.3 and §4.3: The claim that SOP supplies additive benefit beyond masked LM requires training-step-matched ablations against BERT baselines; without these controls, it remains possible that efficiency-driven longer schedules (rather than the architectural changes) drive the reported gains on multi-sentence benchmarks.

minor comments (3)

[Abstract] Abstract: The phrase 'comprehensive empirical evidence' could be sharpened by briefly naming the key metrics (e.g., average GLUE score) for immediate clarity.
[Figure 2] Figure 2: The scaling curves would be easier to interpret with explicit annotation of the training steps or FLOPs at each point.
[§6] §6: The limitations paragraph could explicitly note whether SOP provides benefit on single-sentence tasks or only multi-sentence ones.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and the recommendation of minor revision. We address each major comment below with clarifications drawn directly from the manuscript's experiments and indicate where we will strengthen the presentation in the revised version.

read point-by-point responses

Referee: [§4 and Table 2] §4 and Table 2: The SOTA claims on GLUE/RACE/SQuAD rest on the assumption that factorized embeddings and cross-layer sharing preserve effective capacity relative to BERT-large; the manuscript lacks a direct capacity analysis (e.g., probing tasks or parameter-efficiency metrics) to confirm this, which is load-bearing for attributing gains to the proposed architecture rather than training dynamics.

Authors: We agree that explicit capacity probes would further isolate architectural effects. The manuscript quantifies parameter reduction (ALBERT-large uses 18M parameters versus BERT-large's 340M) and demonstrates improved scaling behavior in Figure 1 and Table 2, where ALBERT models continue to benefit from increased depth and hidden size under the sharing and factorization constraints while outperforming BERT-large on the target benchmarks. These results attribute gains to the architecture because the same pretraining objective and data are used, with only the parameterization changed. We will add a short paragraph in §4 referencing the effective parameter counts and scaling curves to make this attribution more explicit. revision: partial
Referee: [§3.3 and §4.3] §3.3 and §4.3: The claim that SOP supplies additive benefit beyond masked LM requires training-step-matched ablations against BERT baselines; without these controls, it remains possible that efficiency-driven longer schedules (rather than the architectural changes) drive the reported gains on multi-sentence benchmarks.

Authors: We acknowledge the value of step-matched controls. Within ALBERT, Table 4 reports SOP versus NSP ablations under identical training steps and data, showing consistent gains on multi-sentence tasks (MNLI, QNLI, RACE). The manuscript notes that ALBERT's efficiency permits 1M training steps, matching the step count reported for the original BERT-large. We will revise §4.3 to explicitly state the shared step count and add a sentence clarifying that the SOP improvement is measured under these fixed-step ablations rather than extended schedules alone. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical SOTA claims rest on held-out benchmarks

full rationale

The paper proposes two parameter-reduction techniques (factorized embeddings in §3.1, cross-layer sharing in §3.2) and replaces NSP with SOP loss (§3.3), then evaluates the resulting models on GLUE, RACE, and SQuAD. No derivation chain exists that reduces any claimed result to its own inputs by construction, fitted parameters renamed as predictions, or load-bearing self-citations. Performance numbers are external benchmark scores, not internal tautologies. This is the normal non-circular case for an empirical architecture paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work is empirical; it inherits standard transformer assumptions and tunes hyperparameters on validation data, with no new invented entities or unstated axioms required for the central claim.

free parameters (1)

model hyperparameters (layers, hidden size, etc.)
Standard tuning knobs in transformer pretraining; values chosen to balance performance and efficiency.

axioms (1)

domain assumption Transformer layers can share parameters across depth without catastrophic loss of capacity
Invoked by the cross-layer sharing technique described in the abstract.

pith-pipeline@v0.9.0 · 5456 in / 1096 out tokens · 26623 ms · 2026-05-13T12:22:56.836136+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear
we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT... our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.
IndisputableMonolith.Foundation.PhiForcing phi_equation unclear
we also use a self-supervised loss that focuses on modeling inter-sentence coherence

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Measuring Massive Multitask Language Understanding
cs.CY 2020-09 accept novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
SMolLM: Small Language Models Learn Small Molecular Grammar
cs.LG 2026-05 unverdicted novelty 7.0

A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.
Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors
cs.LG 2026-04 unverdicted novelty 7.0

NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.
GuardPhish: Securing Open-Source LLMs from Phishing Abuse
cs.CR 2026-04 unverdicted novelty 7.0

Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.
SecureRouter: Encrypted Routing for Efficient Secure Inference
cs.CR 2026-04 unverdicted novelty 7.0

SecureRouter accelerates secure transformer inference by 1.95x via an encrypted router that selects input-adaptive models from an MPC-optimized pool with negligible accuracy loss.
LA-Sign: Looped Transformers with Geometry-aware Alignment for Skeleton-based Sign Language Recognition
cs.CV 2026-03 unverdicted novelty 7.0

LA-Sign achieves state-of-the-art skeleton-based sign language recognition on WLASL and MSASL by using recurrent looped transformers with adaptive hyperbolic geometry alignment.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
cs.CL 2019-10 accept novelty 7.0

BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
cs.CL 2019-09 unverdicted novelty 7.0

Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
BoolXLLM: LLM-Assisted Explainability for Boolean Models
cs.AI 2026-05 unverdicted novelty 6.0

BoolXLLM augments an existing Boolean rule learner with LLMs for feature selection, discretization thresholds, and natural-language rule translation to improve interpretability while preserving accuracy.
Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

BERT learns shortcut solutions that impair generalization and forward transfer in continual LEGO, while ALBERT learns loop-like solutions for better performance, yet both fail at cross-experience composition, with ALB...
Fight Poison with Poison: Enhancing Robustness in Few-shot Machine-Generated Text Detection with Adversarial Training
cs.CR 2026-05 unverdicted novelty 6.0

REACT uses a RAG-powered attacker to generate challenging adversarial examples and trains a detector with contrastive learning in an alternating loop, raising average F1 by 4.95 points and lowering attack success rate...
ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while...
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Unsupervised Dense Information Retrieval with Contrastive Learning
cs.IR 2021-12 unverdicted novelty 6.0

Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
HuggingFace's Transformers: State-of-the-art Natural Language Processing
cs.CL 2019-10 accept novelty 6.0

Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF
cs.CL 2026-05 unverdicted novelty 5.0

A structured practicum guides readers through the complete modern NLP pipeline with reproducible sessions and new linguistic resources for Tajik and Tatar.
Hyperloop Transformers
cs.LG 2026-04 unverdicted novelty 5.0

Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities
cs.DC 2026-04 unverdicted novelty 3.0

A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF
cs.CL 2026-05 unverdicted novelty 2.0

The work provides a reproducible, session-based guide to the NLP pipeline with original adaptations and resources for morphologically rich low-resource languages.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 23 Pith papers · 7 internal anchors

[1]

arXiv preprint arXiv:1809.10853 , year=

Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853,

work page arXiv
[2]

SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

Daniel Cer, Mona Diab, Eneko Agirre, I˜nigo Lopez-Gazpio, and Lucia Specia. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. InProceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 1–14, Vancouver, Canada, August

work page 2017
[3]

Training Deep Nets with Sublinear Memory Cost

Association for Computational Linguistics. doi: 10.18653/v1/S17-2001. URL https://www.aclweb.org/anthology/S17-2001. Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/s17-2001 2001
[4]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[5]

Bam! born-again multi-task networks for natural language understanding

Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D Manning, and Quoc V Le. Bam! born-again multi-task networks for natural language understanding. arXiv preprint arXiv:1907.04829,

work page arXiv 1907
[6]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a ﬁxed-length context. arXiv preprint arXiv:1901.02860,

work page Pith review arXiv 1901
[7]

Universal Transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, ...

work page 2019
[9]

BERT: Pre-training of deep bidi- rectional transformers for language understanding

Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https: //www.aclweb.org/anthology/N19-1423. William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005) ,

work page doi:10.18653/v1/n19-1423
[10]

Zhe Gan, Yunchen Pu, Ricardo Henao, Chunyuan Li, Xiaodong He, and Lawrence Carin

URL https://www.aclweb.org/anthology/I05-5002. Zhe Gan, Yunchen Pu, Ricardo Henao, Chunyuan Li, Xiaodong He, and Lawrence Carin. Learn- ing generic sentence representations using convolutional neural networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pp. 2390–2400, Copenhagen, Denmark, September

work page 2017
[11]

doi: 10.18653/v1/D17-1254

Association for Computational Linguistics. doi: 10.18653/v1/D17-1254. URL https://www.aclweb.org/anthology/D17-1254. 11 Published as a conference paper at ICLR 2020 Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entail- ment an...

work page doi:10.18653/v1/d17-1254 2020
[12]

URL https: //www.aclweb.org/anthology/J95-2003. M.A.K. Halliday and Ruqaiya Hasan. Cohesion in English. Routledge,

work page 2003
[13]

Modeling recurrence for transformer

Jie Hao, Xing Wang, Baosong Yang, Longyue Wang, Jinfeng Zhang, and Zhaopeng Tu. Modeling recurrence for transformer. Proceedings of the 2019 Conference of the North ,

work page 2019
[14]

Gaussian Error Linear Units (GELUs)

18653/v1/n19-1122. URL http://dx.doi.org/10.18653/v1/n19-1122. Dan Hendrycks and Kevin Gimpel. Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1122
[15]

Learning distributed representations of sentences from unlabelled data

Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pp. 1367–1377. Association for Computational Linguistics,

work page 2016
[16]

URL http: //aclweb.org/anthology/N16-1162

doi: 10.18653/v1/N16-1162. URL http: //aclweb.org/anthology/N16-1162. Jerry R. Hobbs. Coherence and coreference. Cognitive Science, 3(1):67–90,

work page doi:10.18653/v1/n16-1162
[17]

Universal language model fine-tuning for text classification

Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁcation. arXiv preprint arXiv:1801.06146,

work page arXiv
[18]

Yacine Jernite, Samuel R Bowman, and David Sontag

URL https://www.quora.com/q/quoradata/ First-Quora-Dataset-Release-Question-Pairs . Yacine Jernite, Samuel R Bowman, and David Sontag. Discourse-based objectives for fast unsuper- vised sentence representation learning. arXiv preprint arXiv:1705.00557,

work page arXiv
[19]

SpanBERT: Improving pre-training by representing and predicting spans

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529,

work page arXiv 1907
[20]

URL http://dl.acm.org/citation.cfm?id= 2969442.2969607

MIT Press. URL http://dl.acm.org/citation.cfm?id= 2969442.2969607. Taku Kudo and John Richardson. SentencePiece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Con- ference on Empirical Methods in Natural Language Processing: System Demonstrations , pp. 66–71, Brussels, Belgium,...

work page arXiv 2018
[21]

S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL https://www.aclweb.org/anthology/D18-2012. 12 Published as a conference paper at ICLR 2020 Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods ...

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2012
[22]

RACE : Large-scale R e A ding comprehension dataset from examinations

Association for Computational Linguistics. doi: 10.18653/v1/D17-1082. URL https://www. aclweb.org/anthology/D17-1082. Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceed- ings of the 31st ICML, Beijing, China,

work page doi:10.18653/v1/d17-1082
[23]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pre- training approach. arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[24]

doi: 10.18653/v1/P19-1442

Association for Computational Linguistics. doi: 10.18653/v1/P19-1442. URL https://www.aclweb.org/anthology/ P19-1442. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word rep- resentation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, Doha, Qatar, October

work page doi:10.18653/v1/p19-1442 2014
[25]

G lo V e: Global Vectors for Word Representation

Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL https://www.aclweb.org/anthology/ D14-1162. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Con- ference of the North American Chapter of the Association...

work page doi:10.3115/v1/d14-1162 2018
[26]

Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

Association for Computational Linguistics. doi: 10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/ openai-assets/research-covers/language-unsupervised/language_ understanding...

work page doi:10.18653/v1/n18-1202
[27]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. arXiv preprint arXiv:1910.10683,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[28]

SQuAD: 100,000+ questions for machine comprehension of text

13 Published as a conference paper at ICLR 2020 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pp. 2383–2392, Austin, Texas, November

work page 2020
[29]

doi: 10.18653/v1/D16-1264

Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL https://www.aclweb. org/anthology/D16-1264. Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789, Mel...

work page doi:10.18653/v1/d16-1264
[30]

Know What You Don 't Know : Unanswerable Questions for SQuAD

Association for Computational Linguistics. doi: 10.18653/v1/P18-2124. URL https://www.aclweb. org/anthology/P18-2124. Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. Mesh-tensorﬂow: Deep learning for supercomputers. In Advances in Neural Informa...

work page doi:10.18653/v1/p18-2124
[31]

Bi-directional block self- attention for fast and memory-efﬁcient sequence modeling

Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. Bi-directional block self- attention for fast and memory-efﬁcient sequence modeling. arXiv preprint arXiv:1804.00857 ,

work page arXiv
[32]

Manning, Andrew Ng, and Christopher Potts

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642, Seattle, Washington, USA, October

work page 2013
[33]

URL https://www.aclweb.org/anthology/D13-1170

Association for Computa- tional Linguistics. URL https://www.aclweb.org/anthology/D13-1170. Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for BERT model compression. arXiv preprint arXiv:1908.09355,

work page arXiv 1908
[34]

Well-read students learn better: The impact of student initialization on knowledge distillation

Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: The impact of student initialization on knowledge distillation. arXiv preprint arXiv:1908.08962,

work page arXiv 1908
[35]

GLUE: A multi-task benchmark and analysis platform for natural language understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceed- ings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355, Brussels, Belgium, November

work page 2018
[36]

doi: 10.18653/v1/W18-5446

Association for Computational Lin- guistics. doi: 10.18653/v1/W18-5446. URL https://www.aclweb.org/anthology/ W18-5446. Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Liwei Peng, and Luo Si. StructBERT: Incor- porating language structures into pre-training for deep language understanding. arXiv preprint arXiv:1908.04577,

work page doi:10.18653/v1/w18-5446 1908
[37]

Neural network acceptability judgments

Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471,

work page arXiv
[38]

A broad-coverage challenge corpus for sen- tence understanding through inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sen- tence understanding through inference. In Proceedings of the 2018 Conference of the North 14 Published as a conference paper at ICLR 2020 American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 1 (Long Papers), pp. ...

work page 2018
[39]

doi: 10.18653/v1/N18-1101

Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. URL https://www.aclweb. org/anthology/N18-1101. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237,

work page doi:10.18653/v1/n18-1101 1906
[40]

In: International Conference on Learning Representations (ICLR) 2020 (2020).https://arxiv.org/abs/1904.00962

Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Reducing BERT pre-training time from 3 days to 76 minutes. arXiv preprint arXiv:1904.00962,

work page arXiv 1904
[41]

DCMN+: Dual co-matching network for multi-choice reading comprehension

Shuailiang Zhang, Hai Zhao, Yuwei Wu, Zhuosheng Zhang, Xi Zhou, and Xiang Zhou. DCMN+: Dual co-matching network for multi-choice reading comprehension. arXiv preprint arXiv:1908.11511,

work page arXiv 1908
[42]

using different numbers of layers. Networks with 3 or more layers are trained by ﬁne-tuning using the parameters from the depth before (e.g., the 12-layer network parameters are ﬁne-tuned from the checkpoint of the 6-layer network parameters). 5 Similar technique has been used in Gong et al. (2019). If we compare a 3-layer ALBERT model with a 1-layer ALBE...

work page 2019
[43]

We conclude that, when sharing all cross-layer parameters (ALBERT-style), there is no need for models deeper than a 12-layer conﬁguration

The difference between 12-layer and 24-layer ALBERT-xxlarge conﬁgurations in terms of downstream accuracy is negligible, with the Avg score being the same. We conclude that, when sharing all cross-layer parameters (ALBERT-style), there is no need for models deeper than a 12-layer conﬁguration. A.3 D OWNSTREAM EVALUATION TASKS GLUE GLUE is comprised of 9 t...

work page 2018
[44]

It focuses on evaluating model capabilities for natural language understanding

and Winograd NLI (WNLI; Levesque et al., 2012). It focuses on evaluating model capabilities for natural language understanding. When reporting MNLI results, we only report the “match” condition (MNLI-m). We follow the ﬁnetuning procedures from prior work (Devlin et al., 2019; Liu et al., 2019; Yang et al.,

work page 2012
[45]

For test set submissions, we perform task-speciﬁc modiﬁcations for WNLI and QNLI as described by Liu et al

and report the held-out test set performance obtained from GLUE submissions. For test set submissions, we perform task-speciﬁc modiﬁcations for WNLI and QNLI as described by Liu et al. (2019) and Yang et al. (2019). SQuAD SQuAD is an extractive question answering dataset built from Wikipedia. The answers are segments from the context paragraphs and the ta...

work page 2019
[46]

(2019), Devlin et al

We adapt these hyperparameters from Liu et al. (2019), Devlin et al. (2019), and Yang et al. (2019). LR BSZ ALBERT DR Classiﬁer DR TS WS MSL CoLA 1.00E-05 16 0 0.1 5336 320 512 STS 2.00E-05 16 0 0.1 3598 214 512 SST-2 1.00E-05 32 0 0.1 20935 1256 512 MNLI 3.00E-05 128 0 0.1 10000 1000 512 QNLI 1.00E-05 32 0 0.1 33112 1986 512 QQP 5.00E-05 128 0.1 0.1 1400...

work page 2019