Recognition: 2 theorem links
· Lean TheoremALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Pith reviewed 2026-05-13 12:22 UTC · model grok-4.3
The pith
ALBERT uses parameter-reduction techniques and a new inter-sentence coherence loss to reach state-of-the-art results on GLUE, RACE, and SQuAD while using fewer parameters than BERT-large.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying factorized embedding parameterization and cross-layer parameter sharing, together with a sentence-order prediction loss that replaces the next-sentence prediction objective, ALBERT produces language representations that achieve higher scores on GLUE, RACE, and SQuAD than BERT-large while using substantially fewer parameters.
What carries the argument
Factorized embedding parameterization (splitting the large vocabulary embedding matrix into two smaller matrices) combined with cross-layer parameter sharing (reusing the same weights across all transformer layers) and the sentence-order prediction loss (a binary classification task on sentence ordering).
If this is right
- Models of this architecture can be trained on the same hardware with larger batch sizes or longer sequences.
- The inter-sentence coherence objective improves accuracy on multi-sentence reasoning tasks without adding parameters.
- Further scaling becomes feasible because parameter count no longer grows linearly with depth or width.
- Pretrained checkpoints can be released that are both smaller to store and faster to fine-tune than prior large models.
Where Pith is reading between the lines
- The same factorization and sharing ideas could be applied to other transformer families to reduce their training cost.
- Efficiency gains of this kind may allow pretraining at scales that would otherwise be blocked by memory limits on current accelerators.
- Downstream applications that run on edge devices could adopt these lighter models with little loss in accuracy.
- The results suggest that many existing large models contain redundant parameters that can be removed through structured sharing rather than through pruning after training.
Load-bearing premise
That the chosen reductions in embedding size and layer duplication keep enough model capacity to represent the same linguistic patterns that the full BERT-large model captures.
What would settle it
A direct comparison in which an ALBERT model with the reported parameter count scores lower than BERT-large on the GLUE average or on SQuAD F1 when both are trained under identical data and optimization settings.
read the original abstract
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and \squad benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at https://github.com/google-research/ALBERT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents ALBERT, a parameter-efficient variant of BERT for self-supervised language representation learning. It introduces factorized embedding parameterization (§3.1) and cross-layer parameter sharing (§3.2) to reduce memory consumption and training time, along with a sentence-order prediction (SOP) objective (§3.3) to model inter-sentence coherence. Comprehensive experiments demonstrate improved scaling over BERT, with the largest ALBERT model achieving new state-of-the-art results on GLUE, RACE, and SQuAD while using fewer parameters than BERT-large; code and pretrained models are released.
Significance. If the empirical results hold under matched controls, the work is significant for enabling more efficient scaling of pretrained language models. The parameter-reduction techniques address practical GPU/TPU constraints, and the SOP loss offers a targeted improvement for multi-sentence tasks. Releasing code and models supports reproducibility and further research on efficient pretraining.
major comments (2)
- [§4 and Table 2] §4 and Table 2: The SOTA claims on GLUE/RACE/SQuAD rest on the assumption that factorized embeddings and cross-layer sharing preserve effective capacity relative to BERT-large; the manuscript lacks a direct capacity analysis (e.g., probing tasks or parameter-efficiency metrics) to confirm this, which is load-bearing for attributing gains to the proposed architecture rather than training dynamics.
- [§3.3 and §4.3] §3.3 and §4.3: The claim that SOP supplies additive benefit beyond masked LM requires training-step-matched ablations against BERT baselines; without these controls, it remains possible that efficiency-driven longer schedules (rather than the architectural changes) drive the reported gains on multi-sentence benchmarks.
minor comments (3)
- [Abstract] Abstract: The phrase 'comprehensive empirical evidence' could be sharpened by briefly naming the key metrics (e.g., average GLUE score) for immediate clarity.
- [Figure 2] Figure 2: The scaling curves would be easier to interpret with explicit annotation of the training steps or FLOPs at each point.
- [§6] §6: The limitations paragraph could explicitly note whether SOP provides benefit on single-sentence tasks or only multi-sentence ones.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and the recommendation of minor revision. We address each major comment below with clarifications drawn directly from the manuscript's experiments and indicate where we will strengthen the presentation in the revised version.
read point-by-point responses
-
Referee: [§4 and Table 2] §4 and Table 2: The SOTA claims on GLUE/RACE/SQuAD rest on the assumption that factorized embeddings and cross-layer sharing preserve effective capacity relative to BERT-large; the manuscript lacks a direct capacity analysis (e.g., probing tasks or parameter-efficiency metrics) to confirm this, which is load-bearing for attributing gains to the proposed architecture rather than training dynamics.
Authors: We agree that explicit capacity probes would further isolate architectural effects. The manuscript quantifies parameter reduction (ALBERT-large uses 18M parameters versus BERT-large's 340M) and demonstrates improved scaling behavior in Figure 1 and Table 2, where ALBERT models continue to benefit from increased depth and hidden size under the sharing and factorization constraints while outperforming BERT-large on the target benchmarks. These results attribute gains to the architecture because the same pretraining objective and data are used, with only the parameterization changed. We will add a short paragraph in §4 referencing the effective parameter counts and scaling curves to make this attribution more explicit. revision: partial
-
Referee: [§3.3 and §4.3] §3.3 and §4.3: The claim that SOP supplies additive benefit beyond masked LM requires training-step-matched ablations against BERT baselines; without these controls, it remains possible that efficiency-driven longer schedules (rather than the architectural changes) drive the reported gains on multi-sentence benchmarks.
Authors: We acknowledge the value of step-matched controls. Within ALBERT, Table 4 reports SOP versus NSP ablations under identical training steps and data, showing consistent gains on multi-sentence tasks (MNLI, QNLI, RACE). The manuscript notes that ALBERT's efficiency permits 1M training steps, matching the step count reported for the original BERT-large. We will revise §4.3 to explicitly state the shared step count and add a sentence clarifying that the SOP improvement is measured under these fixed-step ablations rather than extended schedules alone. revision: partial
Circularity Check
No significant circularity; empirical SOTA claims rest on held-out benchmarks
full rationale
The paper proposes two parameter-reduction techniques (factorized embeddings in §3.1, cross-layer sharing in §3.2) and replaces NSP with SOP loss (§3.3), then evaluates the resulting models on GLUE, RACE, and SQuAD. No derivation chain exists that reduces any claimed result to its own inputs by construction, fitted parameters renamed as predictions, or load-bearing self-citations. Performance numbers are external benchmark scores, not internal tautologies. This is the normal non-circular case for an empirical architecture paper.
Axiom & Free-Parameter Ledger
free parameters (1)
- model hyperparameters (layers, hidden size, etc.)
axioms (1)
- domain assumption Transformer layers can share parameters across depth without catastrophic loss of capacity
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclearwe present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT... our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.
-
IndisputableMonolith.Foundation.PhiForcingphi_equation unclearwe also use a self-supervised loss that focuses on modeling inter-sentence coherence
Forward citations
Cited by 24 Pith papers
-
Measuring Massive Multitask Language Understanding
Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
SMolLM: Small Language Models Learn Small Molecular Grammar
A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.
-
Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors
NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.
-
GuardPhish: Securing Open-Source LLMs from Phishing Abuse
Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.
-
SecureRouter: Encrypted Routing for Efficient Secure Inference
SecureRouter accelerates secure transformer inference by 1.95x via an encrypted router that selects input-adaptive models from an MPC-optimized pool with negligible accuracy loss.
-
LA-Sign: Looped Transformers with Geometry-aware Alignment for Skeleton-based Sign Language Recognition
LA-Sign achieves state-of-the-art skeleton-based sign language recognition on WLASL and MSASL by using recurrent looped transformers with adaptive hyperbolic geometry alignment.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.
-
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
-
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
-
BoolXLLM: LLM-Assisted Explainability for Boolean Models
BoolXLLM augments an existing Boolean rule learner with LLMs for feature selection, discretization thresholds, and natural-language rule translation to improve interpretability while preserving accuracy.
-
Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning
BERT learns shortcut solutions that impair generalization and forward transfer in continual LEGO, while ALBERT learns loop-like solutions for better performance, yet both fail at cross-experience composition, with ALB...
-
Fight Poison with Poison: Enhancing Robustness in Few-shot Machine-Generated Text Detection with Adversarial Training
REACT uses a RAG-powered attacker to generate challenging adversarial examples and trains a detector with contrastive learning in an alternating loop, raising average F1 by 4.95 points and lowering attack success rate...
-
ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models
ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while...
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
Unsupervised Dense Information Retrieval with Contrastive Learning
Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
-
Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF
A structured practicum guides readers through the complete modern NLP pipeline with reproducible sessions and new linguistic resources for Tajik and Tatar.
-
Hyperloop Transformers
Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.
-
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities
A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
-
Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF
The work provides a reproducible, session-based guide to the NLP pipeline with original adaptations and resources for morphologically rich low-resource languages.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:1809.10853 , year=
Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853,
-
[2]
SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation
Daniel Cer, Mona Diab, Eneko Agirre, I˜nigo Lopez-Gazpio, and Lucia Specia. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. InProceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 1–14, Vancouver, Canada, August
work page 2017
-
[3]
Training Deep Nets with Sublinear Memory Cost
Association for Computational Linguistics. doi: 10.18653/v1/S17-2001. URL https://www.aclweb.org/anthology/S17-2001. Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/s17-2001 2001
-
[4]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[5]
Bam! born-again multi-task networks for natural language understanding
Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D Manning, and Quoc V Le. Bam! born-again multi-task networks for natural language understanding. arXiv preprint arXiv:1907.04829,
-
[6]
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860,
work page Pith review arXiv 1901
-
[7]
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
BERT: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, ...
work page 2019
-
[9]
BERT: Pre-training of deep bidi- rectional transformers for language understanding
Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https: //www.aclweb.org/anthology/N19-1423. William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005) ,
-
[10]
Zhe Gan, Yunchen Pu, Ricardo Henao, Chunyuan Li, Xiaodong He, and Lawrence Carin
URL https://www.aclweb.org/anthology/I05-5002. Zhe Gan, Yunchen Pu, Ricardo Henao, Chunyuan Li, Xiaodong He, and Lawrence Carin. Learn- ing generic sentence representations using convolutional neural networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pp. 2390–2400, Copenhagen, Denmark, September
work page 2017
-
[11]
Association for Computational Linguistics. doi: 10.18653/v1/D17-1254. URL https://www.aclweb.org/anthology/D17-1254. 11 Published as a conference paper at ICLR 2020 Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entail- ment an...
-
[12]
URL https: //www.aclweb.org/anthology/J95-2003. M.A.K. Halliday and Ruqaiya Hasan. Cohesion in English. Routledge,
work page 2003
-
[13]
Modeling recurrence for transformer
Jie Hao, Xing Wang, Baosong Yang, Longyue Wang, Jinfeng Zhang, and Zhaopeng Tu. Modeling recurrence for transformer. Proceedings of the 2019 Conference of the North ,
work page 2019
-
[14]
Gaussian Error Linear Units (GELUs)
18653/v1/n19-1122. URL http://dx.doi.org/10.18653/v1/n19-1122. Dan Hendrycks and Kevin Gimpel. Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1122
-
[15]
Learning distributed representations of sentences from unlabelled data
Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pp. 1367–1377. Association for Computational Linguistics,
work page 2016
-
[16]
URL http: //aclweb.org/anthology/N16-1162
doi: 10.18653/v1/N16-1162. URL http: //aclweb.org/anthology/N16-1162. Jerry R. Hobbs. Coherence and coreference. Cognitive Science, 3(1):67–90,
-
[17]
Universal language model fine-tuning for text classification
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146,
-
[18]
Yacine Jernite, Samuel R Bowman, and David Sontag
URL https://www.quora.com/q/quoradata/ First-Quora-Dataset-Release-Question-Pairs . Yacine Jernite, Samuel R Bowman, and David Sontag. Discourse-based objectives for fast unsuper- vised sentence representation learning. arXiv preprint arXiv:1705.00557,
-
[19]
SpanBERT: Improving pre-training by representing and predicting spans
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529,
-
[20]
URL http://dl.acm.org/citation.cfm?id= 2969442.2969607
MIT Press. URL http://dl.acm.org/citation.cfm?id= 2969442.2969607. Taku Kudo and John Richardson. SentencePiece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Con- ference on Empirical Methods in Natural Language Processing: System Demonstrations , pp. 66–71, Brussels, Belgium,...
-
[21]
Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL https://www.aclweb.org/anthology/D18-2012. 12 Published as a conference paper at ICLR 2020 Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods ...
work page internal anchor Pith review doi:10.18653/v1/d18-2012 2012
-
[22]
RACE : Large-scale R e A ding comprehension dataset from examinations
Association for Computational Linguistics. doi: 10.18653/v1/D17-1082. URL https://www. aclweb.org/anthology/D17-1082. Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceed- ings of the 31st ICML, Beijing, China,
-
[23]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pre- training approach. arXiv preprint arXiv:1907.11692,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[24]
Association for Computational Linguistics. doi: 10.18653/v1/P19-1442. URL https://www.aclweb.org/anthology/ P19-1442. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word rep- resentation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, Doha, Qatar, October
-
[25]
G lo V e: Global Vectors for Word Representation
Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL https://www.aclweb.org/anthology/ D14-1162. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Con- ference of the North American Chapter of the Association...
-
[26]
Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer
Association for Computational Linguistics. doi: 10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/ openai-assets/research-covers/language-unsupervised/language_ understanding...
-
[27]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[28]
SQuAD: 100,000+ questions for machine comprehension of text
13 Published as a conference paper at ICLR 2020 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pp. 2383–2392, Austin, Texas, November
work page 2020
-
[29]
Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL https://www.aclweb. org/anthology/D16-1264. Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789, Mel...
-
[30]
Know What You Don 't Know : Unanswerable Questions for SQuAD
Association for Computational Linguistics. doi: 10.18653/v1/P18-2124. URL https://www.aclweb. org/anthology/P18-2124. Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Informa...
-
[31]
Bi-directional block self- attention for fast and memory-efficient sequence modeling
Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. Bi-directional block self- attention for fast and memory-efficient sequence modeling. arXiv preprint arXiv:1804.00857 ,
-
[32]
Manning, Andrew Ng, and Christopher Potts
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642, Seattle, Washington, USA, October
work page 2013
-
[33]
URL https://www.aclweb.org/anthology/D13-1170
Association for Computa- tional Linguistics. URL https://www.aclweb.org/anthology/D13-1170. Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for BERT model compression. arXiv preprint arXiv:1908.09355,
-
[34]
Well-read students learn better: The impact of student initialization on knowledge distillation
Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: The impact of student initialization on knowledge distillation. arXiv preprint arXiv:1908.08962,
-
[35]
GLUE: A multi-task benchmark and analysis platform for natural language understanding
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceed- ings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355, Brussels, Belgium, November
work page 2018
-
[36]
Association for Computational Lin- guistics. doi: 10.18653/v1/W18-5446. URL https://www.aclweb.org/anthology/ W18-5446. Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Liwei Peng, and Luo Si. StructBERT: Incor- porating language structures into pre-training for deep language understanding. arXiv preprint arXiv:1908.04577,
-
[37]
Neural network acceptability judgments
Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471,
-
[38]
A broad-coverage challenge corpus for sen- tence understanding through inference
Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sen- tence understanding through inference. In Proceedings of the 2018 Conference of the North 14 Published as a conference paper at ICLR 2020 American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 1 (Long Papers), pp. ...
work page 2018
-
[39]
Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. URL https://www.aclweb. org/anthology/N18-1101. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237,
-
[40]
Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Reducing BERT pre-training time from 3 days to 76 minutes. arXiv preprint arXiv:1904.00962,
-
[41]
DCMN+: Dual co-matching network for multi-choice reading comprehension
Shuailiang Zhang, Hai Zhao, Yuwei Wu, Zhuosheng Zhang, Xi Zhou, and Xiang Zhou. DCMN+: Dual co-matching network for multi-choice reading comprehension. arXiv preprint arXiv:1908.11511,
-
[42]
using different numbers of layers. Networks with 3 or more layers are trained by fine-tuning using the parameters from the depth before (e.g., the 12-layer network parameters are fine-tuned from the checkpoint of the 6-layer network parameters). 5 Similar technique has been used in Gong et al. (2019). If we compare a 3-layer ALBERT model with a 1-layer ALBE...
work page 2019
-
[43]
The difference between 12-layer and 24-layer ALBERT-xxlarge configurations in terms of downstream accuracy is negligible, with the Avg score being the same. We conclude that, when sharing all cross-layer parameters (ALBERT-style), there is no need for models deeper than a 12-layer configuration. A.3 D OWNSTREAM EVALUATION TASKS GLUE GLUE is comprised of 9 t...
work page 2018
-
[44]
It focuses on evaluating model capabilities for natural language understanding
and Winograd NLI (WNLI; Levesque et al., 2012). It focuses on evaluating model capabilities for natural language understanding. When reporting MNLI results, we only report the “match” condition (MNLI-m). We follow the finetuning procedures from prior work (Devlin et al., 2019; Liu et al., 2019; Yang et al.,
work page 2012
-
[45]
and report the held-out test set performance obtained from GLUE submissions. For test set submissions, we perform task-specific modifications for WNLI and QNLI as described by Liu et al. (2019) and Yang et al. (2019). SQuAD SQuAD is an extractive question answering dataset built from Wikipedia. The answers are segments from the context paragraphs and the ta...
work page 2019
-
[46]
We adapt these hyperparameters from Liu et al. (2019), Devlin et al. (2019), and Yang et al. (2019). LR BSZ ALBERT DR Classifier DR TS WS MSL CoLA 1.00E-05 16 0 0.1 5336 320 512 STS 2.00E-05 16 0 0.1 3598 214 512 SST-2 1.00E-05 32 0 0.1 20935 1256 512 MNLI 3.00E-05 128 0 0.1 10000 1000 512 QNLI 1.00E-05 32 0 0.1 33112 1986 512 QQP 5.00E-05 128 0.1 0.1 1400...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.