arxiv: 2006.03654 · v6 · submitted 2020-06-05 · 💻 cs.CL · cs.LG

Recognition: 3 theorem links

· Lean Theorem

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He , XiaoDong Liu , Jianfeng Gao , Weizhu Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:45 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords DeBERTadisentangled attentionmasked language modelingSuperGLUEpre-trained language modelsnatural language understandingBERTRoBERTa

0 comments

The pith

DeBERTa uses separate vectors for word content and position to compute attention, plus absolute positions in the mask decoder, yielding better NLP performance than RoBERTa with less data and the first single-model score above human average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeBERTa, a pre-trained language model that refines BERT and RoBERTa through two changes: each token is encoded with distinct content and position vectors, and attention is split into separate matrices for contents and relative positions; an enhanced mask decoder then adds absolute position information when predicting masked tokens. These modifications, combined with virtual adversarial training at fine-tuning time, produce consistent gains on downstream tasks. A DeBERTa model trained on half the data used for RoBERTa-Large improves accuracy on MNLI, SQuAD v2.0, and RACE; when scaled to 1.5 billion parameters the single model reaches a SuperGLUE macro-average of 89.9, exceeding the human baseline of 89.8.

Core claim

By representing each word with two vectors that separately encode its content and its position, and by computing attention weights through disentangled matrices on contents and relative positions, together with an enhanced mask decoder that injects absolute positions into the prediction of masked tokens, the DeBERTa architecture improves both pre-training efficiency and downstream accuracy on natural language understanding and generation tasks. When scaled, this model achieves a macro-average score of 89.9 on SuperGLUE, surpassing the human baseline of 89.8 for the first time with a single model.

What carries the argument

Disentangled attention, in which each word is represented by separate content and position vectors and attention weights are computed with distinct matrices for content and relative-position information.

If this is right

A model trained on half the data can still exceed the accuracy of prior RoBERTa models on MNLI, SQuAD v2.0, and RACE.
Scaling the architecture to 48 layers and 1.5 billion parameters produces a single-model SuperGLUE macro-average above the human baseline.
Adding virtual adversarial training during fine-tuning further improves generalization on the same benchmarks.
An ensemble of DeBERTa models widens the margin over the human baseline on the SuperGLUE leaderboard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The explicit separation of content and relative-position signals may reduce the amount of pre-training data needed for competitive performance in future transformer variants.
The same disentanglement pattern could be tested in non-language sequence tasks such as protein folding or time-series forecasting to see whether similar efficiency gains appear.
If the absolute-position injection in the decoder proves critical, future mask-prediction objectives in other architectures might benefit from making position information available only at the final decoding step.

Load-bearing premise

The reported accuracy gains come from the disentangled attention and enhanced mask decoder rather than from unreported differences in training data volume, optimizer settings, or other implementation details.

What would settle it

Retraining a standard RoBERTa-Large model on exactly the same data and with the same hyperparameters as the reported DeBERTa model, but without the disentangled attention or enhanced mask decoder, and checking whether its scores on MNLI, SQuAD v2.0, RACE, and SuperGLUE still fall short.

read the original abstract

Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeBERTa adds disentangled attention and a mask decoder that produce steady gains over RoBERTa, with the 1.5B model edging past the human SuperGLUE score, but the gains are not yet isolated from scale and data differences.

read the letter

The paper's main contribution is the disentangled attention setup, where each token gets separate content and relative-position vectors and the attention is computed with two distinct matrices, plus the enhanced mask decoder that adds absolute positions back in at the prediction step. They also apply virtual adversarial training at fine-tuning. These changes let a DeBERTa model trained on half the data beat RoBERTa-Large on MNLI, SQuAD v2, and RACE, and the scaled 48-layer 1.5B version reaches 89.9 on SuperGLUE against the human 89.8 baseline.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes DeBERTa, a new pre-trained language model architecture that augments BERT/RoBERTa with two techniques: (1) a disentangled attention mechanism in which each token is represented by separate content and position vectors and attention weights are computed via disentangled matrices on content and relative positions, and (2) an enhanced mask decoder that injects absolute position information when predicting masked tokens during pre-training. It further applies virtual adversarial training at fine-tuning time. The authors report that a DeBERTa model trained on half the data used for RoBERTa-Large outperforms it on MNLI (+0.9%), SQuAD v2.0 (+2.3%), and RACE (+3.6%), and that a 1.5-billion-parameter, 48-layer DeBERTa model achieves a SuperGLUE macro-average of 89.9, exceeding the human baseline of 89.8 (with the ensemble at 90.3).

Significance. If the performance gains are shown to stem from the disentangled attention and enhanced mask decoder rather than model scale, data volume, or unreported hyper-parameter differences, the work would offer a concrete architectural improvement in how transformers handle positional information and would mark a notable milestone by being the first single model to surpass human performance on SuperGLUE. The empirical results on standard NLU benchmarks are a strength, but their attribution to the proposed mechanisms remains the central open question.

major comments (3)

[Abstract and §4] Abstract and §4 (Experimental Results): The headline comparisons (MNLI 91.1%, SQuAD 90.7%, RACE 86.8%) pit a 1.5 B-parameter DeBERTa against RoBERTa-Large (355 M parameters) trained on more data; without a same-size, same-data baseline that uses standard attention and mask decoding, the reported deltas cannot be unambiguously attributed to the disentangled attention or enhanced mask decoder.
[§4.3] §4.3 (SuperGLUE evaluation): The claim that the single 1.5 B DeBERTa model surpasses human performance (89.9 vs. 89.8) is load-bearing for the paper’s significance, yet no ablation is presented that trains an otherwise identical 1.5 B model with conventional BERT attention and mask decoder on the same pre-training mixture to test whether the architectural changes are necessary for exceeding the human baseline.
[Methods and §3] Methods and §3 (Model Architecture): The training recipe for the 1.5 B model (exact data mixture, number of tokens, optimizer schedule, and whether VATT is used in pre-training) is not fully specified, preventing verification that the observed gains exceed what would be expected from scaling laws alone.

minor comments (3)

[Abstract] Abstract: “natural langauge generation” contains a typo and should read “natural language generation.”
[Abstract] The SuperGLUE leaderboard snapshot is dated “January 6, 2021”; the manuscript should clarify whether this reflects the state at submission or a later update.
[§4] No error bars, standard deviations, or number of random seeds are reported for any benchmark score, which weakens confidence in the small margins (e.g., 89.9 vs. 89.8).

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on our experimental controls, additional details on the training setup, and commitments to revise the manuscript accordingly. Our responses emphasize the evidence from controlled smaller-scale experiments while acknowledging computational constraints on large-scale ablations.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experimental Results): The headline comparisons (MNLI 91.1%, SQuAD 90.7%, RACE 86.8%) pit a 1.5 B-parameter DeBERTa against RoBERTa-Large (355 M parameters) trained on more data; without a same-size, same-data baseline that uses standard attention and mask decoding, the reported deltas cannot be unambiguously attributed to the disentangled attention or enhanced mask decoder.

Authors: We acknowledge the value of matched baselines at the 1.5B scale. However, §4.1 and §4.2 report controlled experiments with DeBERTa-base (comparable parameter count to RoBERTa-base) trained on the same data volume and mixture as RoBERTa, where disentangled attention and the enhanced mask decoder yield consistent gains (e.g., +1.5% MNLI, +2.5% SQuAD v2.0). These isolate the architectural contributions independent of scale. The 1.5B results build on this foundation, and we will revise the abstract and §4 to foreground the base-scale controls while noting that full 1.5B matched baselines are left for future work due to resource limits. revision: partial
Referee: [§4.3] §4.3 (SuperGLUE evaluation): The claim that the single 1.5 B DeBERTa model surpasses human performance (89.9 vs. 89.8) is load-bearing for the paper’s significance, yet no ablation is presented that trains an otherwise identical 1.5 B model with conventional BERT attention and mask decoder on the same pre-training mixture to test whether the architectural changes are necessary for exceeding the human baseline.

Authors: An identical 1.5B ablation with standard attention would directly test necessity for the SuperGLUE result. Unfortunately, the compute cost of training a second 1.5B model on the same mixture makes this infeasible in the current study. We instead demonstrate the mechanisms' effectiveness through base-scale ablations where DeBERTa outperforms RoBERTa equivalents under matched conditions, with gains that compound at larger scales. In revision we will add a limitations paragraph, qualify the attribution language in §4.3, and emphasize the base-model evidence supporting the architectural improvements. revision: no
Referee: [Methods and §3] Methods and §3 (Model Architecture): The training recipe for the 1.5 B model (exact data mixture, number of tokens, optimizer schedule, and whether VATT is used in pre-training) is not fully specified, preventing verification that the observed gains exceed what would be expected from scaling laws alone.

Authors: We will expand the methods section and add a dedicated appendix with complete pre-training details for the 1.5B model. VATT is applied exclusively at fine-tuning time and not during pre-training. The appendix will list the precise data mixture proportions, total tokens processed, AdamW optimizer settings (betas, weight decay, epsilon), learning-rate schedule with warmup steps and decay, batch size, and other hyperparameters. This will enable readers to situate the results relative to scaling-law expectations. revision: yes

standing simulated objections not resolved

Ablation training an otherwise identical 1.5B-parameter model using conventional BERT attention and mask decoder on the exact same pre-training mixture to verify whether the architectural changes are required to exceed the human baseline on SuperGLUE.

Circularity Check

0 steps flagged

No circularity: DeBERTa claims rest on empirical training results, not derivations reducing to fitted inputs or self-citations

full rationale

The manuscript introduces disentangled attention (content/position vectors with separate matrices) and an enhanced mask decoder as architectural proposals, then reports benchmark scores from pre-training and fine-tuning runs. No equations, uniqueness theorems, or first-principles derivations appear that equate the reported gains (e.g., SuperGLUE 89.9) to quantities already present in the training data, RoBERTa baselines, or prior self-citations. The central performance claims are direct outcomes of model training and evaluation on held-out benchmarks, not statistical predictions forced by parameter fitting or renamed empirical patterns. External citations (BERT, RoBERTa, SuperGLUE) supply independent baselines rather than load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper introduces two new architectural components whose effectiveness is supported only by the reported benchmark numbers; no independent evidence or formal justification is given in the abstract.

axioms (1)

standard math Transformer layers with self-attention can be stacked to form effective language models
The model is built directly on the BERT transformer backbone.

invented entities (2)

Disentangled attention mechanism no independent evidence
purpose: Compute attention weights using separate content and relative-position matrices
New component introduced to replace standard attention.
Enhanced mask decoder no independent evidence
purpose: Incorporate absolute positions when predicting masked tokens during pre-training
New decoding component added to the pre-training objective.

pith-pipeline@v0.9.0 · 5690 in / 1358 out tokens · 59488 ms · 2026-05-13T04:45:07.449100+00:00 · methodology

discussion (0)

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts
cs.CL 2026-04 accept novelty 8.0

ViLegalNLI is the first 42k-pair Vietnamese legal NLI dataset built via semi-automatic LLM-assisted generation and validation.
RoFormer: Enhanced Transformer with Rotary Position Embedding
cs.CL 2021-04 accept novelty 8.0

RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.
Directed Social Regard: Surfacing Targeted Advocacy, Opposition, Aid, Harms, and Victimization in Online Media
cs.CL 2026-05 unverdicted novelty 7.0

DSR uses transformer models to detect sentiment targets in text and score them along three theory-motivated axes, with validation showing correlations to existing social science datasets.
RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners
cs.CL 2026-04 conditional novelty 7.0

RSAT uses SFT on verified traces followed by GRPO with NLI faithfulness rewards to make 1-8B models produce verifiable table reasoning with cell citations, raising faithfulness 3.7x to 0.826.
RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners
cs.CL 2026-04 unverdicted novelty 7.0

RSAT makes 1-8B language models produce faithful table reasoning by training them to output structured steps with cell citations, using SFT followed by GRPO with an NLI-based faithfulness reward.
Just Pass Twice: Efficient Token Classification with LLMs for Zero-Shot NER
cs.CL 2026-04 unverdicted novelty 7.0

JPT enables bidirectional token classification in causal LLMs for zero-shot NER via input concatenation plus definition-guided embeddings, delivering +7.9 F1 gains and over 20x speedup on benchmarks.
The Indra Representation Hypothesis for Multimodal Alignment
cs.CV 2026-04 unverdicted novelty 7.0

Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal...
Context-Aware Spear Phishing: Generative AI-Enabled Attacks Against Individuals via Public Social Media Data
cs.CR 2026-05 conditional novelty 6.0

Generative AI enables scalable, context-aware spear phishing by extracting profiles from public social media, producing emails that outperform real-world phishing samples in personalization and lower recipient suspicion.
An Information-theoretic Propagation Denoising and Fusion Framework for Fake News Detection
cs.CL 2026-05 unverdicted novelty 6.0

InfoPDF uses mutual information to suppress noise in LLM-generated synthetic propagation graphs and adaptively fuse them with real data, yielding more discriminative representations for fake news detection.
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
cs.CR 2026-04 unverdicted novelty 6.0

TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while...
EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce
cs.CL 2026-04 unverdicted novelty 6.0

EPM-RL uses PEFT followed by RL with agent-based rewards from judge models to create a trainable in-house product mapping model that improves on fine-tuning alone and beats API baselines in quality-cost while enabling...
Beyond Importance Sampling: Rejection-Gated Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.
RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving
cs.NI 2026-04 unverdicted novelty 6.0

Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.
Entities as Retrieval Signals: A Systematic Study of Coverage, Supervision, and Evaluation in Entity-Oriented Ranking
cs.IR 2026-04 conditional novelty 6.0

Entity signals cover only 19.7% of relevant documents on Robust04 and no configuration among 443 systems improves MAP by more than 0.05 in open-world evaluation, despite gains when entities are pre-restricted.
Million Tutoring Moves (MTM): An Open Multimodal Dataset for the Science of Tutoring
cs.CY 2026-04 accept novelty 6.0

MTM v1 releases 4,654 open math tutoring transcripts as the first step toward a large-scale multimodal repository for studying and improving tutoring.
Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation
cs.CV 2026-04 conditional novelty 6.0

Empirical study finds overconfidence persists in medical VLMs despite scaling and prompting; post-hoc calibration reduces error while hallucination-aware calibration improves both calibration and AUROC.
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
cs.AI 2023-12 conditional novelty 6.0

Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation
cs.CL 2023-02 unverdicted novelty 6.0

Semantic entropy improves uncertainty estimation in natural language generation by incorporating semantic equivalences, outperforming standard entropy baselines on predicting model accuracy for question answering.
Ethical and social risks of harm from Language Models
cs.CL 2021-12 accept novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...
Revisiting Semantic Role Labeling: Efficient Structured Inference with Dependency-Informed Analysis
cs.CL 2026-05 unverdicted novelty 5.0

A new encoder-based SRL system with dependency-informed analysis delivers 10x faster inference and comparable or better F1 scores using BERT, RoBERTa, and DeBERTa while supporting multilingual projection.
MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning
cs.CL 2026-05 unverdicted novelty 4.0

MIPIAD reports a hybrid Qwen-TF-IDF ensemble defense that reaches F1 0.9205 and reduces the English-Bangla performance gap on a 1.43-million-sample synthetic benchmark derived from BIPIA templates.
BiMind: A Dual-Head Reasoning Model with Attention-Geometry Adapter for Incorrect Information Detection
cs.CL 2026-04 unverdicted novelty 4.0

BiMind outperforms existing methods in incorrect information detection by disentangling content and knowledge reasoning with attention geometry adaptation and self-retrieval.
Attribution-Driven Explainable Intrusion Detection with Encoder-Based Large Language Models
cs.CR 2026-04 unverdicted novelty 4.0

Encoder-based LLMs detect SDN intrusions with decisions driven by meaningful traffic behaviors, as validated by attribution analysis aligning with established intrusion principles.
LLMs Struggle with Abstract Meaning Comprehension More Than Expected
cs.CL 2026-04 unverdicted novelty 3.0

LLMs struggle with abstract meaning comprehension on SemEval-2021 Task 4 more than fine-tuned models, and a new bidirectional attention classifier yields small accuracy gains of 3-4%.
Predicting User Satisfaction in Online Education Platforms: A Large Language Model Based Multi-Modal Review Mining Framework
cs.GR 2026-04 unverdicted novelty 3.0

An LLM multi-modal system integrates topic modeling, transformer sentiment, and behavioral features to predict MOOC learner satisfaction more accurately than single-modality baselines.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 26 Pith papers · 8 internal anchors

[1]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[3]

Language Models are Few-Shot Learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[4]

SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055,

work page Pith review arXiv 2017
[5]

Natural-to formal-language generation using tensor product representations

Kezhen Chen, Qiuyuan Huang, Hamid Palangi, Paul Smolensky, Kenneth D Forbus, and Jianfeng Gao. Natural-to formal-language generation using tensor product representations. arXiv preprint arXiv:1910.02339,

work page arXiv 1910
[6]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[7]

BoolQ: Exploring the surprising difﬁculty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difﬁculty of natural yes/no questions. In Proceedings of NAACL-HLT 2019,

work page 2019
[8]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186,

work page 2019
[9]

Automatically constructing a corpus of sentential paraphrases

10 Published as a conference paper at ICLR 2021 William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005),

work page 2021
[10]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efﬁcient sparsity.arXiv preprint arXiv:2101.03961,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

A hybrid neural network model for commonsense reasoning

Pengcheng He, Xiaodong Liu, Weizhu Chen, and Jianfeng Gao. A hybrid neural network model for commonsense reasoning. arXiv preprint arXiv:1907.11983, 2019a. Pengcheng He, Yi Mao, Kaushik Chakrabarti, and Weizhu Chen. X-sql: reinforce schema representa- tion with context. arXiv preprint arXiv:1908.08113, 2019b. Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Us...

work page arXiv 1907
[12]

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy

doi: 10.18653/v1/2020.acl-main.197. Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77,

work page doi:10.18653/v1/2020.acl-main.197 2020
[13]

Small-bench nlp: Benchmark for small single gpu trained models in natural language processing

Kamal Raj Kanakarajan, Bhuvana Kundumani, and Malaikannan Sankarasubbu. Small-bench nlp: Benchmark for small single gpu trained models in natural language processing. ArXiv, abs/2109.10847,

work page arXiv
[14]

Looking beyond the surface: A challenge set for reading comprehension over multiple sentences

Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252–262,

work page 2018
[15]

Adam: A Method for Stochastic Optimization

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Race: Large-scale reading comprehension dataset from examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794,

work page 2017
[17]

The Winograd schema challenge

11 Published as a conference paper at ICLR 2021 Hector J Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, volume 46, pp. 47,

work page 2021
[18]

Adversarial training for large neural language models

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. In International Conference on Learning Representations, 2019a. Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. In Proceed...

work page arXiv 2004
[19]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019c. Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam

work page internal anchor Pith review Pith/arXiv arXiv 1907
[20]

Deep learning based text classiﬁcation: A comprehensive review

Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jian- feng Gao. Deep learning based text classiﬁcation: A comprehensive review. arXiv preprint arXiv:2004.03705,

work page arXiv 2004
[21]

Virtual adversarial training: a regularization method for supervised and semi-supervised learning

Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993,

work page 1979
[22]

Wic: the word-in-context dataset for evaluating context-sensitive meaning representations

Mohammad Taher Pilehvar and Jose Camacho-Collados. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1267–1273,

work page 2019
[23]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang

URL http://jmlr.org/papers/v21/20-074.html. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, November

work page 2016
[24]

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series,

work page 2011
[25]

Enhancing the transformer with explicit relational encoding for math problem solving, 2019, 1910.06611 http://arxiv.org/abs/1910.06611

Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, Jürgen Schmidhuber, and Jianfeng Gao. Enhancing the transformer with explicit relational encoding for math problem solving. arXiv preprint arXiv:1910.06611,

work page arXiv 1910
[26]

Self-attention with relative position representations

12 Published as a conference paper at ICLR 2021 Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464–468,

work page 2021
[27]

Ex- ploiting structured knowledge in text via graph-guided representation learning

Tao Shen, Yi Mao, Pengcheng He, Guodong Long, Adam Trischler, and Weizhu Chen. Ex- ploiting structured knowledge in text via graph-guided representation learning. arXiv preprint arXiv:2004.14224,

work page arXiv 2004
[28]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[29]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642,

work page 2013
[30]

Ernie: Enhanced representation through knowledge integration

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223,

work page arXiv 1904
[31]

Trinh and Quoc V

Trieu H Trinh and Quoc V Le. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847,

work page arXiv
[32]

Superglue: A stickier benchmark for general-purpose language understanding systems

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in neural information processing systems, pp. 3266–3280, 2019a. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samue...

work page arXiv 2019
[33]

A broad-coverage challenge corpus for sentence understanding through inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 1 (Long Papers), pp. 1112–1122. Association for Computational Linguistics,

work page 2018
[34]

Swag: A large-scale adversarial dataset for grounded commonsense inference

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 93–104,

work page 2018
[35]

ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension

13 Published as a conference paper at ICLR 2021 Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885,

work page Pith review arXiv 2021
[36]

14 Published as a conference paper at ICLR 2021 A A PPENDIX A.1 D ATASET Corpus Task #Train #Dev #Test #Label Metrics General Language Understanding Evaluation (GLUE) CoLA Acceptability 8.5k 1k 1k 2 Matthews corr SST Sentiment 67k 872 1.8k 2 Accuracy MNLI NLI 393k 20k 20k 3 Accuracy RTE NLI 2.5k 276 3k 2 Accuracy WNLI NLI 634 71 146 2 Accuracy QQP Paraphr...

work page 2021
[37]

and word sense disambiguation (Pilehvar & Camacho-Collados, 2019). ‚ RACE is a large-scale machine reading comprehension dataset, collected from English examinations in China, which are designed for middle school and high school students (Lai et al., 2017). ‚ SQuAD v1.1/v2.0 is the Stanford Question Answering Dataset (SQuAD) v1.1 and v2.0 (Rajpurkar et al., 2016

work page 2019
[38]

Their passages come from approximately 500 Wikipedia articles and the questions and answers are obtained by crowd- sourcing

are popular machine reading comprehension benchmarks. Their passages come from approximately 500 Wikipedia articles and the questions and answers are obtained by crowd- sourcing. The SQuAD v2.0 dataset includes unanswerable questions about the same paragraphs. 15 Published as a conference paper at ICLR 2021 ‚ SW AGis a large-scale adversarial dataset for ...

work page 2021
[39]

The total data size after data deduplication(Shoeybi et al.,

9 (6GB), OPENWEBTEXT (public Reddit content (Gokaslan & Cohen, 2019); 38GB) and STORIES10 (a subset of CommonCrawl (Trinh & Le, 2018); 31GB). The total data size after data deduplication(Shoeybi et al.,

work page 2019
[40]

For ﬁne-tuning, even though we can get better and robust results with RAdam(Liu et al., 2019a) on some tasks, e.g

as the optimizer with weight decay (Loshchilov & Hutter, 2018). For ﬁne-tuning, even though we can get better and robust results with RAdam(Liu et al., 2019a) on some tasks, e.g. CoLA, RTE and RACE, we use Adam(Kingma & Ba,

work page 2018
[41]

Our code is implemented based on Huggingface Transformers11, FairSeq12 and Megatron (Shoeybi et al., 2019)13

The model selection is based on the performance on the task-speciﬁc development sets. Our code is implemented based on Huggingface Transformers11, FairSeq12 and Megatron (Shoeybi et al., 2019)13. A.3.1 P RE-TRAINING EFFICIENCY To investigate the efﬁciency of model pre-training, we plot the performance of the ﬁne-tuned model on downstream tasks as a functi...

work page 2019
[42]

Sequence length Middle High Accuracy 512 88.8 85.0 86.3 768 88.7 86.3 86.8 Table 11: The effect of handling long sequence input for RACE task with DeBERTa Long sequence handling is an active research area. There have been a lot of studies where the Transformer architecture is extended for long sequence handling(Beltagy et al., 2020; Kitaev et al., 2019; C...

work page 2020
[43]

a”, “the

in EMD. A.8 A DDITIONAL DETAILS OF ENHANCED MASK DECODER The structure of EMD is shown in Figure 2b. There are two inputs for EMD, (i.e.,I,H ).H denotes the hidden states from the previous Transformer layer, andI can be any necessary information for decoding, e.g.,H, absolute position embedding or output from previous EMD layer. n denotesn stacked layers ...

work page 2021
[44]

21 Published as a conference paper at ICLR 2021 (a) (b) (c) Figure 5: Comparison on attention patterns of last layer between DeBERTa and its variants (i.e

20 Published as a conference paper at ICLR 2021 (a) (b) (c) Figure 4: Comparison on attention patterns of the last layer between DeBERTa and RoBERTa. 21 Published as a conference paper at ICLR 2021 (a) (b) (c) Figure 5: Comparison on attention patterns of last layer between DeBERTa and its variants (i.e. DeBERTa without EMD, C2P and P2C respectively). A.1...

work page 2021