Recognition: 3 theorem links
· Lean TheoremDeBERTa: Decoding-enhanced BERT with Disentangled Attention
Pith reviewed 2026-05-13 04:45 UTC · model grok-4.3
The pith
DeBERTa uses separate vectors for word content and position to compute attention, plus absolute positions in the mask decoder, yielding better NLP performance than RoBERTa with less data and the first single-model score above human average.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By representing each word with two vectors that separately encode its content and its position, and by computing attention weights through disentangled matrices on contents and relative positions, together with an enhanced mask decoder that injects absolute positions into the prediction of masked tokens, the DeBERTa architecture improves both pre-training efficiency and downstream accuracy on natural language understanding and generation tasks. When scaled, this model achieves a macro-average score of 89.9 on SuperGLUE, surpassing the human baseline of 89.8 for the first time with a single model.
What carries the argument
Disentangled attention, in which each word is represented by separate content and position vectors and attention weights are computed with distinct matrices for content and relative-position information.
If this is right
- A model trained on half the data can still exceed the accuracy of prior RoBERTa models on MNLI, SQuAD v2.0, and RACE.
- Scaling the architecture to 48 layers and 1.5 billion parameters produces a single-model SuperGLUE macro-average above the human baseline.
- Adding virtual adversarial training during fine-tuning further improves generalization on the same benchmarks.
- An ensemble of DeBERTa models widens the margin over the human baseline on the SuperGLUE leaderboard.
Where Pith is reading between the lines
- The explicit separation of content and relative-position signals may reduce the amount of pre-training data needed for competitive performance in future transformer variants.
- The same disentanglement pattern could be tested in non-language sequence tasks such as protein folding or time-series forecasting to see whether similar efficiency gains appear.
- If the absolute-position injection in the decoder proves critical, future mask-prediction objectives in other architectures might benefit from making position information available only at the final decoding step.
Load-bearing premise
The reported accuracy gains come from the disentangled attention and enhanced mask decoder rather than from unreported differences in training data volume, optimizer settings, or other implementation details.
What would settle it
Retraining a standard RoBERTa-Large model on exactly the same data and with the same hyperparameters as the reported DeBERTa model, but without the disentangled attention or enhanced mask decoder, and checking whether its scores on MNLI, SQuAD v2.0, RACE, and SuperGLUE still fall short.
read the original abstract
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DeBERTa, a new pre-trained language model architecture that augments BERT/RoBERTa with two techniques: (1) a disentangled attention mechanism in which each token is represented by separate content and position vectors and attention weights are computed via disentangled matrices on content and relative positions, and (2) an enhanced mask decoder that injects absolute position information when predicting masked tokens during pre-training. It further applies virtual adversarial training at fine-tuning time. The authors report that a DeBERTa model trained on half the data used for RoBERTa-Large outperforms it on MNLI (+0.9%), SQuAD v2.0 (+2.3%), and RACE (+3.6%), and that a 1.5-billion-parameter, 48-layer DeBERTa model achieves a SuperGLUE macro-average of 89.9, exceeding the human baseline of 89.8 (with the ensemble at 90.3).
Significance. If the performance gains are shown to stem from the disentangled attention and enhanced mask decoder rather than model scale, data volume, or unreported hyper-parameter differences, the work would offer a concrete architectural improvement in how transformers handle positional information and would mark a notable milestone by being the first single model to surpass human performance on SuperGLUE. The empirical results on standard NLU benchmarks are a strength, but their attribution to the proposed mechanisms remains the central open question.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experimental Results): The headline comparisons (MNLI 91.1%, SQuAD 90.7%, RACE 86.8%) pit a 1.5 B-parameter DeBERTa against RoBERTa-Large (355 M parameters) trained on more data; without a same-size, same-data baseline that uses standard attention and mask decoding, the reported deltas cannot be unambiguously attributed to the disentangled attention or enhanced mask decoder.
- [§4.3] §4.3 (SuperGLUE evaluation): The claim that the single 1.5 B DeBERTa model surpasses human performance (89.9 vs. 89.8) is load-bearing for the paper’s significance, yet no ablation is presented that trains an otherwise identical 1.5 B model with conventional BERT attention and mask decoder on the same pre-training mixture to test whether the architectural changes are necessary for exceeding the human baseline.
- [Methods and §3] Methods and §3 (Model Architecture): The training recipe for the 1.5 B model (exact data mixture, number of tokens, optimizer schedule, and whether VATT is used in pre-training) is not fully specified, preventing verification that the observed gains exceed what would be expected from scaling laws alone.
minor comments (3)
- [Abstract] Abstract: “natural langauge generation” contains a typo and should read “natural language generation.”
- [Abstract] The SuperGLUE leaderboard snapshot is dated “January 6, 2021”; the manuscript should clarify whether this reflects the state at submission or a later update.
- [§4] No error bars, standard deviations, or number of random seeds are reported for any benchmark score, which weakens confidence in the small margins (e.g., 89.9 vs. 89.8).
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on our experimental controls, additional details on the training setup, and commitments to revise the manuscript accordingly. Our responses emphasize the evidence from controlled smaller-scale experiments while acknowledging computational constraints on large-scale ablations.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experimental Results): The headline comparisons (MNLI 91.1%, SQuAD 90.7%, RACE 86.8%) pit a 1.5 B-parameter DeBERTa against RoBERTa-Large (355 M parameters) trained on more data; without a same-size, same-data baseline that uses standard attention and mask decoding, the reported deltas cannot be unambiguously attributed to the disentangled attention or enhanced mask decoder.
Authors: We acknowledge the value of matched baselines at the 1.5B scale. However, §4.1 and §4.2 report controlled experiments with DeBERTa-base (comparable parameter count to RoBERTa-base) trained on the same data volume and mixture as RoBERTa, where disentangled attention and the enhanced mask decoder yield consistent gains (e.g., +1.5% MNLI, +2.5% SQuAD v2.0). These isolate the architectural contributions independent of scale. The 1.5B results build on this foundation, and we will revise the abstract and §4 to foreground the base-scale controls while noting that full 1.5B matched baselines are left for future work due to resource limits. revision: partial
-
Referee: [§4.3] §4.3 (SuperGLUE evaluation): The claim that the single 1.5 B DeBERTa model surpasses human performance (89.9 vs. 89.8) is load-bearing for the paper’s significance, yet no ablation is presented that trains an otherwise identical 1.5 B model with conventional BERT attention and mask decoder on the same pre-training mixture to test whether the architectural changes are necessary for exceeding the human baseline.
Authors: An identical 1.5B ablation with standard attention would directly test necessity for the SuperGLUE result. Unfortunately, the compute cost of training a second 1.5B model on the same mixture makes this infeasible in the current study. We instead demonstrate the mechanisms' effectiveness through base-scale ablations where DeBERTa outperforms RoBERTa equivalents under matched conditions, with gains that compound at larger scales. In revision we will add a limitations paragraph, qualify the attribution language in §4.3, and emphasize the base-model evidence supporting the architectural improvements. revision: no
-
Referee: [Methods and §3] Methods and §3 (Model Architecture): The training recipe for the 1.5 B model (exact data mixture, number of tokens, optimizer schedule, and whether VATT is used in pre-training) is not fully specified, preventing verification that the observed gains exceed what would be expected from scaling laws alone.
Authors: We will expand the methods section and add a dedicated appendix with complete pre-training details for the 1.5B model. VATT is applied exclusively at fine-tuning time and not during pre-training. The appendix will list the precise data mixture proportions, total tokens processed, AdamW optimizer settings (betas, weight decay, epsilon), learning-rate schedule with warmup steps and decay, batch size, and other hyperparameters. This will enable readers to situate the results relative to scaling-law expectations. revision: yes
- Ablation training an otherwise identical 1.5B-parameter model using conventional BERT attention and mask decoder on the exact same pre-training mixture to verify whether the architectural changes are required to exceed the human baseline on SuperGLUE.
Circularity Check
No circularity: DeBERTa claims rest on empirical training results, not derivations reducing to fitted inputs or self-citations
full rationale
The manuscript introduces disentangled attention (content/position vectors with separate matrices) and an enhanced mask decoder as architectural proposals, then reports benchmark scores from pre-training and fine-tuning runs. No equations, uniqueness theorems, or first-principles derivations appear that equate the reported gains (e.g., SuperGLUE 89.9) to quantities already present in the training data, RoBERTa baselines, or prior self-citations. The central performance claims are direct outcomes of model training and evaluation on held-out benchmarks, not statistical predictions forced by parameter fitting or renamed empirical patterns. External citations (BERT, RoBERTa, SuperGLUE) supply independent baselines rather than load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Transformer layers with self-attention can be stacked to form effective language models
invented entities (2)
-
Disentangled attention mechanism
no independent evidence
-
Enhanced mask decoder
no independent evidence
Forward citations
Cited by 27 Pith papers
-
ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts
ViLegalNLI is the first 42k-pair Vietnamese legal NLI dataset built via semi-automatic LLM-assisted generation and validation.
-
RoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.
-
Directed Social Regard: Surfacing Targeted Advocacy, Opposition, Aid, Harms, and Victimization in Online Media
DSR uses transformer models to detect sentiment targets in text and score them along three theory-motivated axes, with validation showing correlations to existing social science datasets.
-
RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners
RSAT uses SFT on verified traces followed by GRPO with NLI faithfulness rewards to make 1-8B models produce verifiable table reasoning with cell citations, raising faithfulness 3.7x to 0.826.
-
RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners
RSAT makes 1-8B language models produce faithful table reasoning by training them to output structured steps with cell citations, using SFT followed by GRPO with an NLI-based faithfulness reward.
-
Just Pass Twice: Efficient Token Classification with LLMs for Zero-Shot NER
JPT enables bidirectional token classification in causal LLMs for zero-shot NER via input concatenation plus definition-guided embeddings, delivering +7.9 F1 gains and over 20x speedup on benchmarks.
-
The Indra Representation Hypothesis for Multimodal Alignment
Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal...
-
Context-Aware Spear Phishing: Generative AI-Enabled Attacks Against Individuals via Public Social Media Data
Generative AI enables scalable, context-aware spear phishing by extracting profiles from public social media, producing emails that outperform real-world phishing samples in personalization and lower recipient suspicion.
-
An Information-theoretic Propagation Denoising and Fusion Framework for Fake News Detection
InfoPDF uses mutual information to suppress noise in LLM-generated synthetic propagation graphs and adaptively fuse them with real data, yielding more discriminative representations for fake news detection.
-
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
-
ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models
ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while...
-
EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce
EPM-RL uses PEFT followed by RL with agent-based rewards from judge models to create a trainable in-house product mapping model that improves on fine-tuning alone and beats API baselines in quality-cost while enabling...
-
Beyond Importance Sampling: Rejection-Gated Policy Optimization
RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.
-
RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving
Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.
-
Entities as Retrieval Signals: A Systematic Study of Coverage, Supervision, and Evaluation in Entity-Oriented Ranking
Entity signals cover only 19.7% of relevant documents on Robust04 and no configuration among 443 systems improves MAP by more than 0.05 in open-world evaluation, despite gains when entities are pre-restricted.
-
Million Tutoring Moves (MTM): An Open Multimodal Dataset for the Science of Tutoring
MTM v1 releases 4,654 open math tutoring transcripts as the first step toward a large-scale multimodal repository for studying and improving tutoring.
-
Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation
Empirical study finds overconfidence persists in medical VLMs despite scaling and prompting; post-hoc calibration reduces error while hallucination-aware calibration improves both calibration and AUROC.
-
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
-
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation
Semantic entropy improves uncertainty estimation in natural language generation by incorporating semantic equivalences, outperforming standard entropy baselines on predicting model accuracy for question answering.
-
Ethical and social risks of harm from Language Models
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...
-
Revisiting Semantic Role Labeling: Efficient Structured Inference with Dependency-Informed Analysis
A new encoder-based SRL system with dependency-informed analysis delivers 10x faster inference and comparable or better F1 scores using BERT, RoBERTa, and DeBERTa while supporting multilingual projection.
-
MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning
MIPIAD reports a hybrid Qwen-TF-IDF ensemble defense that reaches F1 0.9205 and reduces the English-Bangla performance gap on a 1.43-million-sample synthetic benchmark derived from BIPIA templates.
-
BiMind: A Dual-Head Reasoning Model with Attention-Geometry Adapter for Incorrect Information Detection
BiMind outperforms existing methods in incorrect information detection by disentangling content and knowledge reasoning with attention geometry adaptation and self-retrieval.
-
Attribution-Driven Explainable Intrusion Detection with Encoder-Based Large Language Models
Encoder-based LLMs detect SDN intrusions with decisions driven by meaningful traffic behaviors, as validated by attribution analysis aligning with established intrusion principles.
-
LLMs Struggle with Abstract Meaning Comprehension More Than Expected
LLMs struggle with abstract meaning comprehension on SemEval-2021 Task 4 more than fine-tuned models, and a new bidirectional attention classifier yields small accuracy gains of 3-4%.
-
Predicting User Satisfaction in Online Education Platforms: A Large Language Model Based Multi-Modal Review Mining Framework
An LLM multi-modal system integrates topic modeling, transformer sentiment, and behavioral features to predict MOOC learner satisfaction more accurately than single-modality baselines.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Reference graph
Works this paper leans on
-
[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[3]
Language Models are Few-Shot Learners
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[4]
SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055,
work page Pith review arXiv 2017
-
[5]
Natural-to formal-language generation using tensor product representations
Kezhen Chen, Qiuyuan Huang, Hamid Palangi, Paul Smolensky, Kenneth D Forbus, and Jianfeng Gao. Natural-to formal-language generation using tensor product representations. arXiv preprint arXiv:1910.02339,
-
[6]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[7]
BoolQ: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL-HLT 2019,
work page 2019
-
[8]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186,
work page 2019
-
[9]
Automatically constructing a corpus of sentential paraphrases
10 Published as a conference paper at ICLR 2021 William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005),
work page 2021
-
[10]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.arXiv preprint arXiv:2101.03961,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
A hybrid neural network model for commonsense reasoning
Pengcheng He, Xiaodong Liu, Weizhu Chen, and Jianfeng Gao. A hybrid neural network model for commonsense reasoning. arXiv preprint arXiv:1907.11983, 2019a. Pengcheng He, Yi Mao, Kaushik Chakrabarti, and Weizhu Chen. X-sql: reinforce schema representa- tion with context. arXiv preprint arXiv:1908.08113, 2019b. Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Us...
-
[12]
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy
doi: 10.18653/v1/2020.acl-main.197. Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77,
-
[13]
Small-bench nlp: Benchmark for small single gpu trained models in natural language processing
Kamal Raj Kanakarajan, Bhuvana Kundumani, and Malaikannan Sankarasubbu. Small-bench nlp: Benchmark for small single gpu trained models in natural language processing. ArXiv, abs/2109.10847,
-
[14]
Looking beyond the surface: A challenge set for reading comprehension over multiple sentences
Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252–262,
work page 2018
-
[15]
Adam: A Method for Stochastic Optimization
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Race: Large-scale reading comprehension dataset from examinations
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794,
work page 2017
-
[17]
11 Published as a conference paper at ICLR 2021 Hector J Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, volume 46, pp. 47,
work page 2021
-
[18]
Adversarial training for large neural language models
Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. In International Conference on Learning Representations, 2019a. Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. In Proceed...
-
[19]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019c. Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[20]
Deep learning based text classification: A comprehensive review
Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jian- feng Gao. Deep learning based text classification: A comprehensive review. arXiv preprint arXiv:2004.03705,
-
[21]
Virtual adversarial training: a regularization method for supervised and semi-supervised learning
Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993,
work page 1979
-
[22]
Wic: the word-in-context dataset for evaluating context-sensitive meaning representations
Mohammad Taher Pilehvar and Jose Camacho-Collados. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1267–1273,
work page 2019
-
[23]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang
URL http://jmlr.org/papers/v21/20-074.html. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, November
work page 2016
-
[24]
Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series,
work page 2011
-
[25]
Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, Jürgen Schmidhuber, and Jianfeng Gao. Enhancing the transformer with explicit relational encoding for math problem solving. arXiv preprint arXiv:1910.06611,
-
[26]
Self-attention with relative position representations
12 Published as a conference paper at ICLR 2021 Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464–468,
work page 2021
-
[27]
Ex- ploiting structured knowledge in text via graph-guided representation learning
Tao Shen, Yi Mao, Pengcheng He, Guodong Long, Adam Trischler, and Weizhu Chen. Ex- ploiting structured knowledge in text via graph-guided representation learning. arXiv preprint arXiv:2004.14224,
-
[28]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[29]
Recursive deep models for semantic compositionality over a sentiment treebank
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642,
work page 2013
-
[30]
Ernie: Enhanced representation through knowledge integration
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223,
-
[31]
Trieu H Trinh and Quoc V Le. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847,
-
[32]
Superglue: A stickier benchmark for general-purpose language understanding systems
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in neural information processing systems, pp. 3266–3280, 2019a. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samue...
-
[33]
A broad-coverage challenge corpus for sentence understanding through inference
Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 1 (Long Papers), pp. 1112–1122. Association for Computational Linguistics,
work page 2018
-
[34]
Swag: A large-scale adversarial dataset for grounded commonsense inference
Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 93–104,
work page 2018
-
[35]
ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension
13 Published as a conference paper at ICLR 2021 Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885,
work page Pith review arXiv 2021
-
[36]
14 Published as a conference paper at ICLR 2021 A A PPENDIX A.1 D ATASET Corpus Task #Train #Dev #Test #Label Metrics General Language Understanding Evaluation (GLUE) CoLA Acceptability 8.5k 1k 1k 2 Matthews corr SST Sentiment 67k 872 1.8k 2 Accuracy MNLI NLI 393k 20k 20k 3 Accuracy RTE NLI 2.5k 276 3k 2 Accuracy WNLI NLI 634 71 146 2 Accuracy QQP Paraphr...
work page 2021
-
[37]
and word sense disambiguation (Pilehvar & Camacho-Collados, 2019). ‚ RACE is a large-scale machine reading comprehension dataset, collected from English examinations in China, which are designed for middle school and high school students (Lai et al., 2017). ‚ SQuAD v1.1/v2.0 is the Stanford Question Answering Dataset (SQuAD) v1.1 and v2.0 (Rajpurkar et al., 2016
work page 2019
-
[38]
are popular machine reading comprehension benchmarks. Their passages come from approximately 500 Wikipedia articles and the questions and answers are obtained by crowd- sourcing. The SQuAD v2.0 dataset includes unanswerable questions about the same paragraphs. 15 Published as a conference paper at ICLR 2021 ‚ SW AGis a large-scale adversarial dataset for ...
work page 2021
-
[39]
The total data size after data deduplication(Shoeybi et al.,
9 (6GB), OPENWEBTEXT (public Reddit content (Gokaslan & Cohen, 2019); 38GB) and STORIES10 (a subset of CommonCrawl (Trinh & Le, 2018); 31GB). The total data size after data deduplication(Shoeybi et al.,
work page 2019
-
[40]
as the optimizer with weight decay (Loshchilov & Hutter, 2018). For fine-tuning, even though we can get better and robust results with RAdam(Liu et al., 2019a) on some tasks, e.g. CoLA, RTE and RACE, we use Adam(Kingma & Ba,
work page 2018
-
[41]
The model selection is based on the performance on the task-specific development sets. Our code is implemented based on Huggingface Transformers11, FairSeq12 and Megatron (Shoeybi et al., 2019)13. A.3.1 P RE-TRAINING EFFICIENCY To investigate the efficiency of model pre-training, we plot the performance of the fine-tuned model on downstream tasks as a functi...
work page 2019
-
[42]
Sequence length Middle High Accuracy 512 88.8 85.0 86.3 768 88.7 86.3 86.8 Table 11: The effect of handling long sequence input for RACE task with DeBERTa Long sequence handling is an active research area. There have been a lot of studies where the Transformer architecture is extended for long sequence handling(Beltagy et al., 2020; Kitaev et al., 2019; C...
work page 2020
-
[43]
in EMD. A.8 A DDITIONAL DETAILS OF ENHANCED MASK DECODER The structure of EMD is shown in Figure 2b. There are two inputs for EMD, (i.e.,I,H ).H denotes the hidden states from the previous Transformer layer, andI can be any necessary information for decoding, e.g.,H, absolute position embedding or output from previous EMD layer. n denotesn stacked layers ...
work page 2021
-
[44]
20 Published as a conference paper at ICLR 2021 (a) (b) (c) Figure 4: Comparison on attention patterns of the last layer between DeBERTa and RoBERTa. 21 Published as a conference paper at ICLR 2021 (a) (b) (c) Figure 5: Comparison on attention patterns of last layer between DeBERTa and its variants (i.e. DeBERTa without EMD, C2P and P2C respectively). A.1...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.