pith. machine review for the scientific record. sign in

arxiv: 2111.09543 · v4 · submitted 2021-11-18 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:43 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords DeBERTaELECTRAreplaced token detectionembedding sharingpre-trainingGLUE benchmarknatural language understandinggradient disentanglement
0
0 comments X

The pith

DeBERTaV3 replaces masked language modeling with replaced token detection and introduces gradient-disentangled embedding sharing to raise accuracy on natural language understanding benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard embedding sharing between generator and discriminator in ELECTRA-style training creates opposing gradient forces on the same token representations. This conflict reduces training efficiency. By decoupling the gradients while still sharing the embedding parameters, the new method removes the conflict and lets the pre-trained model learn higher-quality representations. When the resulting DeBERTaV3 Large model is evaluated on the GLUE benchmark it records an average score of 91.37 percent, exceeding the original DeBERTa and ELECTRA baselines. The same change produces even larger relative gains when applied to a multilingual variant on the XNLI zero-shot task.

Core claim

DeBERTaV3 adopts the replaced token detection objective in place of masked language modeling and pairs it with gradient-disentangled embedding sharing so that the generator and discriminator losses no longer tug token embeddings in opposite directions; this change produces a pre-trained model whose downstream performance on GLUE reaches 91.37 percent average and on XNLI reaches 79.8 percent zero-shot accuracy.

What carries the argument

Gradient-disentangled embedding sharing, which keeps a single embedding matrix but routes the generator and discriminator gradients through separate paths before they update that matrix.

If this is right

  • Replaced token detection becomes compatible with shared-embedding architectures without the efficiency penalty previously observed.
  • DeBERTaV3 Large sets a new state-of-the-art average score among models of comparable size on the eight GLUE tasks.
  • The multilingual mDeBERTa Base model improves zero-shot cross-lingual accuracy on XNLI by 3.6 points over XLM-R Base.
  • Pre-trained models released under this scheme can be directly substituted into existing pipelines to raise accuracy on classification, entailment, and similarity tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gradient-separation trick could be tested on other dual-objective pre-training setups such as those combining denoising and contrastive losses.
  • If the method scales to larger models it would reduce the compute needed to reach a given accuracy target.
  • One could measure whether the disentangled embeddings also improve sample efficiency during fine-tuning on low-resource languages.

Load-bearing premise

The measured improvements come from the gradient disentanglement itself rather than from any unstated differences in training data, schedule, or hyperparameters relative to the cited baselines.

What would settle it

Retrain the original DeBERTa and ELECTRA models using exactly the same data, batch sizes, and optimizer settings as DeBERTaV3 and check whether the 1.37-point GLUE gap disappears.

read the original abstract

This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model by replacing mask language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. Our analysis shows that vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. This is because the training losses of the discriminator and the generator pull token embeddings in different directions, creating the "tug-of-war" dynamics. We thus propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics, improving both training efficiency and the quality of the pre-trained model. We have pre-trained DeBERTaV3 using the same settings as DeBERTa to demonstrate its exceptional performance on a wide range of downstream natural language understanding (NLU) tasks. Taking the GLUE benchmark with eight tasks as an example, the DeBERTaV3 Large model achieves a 91.37% average score, which is 1.37% over DeBERTa and 1.91% over ELECTRA, setting a new state-of-the-art (SOTA) among the models with a similar structure. Furthermore, we have pre-trained a multi-lingual model mDeBERTa and observed a larger improvement over strong baselines compared to English models. For example, the mDeBERTa Base achieves a 79.8% zero-shot cross-lingual accuracy on XNLI and a 3.6% improvement over XLM-R Base, creating a new SOTA on this benchmark. We have made our pre-trained models and inference code publicly available at https://github.com/microsoft/DeBERTa.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces DeBERTaV3, which replaces masked language modeling in DeBERTa with replaced token detection (RTD) from ELECTRA and adds gradient-disentangled embedding sharing to avoid tug-of-war dynamics between generator and discriminator losses on shared embeddings. Models are pre-trained under the same settings as DeBERTa; the Large variant reports 91.37% average GLUE score (1.37% above DeBERTa, 1.91% above ELECTRA) and sets new SOTA among comparable models, while mDeBERTa Base reaches 79.8% zero-shot XNLI accuracy (3.6% above XLM-R Base). Pre-trained models and inference code are released publicly.

Significance. If the reported gains are attributable to the proposed changes, the work offers a concrete, practical improvement to ELECTRA-style pre-training by resolving a specific inefficiency in embedding updates, supported by public models that enable direct verification and extension. The tug-of-war analysis provides useful diagnostic insight into discriminator-generator interactions.

major comments (1)
  1. [Table 2 and §4] The central performance claim (Table 2, GLUE results) attributes the 1.37% gain over DeBERTa and new SOTA status to RTD plus gradient-disentangled sharing, yet the manuscript only states that DeBERTaV3 was trained “using the same settings as DeBERTa” without providing verification of identical data order, random seeds, or optimizer trajectories, nor an ablation that differs solely in the embedding-sharing mechanism. ELECTRA baselines are taken from prior publications using a separate codebase. This leaves open the possibility that uncontrolled factors contribute to the observed deltas.
minor comments (2)
  1. [Abstract] The abstract reports the 91.37% GLUE score but does not restate the exact DeBERTa and ELECTRA baseline numbers cited in the text; adding them would improve immediate readability.
  2. [Figure 1] Figure 1 (tug-of-war illustration) would benefit from explicit gradient arrows or a small equation showing the disentanglement operation to make the mechanism clearer to readers unfamiliar with the ELECTRA setup.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the major comment regarding experimental controls and reproducibility below.

read point-by-point responses
  1. Referee: [Table 2 and §4] The central performance claim (Table 2, GLUE results) attributes the 1.37% gain over DeBERTa and new SOTA status to RTD plus gradient-disentangled sharing, yet the manuscript only states that DeBERTaV3 was trained “using the same settings as DeBERTa” without providing verification of identical data order, random seeds, or optimizer trajectories, nor an ablation that differs solely in the embedding-sharing mechanism. ELECTRA baselines are taken from prior publications using a separate codebase. This leaves open the possibility that uncontrolled factors contribute to the observed deltas.

    Authors: We appreciate the referee's emphasis on rigorous experimental controls. The DeBERTaV3 models were pre-trained using the identical training data, batch size, learning rate schedule, number of steps, and optimizer settings as the original DeBERTa (as stated in Section 4), with our implementation extending the publicly released DeBERTa codebase. This ensures that the primary differences are the RTD objective and the gradient-disentangled embedding sharing. We acknowledge that the manuscript does not explicitly verify or report the exact random seeds and data shuffling order from the original DeBERTa runs. To directly address this, we will revise the paper to include a detailed training configuration appendix and add an ablation experiment that trains a variant using standard (non-disentangled) embedding sharing under identical random seeds, data order, and optimizer trajectory as DeBERTaV3. This isolates the contribution of the proposed sharing mechanism. For the ELECTRA baselines, we cite the published results from Clark et al. (2020) following standard practice in the field; our primary claims focus on improvements over DeBERTa using the shared codebase. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results are measured on held-out tasks

full rationale

The paper's central claims consist of measured GLUE and XNLI accuracies obtained after pre-training DeBERTaV3 with the proposed gradient-disentangled embedding sharing. These downstream scores are independent of any fitted parameters or self-citations by construction; they are evaluated on standard held-out benchmarks using the same settings as prior DeBERTa runs. The method description introduces the tug-of-war analysis and the new sharing technique without redefining inputs in terms of outputs or smuggling an ansatz via self-citation. No derivation step reduces to its own inputs, and the performance deltas are reported as experimental outcomes rather than predictions forced by the model equations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical superiority of the new pre-training recipe; no new mathematical axioms or invented physical entities are introduced. The only free parameters are the usual training hyper-parameters (learning rate, batch size, etc.) that are not enumerated in the abstract.

pith-pipeline@v0.9.0 · 5620 in / 1110 out tokens · 24469 ms · 2026-05-15T12:43:34.120976+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DAlembert.Inevitability bilinear_family_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We have pre-trained DeBERTaV3 using the same settings as DeBERTa to demonstrate its exceptional performance on a wide range of downstream natural language understanding (NLU) tasks.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BOOKMARKS: Efficient Active Storyline Memory for Role-playing

    cs.CL 2026-05 unverdicted novelty 7.0

    BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.

  2. Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries

    cs.AI 2026-05 unverdicted novelty 7.0

    ProCompNav improves success rate and shortens user responses in ambiguous instance navigation by using comparative binary questions that prune a candidate pool rather than requesting detailed descriptions.

  3. Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries

    cs.AI 2026-05 unverdicted novelty 7.0

    ProCompNav disambiguates ambiguous instance navigation queries via candidate-pool construction followed by attribute-based comparative binary questions that prune distractors, yielding higher success rates and shorter...

  4. Fine-Grained Perspectives: Modeling Explanations with Annotator-Specific Rationales

    cs.CL 2026-04 unverdicted novelty 7.0

    A framework jointly models annotator-specific NLI labels and explanations using conditioned representations and two explainer architectures, improving predictive performance over baselines.

  5. RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration

    cs.CL 2026-04 unverdicted novelty 7.0

    RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.

  6. Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

    q-bio.QM 2026-04 unverdicted novelty 7.0

    Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...

  7. Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls

    cs.CL 2026-05 unverdicted novelty 6.0

    Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.

  8. TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

    cs.CR 2026-04 unverdicted novelty 6.0

    TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.

  9. ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while...

  10. Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation

    cs.CL 2026-04 unverdicted novelty 6.0

    SHADE adaptively combines coverage and spectral signals to estimate semantic alphabet size from few LLM samples, yielding better performance than baselines in low-sample regimes for alphabet estimation and QA error detection.

  11. AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

    cs.CL 2023-03 unverdicted novelty 6.0

    AdaLoRA uses SVD-based pruning to allocate the parameter budget for low-rank fine-tuning updates according to per-matrix importance scores, yielding better performance than uniform allocation especially under tight budgets.

  12. Feature-Augmented Transformers for Robust AI-Text Detection Across Domains and Generators

    cs.CL 2026-05 conditional novelty 5.0

    Feature-augmented DeBERTa-v3-base with attention-based fusion reaches 85.9% balanced accuracy on the multi-domain M4 benchmark under fixed-threshold evaluation, outperforming zero-shot baselines by up to 7.22 points.

  13. SHIELD: A Diverse Clinical Note Dataset and Distilled Small Language Models for Enterprise-Scale De-identification

    cs.CL 2026-05 conditional novelty 5.0

    SHIELD dataset and distilled DeBERTa v3 model achieve 0.88 micro precision and 0.86 recall on PHI de-identification while matching teacher performance on structured categories.

  14. Optimized Deferral for Imbalanced Settings

    cs.LG 2026-04 unverdicted novelty 5.0

    MILD reformulates two-stage learning to defer as cost-sensitive learning over the input-expert domain and derives new margin-based losses with guarantees, yielding better performance than baselines on image classifica...

  15. ZSG-IAD: A Multimodal Framework for Zero-Shot Grounded Industrial Anomaly Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    ZSG-IAD is a zero-shot multimodal system that uses language-guided two-hop grounding and rule-based reinforcement learning to produce anomaly masks and explainable reports from industrial sensor data.

  16. AGSC: Adaptive Granularity and Semantic Clustering for Uncertainty Quantification in Long-text Generation

    cs.CL 2026-04 unverdicted novelty 5.0

    AGSC combines NLI neutral probabilities for adaptive granularity with GMM semantic clustering to improve uncertainty quantification in long-text LLM generation, claiming SOTA factuality correlation and 60% faster inference.

  17. A Cascaded Generative Approach for e-Commerce Recommendations

    cs.AI 2026-05 unverdicted novelty 4.0

    A cascaded generative system for e-commerce recommendations using theme and keyword generation with teacher-student fine-tuning achieves a 2.7% lift in cart adds per page view.

  18. Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

    cs.AI 2026-05 unverdicted novelty 4.0

    Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.

  19. MKJ at SemEval-2026 Task 9: A Comparative Study of Generalist, Specialist, and Ensemble Strategies for Multilingual Polarization

    cs.CL 2026-04 unverdicted novelty 4.0

    A language-adaptive combination of generalist, specialist, and ensemble transformer models achieves 0.796 macro F1 and 0.826 accuracy on multilingual polarization detection across 22 languages.

  20. Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

    cs.AI 2026-05 unverdicted novelty 3.0

    Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.

  21. YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling

    cs.CL 2026-05 unverdicted novelty 2.0

    Independent task modeling with class weighting outperforms multi-task learning and translation augmentation in a multilingual model ensemble for SemEval-2026 Task 9 polarization detection.

  22. YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling

    cs.CL 2026-05 unverdicted novelty 2.0

    A heterogeneous ensemble of XLM-RoBERTa-large and mDeBERTa-v3-base with independent task modeling and class weighting is reported as effective for multilingual, multicultural, and multievent online polarization detection.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 19 Pith papers · 8 internal anchors

  1. [1]

    Language Models are Few-Shot Learners

    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165,

  2. [2]

    SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation

    Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055,

  3. [3]

    Xlm-e: Cross-lingual language model pre-training via electra

    Zewen Chi, Shaohan Huang, Li Dong, Shuming Ma, Saksham Singhal, Payal Bajaj, Xia Song, and Furu Wei. Xlm-e: Cross-lingual language model pre-training via electra. arXiv preprint arXiv:2106.16138,

  4. [4]

    Xnli: Evaluating cross-lingual sentence representations

    Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2475–2485,

  5. [5]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186,

  6. [6]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.arXiv preprint arXiv:2101.03961,

  7. [7]

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155,

  8. [8]

    URL https://www.aclweb.org/anthology/W07-1401

    Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W07-1401. 10 Published as a conference paper at ICLR 2023 Raia Hadsell, Dushyant Rao, Andrei A Rusu, and Razvan Pascanu. Embracing change: Continual learning in deep neural networks. Trends in Cognitive Sciences, 24(12):1028–1040,

  9. [9]

    Tinybert: Distilling bert for natural language understanding

    Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351,

  10. [10]

    Small-bench nlp: Benchmark for small single gpu trained models in natural language processing

    Kamal Raj Kanakarajan, Bhuvana Kundumani, and Malaikannan Sankarasubbu. Small-bench nlp: Benchmark for small single gpu trained models in natural language processing. ArXiv, abs/2109.10847,

  11. [11]

    Adam: A Method for Stochastic Optimization

    Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

  12. [12]

    Race: Large-scale reading comprehension dataset from examinations

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794,

  13. [13]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

  14. [14]

    Coco-lm: Correcting and contrasting text sequences for language model pretraining

    Yu Meng, Chenyan Xiong, Payal Bajaj, Saurabh Tiwary, Paul Bennett, Jiawei Han, and Xia Song. Coco-lm: Correcting and contrasting text sequences for language model pretraining. arXiv preprint arXiv:2102.08473,

  15. [15]

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang

    URL http://jmlr.org/papers/v21/20-074.html. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, November

  16. [16]

    Introduction to the conll-2003 shared task: Language- independent named entity recognition

    11 Published as a conference paper at ICLR 2023 Erik Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language- independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147,

  17. [17]

    doi: 10.18653/v1/P16-1162

    Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://aclanthology. org/P16-1162. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi...

  18. [18]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053,

  19. [19]

    Recursive deep models for semantic compositionality over a sentiment treebank

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642,

  20. [20]

    Superglue: A stickier benchmark for general-purpose language understanding systems

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in neural information processing systems, pp. 3266–3280, 2019a. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samue...

  21. [21]

    A broad-coverage challenge corpus for sentence understanding through inference

    Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 1 (Long Papers), pp. 1112–1122. Association for Computational Linguistics,

  22. [22]

    Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel

    URL http://aclweb.org/anthology/N18-1101. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua...

  23. [23]

    Swag: A large-scale adversarial dataset for grounded commonsense inference

    12 Published as a conference paper at ICLR 2023 Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 93–104,

  24. [24]

    ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension

    Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885,

  25. [25]

    13 Published as a conference paper at ICLR 2023 A A PPENDIX A.1 D ATASET Table 7: Summary information of the NLP application benchmarks. Corpus Task #Train #Dev #Test #Label Metrics General Language Understanding Evaluation (GLUE) CoLA Acceptability 8.5k 1k 1k 2 Matthews corr SST Sentiment 67k 872 1.8k 2 Accuracy MNLI NLI 393k 20k 20k 3 Accuracy RTE NLI 2...

  26. [26]

    The training set has been machine-translated to the remaining 14 languages, providing synthetic training data for these languages as well

    comes with ground truth dev and test sets in 15 languages, and a ground-truth English training set which is same as MNLI training set. The training set has been machine-translated to the remaining 14 languages, providing synthetic training data for these languages as well. A.2 P RE-TRAINING DATASET For DeBERTaV3 pre-training, we use same data as RoBERTa a...

  27. [27]

    Table 9: Fine-tuning results on MNLI and SQuAD v2.0 tasks of base ELECTRA models trained with different embedding sharing methods. Model MNLI-m/mm SQuAD v2.0 Acc F1/EM ELECTRAbase 85.8/- -/- ELECTRAbase 1⃝ Reimplemented (ES) 87.9/87.4 85.0/82.3 2⃝ NES 86.3/85.6 81.7/78.9 3⃝ GDES 88.3/87.8 85.9/83.1 A.4 I MPLEMENTATION DETAILS Our pre-training almost follo...

  28. [28]

    For fine-tuning, we use Adam (Kingma & Ba,

    as the optimizer with weight decay (Loshchilov & Hutter, 2018). For fine-tuning, we use Adam (Kingma & Ba,

  29. [29]

    Our code is implemented based on DeBERTa (He et al., 2020)5 and ELECTRA (Clark et al., 2020)6

    The model selection is based on the performance on the task-specific development sets. Our code is implemented based on DeBERTa (He et al., 2020)5 and ELECTRA (Clark et al., 2020)6. 5https://github.com/microsoft/DeBERTa 6https://github.com/google-research/electra 15 Published as a conference paper at ICLR 2023 Table 10: Hyper-parameters for pre-training De...