Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding

Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding , author= · 2019 · cs.CL · arXiv 1904.09482

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

This paper explores the use of knowledge distillation to improve a Multi-Task Deep Neural Network (MT-DNN) (Liu et al., 2019) for learning text representations across multiple natural language understanding tasks. Although ensemble learning can improve model performance, serving an ensemble of large DNNs such as MT-DNN can be prohibitively expensive. Here we apply the knowledge distillation method (Hinton et al., 2015) in the multi-task learning setting. For each task, we train an ensemble of different MT-DNNs (teacher) that outperforms any single model, and then train a single MT-DNN (student) via multi-task learning to \emph{distill} knowledge from these ensemble teachers. We show that the distilled MT-DNN significantly outperforms the original MT-DNN on 7 out of 9 GLUE tasks, pushing the GLUE benchmark (single model) to 83.7\% (1.5\% absolute improvement\footnote{ Based on the GLUE leaderboard at https://gluebenchmark.com/leaderboard as of April 1, 2019.}). The code and pre-trained models will be made publicly available at https://github.com/namisan/mt-dnn.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

cs.CL · 2019-05-02 · accept · novelty 6.0

SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.

A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization

cs.CL · 2026-06-29 · unverdicted · novelty 5.0

A single LLM rewrite of skill descriptions using false positive and negative cases matches manual optimization performance in production, with most other pipeline components adding little value.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

cs.CL · 2019-07-26 · accept · novelty 5.0

With better hyperparameters, more data, and longer training, an unchanged BERT-Large architecture matches or exceeds XLNet and other successors on GLUE, SQuAD, and RACE.

citing papers explorer

Showing 5 of 5 citing papers.

Language Models are Few-Shot Learners cs.CL · 2020-05-28 · accept · none · ref 39
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 71
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems cs.CL · 2019-05-02 · accept · none · ref 119 · internal anchor
SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.
A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization cs.CL · 2026-06-29 · unverdicted · none · ref 75 · internal anchor
A single LLM rewrite of skill descriptions using false positive and negative cases matches manual optimization performance in production, with most other pipeline components adding little value.
RoBERTa: A Robustly Optimized BERT Pretraining Approach cs.CL · 2019-07-26 · accept · none · ref 25
With better hyperparameters, more data, and longer training, an unchanged BERT-Large architecture matches or exceeds XLNet and other successors on GLUE, SQuAD, and RACE.

Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer