Do Deep Nets Really Need to be Deep?

Lei Jimmy Ba; Rich Caruana

arxiv: 1312.6184 · v7 · pith:2TD26MENnew · submitted 2013-12-21 · 💻 cs.LG · cs.NE

Do Deep Nets Really Need to be Deep?

Lei Jimmy Ba , Rich Caruana This is my paper

classification 💻 cs.LG cs.NE

keywords deepnetsshallowneuralcomplexcurrentlyfeed-forwardfunctions

0 comments

read the original abstract

Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this extended abstract, we show that shallow feed-forward networks can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. Moreover, in some cases the shallow neural nets can learn these deep functions using a total number of parameters similar to the original deep model. We evaluate our method on the TIMIT phoneme recognition task and are able to train shallow fully-connected nets that perform similarly to complex, well-engineered, deep convolutional architectures. Our success in training shallow neural nets to mimic deeper models suggests that there probably exist better algorithms for training shallow feed-forward nets than those currently available.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On the Generalization of Knowledge Distillation: An Information-Theoretic View
cs.IT 2026-05 unverdicted novelty 7.0

Knowledge distillation generalization bounds are derived via a new distillation divergence measuring teacher-student kernel difference, with tighter bounds from teacher loss flatness.
On the Generalization of Knowledge Distillation: An Information-Theoretic View
cs.IT 2026-05 unverdicted novelty 7.0

Derives upper and lower generalization bounds for the student relative to the teacher using a new distillation divergence, plus a loss-sharpness-aware bound and a bias-variance-rank decomposition in the linear Gaussian case.
Multi-Modality Distillation via Learning the teacher's modality-level Gram Matrix
cs.AI 2021-12 unverdicted novelty 4.0

Proposes a modality relation distillation method that transfers teacher modality relationships via the modality-level Gram Matrix.