Smaller Text Classifiers with Discriminative Cluster Embeddings

Kevin Gimpel; Mingda Chen

arxiv: 1906.09532 · v1 · pith:YIXYCFWDnew · submitted 2019-06-23 · 💻 cs.CL

Smaller Text Classifiers with Discriminative Cluster Embeddings

Mingda Chen , Kevin Gimpel This is my paper

Pith reviewed 2026-05-25 18:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords text classificationmodel compressionword embeddingsGumbel-Softmaxclusteringparameter efficiencyneural networks

0 comments

The pith

Text classifiers shrink by learning hard word clusters end-to-end with the task loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that word embedding parameters, which usually dominate model size, can be replaced by a smaller set of cluster embeddings whose assignments are learned jointly with the classification objective. This joint optimization uses the Gumbel-Softmax distribution to select discrete cluster memberships while still allowing gradients to flow back to the clustering decisions. The result is a deployed model whose embedding layer uses far fewer parameters. Selective extra parameters can be added to a subset of words to recover accuracy without losing most of the size savings.

Core claim

By maximizing over latent word-to-cluster assignments with the Gumbel-Softmax distribution while minimizing the downstream task loss, the method produces a hard clustering that replaces individual word embeddings with shared cluster embeddings, yielding smaller classifiers; optional per-word parameter additions further improve accuracy at modest extra cost.

What carries the argument

Gumbel-Softmax relaxation that selects discrete cluster assignments for each word while the entire model is trained on the classification loss.

If this is right

Deployed text classifiers require substantially less memory for the embedding table at comparable accuracy.
The resulting clusters are task-specific rather than generic.
Accuracy-size trade-offs can be tuned by assigning extra parameters only to selected words.
The approach applies directly to any neural text classifier whose size is dominated by the embedding matrix.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same clustering mechanism could be tested on sequence labeling or generation tasks where embedding size also dominates.
Combining the cluster embeddings with other compression methods such as quantization might yield further reductions.
The learned clusters might reveal task-specific semantic groupings that differ from those produced by unsupervised methods.

Load-bearing premise

The Gumbel-Softmax approximation stays close enough to true hard clustering that the joint optimization produces useful discrete assignments.

What would settle it

A controlled experiment in which models trained with the learned clusters show no accuracy gain over models that use the same number of randomly chosen or fixed clusters at identical parameter budgets would falsify the central claim.

Figures

Figures reproduced from arXiv: 1906.09532 by Kevin Gimpel, Mingda Chen.

**Figure 1.** Figure 1: Schematic of deployed cluster embedding ure 1: Schematic of deployed cluster embedding dl ih k llbbiliib [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Development accuracy vs model size (MB) on four datasets. ME consistently outperforms other models under various size budgets. 2015), and the IMDB movie review dataset (Maas et al., 2011). We randomly sample 5,000 instances from the training set to use as development data for all datasets except for IMDB, where we sample 2,000 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Varying the fraction of training data used on [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Plots of 2-dimensional RNN hidden states [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Word embedding parameters often dominate overall model sizes in neural methods for natural language processing. We reduce deployed model sizes of text classifiers by learning a hard word clustering in an end-to-end manner. We use the Gumbel-Softmax distribution to maximize over the latent clustering while minimizing the task loss. We propose variations that selectively assign additional parameters to words, which further improves accuracy while still remaining parameter-efficient.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses Gumbel-Softmax to train hard word clusters end-to-end so classifiers store only K embeddings instead of V, but the abstract gives no results to show the compression actually happens.

read the letter

The main takeaway is that this work trains a latent hard clustering of the vocabulary jointly with the downstream classifier loss, using Gumbel-Softmax to keep the assignment step differentiable. At deployment only the cluster embeddings are kept, which is the stated route to smaller models. The selective-parameter variations are a straightforward way to recover some accuracy without restoring the full vocabulary table. That formulation is the concrete new piece relative to prior two-stage clustering approaches. The description is clear and the motivation lines up with real deployment constraints on embedding size. The math is standard Gumbel-Softmax plus a task loss; nothing circular or invented. The citation pattern pulls the expected references without obvious gaps. The soft spot is the complete absence of numbers. No accuracy figures, no size measurements, no ablation on temperature or cluster count, and no check on how close the assignments actually get to one-hot. If the relaxation stays soft, the memory saving disappears and the method reduces to ordinary soft clustering. That is exactly the point the stress-test note flags, and the abstract supplies no evidence that annealing or regularization solves it. A reader working on efficient NLP models would still find the training recipe worth trying, but only after seeing the experiments. The paper is coherent on its own terms and shows straightforward thinking about the discrete bottleneck. I would send it for peer review so referees can examine the runs and the hardness diagnostics.

Referee Report

2 major / 1 minor

Summary. The paper claims that word embedding parameters dominate model sizes in neural NLP, and proposes to reduce deployed sizes of text classifiers by learning a hard word-to-cluster assignment end-to-end via the Gumbel-Softmax distribution (maximizing over the latent clustering while minimizing task loss). It also introduces variations that selectively assign extra parameters to individual words while remaining parameter-efficient.

Significance. If the method reliably produces near-discrete assignments, it would offer a practical route to smaller deployed text classifiers by replacing a V-by-d embedding matrix with a K-by-d matrix (plus a small assignment table). The end-to-end discriminative training of the clustering is a clear strength relative to post-hoc clustering approaches.

major comments (2)

[Abstract, paragraph 2] Abstract, paragraph 2: the central size-reduction claim requires that the Gumbel-Softmax relaxation converge to sufficiently hard (near one-hot) assignments so that only K embeddings are stored at inference; the manuscript provides no guarantee, temperature schedule, or post-training discretization procedure that would ensure the effective parameter count is K·d rather than closer to V·d.
[Method section (Gumbel-Softmax formulation)] Method section (Gumbel-Softmax formulation): without an explicit analysis or ablation showing that the learned cluster-assignment entropy is low (or that the straight-through estimator yields discrete behavior at test time), the claimed compression benefit remains unsecured even if task accuracy is preserved.

minor comments (1)

The abstract states that selective additional parameters 'further improves accuracy' but does not indicate the criterion used to decide which words receive extra parameters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need to secure the parameter-efficiency claim through explicit analysis of assignment hardness. We address both major comments below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract, paragraph 2] Abstract, paragraph 2: the central size-reduction claim requires that the Gumbel-Softmax relaxation converge to sufficiently hard (near one-hot) assignments so that only K embeddings are stored at inference; the manuscript provides no guarantee, temperature schedule, or post-training discretization procedure that would ensure the effective parameter count is K·d rather than closer to V·d.

Authors: We agree that the size-reduction claim depends on near-discrete assignments at inference. The original manuscript describes the Gumbel-Softmax but does not detail the temperature schedule or a discretization step. In the revision we will add: (1) the annealing schedule used (starting at τ=1.0 and decaying to 0.1), (2) the explicit post-training procedure of taking argmax over the learned assignment logits for each word, and (3) a statement that only the K cluster embeddings plus the resulting V-to-K lookup table are stored at deployment. We will also report the final average assignment entropy on the test sets. revision: yes
Referee: [Method section (Gumbel-Softmax formulation)] Method section (Gumbel-Softmax formulation): without an explicit analysis or ablation showing that the learned cluster-assignment entropy is low (or that the straight-through estimator yields discrete behavior at test time), the claimed compression benefit remains unsecured even if task accuracy is preserved.

Authors: We acknowledge the absence of such an ablation. The revision will include a new subsection with: (a) plots of assignment entropy versus temperature and training epoch, (b) comparison of straight-through vs. soft Gumbel-Softmax at test time, and (c) measured compression ratios (effective parameters = K·d + V·log₂K bits for the assignment table) on the reported datasets. These additions will directly address the concern that the compression benefit may not be realized. revision: yes

Circularity Check

0 steps flagged

No circularity; proposal is a new training procedure with independent content

full rationale

The paper introduces an end-to-end optimization for hard word clustering via Gumbel-Softmax to reduce embedding parameters at deployment. No equations or claims reduce to fitted quantities defined within the paper itself, no self-citation chains justify core premises, and no predictions are statistically forced by construction from inputs. The method applies an established external relaxation technique to a task loss without renaming known results or smuggling ansatzes. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated beyond the standard use of Gumbel-Softmax.

pith-pipeline@v0.9.0 · 5578 in / 946 out tokens · 30640 ms · 2026-05-25T18:05:15.652571+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter doi edition editor howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Mart\' n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Man\' e , Rajat Monga, Sherry Moore, Derek...

work page 2015
[4]

Yoshua Bengio, R \'e jean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research\/ 3(Feb):1137--1155

work page 2003
[5]

Botha, Emily Pitler, Ji Ma, Anton Bakalov, Alex Salcianu, David Weiss, Ryan McDonald, and Slav Petrov

Jan A. Botha, Emily Pitler, Ji Ma, Anton Bakalov, Alex Salcianu, David Weiss, Ryan McDonald, and Slav Petrov. 2017. Natural language processing with small feed-forward networks http://aclweb.org/anthology/D17-1309. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing\/ . Association for Computational Linguistics, pages...

work page 2017
[6]

Brown, Peter V

Peter F. Brown, Peter V. deSouza, Robert L. Mercer, T. J. Watson, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Class-based n-gram models of natural language http://www.aclweb.org/anthology/J92-4003. Computational Linguistics\/ 18(4). http://www.aclweb.org/anthology/J92-4003 http://www.aclweb.org/anthology/J92-4003

work page 1992
[7]

Song Han, Huizi Mao, and William J Dally. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations\/

work page 2016
[8]

Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network http://papers.nips.cc/paper/5784-learning-both-weights-and-connections-for-efficient-neural-network.pdf. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems...

work page 2015
[9]

Sepp Hochreiter and J \"u rgen Schmidhuber. 1997. Long short-term memory. Neural computation\/ 9(8):1735--1780

work page 1997
[10]

Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with G umbel-softmax. In International Conference on Learning Representations\/

work page 2016
[11]

Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2011. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence\/ 33(1):117--128

work page 2011
[12]

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herv \'e J \'e gou, and Tomas Mikolov. 2017. Fasttext.zip: Compressing text classification models. In International Conference on Learning Representations\/

work page 2017
[13]

Yoon Kim. 2014. Convolutional neural networks for sentence classification https://doi.org/10.3115/v1/D14-1181. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ . Association for Computational Linguistics, pages 1746--1751. https://doi.org/10.3115/v1/D14-1181 https://doi.org/10.3115/v1/D14-1181

work page doi:10.3115/v1/d14-1181 2014
[14]

Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations\/

work page 2015
[15]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis http://www.aclweb.org/anthology/P11-1015. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies\/ . Association for Computational Linguistics, p...

work page 2011
[16]

Chris J Maddison, Andriy Mnih, and Yee Whye Teh. 2016. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations\/

work page 2016
[17]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. https://doi.org/10.3115/v1/D14-1162 GloVe : Global vectors for word representation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ . Association for Computational Linguistics, pages 1532--1543. https://doi.org/10.3115/v1/D14-1162 https:...

work page doi:10.3115/v1/d14-1162 2014
[18]

Raphael Shu and Hideki Nakayama. 2018. Compressing word embeddings via deep compositional code learning. In International Conference on Learning Representations\/

work page 2018
[19]

Dan Tito Svenstrup, Jonas Hansen, and Ole Winther. 2017. Hash embeddings for efficient word representations http://papers.nips.cc/paper/7078-hash-embeddings-for-efficient-word-representations.pdf. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30\/ ,...

work page 2017
[20]

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification http://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28\/ , Curran Ass...

work page 2015

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter doi edition editor howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Mart\' n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Man\' e , Rajat Monga, Sherry Moore, Derek...

work page 2015

[4] [4]

Yoshua Bengio, R \'e jean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research\/ 3(Feb):1137--1155

work page 2003

[5] [5]

Botha, Emily Pitler, Ji Ma, Anton Bakalov, Alex Salcianu, David Weiss, Ryan McDonald, and Slav Petrov

Jan A. Botha, Emily Pitler, Ji Ma, Anton Bakalov, Alex Salcianu, David Weiss, Ryan McDonald, and Slav Petrov. 2017. Natural language processing with small feed-forward networks http://aclweb.org/anthology/D17-1309. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing\/ . Association for Computational Linguistics, pages...

work page 2017

[6] [6]

Brown, Peter V

Peter F. Brown, Peter V. deSouza, Robert L. Mercer, T. J. Watson, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Class-based n-gram models of natural language http://www.aclweb.org/anthology/J92-4003. Computational Linguistics\/ 18(4). http://www.aclweb.org/anthology/J92-4003 http://www.aclweb.org/anthology/J92-4003

work page 1992

[7] [7]

Song Han, Huizi Mao, and William J Dally. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations\/

work page 2016

[8] [8]

Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network http://papers.nips.cc/paper/5784-learning-both-weights-and-connections-for-efficient-neural-network.pdf. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems...

work page 2015

[9] [9]

Sepp Hochreiter and J \"u rgen Schmidhuber. 1997. Long short-term memory. Neural computation\/ 9(8):1735--1780

work page 1997

[10] [10]

Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with G umbel-softmax. In International Conference on Learning Representations\/

work page 2016

[11] [11]

Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2011. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence\/ 33(1):117--128

work page 2011

[12] [12]

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herv \'e J \'e gou, and Tomas Mikolov. 2017. Fasttext.zip: Compressing text classification models. In International Conference on Learning Representations\/

work page 2017

[13] [13]

Yoon Kim. 2014. Convolutional neural networks for sentence classification https://doi.org/10.3115/v1/D14-1181. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ . Association for Computational Linguistics, pages 1746--1751. https://doi.org/10.3115/v1/D14-1181 https://doi.org/10.3115/v1/D14-1181

work page doi:10.3115/v1/d14-1181 2014

[14] [14]

Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations\/

work page 2015

[15] [15]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis http://www.aclweb.org/anthology/P11-1015. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies\/ . Association for Computational Linguistics, p...

work page 2011

[16] [16]

Chris J Maddison, Andriy Mnih, and Yee Whye Teh. 2016. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations\/

work page 2016

[17] [17]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. https://doi.org/10.3115/v1/D14-1162 GloVe : Global vectors for word representation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ . Association for Computational Linguistics, pages 1532--1543. https://doi.org/10.3115/v1/D14-1162 https:...

work page doi:10.3115/v1/d14-1162 2014

[18] [18]

Raphael Shu and Hideki Nakayama. 2018. Compressing word embeddings via deep compositional code learning. In International Conference on Learning Representations\/

work page 2018

[19] [19]

Dan Tito Svenstrup, Jonas Hansen, and Ole Winther. 2017. Hash embeddings for efficient word representations http://papers.nips.cc/paper/7078-hash-embeddings-for-efficient-word-representations.pdf. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30\/ ,...

work page 2017

[20] [20]

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification http://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28\/ , Curran Ass...

work page 2015