pith. machine review for the scientific record. sign in

arxiv: 2605.10933 · v2 · submitted 2026-05-11 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:25 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords mixture-of-expertssparse modelsend-side deploymenttransformerrelu routingmodel inferenceactivation functions
0
0 comments X

The pith

DECO sparse MoE matches dense Transformer performance while activating only 20% of experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DECO introduces a sparse Mixture-of-Experts architecture that seeks to match the performance of dense Transformers using the same total number of parameters and training data. It achieves this through ReLU-based routing with learnable expert-wise scaling to balance expert contributions and a new NormSiLU activation that promotes stable and higher sparsity. A sympathetic reader would care because this could enable powerful models to run efficiently on end-side devices like phones, where storage and memory access are limited. Experiments indicate that activating just 20% of experts suffices to reach dense levels of performance while delivering a 3x speedup on hardware.

Core claim

DECO achieves performance comparable to dense Transformers of the same total parameter count by activating only 20% of its experts through differentiable ReLU-based routing enhanced by learnable expert-wise scaling and the NormSiLU activation function, which stabilizes the routed-expert activation ratio and increases intrinsic sparsity. The architecture also benefits from using non-gated MLP experts.

What carries the argument

ReLU-based routing with learnable expert-wise scaling that adaptively balances routed and shared experts, together with the NormSiLU activation function for stable sparsity.

Load-bearing premise

The learned expert-wise scaling and NormSiLU will keep producing stable sparsity and performance matching dense models when model size, data distribution, or hardware change substantially.

What would settle it

Observe whether a DECO model trained at larger scale or on shifted data distributions maintains performance parity with its dense counterpart and keeps the 20% activation ratio stable.

Figures

Figures reproduced from arXiv: 2605.10933 by Chaojun Xiao, Chenyang Song, Weilin Zhao, Xu Han, Yingfa Chen, Zhiyuan Liu.

Figure 1
Figure 1. Figure 1: The “ideal triangle” of end-side MoE. Beyond the high performance and reduced computational cost of sparse MoE, the model should maintain a minimal storage footprint, achieving high performance within dense-comparable total parameter budgets. creasingly prominent model architecture. The key property of MoE is the sparse activation, namely, activating a small subset of expert modules from a large pool of pa… view at source ↗
Figure 2
Figure 2. Figure 2: The overall architecture of DECO. For router design, we adopt ReLU-based routing enhanced by learnable expert-wise router scaling. For expert design, we propose NormSiLU as a better routed-expert activation function and employ non-gated MLP experts. For precise sparsity control, we employ adaptive sparsity regularization. optimal settings of DeepSeek-V3-style MoE architectures that enable them to surpass d… view at source ↗
Figure 3
Figure 3. Figure 3: The evaluation results of DECO versus baseline settings. “PPL” and “Task” indicate the C4 validation perplexity and the average accuracy (%) on downstream benchmarks, respectively. DeepSeek-V3 uses gated MLP experts, and ReMoE uses non-gated ones. This is due to their better performance than the opposite settings, see Section 4.4 for detailed discussions. DECO’s efficiency in maintaining dense-level repres… view at source ↗
Figure 4
Figure 4. Figure 4: The distribution of routed-expert output norms in the first MoE layer of DECO (Medium) on the C4 validation set, which shows clear expert-wise heterogeneity. To demonstrate the effect of DECO’s router scaling design, we experiment on two ablation settings: “Fixed” adopts a constant scaling factor for all routed experts, and “Scalar” involves a single learnable scalar scaling factor shared by experts. Both … view at source ↗
Figure 6
Figure 6. Figure 6: The trend of the regularization co￾efficient of DECO (Small) and ablation set￾tings without different steps of NormSiLU. The baseline “SiLU” and “w/o RMS” set￾tings show significantly higher coefficients, which potentially harm performance. 0 3000 6000 9000 12000 15000 Training Step 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Absolute SiLU Output Magnitudes SiLU w/o RMS w/o Mean NormSiLU [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
Figure 8
Figure 8. Figure 8: The trend of routed-expert activation ratio of DECO (Small) using different expert gating policies. tectures: DeepSeek-V3, ReMoE, and DECO. DeepSeek-V3 is a well-performing MoE architecture using a fixed per￾token activation ratio, while ReMoE and DECO use ReLU￾based routing to implement a flexible activation ratio. For each architecture, we compare non-gated MLP experts (NG) against gated MLP experts (GA)… view at source ↗
Figure 9
Figure 9. Figure 9: The impact of the routed-expert ac￾tivation ratio on the performance of DECO (Small and Medium). 32 64 96 128 160 192 224 256 Intermediate Dimension of Shared Expert 28 29 30 31 32 33 34 35 C4 Validation PPL DECO (Small) Dense (Small) DECO (Medium) Dense (Medium) [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
read the original abstract

While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed-expert activation ratio and a higher intrinsic sparsity level. We also identify an empirical advantage in using non-gated MLP experts with ReLU-based routing, indicating the possibility of MoE architecture simplification. Experiments demonstrate that DECO, activating only 20% of experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 3.00$\times$ speedup on real hardware compared with dense inference. Codes and checkpoints are all available at https://github.com/thunlp/DECO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DECO, a sparse Mixture-of-Experts architecture for end-side devices that matches dense Transformer performance under identical total parameter budgets and training tokens. It employs ReLU-based routing augmented by learnable expert-wise scaling factors to balance routed and shared experts, introduces the NormSiLU activation (input normalization before SiLU) to stabilize the routed-expert activation ratio at ~20% sparsity, and uses non-gated MLP experts. Experiments report that DECO outperforms established MoE baselines while a custom kernel achieves 3.00× inference speedup on real hardware; code and checkpoints are released.

Significance. If the empirical performance match holds under the reported constraints, the result is significant for memory-constrained deployment of high-capacity models, as it decouples total parameters from active computation and storage overhead without requiring post-training compression. The open release of code, checkpoints, and hardware measurements strengthens reproducibility and enables direct verification of the claimed speedup.

major comments (2)
  1. [Experiments] Experiments section (and associated tables/figures): the central claim that DECO matches dense performance while activating only 20% of experts is presented without reported standard deviations, number of random seeds, or statistical significance tests for the accuracy comparisons. This makes it impossible to assess whether the match is robust or within variance of the dense baseline.
  2. [Method] Section describing NormSiLU and expert-wise scaling: the stabilization of the routed-expert activation ratio is asserted empirically, yet no scaling curves, ablations on expert count, or analysis under distribution shift are provided. The interaction between input normalization in NormSiLU and growing expert capacity therefore remains untested, which is load-bearing for the claim that the 20% sparsity level and dense-comparable accuracy will persist beyond the evaluated end-side model sizes.
minor comments (2)
  1. [Method] The abstract and method sections use “non-gated MLP experts” without an explicit equation or diagram contrasting them to standard gated experts; a short comparison equation would clarify the claimed simplification.
  2. [Experiments] Figure captions and table footnotes should explicitly state the exact model sizes, dataset, and token budget used for each dense vs. DECO comparison to allow immediate replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help improve the clarity and rigor of our work. We address each major comment point by point below and will revise the manuscript to incorporate the suggested additions.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (and associated tables/figures): the central claim that DECO matches dense performance while activating only 20% of experts is presented without reported standard deviations, number of random seeds, or statistical significance tests for the accuracy comparisons. This makes it impossible to assess whether the match is robust or within variance of the dense baseline.

    Authors: We agree that reporting variability and statistical tests is essential for robust evaluation. In the revised manuscript, we will explicitly state the number of random seeds (we used 3 seeds for all main experiments) and include standard deviations alongside mean accuracy values in Tables 1-3 and Figure 2. We will also add pairwise t-test results (with p-values) comparing DECO to the dense baseline to confirm that observed differences fall within expected variance. revision: yes

  2. Referee: [Method] Section describing NormSiLU and expert-wise scaling: the stabilization of the routed-expert activation ratio is asserted empirically, yet no scaling curves, ablations on expert count, or analysis under distribution shift are provided. The interaction between input normalization in NormSiLU and growing expert capacity therefore remains untested, which is load-bearing for the claim that the 20% sparsity level and dense-comparable accuracy will persist beyond the evaluated end-side model sizes.

    Authors: We acknowledge that additional analysis would better support the generalizability of NormSiLU. In the revision, we will add a new figure with scaling curves of the routed-expert activation ratio versus expert count (from 4 to 32 experts) and model size. We will also include an ablation table on expert count and a short analysis of activation ratio stability under distribution shift using a held-out validation set from a different domain. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results independent of inputs

full rationale

The paper introduces DECO via ReLU routing with learnable expert-wise scaling and the NormSiLU activation, then validates the 20% activation claim through end-to-end training and hardware benchmarks on fixed model sizes and token budgets. No equations reduce a prediction to a fitted parameter by construction, no load-bearing premise rests on self-citation, and no ansatz or uniqueness result is smuggled in. The architecture choices and performance match are presented as design decisions confirmed experimentally rather than tautologically derived from the same quantities.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The architecture adds learnable scaling parameters and a new activation on top of standard MoE and transformer assumptions.

free parameters (1)
  • expert-wise scaling factors
    Learnable per-expert multipliers that balance routed and shared experts.
axioms (1)
  • domain assumption Standard dense transformer pre-training assumptions hold for the MoE variant.
    Training tokens and optimizer settings are assumed comparable to dense baselines.
invented entities (1)
  • NormSiLU no independent evidence
    purpose: Activation that normalizes before SiLU to stabilize routed-expert ratios.
    New function introduced to increase intrinsic sparsity.

pith-pipeline@v0.9.0 · 5534 in / 1051 out tokens · 58530 ms · 2026-05-13T07:25:41.711109+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

192 extracted references · 192 canonical work pages · 20 internal anchors

  1. [1]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=. 2020 , url=

  2. [2]

    Finetuned Language Models Are Zero-Shot Learners

    Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

  3. [3]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=

  4. [4]

    Advances in Neural Information Processing Systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=. 2022 , url=

  5. [5]

    2023 , url =

    OpenAI , title =. 2023 , url =

  6. [6]

    2023 , url=

    Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal=. 2023 , url=

  7. [7]

    2023 , url=

    Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=. 2023 , url=

  8. [8]

    Efficiently scaling

    Pope, Reiner and Douglas, Sholto and Chowdhery, Aakanksha and Devlin, Jacob and Bradbury, James and Heek, Jonathan and Xiao, Kefan and Agrawal, Shivani and Dean, Jeff , journal=. Efficiently scaling. 2023 , url=

  9. [9]

    2022 , organization=

    Aminabadi, Reza Yazdani and Rajbhandari, Samyam and Awan, Ammar Ahmad and Li, Cheng and Li, Du and Zheng, Elton and Ruwase, Olatunji and Smith, Shaden and Zhang, Minjia and Rasley, Jeff and others , booktitle=. 2022 , organization=

  10. [10]

    International Conference on Machine Learning , pages=

    Smoothquant: Accurate and efficient post-training quantization for large language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    Towards efficient post-training quantization of pre-trained language models , author=. Advances in Neural Information Processing Systems , volume=. 2022 , url=

  12. [12]

    Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation,

    A comprehensive study on post-training quantization for large language models , author=. arXiv preprint arXiv:2303.08302 , year=

  13. [13]

    A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023

    A Simple and Effective Pruning Approach for Large Language Models , author=. arXiv preprint arXiv:2306.11695 , year=

  14. [14]

    2023 , organization=

    Frantar, Elias and Alistarh, Dan , booktitle=. 2023 , organization=

  15. [15]

    Xia, Mengzhou and Gao, Tianyu and Zeng, Zhiyuan and Chen, Danqi , journal=. Sheared. 2023 , url=

  16. [16]

    Fast inference from

    Leviathan, Yaniv and Kalman, Matan and Matias, Yossi , booktitle=. Fast inference from. 2023 , organization=

  17. [17]

    2023 , organization=

    Liu, Zichang and Wang, Jue and Dao, Tri and Zhou, Tianyi and Yuan, Binhang and Song, Zhao and Shrivastava, Anshumali and Zhang, Ce and Tian, Yuandong and Re, Christopher and others , booktitle=. 2023 , organization=

  18. [18]

    2023 , url=

    Song, Yixin and Mi, Zeyu and Xie, Haotong and Chen, Haibo , journal=. 2023 , url=

  19. [19]

    Adversarial robustness of sparse local

    Muthukumar, Ramchandran and Sulam, Jeremias , journal=. Adversarial robustness of sparse local. 2023 , publisher=

  20. [20]

    How can we be so dense?

    Ahmad, Subutai and Scheinkman, Luiz , journal=. How can we be so dense?. 2019 , url=

  21. [21]

    Adaptively Sparse

    Correia, Gon. Adaptively Sparse. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=. 2019 , url=

  22. [22]

    The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in

    Li, Zonglin and You, Chong and Bhojanapalli, Srinadh and Li, Daliang and Rawat, Ankit Singh and Reddi, Sashank J and Ye, Ke and Chern, Felix and Yu, Felix and Guo, Ruiqi and others , booktitle=. The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in. 2022 , url=

  23. [23]

    2022 , url=

    Zhang, Susan and Roller, Stephen and Goyal, Naman and Artetxe, Mikel and Chen, Moya and Chen, Shuohui and Dewan, Christopher and Diab, Mona and Li, Xian and Lin, Xi Victoria and others , journal=. 2022 , url=

  24. [24]

    Deep learning using rectified linear units (

    Agarap, Abien Fred , journal=. Deep learning using rectified linear units (. 2018 , url=

  25. [25]

    2023 , url=

    Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and others , journal=. 2023 , url=

  26. [26]

    Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, M. The. arXiv preprint arXiv:2311.16867 , year=

  27. [27]

    2023 , url=

    Mirzadeh, Iman and Alizadeh, Keivan and Mehta, Sachin and Del Mundo, Carlo C and Tuzel, Oncel and Samei, Golnoosh and Rastegari, Mohammad and Farajtabar, Mehrdad , journal=. 2023 , url=

  28. [28]

    arXiv preprint arXiv:2010.01048 , year=

    The Efficacy of L_1 Regularization in Two-Layer Neural Networks , author=. arXiv preprint arXiv:2010.01048 , year=

  29. [29]

    Neural Networks , volume=

    Transformed L_1 regularization for learning sparse deep neural networks , author=. Neural Networks , volume=. 2019 , publisher=

  30. [30]

    Gaussian error linear units (

    Hendrycks, Dan and Gimpel, Kevin , journal=. Gaussian error linear units (. 2016 , url=

  31. [31]

    Neural networks , volume=

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning , author=. Neural networks , volume=. 2018 , publisher=

  32. [32]

    L_2 regularization, and rotational invariance , author=

    Feature selection, L_1 vs. L_2 regularization, and rotational invariance , author=. Proceedings of the twenty-first international conference on Machine learning , pages=. 2004 , url=

  33. [33]

    IEEE access , volume=

    A survey of sparse representation: algorithms and applications , author=. IEEE access , volume=. 2015 , publisher=

  34. [34]

    Journal of physics: Conference series , volume=

    An overview of overfitting and its solutions , author=. Journal of physics: Conference series , volume=. 2019 , organization=

  35. [35]

    2016 , url=

    Loshchilov, Ilya and Hutter, Frank , booktitle=. 2016 , url=

  36. [36]

    International Conference on Machine Learning , pages=

    Language modeling with gated convolutional networks , author=. International Conference on Machine Learning , pages=. 2017 , organization=

  37. [37]

    2020 , url=

    Shazeer, Noam , journal=. 2020 , url=

  38. [38]

    2022 , url=

    Han, Xu and Zeng, Guoyang and Zhao, Weilin and Liu, Zhiyuan and Zhang, Zhengyan and Zhou, Jie and Zhang, Jun and Chao, Jia and Sun, Maosong , booktitle=. 2022 , url=

  39. [39]

    International Conference on Machine Learning , pages=

    Flexgen: High-throughput generative inference of large language models with a single GPU , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  40. [40]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

  41. [41]

    Program Synthesis with Large Language Models

    Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

  42. [42]

    2020 , url=

    Bisk, Yonatan and Zellers, Rowan and Gao, Jianfeng and Choi, Yejin and others , booktitle=. 2020 , url=

  43. [43]

    2019 , url=

    Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin , booktitle=. 2019 , url=

  44. [44]

    2019 , url=

    Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle=. 2019 , url=

  45. [45]

    2020 , url=

    Sakaguchi, Keisuke and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin , booktitle=. 2020 , url=

  46. [46]

    Think you have Solved Question Answering?

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think you have Solved Question Answering?. 2018 , url=

  47. [47]

    2019 , url=

    Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan , booktitle=. 2019 , url=

  48. [48]

    2011 AAAI Spring Symposium Series , year=

    Choice of plausible alternatives: An evaluation of commonsense causal reasoning , author=. 2011 AAAI Spring Symposium Series , year=

  49. [49]

    2019 , url=

    Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=. 2019 , url=

  50. [50]

    Paperno, Denis and Kruszewski, Germ. The. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=. 2016 , url=

  51. [51]

    2020 , url=

    Clark, Jonathan H and Choi, Eunsol and Collins, Michael and Garrette, Dan and Kwiatkowski, Tom and Nikolaev, Vitaly and Palomaki, Jennimaria , journal=. 2020 , url=

  52. [52]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

  53. [53]

    Measuring Massive Multitask Language Understanding

    Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

  54. [54]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

  55. [55]

    2023 , url=

    Zhong, Wanjun and Cui, Ruixiang and Guo, Yiduo and Liang, Yaobo and Lu, Shuai and Wang, Yanlin and Saied, Amin and Chen, Weizhu and Duan, Nan , journal=. 2023 , url=

  56. [56]

    Scaling Laws for Neural Language Models

    Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

  57. [57]

    Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

    Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding , author=. arXiv preprint arXiv:1510.00149 , year=

  58. [58]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Quantization and training of neural networks for efficient integer-arithmetic-only inference , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=. 2018 , url=

  59. [59]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Data-free quantization through weight equalization and bias correction , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=. 2019 , url=

  60. [60]

    International Conference on Machine Learning , pages=

    Improving neural network quantization without retraining using outlier channel splitting , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  61. [61]

    Advances in neural information processing systems , volume=

    Learning both weights and connections for efficient neural network , author=. Advances in neural information processing systems , volume=. 2015 , url=

  62. [62]

    2023 , url=

    Ma, Xinyin and Fang, Gongfan and Wang, Xinchao , journal=. 2023 , url=

  63. [63]

    International Conference on Learning Representations , year=

    Pruning Convolutional Neural Networks for Resource Efficient Inference , author=. International Conference on Learning Representations , year=

  64. [64]

    The Journal of Machine Learning Research , volume=

    Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks , author=. The Journal of Machine Learning Research , volume=. 2021 , publisher=

  65. [65]

    Distilling the Knowledge in a Neural Network

    Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

  66. [66]

    arXiv preprint arXiv:1903.12136 , year=

    Distilling task-specific knowledge from bert into simple neural networks , author=. arXiv preprint arXiv:1903.12136 , year=

  67. [67]

    Training data-efficient image

    Touvron, Hugo and Cord, Matthieu and Douze, Matthijs and Massa, Francisco and Sablayrolles, Alexandre and J. Training data-efficient image. International Conference on Machine Learning , pages=. 2021 , organization=

  68. [68]

    MiniLLM: On-Policy Distillation of Large Language Models

    Knowledge Distillation of Large Language Models , author=. arXiv preprint arXiv:2306.08543 , year=

  69. [69]

    Distilling step-by-step! 32 outperforming larger language models with less training data and smaller model sizes,

    Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes , author=. arXiv preprint arXiv:2305.02301 , year=

  70. [70]

    Wang, Yiding and Chen, Kai and Tan, Haisheng and Guo, Kun , booktitle=. Tabi:. 2023 , url=

  71. [71]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Accelerating large language model decoding with speculative sampling , author=. arXiv preprint arXiv:2302.01318 , year=

  72. [72]

    2023 , url=

    Miao, Xupeng and Oliaro, Gabriele and Zhang, Zhihao and Cheng, Xinhao and Wang, Zeyu and Wong, Rae Ying Yee and Chen, Zhuoming and Arfeen, Daiyaan and Abhyankar, Reyna and Jia, Zhihao , journal=. 2023 , url=

  73. [73]

    Zeroquant: Efficient and affordable post-training quantization for large-scale

    Yao, Zhewei and Yazdani Aminabadi, Reza and Zhang, Minjia and Wu, Xiaoxia and Li, Conglong and He, Yuxiong , journal=. Zeroquant: Efficient and affordable post-training quantization for large-scale. 2022 , url=

  74. [74]

    Frantar and D

    Massive language models can be accurately pruned in one-shot , author=. arXiv preprint arXiv:2301.00774 , year=

  75. [75]

    2023 , url=

    Zheng, Ningxin and Jiang, Huiqiang and Zhang, Quanlu and Han, Zhenhua and Ma, Lingxiao and Yang, Yuqing and Yang, Fan and Zhang, Chengruidong and Qiu, Lili and Yang, Mao and others , booktitle=. 2023 , url=

  76. [76]

    Advances in neural information processing systems , volume=

    Learning structured sparsity in deep neural networks , author=. Advances in neural information processing systems , volume=. 2016 , url=

  77. [77]

    International Conference on Learning Representations , year=

    Exploring Sparsity in Recurrent Neural Networks , author=. International Conference on Learning Representations , year=

  78. [78]

    arXiv preprint arXiv:1702.06257 , year=

    The power of sparsity in convolutional neural networks , author=. arXiv preprint arXiv:1702.06257 , year=

  79. [79]

    An Image is Worth 16x16 Words:

    Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and others , booktitle=. An Image is Worth 16x16 Words:. 2020 , url=

  80. [80]

    2023 , url=

    Alizadeh, Keivan and Mirzadeh, Iman and Belenko, Dmitry and Khatamifard, Karen and Cho, Minsik and Del Mundo, Carlo C and Rastegari, Mohammad and Farajtabar, Mehrdad , journal=. 2023 , url=

Showing first 80 references.