pith. sign in

arxiv: 2606.18304 · v1 · pith:WDCIBQBTnew · submitted 2026-06-16 · 💻 cs.LG · cs.AI

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

Pith reviewed 2026-06-27 02:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords mixture of expertsmodel compressionstructural pruningchannel pruningattribution scoresquantization aware pruning
0
0 comments X

The pith

Channel-level pruning for MoE models preserves accuracy at 50% or 25% structured rates by maximizing coverage of important channels via attribution scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a pruning method for Mixture-of-Experts models that prunes individual channels inside experts rather than removing whole experts. It starts from the observation that expert information concentrates in a small number of channels, so expert-level decisions leave unused redundancy. The approach recasts the allocation of prune ratios as a coverage-maximization problem over channel importance scores and solves it with an attribution-based approximation. Experiments on DeepSeek and Qwen models show the resulting compressed models retain accuracy when paired with 4-bit quantization and deliver a 5.27 times memory reduction on Qwen3-30B-A3B while beating prior baselines.

Core claim

The paper claims that reformulating prune-ratio allocation as a channel-score coverage maximization problem and solving it with an attribution-based approximation yields a structural pruning method that maintains model accuracy under 50% or 25% structured pruning on MoE architectures when combined with 4-bit quantization, outperforming expert-level baselines.

What carries the argument

Attribution-guided coverage maximization that assigns per-channel prune ratios to maximize the total importance score covered by retained channels.

If this is right

  • 50% structured channel pruning plus 4-bit quantization can reduce memory footprint by more than 5 times on 30B-scale MoE models without accuracy collapse.
  • Prune budgets can be allocated across channels inside retained experts rather than across experts.
  • The same coverage-maximization procedure applies to both DeepSeek and Qwen MoE families and beats existing expert-ranking methods on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the channel-concentration pattern generalizes beyond the tested models, the same allocation logic could be applied to other sparsely activated architectures.
  • The method could be combined with expert merging or routing adjustments to further reduce inference latency.
  • A direct measurement of channel importance variance across layers would give a simple diagnostic for when the approach is likely to succeed.

Load-bearing premise

Information inside each MoE expert is concentrated enough in a small subset of channels that channel-level decisions reliably capture more redundancy than expert-level decisions.

What would settle it

A controlled test on an MoE model where expert-level pruning at the same ratio retains higher accuracy than the channel-level method would falsify the claimed advantage.

Figures

Figures reproduced from arXiv: 2606.18304 by Dacheng Tao, Ge Yang, Jiacheng Wang, Jinyang Guo, Xianglong Liu, Yifu Ding, Yongcheng Jing.

Figure 1
Figure 1. Figure 1: Overview of our pruning framework: estimating expert importance via an attribution-based approximation (left), maxi￾mizing score coverage to avoid wasting capacity (middle), and applying alignment-aware redistribution for compact storage and kernel-friendly low-bit inference (right). inference, structural pruning, which removes entire chan￾nels or experts to yield hardware-efficient dense smaller model, of… view at source ↗
Figure 2
Figure 2. Figure 2: Misalignment between router outputs and expert-wise ablated NLL. (a) and (b) rank the top 50 experts by router weight and token usage. The NLL (bars) demonstrates a weak correlation with router outputs. Notably, the orange bars highlight that even selected experts can provide negative contributions. for fine-grained slimming allocations. Below, we revisit common metrics and their limitations, and motivate … view at source ↗
Figure 4
Figure 4. Figure 4: (a) Cumulative channel score distribution, which reveal that many experts possess highly centralized channels. (b) Layer￾wise output loss under various prune ratio, for some experts the loss drops rapidly after keeping only a small fraction of channels. Contribution can be recovered by few channels. Given the concentration patterns, whole-expert ablation becomes an overly coarse proxy for redundancy [PITH… view at source ↗
Figure 5
Figure 5. Figure 5: The overview of Attribution-Guided & Coverage-Maximized Expert-wise Pruning framework for MoE models. Algorithm 1 Coverage-Maximized Allocation Search 1: Input: Score allocation weights ϕ ∈ R |G| + ; prefix sums {Sg(n)}g∈G; total scores {S tot g }g∈G; channel budget Nbudget; total channels Ntot; tolerance ε. 2: Output: Channel budgets {N⋆ g }g∈G 3: αmin ← 0, αmax ← 1 4: while αmin < αmax do 5: α ← (αmin + … view at source ↗
Figure 9
Figure 9. Figure 9: Pareto frontier of average downstream-task accuracy versus compressed model storage (GB) for Qwen1.5-MoE-A2.7B. Our channel-level pruning consistently dominates expert-level baselines across the full compression range. C.2.2. WIDER PRUNING–QUANTIZATION COMBINATIONS Based on our current experiments, P25% Q4b gives the best accuracy-efficiency tradeoff among the default deployment￾oriented settings, but we d… view at source ↗
Figure 10
Figure 10. Figure 10: Throughput and runtime memory usage of Qwen1.5-MoE-A2.7B with different minimal channel numbers and alignment granularity. • Weight Magnitude (channel-wise L2 norm). For expert e in layer ℓ and a projection ϕ with weight matrix W (ϕ) ℓ,e ∈ R Oϕ×Iϕ , we define the importance of input channel c by the L2 norm of the corresponding weight column: s (ϕ,W) ℓ,e,c = ∥W (ϕ) ℓ,e,c∥ Oϕ 2 =  X o∈Oϕ [PITH_FULL_IMAGE… view at source ↗
Figure 11
Figure 11. Figure 11: Losses (raw and smoothed), coverage ratio and channel keep ratio after pruning. Alternative Smoothing Functions for Layerwise Loss The square-root smoothing used in inter-layer allocation is not a theoretically essential component; rather, it is a simple realization of monotone-concave dynamic-range compression. Without smoothing, a few high-loss layers capture most of the channel budget while low-loss la… view at source ↗
Figure 12
Figure 12. Figure 12: Smoothed layerwise loss under different monotone-concave smoothing functions. C.3.4. HYPERPARAMETER SENSITIVITY IN CBA AND AAR For Coverage-Maximized Budget Allocation, we set the maximum number of binary-search iterations to 50. In practice, the search usually converges within 30 iterations, and the maximum only serves as a safeguard. For Alignment-Aware Redistribution, the minimum kept-channel threshold… view at source ↗
Figure 13
Figure 13. Figure 13: Router entropy distributions across tasks, layer depths, and MoE models evaluated in the main text. The distributions illustrate the routing-dynamics variation used to evaluate robustness across architectures and tasks. Despite the diverse aspects, most existing works only rank experts at the granularity of entire experts and do not explicitly analysis the redundancy within each expert. MoE-I2 (Yang et al… view at source ↗
Figure 14
Figure 14. Figure 14: Distribution of expert activation magnitudes across representative shallow, middle, and deep layers of Qwen1.5-MoE-A2.7B under different calibration corpora. The figure supports the calibration robustness analysis by showing that activation heterogeneity persists across data sources [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Cumulative scores fraction (blue stacked bars), kepted channels (red lines) and attribution score (yellow diamond markers) for each expert in specific layers in Qwen1.5-MoE-A2.7B. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Cumulative channel-score fractions (blue stacked bars), kept channels (red lines), and attribution scores (yellow diamonds) for experts within a representative layer of Qwen3-30B-A3B. Panels (a) and (b) use activation-based scores, while (c) and (d) use weight gradient scores (weight times gradient). The channel-score concentration pattern appears under both metrics. Although experts exhibit substantial h… view at source ↗
read the original abstract

Mixture-of-Experts (MoE) models scale compute efficiently, yet remain expensive to deploy due to their substantial memory footprint and inference overhead. Prior compression methods mainly operate at the expert level, either removing entire experts or ranking experts by coarse-grained importance scores. However, such expert-wise decisions are often too coarse to capture fine-grained redundancy, leading to misallocated pruning budgets and limited compression. To address this problem, we observe that information within MoE experts is highly concentrated in a small subset of channels, leaving substantial redundancy even in experts deemed important. Based on this observation, we propose a structural pruning framework tailored for MoE models. Our method reformulates prune-ratio allocation as a channel-score coverage maximization problem and solves it efficiently using an attribution-based approximation. Experiments on DeepSeek and Qwen MoE models show that our method preserves model accuracy under 50% or 25% structured pruning when combined with 4-bit quantization. On Qwen3-30B-A3B, our approach reduces memory footprint by 5.27$\times$ and consistently outperforms state-of-the-art baselines across diverse benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a structural pruning framework for Mixture-of-Experts (MoE) models based on the observation that information within experts is highly concentrated in a small subset of channels. It reformulates prune-ratio allocation as a channel-score coverage maximization problem solved via an attribution-based approximation. Experiments on DeepSeek and Qwen MoE models claim that the method preserves accuracy under 50% or 25% structured pruning combined with 4-bit quantization, achieving a 5.27× memory reduction on Qwen3-30B-A3B while outperforming state-of-the-art baselines across diverse benchmarks.

Significance. If the empirical results hold under rigorous controls, the work could meaningfully advance efficient deployment of large MoE models by enabling finer-grained pruning that captures intra-expert redundancy better than expert-level baselines. The reported memory reduction and accuracy preservation on models like Qwen3-30B-A3B indicate potential practical utility for memory-constrained inference.

major comments (2)
  1. [Abstract] Abstract: the central claim that the method 'preserves model accuracy' under 50%/25% pruning + 4-bit quantization rests on experimental outcomes whose details (specific benchmarks, variance across runs, controls for the channel-concentration observation) are not visible; this makes it difficult to evaluate whether the coverage-maximization step is load-bearing or if results could be explained by the quantization alone.
  2. [Abstract] The weakest assumption—that channel-level decisions capture fine-grained redundancy better than expert-level decisions—requires explicit validation (e.g., ablation comparing channel vs. expert pruning ratios on the same models); without it, the reformulation as coverage maximization risks being an ad-hoc improvement rather than a principled advance.
minor comments (2)
  1. The abstract states results on 'DeepSeek and Qwen MoE models' but does not name the exact model variants or pruning ratios per model; adding a table summarizing these would improve clarity.
  2. Consider defining 'attribution-based approximation' more precisely even at the abstract level, or adding a short methods paragraph, to allow readers to assess the approximation's fidelity without the full text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for minor revision. We address each major comment below and will revise the manuscript accordingly to improve clarity and strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the method 'preserves model accuracy' under 50%/25% pruning + 4-bit quantization rests on experimental outcomes whose details (specific benchmarks, variance across runs, controls for the channel-concentration observation) are not visible; this makes it difficult to evaluate whether the coverage-maximization step is load-bearing or if results could be explained by the quantization alone.

    Authors: We agree that the abstract is too concise to convey the full experimental context. The manuscript already reports these details in Sections 4 and 5 (specific benchmarks including MMLU, GSM8K, HumanEval; tables with mean/std over 3 seeds; direct comparisons to quantization-only and expert-pruning baselines showing the additional gain from coverage maximization). In the revision we will expand the abstract to briefly list the key benchmarks, note that results are averaged across runs, and reference the controls for the channel-concentration observation, making the load-bearing role of the proposed step explicit. revision: yes

  2. Referee: [Abstract] The weakest assumption—that channel-level decisions capture fine-grained redundancy better than expert-level decisions—requires explicit validation (e.g., ablation comparing channel vs. expert pruning ratios on the same models); without it, the reformulation as coverage maximization risks being an ad-hoc improvement rather than a principled advance.

    Authors: The manuscript already contains head-to-head comparisons against expert-level pruning baselines at identical overall pruning ratios (Table 2, Figure 3), demonstrating consistent gains from channel-level allocation. Nevertheless, we acknowledge that a dedicated side-by-side ablation isolating the effect of channel versus expert granularity would strengthen the argument. We will add this ablation in the revised version, reporting accuracy and memory metrics for both strategies on the same DeepSeek and Qwen models. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core contribution is an empirical pruning method: it states an observation on channel-level concentration within MoE experts, reformulates prune-ratio allocation as a coverage-maximization problem, and solves it with an attribution approximation. No equations, derivations, or load-bearing steps are described that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The central accuracy-preservation claims rest on direct experiments with DeepSeek/Qwen models and external baselines, which constitute independent empirical support rather than internal re-derivation of inputs. This is the normal case of a self-contained applied method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that channel-level redundancy is substantial and that attribution scores provide a sufficient proxy for coverage maximization. No free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Attribution scores can be used as a proxy for channel importance in coverage maximization
    Invoked to justify the approximation that solves the prune-ratio allocation problem.

pith-pipeline@v0.9.1-grok · 5745 in / 1172 out tokens · 40351 ms · 2026-06-27T02:04:52.704184+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 21 canonical work pages · 11 internal anchors

  1. [1]

    OpenCodeReasoning: Advancing Data Distillation for Competitive Coding

    Ahmad, W. U., Narenthiran, S., Majumdar, S., Ficek, A., Jain, S., Huang, J., Noroozi, V ., and Ginsburg, B. Open- codereasoning: Advancing data distillation for competi- tive coding.arXiv preprint arXiv:2504.01943,

  2. [2]

    Orchestrating hidden-intermediate pruning- and-distill for moes slimming

    Anonymous. Orchestrating hidden-intermediate pruning- and-distill for moes slimming. Anonymous ICML 2026 submission (under review),

  3. [3]

    Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

    Cao, M., Li, G., Ji, J., Zhang, J., Ma, X., Liu, S., and Yin, L. Condense, don’t just prune: Enhancing efficiency and performance in moe layer pruning.arXiv preprint arXiv:2412.00069,

  4. [4]

    A provably effective method for pruning experts in fine-tuned sparse mixture-of-experts.arXiv preprint arXiv:2405.16646, 2024

    doi: 10.48550/arXiv.2405.16646. Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions. InNorth American Chapter of the Association for Computational Linguistics,

  5. [5]

    doi: 10.48550/arXiv.2410. 11988. 10 Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression Gong, R., Ding, Y ., Wang, Z., Lv, C., Zheng, X., Du, J., Yong, Y ., Gu, S., Qin, H., et al. A survey of low-bit large language models: Basics, systems, and algorithms. Neural Networks, pp. 107856,

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025a. Guo, H., Yao, J., Wang, B., Du, J., Cao, S., Di, D., Zhang, S., and Li, Z. Cluster-driven expert pruning for mixture-of-experts lar...

  7. [7]

    Measuring Massive Multitask Language Understanding

    ISSN 2835-8856. URL https: //openreview.net/forum?id=HTpMOl6xSI. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding.ArXiv, abs/2009.03300,

  8. [8]

    Huang, Y ., Wang, Z., Yuan, Z., Ding, Y ., Gong, R., Guo, J., Liu, X., and Zhang, J

    doi: 10.48550/arXiv.2410.06270. Huang, Y ., Wang, Z., Yuan, Z., Ding, Y ., Gong, R., Guo, J., Liu, X., and Zhang, J. Modes: Accelerating mixture-of- experts multimodal large language models via dynamic expert skipping.arXiv preprint arXiv:2511.15690,

  9. [9]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free eval- uation of large language models for code.arXiv preprint arXiv:2403.07974,

  10. [10]

    Mixtral of Experts

    Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

  11. [11]

    Let's Verify Step by Step

    Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

  12. [12]

    arXiv preprint arXiv:2407.00945 , year=

    Liu, E., Zhu, J., Lin, Z., Ning, X., Blaschko, M. B., Yan, S., Dai, G., Yang, H., and Wang, Y . Efficient expert pruning for sparse mixture-of-experts language models: Enhancing performance and reducing inference costs. In arXiv.org, 2024a. doi: 10.48550/arXiv.2407.00945. Liu, S.-Y ., Wang, C.-Y ., Yin, H., Molchanov, P., Wang, Y .-C. F., Cheng, K.-T., an...

  13. [13]

    Lv, C., Zhang, B., Yong, Y ., Gong, R., Huang, Y ., Gu, S., Wu, J., Shi, Y ., Guo, J., et al

    doi: 10.48550/ arXiv.2402.14800. Lv, C., Zhang, B., Yong, Y ., Gong, R., Huang, Y ., Gu, S., Wu, J., Shi, Y ., Guo, J., et al. LLMC+: Benchmarking vision-language model compression with a plug-and-play toolkit. InAAAI Conference on Artificial Intelligence,

  14. [14]

    arXiv preprint arXiv:2404.05089 , year=

    doi: 10.48550/arXiv.2404.05089. Qwen-Team. Qwen3 technical report,

  15. [15]

    Qwen3 Technical Report

    URL https: //arxiv.org/abs/2505.09388. Raffel, C., Shazeer, N. M., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text trans- former.J. Mach. Learn. Res., 21:140:1–140:67,

  16. [16]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    URL https://arxiv.org/abs/2311.12022. 11 Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y . Winogrande: An adversarial winograd schema challenge at scale.Proceedings of the AAAI Conference on Artificial Intelligence, undefined. Skean, O., Arefin, M. R., Zhao, D., Pate...

  17. [17]

    Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695,

  18. [18]

    Metamorphictestingoflarge languagemodelsfornaturallanguageprocessing.doi:10.48550/arXiv

    doi: 10.48550/arXiv. 2410.12013. Xu, H., Wu, H., Ke, X., Wu, J., Xu, R., and Xu, J. Mc- moe: Completing missing modalities with mixture of experts for incomplete multimodal action quality as- sessment,

  19. [19]

    Xue, F., Zheng, Z., Fu, Y ., Ni, J., Zheng, Z., Zhou, W., and You, Y

    URL https://arxiv.org/abs/ 2511.17397. Xue, F., Zheng, Z., Fu, Y ., Ni, J., Zheng, Z., Zhou, W., and You, Y . Openmoe: An early effort on open mixture-of-experts language models.arXiv preprint arXiv:2402.01739,

  20. [20]

    MoE-i2: Compressing mixture of experts models through inter- expert pruning and intra-expert low-rank decomposition

    Yang, C., Sui, Y ., Xiao, J., Huang, L., Gong, Y ., Duan, Y ., Jia, W., Yin, M., Cheng, Y ., and Yuan, B. MoE-i2: Compressing mixture of experts models through inter- expert pruning and intra-expert low-rank decomposition. InFindings of the Association for Computational Linguis- tics: EMNLP 2024, pp. 10456–10466, Miami, Florida, USA, November

  21. [21]

    and Math-AI, T

    Zhang, Y . and Math-AI, T. American invitational mathemat- ics examination (aime) 2025,

  22. [22]

    Zhao, Y ., Wang, Z., and Zhang, M

    doi: 10.48550/arXiv.2407.09590. Zhao, Y ., Wang, Z., and Zhang, M. Puzzlemoe: Effi- cient compression of large mixture-of-experts models via sparse expert merging and bit-packed inference.arXiv preprint arXiv:2511.04805,

  23. [23]

    13 A.1Complete Process of Maximum Coverage Allocation Algorithm

    12 Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression Contents A Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.1Complete Process of Maximum Coverage Allocation Algorithm ...

  24. [24]

    Step 3: Compute remaining quota and available blocks.The channel budget released by trimming and alignment is collected and segmented as units of a-blocks

    10:end for 11:N ℓ(ρ)← P e∈Eℓ Nℓ,e(ρℓ,e) 12:if Nℓ(ρ)−N ⋆ ℓ ≤ε N tot ℓ then 13:N ⋆ ℓ,e ←N ℓ,e(ρℓ,e),∀e∈ E ℓ 14:break 15:end if 16:ifN ℓ(ρ)> N ⋆ ℓ then 17:α max ←α 18:else 19:α min ←α 20:end if 21:end while 22:return{N ⋆ ℓ,e}e∈Eℓ Step 2: Downward alignment.For each remaining expert e∈ A ℓ, we round ˜Nl,e down to the nearest multiple of a, ensuring compatibil...

  25. [25]

    We fine-tune the MoE blocks using DoRA (Liu et al., 2024b) with rank 32 and learning rate1e−4, while adapting the routing module with rank 4 and learning rate 1e−6

    for 2 epochs. We fine-tune the MoE blocks using DoRA (Liu et al., 2024b) with rank 32 and learning rate1e−4, while adapting the routing module with rank 4 and learning rate 1e−6. We use AdamW with warmup ratio 0.1 and clip gradient exceeding 0.5, without weight decay. All training is conducted on 4 ×H20 GPUs. The training cost is 12 GPU hours for Qwen1.5-...

  26. [26]

    The framework therefore supports flexible operating points depending on deployment constraints

    Stronger compression gives lower storage but larger accuracy drop, while milder compression preserves accuracy better. The framework therefore supports flexible operating points depending on deployment constraints. 2https://github.com/EleutherAI/lm-evaluation-harness 3https://github.com/open-compass/opencompass 19 Attribution-Guided and Coverage-Maximized...

  27. [27]

    Sensitivity and Robustness Analysis C.4.1

    C.4. Sensitivity and Robustness Analysis C.4.1. SENSITIVITY TOCALIBRATIONCORPUS Our default setup follows common post-training compression practice, using C4 for general tasks, GSM8K for math, and OpenCodeReasoning for code. To examine the sensitivity systematically, we conduct an ablation on six calibration corpora: WikiText2, C4, Pile, RedPajama, GSM8K ...

  28. [28]

    reduces the parameters via low rank decomposition and assigns higher ranks to more important experts while using lower ranks for less important ones. However, the speedup is limited: the fragmentation into small kernels makes it difficult to reach peak throughput of one larger kernel, introducing additional overhead in kernel launching, cache hit, and mem...