Deterministic Differentiable Structured Pruning for Large Language Models

Jianfei Chen; Jun Zhou; Jun Zhu; Pengle Zhang; Weiyu Huang; Xiaolu Zhang

arxiv: 2603.08065 · v2 · submitted 2026-03-09 · 💻 cs.LG · cs.CL

Deterministic Differentiable Structured Pruning for Large Language Models

Weiyu Huang , Pengle Zhang , Xiaolu Zhang , Jun Zhou , Jun Zhu , Jianfei Chen This is my paper

Pith reviewed 2026-05-15 14:08 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords structured pruninglarge language modelsdifferentiable optimizationl0 sparsitymask optimizationinference efficiency

0 comments

The pith

A deterministic optimization method prunes large language models to 20% sparsity with only 1% performance loss on downstream tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that structured pruning of large language models can be improved by replacing stochastic relaxations of the l0 sparsity constraint with a deterministic soft surrogate that is optimized directly on the mask. This change removes the train-test mismatch that arises when sampled masks are discretized at deployment and allows masks to take a wider range of values. A sympathetic reader would care because large models are expensive to run at inference time, and a reliable way to remove 20% of components while keeping downstream performance nearly intact would lower those costs substantially. The approach is shown to work on both dense models and mixture-of-experts architectures while also producing faster convergence during the mask optimization phase.

Core claim

We introduce Deterministic Differentiable Pruning (DDP), a mask-only method that directly optimizes a deterministic soft surrogate of the discrete l0 objective instead of using stochastic hard-concrete relaxations. This surrogate provides greater expressiveness and eliminates the train-test mismatch that occurs when stochastic masks are discretized for final use. When the resulting masks are applied to large language models, the pruned models retain performance within 1% of the dense baseline on downstream tasks at 20% sparsity and deliver measurable end-to-end inference speedups.

What carries the argument

The deterministic soft surrogate of the discrete l0 objective, which replaces stochastic sampling to enable direct, non-stochastic optimization of multiplicative gates under a sparsity constraint.

If this is right

Pruned models lose as little as 1% performance on downstream tasks at 20% sparsity.
The method outperforms prior stochastic approaches on the sparsity-performance tradeoff.
It applies to both dense and mixture-of-experts large language models.
Faster convergence occurs during the mask optimization stage.
End-to-end inference speedups appear in realistic deployment pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The deterministic surrogate could be adapted to other discrete selection problems in machine learning where sampling noise creates similar deployment gaps.
Combining this pruning step with quantization might produce even larger efficiency gains than either technique alone.
Faster convergence opens the possibility of applying the method iteratively to refine already-pruned models without prohibitive extra compute.

Load-bearing premise

The deterministic soft surrogate of the l0 objective preserves enough expressiveness and does not introduce its own train-test mismatch or optimization artifacts once the mask is discretized for deployment.

What would settle it

A controlled experiment on the same model and tasks where the final discretized masks produced by the deterministic method yield higher downstream error than masks produced by a stochastic baseline would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 2603.08065 by Jianfei Chen, Jun Zhou, Jun Zhu, Pengle Zhang, Weiyu Huang, Xiaolu Zhang.

**Figure 1.** Figure 1: Deterministic Differentiable Pruning overview. Left: Masked formulation for dense and MoE models. For dense models, we prune attention heads and MLP channels; for MoE models, we prune expert channels only. Right: Mask-only optimization with decoupled forward masks and retention scores for regularization, enabling deterministic training and an expanded mask range [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Deterministic surrogate mapping in DDP. Annealing µ sharpens the soft sigmoid projection, progressively approximating ℓ0 regularization. The resulting s are then used in the Lagrangian regularization term to enforce the target keep ratio ρ. Denote s¯ = 1 K P k sk as their mean. The penalty is given as: Lsparsity(s) = λ1(¯s − ρ) + λ2(¯s − ρ) 2 . (13) Let T denote the total number of training steps. During … view at source ↗

**Figure 3.** Figure 3: Effect of different training tokens on perplexity and zero-shot mean accuracy on different models [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Dense-model sparsity patterns (LLaMA-7B, 20% sparsity). Left: layer-wise MLP channel sparsity. Right: learned head sparsity map. 0 5 10 15 20 25 30 Layer index 3000 4000 5000 6000 7000 8000 9000 Number zeros Channel Mask Zero Count Per Layer Average Ratio (a) MLP channel sparsity (per layer). 0 5 10 15 20 25 30 Head index 0 5 10 15 20 25 30 Layer index Head Sparsity Pattern 0.0 0.2 0.4 0.6 0.8 1.0 Mask val… view at source ↗

**Figure 5.** Figure 5: Dense-model sparsity patterns (LLaMA-7B, 50% sparsity). Sparsity increases toward later layers, and head pruning becomes more selective under higher compression. 0 10 20 30 40 50 60 Layer Index 0 5 10 15 20 25 Expert Index Nonzero Ratio in Each Expert 0.0 0.2 0.4 0.6 0.8 1.0 Nonzero Ratio (a) 20% sparsity. 0 10 20 30 40 50 60 Layer Index 0 5 10 15 20 25 Expert Index Nonzero Ratio in Each Expert 0.0 0.2 0.4… view at source ↗

**Figure 6.** Figure 6: MoE expert sparsity patterns (DeepSeekMoE-16B). Expert-wise sparsity maps at different target sparsities, showing that pruning increasingly concentrates on rarely activated experts. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

read the original abstract

Structured pruning reduces LLM inference cost by removing low-importance architectural components. This can be viewed as learning a multiplicative gate for each component under an l0 sparsity constraint. Due to the discreteness of the l0 norm, prior work typically adopts stochastic hard-concrete relaxations to enable differentiable optimization; however, this stochasticity can introduce a train--test mismatch when sampled masks are discretized for deployment and restricts masks to a bounded, near-binary range. To address this, we propose Deterministic Differentiable Pruning (DDP), a mask-only optimization method that eliminates stochasticity by directly optimizing a deterministic soft surrogate of the discrete l0 objective. Compared with prior approaches, DDP offers greater expressiveness, reduced train--test mismatch, and faster convergence. We apply our method to several dense and MoE models, including Qwen3-32B and Qwen3-30B-A3B, achieving a performance loss as small as 1% on downstream tasks while outperforming previous methods at 20% sparsity. We further demonstrate end-to-end inference speedups in realistic deployment settings with vLLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DDP swaps stochastic hard-concrete for a deterministic soft surrogate in structured LLM pruning and reports usable numbers on Qwen3-32B, but the surrogate equation and discretization step stay unspecified.

read the letter

The paper's main move is to drop the usual stochastic sampling in l0 pruning and instead optimize a deterministic soft surrogate for the mask. The authors argue this removes train-test mismatch, increases expressiveness, and speeds convergence. They test on Qwen3-32B and a 30B MoE model, claim roughly 1% downstream loss at 20% sparsity, and show real vLLM speedups. Running at that scale and including end-to-end timing is better than most pruning papers manage.

Referee Report

3 major / 2 minor

Summary. The paper proposes Deterministic Differentiable Pruning (DDP) as a mask-only optimization procedure for structured pruning of LLMs. It replaces stochastic hard-concrete relaxations of the l0 objective with a deterministic soft surrogate, claiming this yields greater expressiveness, reduced train-test mismatch upon discretization, and faster convergence. Experiments on dense and MoE models including Qwen3-32B and Qwen3-30B-A3B report downstream performance loss as low as 1% at 20% sparsity while outperforming prior methods, plus end-to-end inference speedups in vLLM.

Significance. If the deterministic surrogate can be shown to preserve mask expressiveness for structured components while producing discretized masks whose behavior closely matches the training objective, the approach would simplify pruning pipelines by removing sampling variance and potentially improve reproducibility. Demonstrated results on 30B-scale models would be practically relevant for inference-cost reduction.

major comments (3)

[Abstract] Abstract: the functional form of the 'deterministic soft surrogate of the discrete l0 objective' and the precise discretization operator applied at deployment are never stated. Without these, it is impossible to verify whether the surrogate remains sufficiently expressive for structured masks or whether it avoids introducing its own train-test mismatch or optimization artifacts.
[Abstract] Abstract and experimental claims: no equations, no ablation on the soft approximation, and no error bars or statistical tests are provided for the reported wins on Qwen3 models. The central performance claims therefore rest on unreported implementation details.
[Method] Method description: the paper asserts that DDP 'eliminates stochasticity' and 'reduces train-test mismatch' relative to hard-concrete relaxations, yet supplies no analysis of the mask distribution at convergence or comparison of the final discretized mask behavior against the training objective.

minor comments (2)

[Abstract] The abstract mentions 'several dense and MoE models' but does not list the full set of evaluated architectures or sparsity levels beyond the 20% figure.
[Method] No discussion of how the deterministic surrogate interacts with the structured pruning constraints (e.g., group-level or expert-level masks in MoE models).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and completeness of our manuscript. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the functional form of the 'deterministic soft surrogate of the discrete l0 objective' and the precise discretization operator applied at deployment are never stated. Without these, it is impossible to verify whether the surrogate remains sufficiently expressive for structured masks or whether it avoids introducing its own train-test mismatch or optimization artifacts.

Authors: We agree that these details were insufficiently highlighted in the abstract. In the revised version, we explicitly state the functional form of the deterministic soft surrogate as a scaled sigmoid function and specify the discretization as a threshold-based operator. These are now included in the abstract and elaborated in the method section to allow verification of expressiveness and mismatch reduction. revision: yes
Referee: [Abstract] Abstract and experimental claims: no equations, no ablation on the soft approximation, and no error bars or statistical tests are provided for the reported wins on Qwen3 models. The central performance claims therefore rest on unreported implementation details.

Authors: We have added the key equations for DDP to the main body of the paper. An ablation study on the soft approximation is now provided in the appendix. For the Qwen3 results, we have included error bars from multiple random seeds and conducted paired statistical tests to support the performance claims. revision: yes
Referee: [Method] Method description: the paper asserts that DDP 'eliminates stochasticity' and 'reduces train-test mismatch' relative to hard-concrete relaxations, yet supplies no analysis of the mask distribution at convergence or comparison of the final discretized mask behavior against the training objective.

Authors: To address this, we have added an analysis of the mask distributions at convergence, including histograms and KL-divergence metrics between training and discretized masks. This demonstrates that DDP achieves lower train-test mismatch compared to stochastic baselines, as the deterministic nature allows direct optimization without sampling variance. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces Deterministic Differentiable Pruning (DDP) as a new mask-only optimization procedure that directly optimizes a deterministic soft surrogate of the discrete l0 objective, presented as independent of prior stochastic relaxations. No equations, parameter fits, or self-citations are shown that reduce the surrogate definition, discretization behavior, or reported performance gains (e.g., 1% loss at 20% sparsity on Qwen3 models) to quantities defined by the authors' own earlier inputs or self-referential constructions. The central claims rest on empirical comparisons and deployment results rather than definitional equivalence or fitted-input predictions, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of a deterministic differentiable surrogate that can stand in for the discrete l0 norm without additional fitted parameters beyond standard training hyperparameters.

axioms (1)

domain assumption The l0 norm can be approximated by a deterministic soft function whose gradient is usable for end-to-end optimization.
Invoked in the description of DDP as the core replacement for stochastic relaxations.

invented entities (1)

Deterministic soft surrogate mask no independent evidence
purpose: Replace stochastic sampling in structured pruning optimization
New construct introduced to eliminate train-test mismatch; no independent falsifiable prediction outside the pruning task is given.

pith-pipeline@v0.9.0 · 5503 in / 1234 out tokens · 47722 ms · 2026-05-15T14:08:27.854571+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

[1]

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y ., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, ...

work page doi:10.1038/s41586-025-09422-z
[2]

How do llms use their depth?arXiv preprint arXiv:2510.18871,

Gupta, A., Yeung, J., Anumanchipalli, G., and Ivanova, A. How do llms use their depth?arXiv preprint arXiv:2510.18871,

work page arXiv
[3]

Prun- ing large language models with semi-structural adaptive sparse training.Proceedings of the AAAI Conference on Artificial Intelligence, 39(23):24167–24175, Apr

Huang, W., Hu, Y ., Jian, G., Zhu, J., and Chen, J. Prun- ing large language models with semi-structural adaptive sparse training.Proceedings of the AAAI Conference on Artificial Intelligence, 39(23):24167–24175, Apr. 2025a. doi: 10.1609/aaai.v39i23.34592. Huang, W., Hu, Y ., Zhu, J., and Chen, J. Cast: Contin- uous and differentiable semi-structured spar...

work page doi:10.1609/aaai.v39i23.34592
[4]

LoRAP: Transformer sub- layers deserve differentiated structured compression for large language models.arXiv preprint arXiv:2404.09695,

Li, G., Tang, Y ., and Zhang, W. LoRAP: Transformer sub- layers deserve differentiated structured compression for large language models.arXiv preprint arXiv:2404.09695,

work page arXiv
[5]

HEAPr: Hessian-based efficient atomic expert pruning in output space.arXiv preprint arXiv:2509.22299,

Li, K., Yang, Z., Zhou, Z., Xue, F., Jiang, Z., and Wang, W. HEAPr: Hessian-based efficient atomic expert pruning in output space.arXiv preprint arXiv:2509.22299,

work page internal anchor Pith review arXiv
[6]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. LLaMA: Open and efficient founda- tion language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

GLUE: A multi-task benchmark and analy- sis platform for natural language understanding

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. GLUE: A multi-task benchmark and analy- sis platform for natural language understanding. InPro- ceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pp. 353–355,

work page 2018
[8]

Structured pruning of large language models

Wang, Z., Wohlwend, J., and Lei, T. Structured pruning of large language models. In Webber, B., Cohn, T., He, Y ., and Liu, Y . (eds.),Proceedings of the 2020 Conference 10 Deterministic Differentiable Structured Pruning for Large Language Models on Empirical Methods in Natural Language Processing (EMNLP), Online, November

work page 2020
[9]

Xia, M., Gao, T., Zeng, Z., and Chen, D

Association for Com- putational Linguistics. Xia, M., Gao, T., Zeng, Z., and Chen, D. SHEARED LLAMA: Accelerating language model pre-training via structured pruning. In12th International Conference on Learning Representations, ICLR 2024,

work page 2024
[10]

CAMERA: Multi-matrix joint compression for moe models via micro-expert redundancy analysis

Xu, Y ., Han, X., Zhang, Y ., Wang, Y ., Liu, Y ., Ji, S., Zhu, Q., and Che, W. CAMERA: Multi-matrix joint compression for moe models via micro-expert redundancy analysis. arXiv preprint arXiv:2508.02322,

work page arXiv
[11]

LoRAPrune: Structured pruning meets low- rank parameter-efficient fine-tuning

Zhang, M., Chen, H., Shen, C., Yang, Z., Ou, L., Yu, X., and Zhuang, B. LoRAPrune: Structured pruning meets low- rank parameter-efficient fine-tuning. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 3013–3026,

work page 2024
[12]

Hence Sµr(z(r))− |A (r)| = KX k=1 ϕ(z(r) k ;µ r)− KX k=1 I[z(r) k >0] ≤ 1 2

For suchr, using monotonicity ofϕ(·;µ r), X k∈A(r) ϕ(z(r) k ;µ r)≥ |A (r)| 1− 1 4K , X k /∈A(r) ϕ(z(r) k ;µ r)≤(K− |A (r)|) 1 4K . Hence Sµr(z(r))− |A (r)| = KX k=1 ϕ(z(r) k ;µ r)− KX k=1 I[z(r) k >0] ≤ 1 2 . But Sµr(z(r)) =P and |A(r)| are integers, so the only possibility is |A(r)|=P , i.e.,PK k=1 I[z(r) k >0] =P . Finally, (A3) (e.g., Proposition B.2) ...

work page arXiv
[13]

U-shaped

Dense models (LLaMA-7B).At 20% sparsity, head pruning is conservative, with most heads retained and sparsity concentrated in a small subset of layers/heads (Figure 4b). At 50% sparsity, the head map becomes markedly more selective and structured (Figure 5b), indicating substantial redundancy in multi-head attention and a tendency to concentrate capacity i...

work page arXiv 2044

[1] [1]

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y ., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, ...

work page doi:10.1038/s41586-025-09422-z

[2] [2]

How do llms use their depth?arXiv preprint arXiv:2510.18871,

Gupta, A., Yeung, J., Anumanchipalli, G., and Ivanova, A. How do llms use their depth?arXiv preprint arXiv:2510.18871,

work page arXiv

[3] [3]

Prun- ing large language models with semi-structural adaptive sparse training.Proceedings of the AAAI Conference on Artificial Intelligence, 39(23):24167–24175, Apr

Huang, W., Hu, Y ., Jian, G., Zhu, J., and Chen, J. Prun- ing large language models with semi-structural adaptive sparse training.Proceedings of the AAAI Conference on Artificial Intelligence, 39(23):24167–24175, Apr. 2025a. doi: 10.1609/aaai.v39i23.34592. Huang, W., Hu, Y ., Zhu, J., and Chen, J. Cast: Contin- uous and differentiable semi-structured spar...

work page doi:10.1609/aaai.v39i23.34592

[4] [4]

LoRAP: Transformer sub- layers deserve differentiated structured compression for large language models.arXiv preprint arXiv:2404.09695,

Li, G., Tang, Y ., and Zhang, W. LoRAP: Transformer sub- layers deserve differentiated structured compression for large language models.arXiv preprint arXiv:2404.09695,

work page arXiv

[5] [5]

HEAPr: Hessian-based efficient atomic expert pruning in output space.arXiv preprint arXiv:2509.22299,

Li, K., Yang, Z., Zhou, Z., Xue, F., Jiang, Z., and Wang, W. HEAPr: Hessian-based efficient atomic expert pruning in output space.arXiv preprint arXiv:2509.22299,

work page internal anchor Pith review arXiv

[6] [6]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. LLaMA: Open and efficient founda- tion language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

GLUE: A multi-task benchmark and analy- sis platform for natural language understanding

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. GLUE: A multi-task benchmark and analy- sis platform for natural language understanding. InPro- ceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pp. 353–355,

work page 2018

[8] [8]

Structured pruning of large language models

Wang, Z., Wohlwend, J., and Lei, T. Structured pruning of large language models. In Webber, B., Cohn, T., He, Y ., and Liu, Y . (eds.),Proceedings of the 2020 Conference 10 Deterministic Differentiable Structured Pruning for Large Language Models on Empirical Methods in Natural Language Processing (EMNLP), Online, November

work page 2020

[9] [9]

Xia, M., Gao, T., Zeng, Z., and Chen, D

Association for Com- putational Linguistics. Xia, M., Gao, T., Zeng, Z., and Chen, D. SHEARED LLAMA: Accelerating language model pre-training via structured pruning. In12th International Conference on Learning Representations, ICLR 2024,

work page 2024

[10] [10]

CAMERA: Multi-matrix joint compression for moe models via micro-expert redundancy analysis

Xu, Y ., Han, X., Zhang, Y ., Wang, Y ., Liu, Y ., Ji, S., Zhu, Q., and Che, W. CAMERA: Multi-matrix joint compression for moe models via micro-expert redundancy analysis. arXiv preprint arXiv:2508.02322,

work page arXiv

[11] [11]

LoRAPrune: Structured pruning meets low- rank parameter-efficient fine-tuning

Zhang, M., Chen, H., Shen, C., Yang, Z., Ou, L., Yu, X., and Zhuang, B. LoRAPrune: Structured pruning meets low- rank parameter-efficient fine-tuning. InFindings of the Association for Computational Linguistics: ACL 2024, pp. 3013–3026,

work page 2024

[12] [12]

Hence Sµr(z(r))− |A (r)| = KX k=1 ϕ(z(r) k ;µ r)− KX k=1 I[z(r) k >0] ≤ 1 2

For suchr, using monotonicity ofϕ(·;µ r), X k∈A(r) ϕ(z(r) k ;µ r)≥ |A (r)| 1− 1 4K , X k /∈A(r) ϕ(z(r) k ;µ r)≤(K− |A (r)|) 1 4K . Hence Sµr(z(r))− |A (r)| = KX k=1 ϕ(z(r) k ;µ r)− KX k=1 I[z(r) k >0] ≤ 1 2 . But Sµr(z(r)) =P and |A(r)| are integers, so the only possibility is |A(r)|=P , i.e.,PK k=1 I[z(r) k >0] =P . Finally, (A3) (e.g., Proposition B.2) ...

work page arXiv

[13] [13]

U-shaped

Dense models (LLaMA-7B).At 20% sparsity, head pruning is conservative, with most heads retained and sparsity concentrated in a small subset of layers/heads (Figure 4b). At 50% sparsity, the head map becomes markedly more selective and structured (Figure 5b), indicating substantial redundancy in multi-head attention and a tendency to concentrate capacity i...

work page arXiv 2044