Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy

Anhao Zhao; Hao Wu; Haozhe Hu; Longwei Ding; Peiran Yin; Xiaoyu Shen; Yunpu Ma

arxiv: 2606.09080 · v1 · pith:NZQWVV3Wnew · submitted 2026-06-08 · 💻 cs.LG · cs.CL

Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy

Haozhe Hu , Hao Wu , Anhao Zhao , Longwei Ding , Peiran Yin , Yunpu Ma , Xiaoyu Shen This is my paper

Pith reviewed 2026-06-27 17:01 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords LLM pruninginference accelerationGEMM taxonomyPareto frontierdepth pruningwidth pruningdynamic pruningmemory-bound inference

0 comments

The pith

Static depth pruning stays closest to theoretical acceleration limits and leads the Pareto frontier for LLM inference at low quality loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a taxonomy that groups LLM pruning methods by how they shrink the M, N, or K dimensions of matrix multiplications. It then runs a single benchmarking setup to measure actual wall-clock speedups instead of counting operations. The measurements reveal that static depth pruning delivers the best realized acceleration for a given drop in model quality when the workload is memory-bound. As allowed quality loss increases during the prefill phase, the best method shifts first to dynamic depth pruning and later to static width pruning. These patterns give a practical map of which pruning families are worth implementing on current hardware.

Core claim

Static depth pruning remains the strongest Pareto-optimal baseline and stays closest to its theoretical acceleration upper bound in memory-bounded scenarios. During prefill, the frontier transitions from static depth at low quality loss (0%--4%), to dynamic depth at moderate loss (5%--16%), and finally to static width pruning at higher loss levels (17%--26%).

What carries the argument

GEMM-centric taxonomy that reclassifies every pruning technique by the logical M, N, or K dimension it reduces in general matrix multiplication.

If this is right

Static depth pruning should be the default starting point for memory-bound inference when quality loss must stay under 4 percent.
Dynamic depth methods become preferable once moderate quality loss is acceptable.
Static width pruning only justifies its implementation cost at high quality-loss budgets above 17 percent.
Future pruning work can target the specific GEMM dimension that matches the desired operating point on the frontier.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hardware designers could prioritize kernels that accelerate the depth-pruned case first, since that region dominates low-loss regimes.
The same taxonomy could be applied to quantization or speculative decoding to see whether similar frontier transitions appear.
If memory bandwidth improves faster than compute, the width-pruning region may shrink and depth methods could remain dominant longer.

Load-bearing premise

The GEMM-centric taxonomy and unified benchmarking framework capture the dominant execution behaviors of all pruning families in an implementation-consistent manner without being dominated by unmodeled kernel or hardware specifics.

What would settle it

If measurements on the same models but with a different kernel library or GPU show width pruning beating depth pruning at quality losses below 4 percent, the reported frontier ordering would be falsified.

Figures

Figures reproduced from arXiv: 2606.09080 by Anhao Zhao, Hao Wu, Haozhe Hu, Longwei Ding, Peiran Yin, Xiaoyu Shen, Yunpu Ma.

**Figure 2.** Figure 2: Llama3.1-8B’s latency breakdown on each for [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: An overview of MNK-dimension pruning and their propagation through attention and FFN layers. The empty blocks denote zero-filled elements that make no contributions to the computation, lavender blocks denote deviations compared to the dense results. The head-wise pruning can also be treated as M-dimension pruning over each head with head dim set to 1, where sparse attention can be treated as pruning over k… view at source ↗

**Figure 4.** Figure 4: The execution pipeline of dynamic M, with the number of tokens set to 8, and M-axis tile size to 4, blocks that contain any active tokens will be executed. implementation performs mask-to-indices conversion and gather–scatter fusion with indices I 2 : K(X,W, I) = ScatterI(OP(GatherI(X),W)) Thus an additional mask-reordering preprocessing scheme is critical for such tile-based intrakernel skipping ( [PI… view at source ↗

**Figure 5.** Figure 5: (a-d) Prefill and decode speedup on Llama3.1-8B under 50% sparsity. (e-f) Speedup over different sparsity, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Breakdown of wall-clock time for all operations in one layer. The [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Prefill throughput with 50% sparsity and dif [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Prefill & decode speedup on vanilla static [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Speedup over different sparsity, and their gap with the theoretical upper bound. Context length is set to [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Inference speedup on Llama3.1-8B with different pruning strategies, with sparsity set to 12.5%. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Inference speedup on Llama3.1-8B with different pruning strategies, with sparsity set to 25%. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Inference speedup on Llama3.1-8B with different pruning strategies, with sparsity set to 37.5%. [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Inference speedup on Qwen3-14B with different pruning strategies, with sparsity set to 50%. [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Inference speedup on Llama3.1-8B with different pruning strategies, with sparsity set to A800-80G. [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

read the original abstract

Pruning has emerged as a dominant paradigm for accelerating large language model (LLM) inference, spanning a broad spectrum of methods that remove computation across tokens, layers, heads, dimensions, and attention patterns. Despite sharing the same objective, these pruning approaches induce fundamentally different execution behaviors, causing realized speedups to depend heavily on hardware and kernel implementations. Consequently, the practical acceleration benefits of different pruning families remain poorly understood. In this work, we introduce a GEMM-centric taxonomy that reorganizes existing pruning methods according to the logical \textbf{M}, \textbf{N}, and \textbf{K} dimensions of general matrix multiplication (GEMM). Leveraging this abstraction, we build a unified benchmarking framework that enables implementation-consistent comparison across the pruning design space and systematically characterizes the acceleration--quality Pareto frontier. Our results show that static depth pruning remains the strongest Pareto-optimal baseline and stays closest to its theoretical acceleration upper bound in memory-bounded scenarios. During prefill, the frontier transitions from static depth at low quality loss (0\%--4\%), to dynamic depth at moderate loss (5\%--16\%), and finally to static width pruning at higher loss levels (17\%--26\%). These findings establish the first unified view of the practical limits of pruning-based LLM acceleration and provide guidance for future pruning research.\footnote{Code is available at https://github.com/EIT-NLP/LLM-Pruning/tree/main/PruningInferSim}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The GEMM taxonomy and unified simulator give a cleaner way to compare pruning families on real speedups, but the reported Pareto transitions rest on how well that simulator matches actual kernels.

read the letter

The paper's main contribution is reorganizing pruning methods by their effect on GEMM M/N/K dimensions and then measuring realized inference speed on one framework instead of relying on FLOPs counts. That produces concrete guidance: static depth pruning stays closest to its theoretical bound in memory-bound cases, with the optimal choice shifting to dynamic depth then static width as quality loss increases.

The work is useful because it ships code and runs the same measurement protocol across families, which is more informative than scattered prior results. The prefill-stage transitions at 0-4%, 5-16%, and 17-26% loss are the kind of numbers practitioners can test against.

The soft spot is that everything flows through their simulator. If the dense GEMM baseline is applied uniformly, methods that could use sparse kernels or different memory layouts may appear worse than they would on real hardware. The stress-test concern about unmodeled kernel effects is real; without more detail on how the simulator handles sparsity or specific accelerators, the exact transition points could move. The abstract does not spell out error bars or data exclusion rules, so the strength of the claims depends on how thoroughly the full methods section addresses those.

This is worth a serious referee for the model-compression community. People building deployment pipelines will get value from the taxonomy even if they later re-measure on their own stack. I would send it to review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper introduces a GEMM-centric taxonomy that reorganizes LLM pruning methods according to the logical M, N, and K dimensions of matrix multiplication. It develops a unified benchmarking framework (with code released) to enable implementation-consistent comparisons across pruning families and uses it to map the acceleration-quality Pareto frontier for prefill and decode phases on LLMs. The central empirical claim is that static depth pruning is the strongest Pareto-optimal baseline and remains closest to its theoretical upper bound in memory-bounded regimes, with the frontier transitioning to dynamic depth pruning at moderate quality loss (5%–16%) and to static width pruning at higher loss (17%–26%).

Significance. If the unified framework delivers truly implementation-consistent measurements, the work supplies the first systematic, cross-family view of realized versus theoretical acceleration limits for pruning-based LLM inference. The release of code is a concrete strength that supports reproducibility and allows independent verification of the reported transitions.

major comments (2)

[Abstract / Methods (GEMM-centric taxonomy)] Abstract and Methods (GEMM-centric taxonomy and simulator): the claim of 'implementation-consistent comparison' across pruning families is load-bearing for all reported Pareto transitions and the superiority of static depth pruning. The manuscript must explicitly document whether the simulator applies a uniform dense GEMM baseline to width-pruning methods (which could otherwise exploit sparsity) or incorporates family-specific kernels; otherwise the acceleration gaps and the 17%–26% transition point may reflect modeling choices rather than inherent pruning behavior.
[Results (prefill frontier)] Results (prefill frontier transitions): the specific quality-loss thresholds (0%–4% static depth, 5%–16% dynamic depth, 17%–26% static width) are presented without accompanying error bars, model/hardware sensitivity analysis, or explicit data-exclusion rules. These details are required to confirm that the reported transitions are robust rather than sensitive to post-hoc selection or hardware-specific bias.

minor comments (1)

[Abstract] The abstract footnote states code availability, but the main text should include a brief pointer to the exact repository path and commit used for the reported experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of our GEMM-centric taxonomy and benchmarking framework. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract / Methods (GEMM-centric taxonomy)] Abstract and Methods (GEMM-centric taxonomy and simulator): the claim of 'implementation-consistent comparison' across pruning families is load-bearing for all reported Pareto transitions and the superiority of static depth pruning. The manuscript must explicitly document whether the simulator applies a uniform dense GEMM baseline to width-pruning methods (which could otherwise exploit sparsity) or incorporates family-specific kernels; otherwise the acceleration gaps and the 17%–26% transition point may reflect modeling choices rather than inherent pruning behavior.

Authors: We agree that explicit documentation is required to support the implementation-consistent claim. Our simulator applies a uniform dense GEMM baseline to all pruning families (including width pruning) to isolate the effects of the logical M/N/K reductions under the taxonomy, rather than confounding results with family-specific sparse kernels. This design choice ensures fair cross-family comparison focused on pruning-induced dimension changes. We will add a new subsection in Methods (and update the abstract if needed) that details the simulator architecture, confirms the uniform dense baseline, and explains why family-specific kernels were not used. This revision will directly address the concern that gaps may stem from modeling choices. revision: yes
Referee: [Results (prefill frontier)] Results (prefill frontier transitions): the specific quality-loss thresholds (0%–4% static depth, 5%–16% dynamic depth, 17%–26% static width) are presented without accompanying error bars, model/hardware sensitivity analysis, or explicit data-exclusion rules. These details are required to confirm that the reported transitions are robust rather than sensitive to post-hoc selection or hardware-specific bias.

Authors: We will incorporate the requested robustness elements. The thresholds were obtained by aggregating Pareto-optimal points across multiple LLMs (Llama-2/3 variants) and hardware configurations in the prefill phase. In revision, we will add error bars (standard deviation over 5 runs per configuration), a sensitivity analysis subsection examining variations across model sizes, batch sizes, and two hardware platforms, and an explicit statement of data-exclusion rules (e.g., configurations with >20% variance or incomplete kernel support were excluded). These will appear in the main Results and an expanded appendix to demonstrate that the frontier transitions (static depth → dynamic depth → static width) are stable. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical measurements

full rationale

The paper's core contribution is a GEMM-centric taxonomy for reorganizing pruning methods by M/N/K dimensions, followed by construction of a unified benchmarking framework whose outputs are direct runtime and quality measurements. The reported Pareto frontiers and transitions (static depth at low loss, dynamic depth at moderate loss, static width at high loss) are presented as observed results from this framework, not as predictions derived from equations that reduce to the inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described methodology. The framework is an implementation choice whose fidelity is an external validity question, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that GEMM operations dominate LLM inference latency and introduces a new classification scheme; no free parameters or invented physical entities are required by the central claims in the abstract.

axioms (1)

domain assumption GEMM operations are the primary bottleneck in LLM inference
The entire taxonomy is constructed around the M, N, K dimensions of GEMM, presupposing this captures execution behavior across pruning families.

pith-pipeline@v0.9.1-grok · 5810 in / 1248 out tokens · 29919 ms · 2026-06-27T17:01:04.413042+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 27 canonical work pages · 8 internal anchors

[1]

A Deeper Look at Depth Pruning of

Siddiqui, Shoaib Ahmed and Dong, Xin and Heinrich, Greg and Breuel, Thomas and Kautz, Jan and Krueger, David and Molchanov, Pavlo , year = 2024, month = jul, langid =. A Deeper Look at Depth Pruning of

2024
[2]

V isi P runer: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLM s

Fan, Yingqi and Zhao, Anhao and Fu, Jinlan and Tong, Junlong and Su, Hui and Pan, Yijie and Zhang, Wei and Shen, Xiaoyu. V isi P runer: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.955

work page doi:10.18653/v1/2025.emnlp-main.955 2025
[3]

2026 , eprint=

From LLMs to LRMs: Rethinking Pruning for Reasoning-Centric Models , author=. 2026 , eprint=

2026
[4]

2026 , eprint=

What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models , author=. 2026 , eprint=

2026
[5]

2026 , eprint=

SkipOPU: An FPGA-based Overlay Processor for Large Language Models with Dynamically Allocated Computation , author=. 2026 , eprint=

2026
[6]

2025 , eprint=

Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference , author=. 2025 , eprint=

2025
[7]

2024 , eprint=

A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations , author=. 2024 , eprint=

2024
[8]

A Survey on Deep Neural Network Pruning: Taxonomy, Comparison, Analysis, and Recommendations , year=

Cheng, Hongrong and Zhang, Miao and Shi, Javen Qinfeng , journal=. A Survey on Deep Neural Network Pruning: Taxonomy, Comparison, Analysis, and Recommendations , year=
[9]

Zhou, Zixuan and Ning, Xuefei and Hong, Ke and Fu, Tianyu and Xu, Jiaming and Li, Shiyao and Lou, Yuming and Wang, Luning and Yuan, Zhihang and Li, Xiuhong and Yan, Shengen and Dai, Guohao and Zhang, Xiao-Ping and Dong, Yuhan and Wang, Yu , year = 2024, month = jul, number =. A. doi:10.48550/arXiv.2404.14294 , archiveprefix =. 2404.14294 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14294 2024
[10]

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Raposo, David and Ritter, Sam and Richards, Blake and Lillicrap, Timothy and Humphreys, Peter Conway and Santoro, Adam , year = 2024, month = apr, number =. Mixture-of-. doi:10.48550/arXiv.2404.02258 , archiveprefix =. 2404.02258 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.02258 2024
[11]

Forty-Second

Zhao, Anhao and Ye, Fanghua and Fan, Yingqi and Tong, Junlong and Xiong, Jing and Fei, Zhiwei and Su, Hui and Shen, Xiaoyu , year = 2025, month = jun, langid =. Forty-Second

2025
[12]

S hort GPT : Layers in Large Language Models are More Redundant Than You Expect

Men, Xin and Xu, Mingyu and Zhang, Qingyu and Yuan, Qianhao and Wang, Bingning and Lin, Hongyu and Lu, Yaojie and Han, Xianpei and Chen, Weipeng , year = 2025, month = jul, pages =. Findings of the. doi:10.18653/v1/2025.findings-acl.1035 , isbn =

work page doi:10.18653/v1/2025.findings-acl.1035 2025
[13]

Findings of the

Yang, Yifei and Cao, Zouying and Zhao, Hai , year = 2024, month = nov, pages =. Findings of the

2024
[14]

Yang, Guang and Zhou, Yu and Zhang, Xiangyu and Cheng, Wei and Liu, Ke and Chen, Xiang and Zhuo, Terry Yue and Chen, Taolue , year = 2025, month = apr, number =. Less Is. doi:10.48550/arXiv.2412.15921 , archiveprefix =. 2412.15921 , primaryclass =

work page doi:10.48550/arxiv.2412.15921 2025
[15]

doi:10.48550/arXiv.2501.09949 , archiveprefix =

Mu. doi:10.48550/arXiv.2501.09949 , archiveprefix =. 2501.09949 , primaryclass =

work page doi:10.48550/arxiv.2501.09949
[16]

Xia, Haojun and Zheng, Zhen and Li, Yuchao and Zhuang, Donglin and Zhou, Zhongzhu and Qiu, Xiafei and Li, Yong and Lin, Wei and Song, Shuaiwen Leon , year = 2023, month = sep, number =. Flash-. doi:10.48550/arXiv.2309.10285 , archiveprefix =. 2309.10285 , primaryclass =

work page doi:10.48550/arxiv.2309.10285 2023
[17]

Efficient

Lin, Bin and Zheng, Ningxin and Wang, Lei and Cao, Shijie and Ma, Lingxiao and Zhang, Quanlu and Zhu, Yi and Cao, Ting and Xue, Jilong and Yang, Yuqing and Yang, Fan , year = 2023, month = mar, journal =. Efficient

2023
[18]

and do Nascimento, Marcelo Gennari and Hoefler, Torsten and Hensman, James , year = 2023, month = oct, langid =

Ashkboos, Saleh and Croci, Maximilian L. and do Nascimento, Marcelo Gennari and Hoefler, Torsten and Hensman, James , year = 2023, month = oct, langid =. The

2023
[19]

Thirty-Seventh

Ma, Xinyin and Fang, Gongfan and Wang, Xinchao , year = 2023, month = nov, langid =. Thirty-Seventh

2023
[20]

Ling, Gui and Wang, Ziyang and YuliangYan and Liu, Qingwen , year = 2024, month = nov, langid =. The

2024
[21]

T\'yr-the-

Li, Guanchen and Xu, Yixing and Li, Zeping and Liu, Ji and Yin, Xuanwu and Li, Dong and Barsoum, Emad , year = 2025, month = oct, number =. T\'yr-the-. doi:10.48550/arXiv.2503.09657 , archiveprefix =. 2503.09657 , primaryclass =

work page doi:10.48550/arxiv.2503.09657 2025
[22]

Zhou, Changhai and Qiao, Qian and Zhang, Weizhong and Jin, Cheng , year = 2025, month = may, number =. Large. doi:10.48550/arXiv.2505.03801 , archiveprefix =. 2505.03801 , primaryclass =

work page doi:10.48550/arxiv.2505.03801 2025
[23]

Fang, Gongfan and Yin, Hongxu and Muralidharan, Saurav and Heinrich, Greg and Pool, Jeff and Kautz, Jan and Molchanov, Pavlo and Wang, Xinchao , year = 2024, month = nov, langid =. The

2024
[24]

Proceedings of the 40th

Frantar, Elias and Alistarh, Dan , year = 2023, month = jul, pages =. Proceedings of the 40th

2023
[25]

Zico , year = 2023, month = oct, langid =

Sun, Mingjie and Liu, Zhuang and Bair, Anna and Kolter, J. Zico , year = 2023, month = oct, langid =. A. The

2023
[26]

Le, Qi and Diao, Enmao and Wang, Ziyan and Wang, Xinran and Ding, Jie and Yang, Li and Anwar, Ali , year = 2024, month = oct, langid =. Probe. The

2024
[27]

Prompt-Based

Wee, Juyun and Park, Minjae and Lee, Jaeho , year = 2025, month = jun, langid =. Prompt-Based. Forty-Second

2025
[28]

Jiang, Yikun and Wang, Huanyu and Xie, Lei and Zhao, Hanbin and Zhang, Chao and Qian, Hui and Lui, John C. S. , year = 2024, month = nov, langid =. D-. The

2024
[29]

Yang, Mingzhe and Lin, Sihao and Li, Changlin and Chang, Xiaojun , year = 2025, month = jun, langid =. Let. Forty-Second

2025
[30]

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

Yuan, Jiayi and Shinn, Cameron and Xu, Kai and Cui, Jingze and Klimiashvili, George and Xiao, Guangxuan and Zheng, Perkz and Li, Bo and Zhou, Yuxin and Ye, Zhouhai and You, Weijie and Zheng, Tian and Brown, Dominic and Wang, Pengbo and Cai, Richard and Demouth, Julien and Owens, John D. and Hu, Xia and Han, Song and Liu, Timmy and Mao, Huizi , year = 2025...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.12087 2025
[31]

Efficient

Sun, Yutao and Li, Zhenyu and Zhang, Yike and Pan, Tengyu and Dong, Bowen and Guo, Yuyi and Wang, Jianyong , year = 2026, month = feb, number =. Efficient. doi:10.48550/arXiv.2507.19595 , archiveprefix =. 2507.19595 , primaryclass =

work page doi:10.48550/arxiv.2507.19595 2026
[32]

Gao, Chaochen and W, Xing and Fu, Qi and Hu, Songlin , year = 2024, month = oct, langid =. Quest:. The

2024
[33]

Forty-Second

Zhang, Jintao and Xiang, Chendong and Huang, Haofeng and Wei, Jia and Xi, Haocheng and Zhu, Jun and Chen, Jianfei , year = 2025, month = jun, langid =. Forty-Second

2025
[34]

Zhao, Wayne Xin and Zhou, Kun and Li, Junyi and Tang, Tianyi and Wang, Xiaolei and Hou, Yupeng and Min, Yingqian and Zhang, Beichen and Zhang, Junjie and Dong, Zican and Du, Yifan and Yang, Chen and Chen, Yushuo and Chen, Zhipeng and Jiang, Jinhao and Ren, Ruiyang and Li, Yifan and Tang, Xinyu and Liu, Zikang and Liu, Peiyu and Nie, Jian-Yun and Wen, Ji-R...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.18223 2025
[35]

Team, Kimi and Bai, Tongtong and Bai, Yifan and Bao, Yiping and Cai, S. H. and Cao, Yuan and Charles, Y. and Che, H. S. and Chen, Cheng and Chen, Guanduo and Chen, Huarong and Chen, Jia and Chen, Jiahao and Chen, Jianlong and Chen, Jun and Chen, Kefan and Chen, Liang and Chen, Ruijue and Chen, Xinhao and Chen, Yanru and Chen, Yanxu and Chen, Yicun and Che...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02276 2026
[36]

Hu, Ming and Ma, Chenglong and Li, Wei and Xu, Wanghan and Wu, Jiamin and Hu, Jucheng and Li, Tianbin and Zhuang, Guohang and Liu, Jiaqi and Lu, Yingzhou and Chen, Ying and Zhang, Chaoyang and Tan, Cheng and Ying, Jie and Wu, Guocheng and Gao, Shujian and Chen, Pengcheng and Lin, Jiashi and Wu, Haitao and Chen, Lulu and Wang, Fengxiang and Zhang, Yuanyuan...

work page doi:10.48550/arxiv.2508.21148 2025
[37]

SGLang: Efficient Execution of Structured Language Model Programs

Zheng, Lianmin and Yin, Liangsheng and Xie, Zhiqiang and Sun, Chuyue and Huang, Jeff and Yu, Cody Hao and Cao, Shiyi and Kozyrakis, Christos and Stoica, Ion and Gonzalez, Joseph E. and Barrett, Clark and Sheng, Ying , year = 2024, month = jun, number =. doi:10.48550/arXiv.2312.07104 , archiveprefix =. 2312.07104 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.07104 2024
[38]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , year = 2023, month = sep, number =. Efficient. doi:10.48550/arXiv.2309.06180 , archiveprefix =. 2309.06180 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.06180 2023
[39]

Int vs fp: A comprehensive study of fine-grained low-bit quantization formats.arXiv preprint arXiv:2510.25602, 2025a

Chen, Mengzhao and Wu, Meng and Jin, Hui and Yuan, Zhihang and Liu, Jing and Zhang, Chaoyi and Li, Yunshui and Huang, Jie and Ma, Jin and Xue, Zeyue and Liu, Zhiheng and Bin, Xingyan and Luo, Ping , year = 2025, month = oct, number =. doi:10.48550/arXiv.2510.25602 , archiveprefix =. 2510.25602 , primaryclass =

work page doi:10.48550/arxiv.2510.25602 2025
[40]

He, Shwai and Sun, Guoheng and Shen, Zheyu and Li, Ang , year = 2024, month = oct, number =. What. doi:10.48550/arXiv.2406.15786 , archiveprefix =. 2406.15786 , primaryclass =

work page doi:10.48550/arxiv.2406.15786 2024
[41]

Tillet, Philippe and Kung, H. T. and Cox, David , year = 2019, month = jun, series =. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations , booktitle =. doi:10.1145/3315508.3329973 , isbn =

work page doi:10.1145/3315508.3329973 2019
[42]

Findings of the

Zhong, Longguang and Wan, Fanqi and Chen, Ruijun and Quan, Xiaojun and Li, Liangzhi , year = 2025, month = jul, pages =. Findings of the. doi:10.18653/v1/2025.findings-acl.262 , isbn =

work page doi:10.18653/v1/2025.findings-acl.262 2025
[43]

Sandri, Fabrizio and Cunegatti, Elia and Iacca, Giovanni , year = 2025, month = may, journal =

2025
[44]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and. The. doi:10.48550/arXiv.2407.21783 , archiveprefix =. 2407.21783 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783
[45]

Gao, Yizhao and Zeng, Zhichen and Du, DaYou and Cao, Shijie and Zhou, Peiyuan and Qi, Jiaxing and Lai, Junjie and So, Hayden Kwok-Hay and Cao, Ting and Yang, Fan and Yang, Mao , year = 2025, month = oct, langid =. The

2025
[46]

Gao, Yizhao and Guo, Shuming and Cao, Shijie and Xia, Yuqing and Cheng, Yu and Wang, Lei and Ma, Lingxiao and Sun, Yutao and Ye, Tianzhu and Dong, Li and So, Hayden Kwok-Hay and Hua, Yu and Cao, Ting and Yang, Fan and Yang, Mao , year = 2025, month = oct, langid =. Sparse. The

2025
[47]

Wang, Lei and Cheng, Yu and Shi, Yining and Mo, Zhiwen and Tang, Zhengju and Xie, Wenhao and Wu, Tong and Ma, Lingxiao and Xia, Yuqing and Xue, Jilong and Yang, Fan and Yang, Zhi , year = 2025, month = oct, langid =. The

2025
[48]

doi:10.48550/arXiv.2512.09946 , archiveprefix =

Chiang, Hung-Yueh and Wang, Bokun and Marculescu, Diana , year = 2025, month = dec, number =. doi:10.48550/arXiv.2512.09946 , archiveprefix =. 2512.09946 , primaryclass =

work page doi:10.48550/arxiv.2512.09946 2025
[49]

Pruning as a

Ding, Xuan and Tong, Pengyu and Duan, Ranjie and Zhang, Yunjian and Sun, Rui and Zhu, Yao , year = 2025, month = oct, langid =. Pruning as a. The

2025
[50]

Shortened

Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu , year = 2024, month = jun, number =. Shortened. doi:10.48550/arXiv.2402.02834 , archiveprefix =. 2402.02834 , primaryclass =

work page doi:10.48550/arxiv.2402.02834 2024
[51]

Wang, Xin and Zheng, Yu and Wan, Zhongwei and Zhang, Mi , year = 2024, month = oct, langid =. The

2024
[52]

Qinsi, Wang and Ke, Jinghan and Tomizuka, Masayoshi and Keutzer, Kurt and Xu, Chenfeng , year = 2024, month = oct, langid =. Dobi-. The

2024
[53]

and Shen, Yelong and Wallis, Phillip and

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and. International
[54]

Fluctuation-Based

An, Yongqi and Zhao, Xu and Yu, Tao and Tang, Ming and Wang, Jinqiao , year = 2023, month = dec, journal =. Fluctuation-Based

2023
[55]

arXiv.org , howpublished =

Llama: Open and efficient foundation language models , author=. arXiv.org , howpublished =
[56]

Huang, Xinhao and Huang, You-Liang and Wen, Zeyi , year = 2025, journal =

2025
[57]

RedPajama: an Open Dataset for Training Large Language Models , author =
[58]

Li, Jinhao and Xu, Jiaming and Huang, Shan and Chen, Yonghua and Li, Wen and Liu, Jun and Lian, Yaoxiu and Pan, Jiayi and Ding, Li and Zhou, Hao and Wang, Yu and Dai, Guohao , year = 2025, month = jun, number =. Large. doi:10.48550/arXiv.2410.04466 , archiveprefix =. 2410.04466 , primaryclass =

work page doi:10.48550/arxiv.2410.04466 2025
[59]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602
[60]

arXiv preprint arXiv:1609.07843 , year=

Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

Pith/arXiv arXiv
[61]

arXiv preprint arXiv:1803.05457 , year=

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

Pith/arXiv arXiv
[62]

arXiv preprint arXiv:1905.10044 , year=

Boolq: Exploring the surprising difficulty of natural yes/no questions , author=. arXiv preprint arXiv:1905.10044 , year=

Pith/arXiv arXiv 1905
[63]

Proceedings of the AAAI conference on artificial intelligence , volume=

Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[64]

arXiv preprint arXiv:1905.07830 , year=

Hellaswag: Can a machine really finish your sentence? , author=. arXiv preprint arXiv:1905.07830 , year=

Pith/arXiv arXiv 1905
[65]

Communications of the ACM , volume=

Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

2021
[66]

arXiv preprint arXiv:1809.02789 , year=

Can a suit of armor conduct electricity? a new dataset for open book question answering , author=. arXiv preprint arXiv:1809.02789 , year=

Pith/arXiv arXiv
[67]

Wu, Hao and Fan, Yingqi and Jinyang, Dai and Tong, Junlong and Ma, Yunpu and Shen, Xiaoyu , year = 2025, month = oct, langid =. The

2025
[68]

doi:10.48550/arXiv.2602.23734 , archiveprefix =

Wu, Hao and Wang, Xudong and Zhang, Jialiang and Tong, Junlong and Chen, Xinghao and Lin, Junyan and Ma, Yunpu and Shen, Xiaoyu , year = 2026, month = feb, number =. doi:10.48550/arXiv.2602.23734 , archiveprefix =. 2602.23734 , primaryclass =

work page doi:10.48550/arxiv.2602.23734 2026
[69]

From Data to Model: A Survey of the Compression Lifecycle in MLLMs , url=

Wu, Hao and Tong, Junlong and Wang, Xudong and Tan, Yang and Zeng, Changyu and Antsiferova, Anastasia and Shen, Xiaoyu , year=. From Data to Model: A Survey of the Compression Lifecycle in MLLMs , url=. doi:10.36227/techrxiv.177220375.55495124/v1 , publisher=

work page doi:10.36227/techrxiv.177220375.55495124/v1
[70]

2026 , eprint=

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention , author=. 2026 , eprint=

2026

[1] [1]

A Deeper Look at Depth Pruning of

Siddiqui, Shoaib Ahmed and Dong, Xin and Heinrich, Greg and Breuel, Thomas and Kautz, Jan and Krueger, David and Molchanov, Pavlo , year = 2024, month = jul, langid =. A Deeper Look at Depth Pruning of

2024

[2] [2]

V isi P runer: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLM s

Fan, Yingqi and Zhao, Anhao and Fu, Jinlan and Tong, Junlong and Su, Hui and Pan, Yijie and Zhang, Wei and Shen, Xiaoyu. V isi P runer: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.955

work page doi:10.18653/v1/2025.emnlp-main.955 2025

[3] [3]

2026 , eprint=

From LLMs to LRMs: Rethinking Pruning for Reasoning-Centric Models , author=. 2026 , eprint=

2026

[4] [4]

2026 , eprint=

What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models , author=. 2026 , eprint=

2026

[5] [5]

2026 , eprint=

SkipOPU: An FPGA-based Overlay Processor for Large Language Models with Dynamically Allocated Computation , author=. 2026 , eprint=

2026

[6] [6]

2025 , eprint=

Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference , author=. 2025 , eprint=

2025

[7] [7]

2024 , eprint=

A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations , author=. 2024 , eprint=

2024

[8] [8]

A Survey on Deep Neural Network Pruning: Taxonomy, Comparison, Analysis, and Recommendations , year=

Cheng, Hongrong and Zhang, Miao and Shi, Javen Qinfeng , journal=. A Survey on Deep Neural Network Pruning: Taxonomy, Comparison, Analysis, and Recommendations , year=

[9] [9]

Zhou, Zixuan and Ning, Xuefei and Hong, Ke and Fu, Tianyu and Xu, Jiaming and Li, Shiyao and Lou, Yuming and Wang, Luning and Yuan, Zhihang and Li, Xiuhong and Yan, Shengen and Dai, Guohao and Zhang, Xiao-Ping and Dong, Yuhan and Wang, Yu , year = 2024, month = jul, number =. A. doi:10.48550/arXiv.2404.14294 , archiveprefix =. 2404.14294 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14294 2024

[10] [10]

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Raposo, David and Ritter, Sam and Richards, Blake and Lillicrap, Timothy and Humphreys, Peter Conway and Santoro, Adam , year = 2024, month = apr, number =. Mixture-of-. doi:10.48550/arXiv.2404.02258 , archiveprefix =. 2404.02258 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.02258 2024

[11] [11]

Forty-Second

Zhao, Anhao and Ye, Fanghua and Fan, Yingqi and Tong, Junlong and Xiong, Jing and Fei, Zhiwei and Su, Hui and Shen, Xiaoyu , year = 2025, month = jun, langid =. Forty-Second

2025

[12] [12]

S hort GPT : Layers in Large Language Models are More Redundant Than You Expect

Men, Xin and Xu, Mingyu and Zhang, Qingyu and Yuan, Qianhao and Wang, Bingning and Lin, Hongyu and Lu, Yaojie and Han, Xianpei and Chen, Weipeng , year = 2025, month = jul, pages =. Findings of the. doi:10.18653/v1/2025.findings-acl.1035 , isbn =

work page doi:10.18653/v1/2025.findings-acl.1035 2025

[13] [13]

Findings of the

Yang, Yifei and Cao, Zouying and Zhao, Hai , year = 2024, month = nov, pages =. Findings of the

2024

[14] [14]

Yang, Guang and Zhou, Yu and Zhang, Xiangyu and Cheng, Wei and Liu, Ke and Chen, Xiang and Zhuo, Terry Yue and Chen, Taolue , year = 2025, month = apr, number =. Less Is. doi:10.48550/arXiv.2412.15921 , archiveprefix =. 2412.15921 , primaryclass =

work page doi:10.48550/arxiv.2412.15921 2025

[15] [15]

doi:10.48550/arXiv.2501.09949 , archiveprefix =

Mu. doi:10.48550/arXiv.2501.09949 , archiveprefix =. 2501.09949 , primaryclass =

work page doi:10.48550/arxiv.2501.09949

[16] [16]

Xia, Haojun and Zheng, Zhen and Li, Yuchao and Zhuang, Donglin and Zhou, Zhongzhu and Qiu, Xiafei and Li, Yong and Lin, Wei and Song, Shuaiwen Leon , year = 2023, month = sep, number =. Flash-. doi:10.48550/arXiv.2309.10285 , archiveprefix =. 2309.10285 , primaryclass =

work page doi:10.48550/arxiv.2309.10285 2023

[17] [17]

Efficient

Lin, Bin and Zheng, Ningxin and Wang, Lei and Cao, Shijie and Ma, Lingxiao and Zhang, Quanlu and Zhu, Yi and Cao, Ting and Xue, Jilong and Yang, Yuqing and Yang, Fan , year = 2023, month = mar, journal =. Efficient

2023

[18] [18]

and do Nascimento, Marcelo Gennari and Hoefler, Torsten and Hensman, James , year = 2023, month = oct, langid =

Ashkboos, Saleh and Croci, Maximilian L. and do Nascimento, Marcelo Gennari and Hoefler, Torsten and Hensman, James , year = 2023, month = oct, langid =. The

2023

[19] [19]

Thirty-Seventh

Ma, Xinyin and Fang, Gongfan and Wang, Xinchao , year = 2023, month = nov, langid =. Thirty-Seventh

2023

[20] [20]

Ling, Gui and Wang, Ziyang and YuliangYan and Liu, Qingwen , year = 2024, month = nov, langid =. The

2024

[21] [21]

T\'yr-the-

Li, Guanchen and Xu, Yixing and Li, Zeping and Liu, Ji and Yin, Xuanwu and Li, Dong and Barsoum, Emad , year = 2025, month = oct, number =. T\'yr-the-. doi:10.48550/arXiv.2503.09657 , archiveprefix =. 2503.09657 , primaryclass =

work page doi:10.48550/arxiv.2503.09657 2025

[22] [22]

Zhou, Changhai and Qiao, Qian and Zhang, Weizhong and Jin, Cheng , year = 2025, month = may, number =. Large. doi:10.48550/arXiv.2505.03801 , archiveprefix =. 2505.03801 , primaryclass =

work page doi:10.48550/arxiv.2505.03801 2025

[23] [23]

Fang, Gongfan and Yin, Hongxu and Muralidharan, Saurav and Heinrich, Greg and Pool, Jeff and Kautz, Jan and Molchanov, Pavlo and Wang, Xinchao , year = 2024, month = nov, langid =. The

2024

[24] [24]

Proceedings of the 40th

Frantar, Elias and Alistarh, Dan , year = 2023, month = jul, pages =. Proceedings of the 40th

2023

[25] [25]

Zico , year = 2023, month = oct, langid =

Sun, Mingjie and Liu, Zhuang and Bair, Anna and Kolter, J. Zico , year = 2023, month = oct, langid =. A. The

2023

[26] [26]

Le, Qi and Diao, Enmao and Wang, Ziyan and Wang, Xinran and Ding, Jie and Yang, Li and Anwar, Ali , year = 2024, month = oct, langid =. Probe. The

2024

[27] [27]

Prompt-Based

Wee, Juyun and Park, Minjae and Lee, Jaeho , year = 2025, month = jun, langid =. Prompt-Based. Forty-Second

2025

[28] [28]

Jiang, Yikun and Wang, Huanyu and Xie, Lei and Zhao, Hanbin and Zhang, Chao and Qian, Hui and Lui, John C. S. , year = 2024, month = nov, langid =. D-. The

2024

[29] [29]

Yang, Mingzhe and Lin, Sihao and Li, Changlin and Chang, Xiaojun , year = 2025, month = jun, langid =. Let. Forty-Second

2025

[30] [30]

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

Yuan, Jiayi and Shinn, Cameron and Xu, Kai and Cui, Jingze and Klimiashvili, George and Xiao, Guangxuan and Zheng, Perkz and Li, Bo and Zhou, Yuxin and Ye, Zhouhai and You, Weijie and Zheng, Tian and Brown, Dominic and Wang, Pengbo and Cai, Richard and Demouth, Julien and Owens, John D. and Hu, Xia and Han, Song and Liu, Timmy and Mao, Huizi , year = 2025...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.12087 2025

[31] [31]

Efficient

Sun, Yutao and Li, Zhenyu and Zhang, Yike and Pan, Tengyu and Dong, Bowen and Guo, Yuyi and Wang, Jianyong , year = 2026, month = feb, number =. Efficient. doi:10.48550/arXiv.2507.19595 , archiveprefix =. 2507.19595 , primaryclass =

work page doi:10.48550/arxiv.2507.19595 2026

[32] [32]

Gao, Chaochen and W, Xing and Fu, Qi and Hu, Songlin , year = 2024, month = oct, langid =. Quest:. The

2024

[33] [33]

Forty-Second

Zhang, Jintao and Xiang, Chendong and Huang, Haofeng and Wei, Jia and Xi, Haocheng and Zhu, Jun and Chen, Jianfei , year = 2025, month = jun, langid =. Forty-Second

2025

[34] [34]

Zhao, Wayne Xin and Zhou, Kun and Li, Junyi and Tang, Tianyi and Wang, Xiaolei and Hou, Yupeng and Min, Yingqian and Zhang, Beichen and Zhang, Junjie and Dong, Zican and Du, Yifan and Yang, Chen and Chen, Yushuo and Chen, Zhipeng and Jiang, Jinhao and Ren, Ruiyang and Li, Yifan and Tang, Xinyu and Liu, Zikang and Liu, Peiyu and Nie, Jian-Yun and Wen, Ji-R...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.18223 2025

[35] [35]

Team, Kimi and Bai, Tongtong and Bai, Yifan and Bao, Yiping and Cai, S. H. and Cao, Yuan and Charles, Y. and Che, H. S. and Chen, Cheng and Chen, Guanduo and Chen, Huarong and Chen, Jia and Chen, Jiahao and Chen, Jianlong and Chen, Jun and Chen, Kefan and Chen, Liang and Chen, Ruijue and Chen, Xinhao and Chen, Yanru and Chen, Yanxu and Chen, Yicun and Che...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02276 2026

[36] [36]

Hu, Ming and Ma, Chenglong and Li, Wei and Xu, Wanghan and Wu, Jiamin and Hu, Jucheng and Li, Tianbin and Zhuang, Guohang and Liu, Jiaqi and Lu, Yingzhou and Chen, Ying and Zhang, Chaoyang and Tan, Cheng and Ying, Jie and Wu, Guocheng and Gao, Shujian and Chen, Pengcheng and Lin, Jiashi and Wu, Haitao and Chen, Lulu and Wang, Fengxiang and Zhang, Yuanyuan...

work page doi:10.48550/arxiv.2508.21148 2025

[37] [37]

SGLang: Efficient Execution of Structured Language Model Programs

Zheng, Lianmin and Yin, Liangsheng and Xie, Zhiqiang and Sun, Chuyue and Huang, Jeff and Yu, Cody Hao and Cao, Shiyi and Kozyrakis, Christos and Stoica, Ion and Gonzalez, Joseph E. and Barrett, Clark and Sheng, Ying , year = 2024, month = jun, number =. doi:10.48550/arXiv.2312.07104 , archiveprefix =. 2312.07104 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.07104 2024

[38] [38]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , year = 2023, month = sep, number =. Efficient. doi:10.48550/arXiv.2309.06180 , archiveprefix =. 2309.06180 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.06180 2023

[39] [39]

Int vs fp: A comprehensive study of fine-grained low-bit quantization formats.arXiv preprint arXiv:2510.25602, 2025a

Chen, Mengzhao and Wu, Meng and Jin, Hui and Yuan, Zhihang and Liu, Jing and Zhang, Chaoyi and Li, Yunshui and Huang, Jie and Ma, Jin and Xue, Zeyue and Liu, Zhiheng and Bin, Xingyan and Luo, Ping , year = 2025, month = oct, number =. doi:10.48550/arXiv.2510.25602 , archiveprefix =. 2510.25602 , primaryclass =

work page doi:10.48550/arxiv.2510.25602 2025

[40] [40]

He, Shwai and Sun, Guoheng and Shen, Zheyu and Li, Ang , year = 2024, month = oct, number =. What. doi:10.48550/arXiv.2406.15786 , archiveprefix =. 2406.15786 , primaryclass =

work page doi:10.48550/arxiv.2406.15786 2024

[41] [41]

Tillet, Philippe and Kung, H. T. and Cox, David , year = 2019, month = jun, series =. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations , booktitle =. doi:10.1145/3315508.3329973 , isbn =

work page doi:10.1145/3315508.3329973 2019

[42] [42]

Findings of the

Zhong, Longguang and Wan, Fanqi and Chen, Ruijun and Quan, Xiaojun and Li, Liangzhi , year = 2025, month = jul, pages =. Findings of the. doi:10.18653/v1/2025.findings-acl.262 , isbn =

work page doi:10.18653/v1/2025.findings-acl.262 2025

[43] [43]

Sandri, Fabrizio and Cunegatti, Elia and Iacca, Giovanni , year = 2025, month = may, journal =

2025

[44] [44]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and. The. doi:10.48550/arXiv.2407.21783 , archiveprefix =. 2407.21783 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783

[45] [45]

Gao, Yizhao and Zeng, Zhichen and Du, DaYou and Cao, Shijie and Zhou, Peiyuan and Qi, Jiaxing and Lai, Junjie and So, Hayden Kwok-Hay and Cao, Ting and Yang, Fan and Yang, Mao , year = 2025, month = oct, langid =. The

2025

[46] [46]

Gao, Yizhao and Guo, Shuming and Cao, Shijie and Xia, Yuqing and Cheng, Yu and Wang, Lei and Ma, Lingxiao and Sun, Yutao and Ye, Tianzhu and Dong, Li and So, Hayden Kwok-Hay and Hua, Yu and Cao, Ting and Yang, Fan and Yang, Mao , year = 2025, month = oct, langid =. Sparse. The

2025

[47] [47]

Wang, Lei and Cheng, Yu and Shi, Yining and Mo, Zhiwen and Tang, Zhengju and Xie, Wenhao and Wu, Tong and Ma, Lingxiao and Xia, Yuqing and Xue, Jilong and Yang, Fan and Yang, Zhi , year = 2025, month = oct, langid =. The

2025

[48] [48]

doi:10.48550/arXiv.2512.09946 , archiveprefix =

Chiang, Hung-Yueh and Wang, Bokun and Marculescu, Diana , year = 2025, month = dec, number =. doi:10.48550/arXiv.2512.09946 , archiveprefix =. 2512.09946 , primaryclass =

work page doi:10.48550/arxiv.2512.09946 2025

[49] [49]

Pruning as a

Ding, Xuan and Tong, Pengyu and Duan, Ranjie and Zhang, Yunjian and Sun, Rui and Zhu, Yao , year = 2025, month = oct, langid =. Pruning as a. The

2025

[50] [50]

Shortened

Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu , year = 2024, month = jun, number =. Shortened. doi:10.48550/arXiv.2402.02834 , archiveprefix =. 2402.02834 , primaryclass =

work page doi:10.48550/arxiv.2402.02834 2024

[51] [51]

Wang, Xin and Zheng, Yu and Wan, Zhongwei and Zhang, Mi , year = 2024, month = oct, langid =. The

2024

[52] [52]

Qinsi, Wang and Ke, Jinghan and Tomizuka, Masayoshi and Keutzer, Kurt and Xu, Chenfeng , year = 2024, month = oct, langid =. Dobi-. The

2024

[53] [53]

and Shen, Yelong and Wallis, Phillip and

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and. International

[54] [54]

Fluctuation-Based

An, Yongqi and Zhao, Xu and Yu, Tao and Tang, Ming and Wang, Jinqiao , year = 2023, month = dec, journal =. Fluctuation-Based

2023

[55] [55]

arXiv.org , howpublished =

Llama: Open and efficient foundation language models , author=. arXiv.org , howpublished =

[56] [56]

Huang, Xinhao and Huang, You-Liang and Wen, Zeyi , year = 2025, journal =

2025

[57] [57]

RedPajama: an Open Dataset for Training Large Language Models , author =

[58] [58]

Li, Jinhao and Xu, Jiaming and Huang, Shan and Chen, Yonghua and Li, Wen and Liu, Jun and Lian, Yaoxiu and Pan, Jiayi and Ding, Li and Zhou, Hao and Wang, Yu and Dai, Guohao , year = 2025, month = jun, number =. Large. doi:10.48550/arXiv.2410.04466 , archiveprefix =. 2410.04466 , primaryclass =

work page doi:10.48550/arxiv.2410.04466 2025

[59] [59]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602

[60] [60]

arXiv preprint arXiv:1609.07843 , year=

Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

Pith/arXiv arXiv

[61] [61]

arXiv preprint arXiv:1803.05457 , year=

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

Pith/arXiv arXiv

[62] [62]

arXiv preprint arXiv:1905.10044 , year=

Boolq: Exploring the surprising difficulty of natural yes/no questions , author=. arXiv preprint arXiv:1905.10044 , year=

Pith/arXiv arXiv 1905

[63] [63]

Proceedings of the AAAI conference on artificial intelligence , volume=

Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

[64] [64]

arXiv preprint arXiv:1905.07830 , year=

Hellaswag: Can a machine really finish your sentence? , author=. arXiv preprint arXiv:1905.07830 , year=

Pith/arXiv arXiv 1905

[65] [65]

Communications of the ACM , volume=

Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

2021

[66] [66]

arXiv preprint arXiv:1809.02789 , year=

Can a suit of armor conduct electricity? a new dataset for open book question answering , author=. arXiv preprint arXiv:1809.02789 , year=

Pith/arXiv arXiv

[67] [67]

Wu, Hao and Fan, Yingqi and Jinyang, Dai and Tong, Junlong and Ma, Yunpu and Shen, Xiaoyu , year = 2025, month = oct, langid =. The

2025

[68] [68]

doi:10.48550/arXiv.2602.23734 , archiveprefix =

Wu, Hao and Wang, Xudong and Zhang, Jialiang and Tong, Junlong and Chen, Xinghao and Lin, Junyan and Ma, Yunpu and Shen, Xiaoyu , year = 2026, month = feb, number =. doi:10.48550/arXiv.2602.23734 , archiveprefix =. 2602.23734 , primaryclass =

work page doi:10.48550/arxiv.2602.23734 2026

[69] [69]

From Data to Model: A Survey of the Compression Lifecycle in MLLMs , url=

Wu, Hao and Tong, Junlong and Wang, Xudong and Tan, Yang and Zeng, Changyu and Antsiferova, Anastasia and Shen, Xiaoyu , year=. From Data to Model: A Survey of the Compression Lifecycle in MLLMs , url=. doi:10.36227/techrxiv.177220375.55495124/v1 , publisher=

work page doi:10.36227/techrxiv.177220375.55495124/v1

[70] [70]

2026 , eprint=

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention , author=. 2026 , eprint=

2026