arxiv: 2604.09595 · v1 · submitted 2026-03-05 · 💻 cs.DC · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs

Jihao Xin , Tian Lyu , Qilong Pan , Kesen Wang , Marco Canini

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:36 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords LLM compressiondimensional misalignmentGPU performancemodel inferenceknapsack optimizationsingular value decomposition

0 comments

The pith

Compressed LLMs often run no faster than uncompressed ones because their tensor dimensions misalign with GPU hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Post-training compression shrinks large language models but creates irregular tensor dimensions that GPUs execute inefficiently. In the case of Llama-3-8B compressed via activation-aware singular value decomposition, a 15 percent parameter reduction yields no speedup because 95 percent of dimensions are misaligned at the framework, library, and hardware levels. The proposed GAC method wraps existing compressors and re-selects dimensions using multi-choice knapsack optimization to reach 100 percent alignment under the same parameter budget. This recovers up to 1.5 times speedup on models like Llama-3-8B while keeping output quality intact.

Core claim

The paper shows that dimensional misalignment in compressed LLMs prevents expected performance gains from parameter reduction, with 95 percent of dimensions in an ASVD-compressed Llama-3-8B proving unfriendly to the GPU execution stack. GAC solves this by wrapping any dimension-reducing compressor and applying multi-choice knapsack optimization to pick hardware-aligned dimensions within the original parameter count, achieving full alignment and restoring runtime speedups up to 1.5 times on tested models without quality loss.

What carries the argument

GPU-Aligned Compression (GAC), a wrapper that re-selects hardware-aligned dimensions from any base compressor using multi-choice knapsack optimization under a fixed parameter budget.

If this is right

Existing compressors such as ASVD and LLM-Pruner can deliver actual speedups when wrapped with alignment selection.
Compressed models can become both smaller and faster on standard GPUs without extra training steps.
Compression pipelines must incorporate hardware dimension constraints to realize efficiency benefits.
Full alignment eliminates the performance penalty from irregular tensor shapes across the GPU stack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same misalignment problem may appear in compression applied to other model families or hardware accelerators.
Knapsack-based selection could be extended to optimize additional constraints such as memory bandwidth.
Inherently aligned compression algorithms might be designed to avoid the need for post-hoc re-selection.

Load-bearing premise

Re-selecting dimensions with multi-choice knapsack optimization under a fixed parameter budget preserves model quality without any retraining or fine-tuning.

What would settle it

Inference benchmarks on GAC-compressed Llama-3-8B that show either accuracy drop or no speedup on GPU hardware would disprove the claim.

Figures

Figures reproduced from arXiv: 2604.09595 by Jihao Xin, Kesen Wang, Marco Canini, Qilong Pan, Tian Lyu.

**Figure 1.** Figure 1: GEMM: 𝑌=𝑋·𝑊 with 𝑋∈R 𝑀×𝐾, 𝑊 ∈R 𝐾×𝑁 . reducing 𝑀 [1, 21]. All three optimize size vs. accuracy without checking whether the resulting dimensions satisfy GPU alignment constraints (e.g., whether 𝑑 mod 8 = 0). The problem is pervasive: ASVD [20] produces 95% misaligned dimensions, and even LLM-Pruner [8]—whose pruning granularity is relatively coarse—still leaves 17% of weights misaligned ( [PITH_FULL_IMA… view at source ↗

**Figure 2.** Figure 2: Llama-3-8B at 𝜌 = 20%. Shape: ◦=Q Head, □=K Head, △=V Head; color: green=8-aligned, red=misaligned. Here D is the data distribution, L is the loss, |W′ | and |W| denote total parameter counts. We denote 𝐵 = (1 − 𝜌)|W| as the parameter budget. Different parameters have different compression sensitivities, e.g., early and late layers are often more critical than middle layers—so budget cannot be allocated u… view at source ↗

**Figure 3.** Figure 3: Llama-3-8B after PaLU compression at 𝜌 = 20% [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: PyTorch SDPA execution stack. 3 Full-Stack Analysis We analyze where and why dimensional misalignment causes slowdowns on an NVIDIA A100-80GB with PyTorch 2.9.1, CUDA 12.8, FP16. Latency is measured with CUDA events (50 warmup, 200 measurement iterations, 3 trials). Root causes fall into three layers ( [PITH_FULL_IMAGE:figures/full_fig_p002_4.png] view at source ↗

**Figure 5.** Figure 5: PyTorch SDPA latency across dimensions [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗

**Figure 7.** Figure 7: GEMM latency with dimension sweep [PITH_FULL_IMAGE:figures/full_fig_p004_7.png] view at source ↗

**Figure 8.** Figure 8: Hardware-level alignment penalties (sweep near 4096): (a,b) Tensor Core throughput, (c) L2 Cache bandwidth. Input M, F, 𝐵 Misaligned Compression Dimension Sweep Constrained Optimization Output 100% aligned model Compressed LLM (Misaligned) Candidates Dimensions [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗

**Figure 10.** Figure 10: Llama-3-8B latency across sequence lengths. pruning [4, 8, 16], and KV eviction [1, 17, 21] optimize accuracy under a size budget but ignore how the resulting dimensions interact with GPU execution. HALP [15] and HALOC [19] incorporate hardware awareness into CNN compression, but treat latency as a black-box signal: they optimize aggregate runtime without isolating why certain dimensions are slow, and a… view at source ↗

**Figure 11.** Figure 11: Compressed dimension distributions across compression ratios [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 12.** Figure 12: PyTorch SDPA latency across dimensions on H100. 4000 4032 4064 4096 K Dimension 100 150 200 250 300 350 TFLOPS (a) Alignment penalty Aligned (K mod 16 = 0) Misaligned 4000 4032 4064 4096 N Dimension 100 150 200 250 300 350 TFLOPS (b) Alignment penalty Aligned (N mod 8 = 0) Misaligned 4000 4032 4064 4096 K Dimension 120 140 160 180 200 GB/s (c) Alignment penalty Aligned (K mod 16 = 0) Misaligned [PITH_FUL… view at source ↗

**Figure 13.** Figure 13: Hardware-level alignment penalties on H100: (a,b) Tensor Core throughput, (c) L2 Cache bandwidth [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗

**Figure 14.** Figure 14: Llama-3-8B latency across sequence lengths on H100 [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗

read the original abstract

Post-training compression reduces LLM parameter counts but often produces irregular tensor dimensions that degrade GPU performance -- a phenomenon we call \emph{dimensional misalignment}. We present a full-stack analysis tracing root causes at three levels: framework, library, and hardware. The key insight is that model inference becomes slower because the resulting dimensions are unfriendly with the GPU execution stack. For example, compressing Llama-3-8B with activation-aware singular value decomposition (ASVD) has 15\% fewer parameters yet runs no faster than the uncompressed baseline, because 95\% of its dimensions are misaligned. We propose \textbf{GAC} (GPU-Aligned Compression), a new compression paradigm that wraps any dimension-reducing compressor and re-selects hardware-aligned dimensions via multi-choice knapsack optimization under the same parameter budget. We evaluate GAC on Llama-3-8B with ASVD and LLM-Pruner, achieving 100\% alignment and recovering up to 1.5$\times$ speedup while preserving model quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows compression often fails to speed up LLMs because dimensions misalign with GPU kernels, and GAC wraps existing compressors with knapsack re-selection to force alignment under the same budget.

read the letter

The core observation is that post-training compression like ASVD on Llama-3-8B cuts parameters by 15% but leaves 95% of dimensions misaligned, so inference speed stays flat. GAC addresses this by wrapping any compressor and using multi-choice knapsack to pick aligned dimensions without exceeding the original parameter count, claiming 100% alignment and up to 1.5x speedup with quality intact. That framing of misalignment as a first-class deployment issue and the general wrapper approach are the new pieces; prior work on ASVD and LLM-Pruner focused on importance scoring but not hardware alignment as an explicit constraint. The full-stack trace through framework, library, and hardware levels is useful for practitioners who have seen this slowdown in practice. The knapsack formulation is straightforward and reproducible in principle. The main soft spot is the quality claim. Re-selecting dimensions purely for alignment and count can replace high-importance ones with lower-importance aligned alternatives, yet the abstract gives no ablation on perplexity or task accuracy after the swap, nor any detail on whether the original compressor’s importance metric is folded into the knapsack weights. If the full paper shows only the headline numbers without those controls, the preservation result rests on an untested assumption. This work is aimed at engineers deploying compressed models on GPUs and at compression researchers who want a lightweight post-processing step. It is coherent on its own terms and the problem is real enough that it deserves a serious referee, mainly to check the experimental controls around quality and to see whether the speedup holds across more models and hardware. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that post-training compression of LLMs produces irregular tensor dimensions that cause dimensional misalignment with the GPU execution stack, resulting in no inference speedup despite fewer parameters. For example, ASVD compression of Llama-3-8B yields 15% fewer parameters but runs at baseline speed because 95% of dimensions are misaligned. The authors propose GAC, a wrapper around existing compressors (ASVD, LLM-Pruner) that re-selects dimensions via multi-choice knapsack optimization under a fixed parameter budget to achieve 100% alignment and up to 1.5× speedup while preserving model quality.

Significance. If the central result holds, the work identifies a concrete, previously under-appreciated source of performance loss in compressed LLMs and supplies a practical, compressor-agnostic fix that recovers hardware efficiency without retraining. The full-stack tracing from framework through library to hardware is a positive contribution; the knapsack formulation itself is parameter-free once the alignment constraint and budget are fixed.

major comments (2)

[Abstract and §4] Abstract and §4 (Evaluation): the headline claim that GAC 'preserves model quality' while replacing dimensions chosen by the base compressor is unsupported by any ablation. The knapsack objective encodes only alignment and parameter count; no experiment shows that perplexity or downstream accuracy remains unchanged when high-singular-value dimensions are swapped for lower-importance but aligned ones.
[§3.2] §3.2 (GAC formulation): the multi-choice knapsack is described as operating 'under the same parameter budget,' yet the manuscript provides neither the precise importance-weighting scheme inherited from the base compressor nor an explicit statement that the original singular-value or pruning scores are folded into the knapsack objective. Without this, the substitution risk identified in the stress-test note cannot be ruled out.

minor comments (2)

[Abstract] Abstract: supply error bars, exact baseline configurations, and the precise definition of 'alignment' (e.g., divisibility by 128 for tensor-core tiles) so that the reported 95% and 100% figures can be reproduced.
[Figure 1 and §2] Figure 1 and §2: clarify whether the reported 1.5× speedup is measured on the same hardware and batch size as the uncompressed baseline, and whether any tensor-core utilization or memory-bandwidth counters are provided to substantiate the misalignment diagnosis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments that help strengthen the manuscript. We address each major comment below with clarifications and planned revisions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Evaluation): the headline claim that GAC 'preserves model quality' while replacing dimensions chosen by the base compressor is unsupported by any ablation. The knapsack objective encodes only alignment and parameter count; no experiment shows that perplexity or downstream accuracy remains unchanged when high-singular-value dimensions are swapped for lower-importance but aligned ones.

Authors: In §4 we report that GAC achieves perplexity and downstream accuracy comparable to the base compressors (ASVD, LLM-Pruner) under identical parameter budgets, as shown in Tables 2–3. We acknowledge that an explicit ablation isolating the effect of replacing high-singular-value dimensions with aligned lower-importance ones is absent. We will add this ablation study in the revision to directly confirm quality preservation. revision: partial
Referee: [§3.2] §3.2 (GAC formulation): the multi-choice knapsack is described as operating 'under the same parameter budget,' yet the manuscript provides neither the precise importance-weighting scheme inherited from the base compressor nor an explicit statement that the original singular-value or pruning scores are folded into the knapsack objective. Without this, the substitution risk identified in the stress-test note cannot be ruled out.

Authors: We agree the formulation in §3.2 requires clarification. The knapsack objective maximizes the sum of importance scores inherited from the base compressor (singular values for ASVD, pruning scores for LLM-Pruner) subject to the alignment and budget constraints. We will revise §3.2 to state this weighting explicitly and show that high-importance dimensions are prioritized, thereby addressing the substitution concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity; GAC is an independent optimization wrapper

full rationale

The paper's derivation consists of an empirical root-cause analysis of GPU performance degradation due to irregular dimensions after compression, followed by the introduction of GAC as a post-hoc multi-choice knapsack re-selection step that enforces alignment while respecting the original parameter budget. Speedup and alignment percentages are reported as measured hardware outcomes on Llama-3-8B, not as quantities derived by algebraic reduction from the input compressor outputs. Quality preservation is presented as an empirical evaluation result rather than a definitional necessity. No equations, self-citations, or fitted parameters are shown to collapse the central claims into tautologies or input re-labelings. The approach remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the domain assumption that certain tensor dimensions are inherently faster on GPUs due to hardware and library constraints; no free parameters or new invented entities are introduced in the abstract.

axioms (1)

domain assumption GPU execution stack favors specific tensor dimensions for peak throughput
Invoked when explaining why misaligned dimensions degrade performance at the hardware level.

pith-pipeline@v0.9.0 · 5484 in / 1234 out tokens · 46735 ms · 2026-05-15T15:36:56.073726+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

compressing Llama-3-8B with ASVD has 15% fewer parameters yet runs no faster ... because 95% of its dimensions are misaligned. ... GAC ... re-selects hardware-aligned dimensions via multi-choice knapsack optimization under the same parameter budget

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 3 internal anchors

[1]

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. 2025. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. arXiv:2406.02069 [cs.CL]https://arxiv.org/ abs/2406.02069

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Abdelfattah, and Kai-Chiang Wu

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S. Abdelfattah, and Kai-Chiang Wu. 2025. Palu: KV-Cache Compression with Low-Rank Projection. InThe Thirteenth International Confer- ence on Learning Representations.https://openreview.net/forum?id= LWMS4pk2vK

work page 2025
[3]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Atten- tion with IO-Awareness. InAdvances in Neural Information Pro- cessing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 16344– 16359.https://proceedings.neurips.cc/p...

work page 2022
[4]

Elias Frantar and Dan Alistarh. 2023. SparseGPT: massive language models can be accurately pruned in one-shot. InProceedings of the 40th International Conference on Machine Learning(Honolulu, Hawaii, USA)(ICML’23). JMLR.org, Article 414, 15 pages

work page 2023
[5]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh

work page
[6]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323 [cs.LG]https://arxiv.org/ abs/2210.17323

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

work page
[8]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han

Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). As- sociation for Computing Machinery, New York, NY, USA, 611–626. doi:10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165
[9]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On- Device LLM Compression and Acceleration. InProceedings of Machine Learning and Systems, P. Gibbons, G. Pekhimenko, and C. De Sa (Eds.), Vol. 6. 87–100.https://proceedin...

work page 2024
[10]

Xinyin Ma, Gongfan Fang, and Xinchao Wang. 2023. LLM- Pruner: On the Structural Pruning of Large Language Mod- els. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 21702– 21720.https://proceedings.neurips.cc/paper_files/paper/2023/file/ 44956...

work page 2023
[11]

Meta AI. 2024. Llama 3 Model Card.https://github.com/meta-llama/ llama3

work page 2024
[12]

Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. 2019. Importance Estimation for Neural Network Pruning. arXiv:1906.10771 [cs.LG]https://arxiv.org/abs/1906.10771

work page internal anchor Pith review Pith/arXiv arXiv 2019
[13]

NVIDIA. 2024. NVIDIA TensorRT: Programmable Inference Accelera- tor.https://developer.nvidia.com/tensorrt

work page 2024
[14]

NVIDIA Corporation. 2022. NVIDIA H100 Tensor Core GPU Archi- tecture. White Paper.https://resources.nvidia.com/en-us-hopper- architecture/nvidia-h100-tensor-c

work page 2022
[15]

SemiAnalysis. 2024. NVIDIA Tensor Core Evolution: From Volta To Blackwell.https://newsletter.semianalysis.com/p/nvidia-tensor-core- evolution-from-volta-to-blackwell

work page 2024
[16]

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 68658–686...

work page doi:10.52202/079017-2193 2024
[17]

Maying Shen, Hongxu Yin, Pavlo Molchanov, Lei Mao, Jianna Liu, and Jose M. Alvarez. 2021. HALP: Hardware-Aware Latency Pruning. arXiv:2110.10811 [cs.CV]https://arxiv.org/abs/2110.10811

work page arXiv 2021
[18]

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. 2024. A Simple and Effective Pruning Approach for Large Language Models. InThe Twelfth International Conference on Learning Representations.https: //openreview.net/forum?id=PxoFut3dWW

work page 2024
[19]

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. QUEST: query-aware sparsity for efficient long- context LLM inference. InProceedings of the 41st International Con- ference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 1955, 11 pages

work page 2024
[20]

Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. 2025. SVD- LLM: Truncation-aware Singular Value Decomposition for Large Lan- guage Model Compression. InThe Thirteenth International Confer- ence on Learning Representations.https://openreview.net/forum?id= LNYIUouhdt

work page 2025
[21]

Jinqi Xiao, Chengming Zhang, Yu Gong, Miao Yin, Yang Sui, Lizhi Xiang, Dingwen Tao, and Bo Yuan. 2023. HALOC: hardware-aware automatic low-rank compression for compact neural networks. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial In- telligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thir...

work page doi:10.1609/aaai.v37i9.26244 2023
[22]

Zhihang Yuan, Yuzhang Shang, Yue Song, Dawei Yang, Qiang Wu, Yan Yan, and Guangyu Sun. 2025. ASVD: Activation-aware Singu- lar Value Decomposition for Compressing Large Language Models. arXiv:2312.05821 [cs.CL]https://arxiv.org/abs/2312.05821

work page arXiv 2025
[23]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lian- min Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang "Atlas" Wang, and Beidi Chen. 2023. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. InAdvances in Neural Information Pro- cessing Systems, A. Oh, T. Naumann, A. Globerson...

work page 2023