CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

Ammar Ali; Denis Makhov; Dmitriy Shopkhoev; Magauiya Zhussip; Stamatios Lefkimmiatis

arxiv: 2509.22075 · v6 · pith:Q7ASRHN6new · submitted 2025-09-26 · 💻 cs.CL · cs.AI

CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

Denis Makhov , Dmitriy Shopkhoev , Magauiya Zhussip , Ammar Ali , Stamatios Lefkimmiatis This is my paper

Pith reviewed 2026-05-18 13:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM compressionsparse dictionary learningpost-training compressionstructured sparsitycalibration guidedunion of subspaceslow-rank approximationLlama Qwen

0 comments

The pith

CoSpaDi replaces low-rank factorization with a sparse dictionary model that better preserves LLM accuracy at 20-40 percent compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are often compressed after training by approximating each weight matrix with a low-rank factorization that forces every column into the same low-dimensional subspace. CoSpaDi instead decomposes each weight matrix as a dense dictionary multiplied by a column-sparse coefficient matrix, so that different columns can combine different subsets of dictionary atoms. The dictionary and coefficients are chosen by minimizing the difference between original and compressed layer outputs on a small calibration set rather than minimizing weight error directly. An activation-based Gram orthonormalization turns this objective into a standard dictionary learning problem that can be solved per layer or with shared dictionaries across similar layers. Experiments on Llama and Qwen families show improved accuracy-compression and perplexity-compression curves compared with strong SVD and structured pruning baselines.

Core claim

Each weight matrix is expressed as the product of a dense dictionary and a column-sparse coefficient matrix, producing a union-of-subspaces representation. The factorization is obtained by minimizing functional reconstruction error of layer outputs on a calibration set; this data-aware objective is converted via activation-derived Gram orthonormalization into a conventional dictionary learning task. The resulting structured sparsity supports efficient sparse-dense computation and post-training quantization of the coefficients while allowing optional cross-layer dictionary sharing.

What carries the argument

Calibration-guided sparse dictionary learning that reformulates functional reconstruction error minimization into dictionary learning on Gram-orthonormalized transformed weights.

Load-bearing premise

Minimizing layer output error on a small calibration set produces a factorization whose downstream task accuracy stays close to the original model without any fine-tuning.

What would settle it

A side-by-side evaluation on Llama-7B or Qwen-7B at 30 percent compression showing equal or higher downstream accuracy and lower perplexity for an SVD baseline than for CoSpaDi would falsify the reported trade-off improvement.

Figures

Figures reproduced from arXiv: 2509.22075 by Ammar Ali, Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Stamatios Lefkimmiatis.

**Figure 1.** Figure 1: Left side: weight factorization methods using low-rank decomposition. Low-rank approximation decomposes a matrix into two dense matrices of lower rank. Right side: proposed CoSpaDi. A dictionary of k atoms and a column-sparse coefficient matrix are employed. No restrictions on size of k (undercomplete : k < d1, complete: k = d1 or overcomplete : k > d1 dictionaries are possible), while sparsity is defined… view at source ↗

**Figure 2.** Figure 2: Dual-axis plot showing average accuracy ( solid lines, left axis) and perplexity (- - - dashed lines, right axis, logarithmic scale with inverted direction) as functions of ρ for Llama3.2-1B under three compression levels: 0.2, 0.3 and 0.4. Perplexity decreases upward due to axis inversion. CR Bitwidth Avg. Acc. PPL 0.1686 bFP16 0.6198 1.94E+01 0.1843 bFP15 0.6195 1.95E+01 0.2001 bFP14 0.6176 1.97E+01 0.… view at source ↗

**Figure 3.** Figure 3: Average benchmark accuracy and WikiText perplexity for (a) LLaMA-3.2-1B and (b) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Inference time for different projection layers of Llama3.2 1B for different compression [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Inference time for different projection layers of Llama3 8B for different compression [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Inference time for different projection layers of Qwen3 0.6B for different compression [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Average benchmark accuracy and WikiText perplexity with respect to the number of K [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

read the original abstract

Post-training LLM compression often relies on low-rank approximations, which force all columns of a projection matrix to share a single low-dimensional subspace. We propose CoSpaDi, a training-free compression framework that replaces this single-subspace assumption with a union-of-subspaces model via sparse dictionary learning. CoSpaDi factorizes each weight matrix into a dense dictionary and column-sparse coefficients, allowing different columns to select different subsets of dictionary atoms at the same storage budget. To preserve model behavior, we use calibration activations to transform functional reconstruction into a standard dictionary learning problem. Across Llama and Qwen models, CoSpaDi improves accuracy--compression and perplexity--compression trade-offs over SVD-based and structured pruning baselines at 20--40\% compression ratios, while naturally supporting sparse--dense computation and post-training quantization of sparse coefficients.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoSpaDi improves accuracy-compression trade-offs over SVD and pruning at 20-40% ratios via calibration-guided sparse dictionary learning, but the gains rest on how well a small calibration set represents inference activations.

read the letter

The main thing to know is that CoSpaDi replaces low-rank factorization with a sparse dictionary approach for post-training LLM compression. It optimizes the decomposition using a small calibration set to minimize reconstruction error at the layer outputs, leading to reported better trade-offs than SVD and pruning baselines at 20 to 40 percent compression on Llama and Qwen families. The technique brings in column-sparse coefficients over a dense dictionary, which allows each weight column to draw from different subsets of atoms. The activation-derived Gram orthonormalization is a clever way to convert the functional error minimization into a more standard dictionary learning problem on transformed weights. They also explore sharing the dictionary across groups of similar layers. This keeps the method training-free and the output sparsity structured enough to combine with quantization. On the good side, moving to a union-of-subspaces model fits heterogeneous weight matrices better than forcing everything into one low-dimensional space. Targeting the actual output error rather than weight error makes sense for preserving model behavior. The soft spots are mostly around verification and assumptions. The central claim rests on the calibration set being representative enough that the compressed model holds up on downstream tasks without further adaptation. A mismatch there could mean larger effective errors than the baselines. Without specific quantitative results or details on calibration set size in the summary, it is hard to judge the magnitude of the gains or their reliability across different setups. The choice of dictionary size and sparsity level as free parameters also calls for more exploration. This work is for people focused on practical compression methods for large models. Readers in the efficient inference area will find the ideas and the open code valuable. It shows clear technical engagement and deserves a serious referee to sort out the details. I would recommend sending it for peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CoSpaDi, a training-free framework for post-training compression of LLMs. It replaces low-rank weight approximations with a structured sparse decomposition using a dense dictionary and column-sparse coefficients, optimized to minimize functional reconstruction error of layer outputs on a small calibration set via activation-derived Gram orthonormalization. The paper claims that this union-of-subspaces model improves accuracy-compression and perplexity-compression trade-offs over SVD-based and structured pruning baselines at 20-40% compression ratios on Llama and Qwen model families.

Significance. If the empirical results hold, the approach provides a more expressive parameterization for weight compression at fixed parameter budgets, potentially reducing accuracy loss compared to rigid low-rank methods. The calibration-guided objective and support for cross-layer dictionary sharing are notable technical elements. The training-free design and compatibility with quantization are practical strengths that could influence future work in efficient LLM deployment.

major comments (2)

Abstract: The abstract states that CoSpaDi 'consistently improves' the trade-offs but provides no quantitative numbers, error bars, details on calibration set size, dictionary size selection, or statistical significance tests. This absence makes the central empirical claim difficult to evaluate and verify from the provided text.
Central claim (calibration-guided reconstruction): The assumption that minimizing functional reconstruction error on a small calibration set will yield a factorization whose zero-shot accuracy and perplexity remain superior to SVD/pruning baselines without fine-tuning is load-bearing. If the calibration set under-samples rare patterns or task-specific activations, the union-of-subspaces model can still incur larger effective error on downstream benchmarks than weight-space methods at identical parameter budgets. The manuscript should include ablations or analysis on calibration set size, distribution, and representativeness to support this.

minor comments (1)

Abstract: The code repository link is provided, supporting reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each of the major comments below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: Abstract: The abstract states that CoSpaDi 'consistently improves' the trade-offs but provides no quantitative numbers, error bars, details on calibration set size, dictionary size selection, or statistical significance tests. This absence makes the central empirical claim difficult to evaluate and verify from the provided text.

Authors: We agree that incorporating specific quantitative details in the abstract would enhance the clarity and verifiability of our claims. In the revised manuscript, we will modify the abstract to include key performance metrics, such as the observed improvements in perplexity and zero-shot accuracy at various compression ratios. We will also specify the calibration set size used (128 samples from the C4 dataset), the method for selecting dictionary size (based on minimizing reconstruction error on the calibration set), and note that error bars and statistical details are provided in the experimental results section of the full paper. revision: yes
Referee: Central claim (calibration-guided reconstruction): The assumption that minimizing functional reconstruction error on a small calibration set will yield a factorization whose zero-shot accuracy and perplexity remain superior to SVD/pruning baselines without fine-tuning is load-bearing. If the calibration set under-samples rare patterns or task-specific activations, the union-of-subspaces model can still incur larger effective error on downstream benchmarks than weight-space methods at identical parameter budgets. The manuscript should include ablations or analysis on calibration set size, distribution, and representativeness to support this.

Authors: This is a valid concern regarding the generalizability of the calibration-guided optimization. The current manuscript uses a fixed calibration set of 128 samples and demonstrates consistent improvements across Llama and Qwen models on standard benchmarks. To further support the robustness of this approach, we will add an ablation study in the revised version analyzing the effects of varying the calibration set size and using different data distributions (e.g., C4 versus other corpora). We will also include a discussion on the limitations of finite calibration sets and how the functional reconstruction objective helps mitigate issues with rare patterns by focusing on activation statistics. We believe these additions will address the referee's point without altering the core methodology. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives its method by first posing a data-aware objective that minimizes layer-output reconstruction error on a calibration set, then applying an activation-derived Gram orthonormalization to recast this exactly as a standard dictionary learning problem on transformed weights. This is a mathematical equivalence that enables use of existing solvers rather than a self-definitional loop or fitted input renamed as prediction. Empirical gains over SVD and structured pruning baselines at 20-40% compression are reported via direct accuracy and perplexity measurements on Llama and Qwen families; these do not reduce to the calibration inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the described chain, leaving the central claims self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the expressiveness of the union-of-subspaces model for weight matrices and the sufficiency of a small calibration set for guiding the factorization; no new physical entities or unproven mathematical axioms are introduced beyond standard linear algebra assumptions.

free parameters (1)

dictionary size and sparsity level
Hyperparameters chosen to achieve target 20-40% compression ratios; not fitted to final task metrics in the abstract description.

axioms (1)

domain assumption Weight matrices admit a good approximation as dense dictionary times column-sparse coefficients
Invoked when replacing low-rank factorization with the proposed sparse decomposition.

pith-pipeline@v0.9.0 · 5824 in / 1348 out tokens · 54230 ms · 2026-05-18T13:14:45.608935+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Motion-Compensated Weight Compression
cs.CV 2026-05 unverdicted novelty 6.0

MCWC aligns permutation-symmetric blocks across layers to enable sequential prediction and residual entropy coding, improving rate-accuracy tradeoffs versus quantization and prior codecs on language and vision models.