arxiv: 2605.09281 · v1 · submitted 2026-05-10 · 💻 cs.LG

Recognition: no theorem link

TileQ: Efficient Low-Rank Quantization of Mixture-of-Experts with 2D Tiling

Hongyaoxing Gu , Xinzhe Chen , Lijuan Hu , Fangfang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:13 UTC · model grok-4.3

classification 💻 cs.LG

keywords mixture of expertslow-rank quantizationpost-training quantization2D tilingmodel compressioninference latencymemory efficiency

0 comments

The pith

Mixture-of-Experts models can be quantized with 2D tiling to share low-rank factors across dimensions, reducing extra memory use by up to 10 times and latency to 5 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes TileQ, a post-training quantization method for Mixture-of-Experts models that applies 2D-tiling structured low-rank approximation. This approach shares low-rank factors across both the input and output dimensions of the expert layers. By doing so, it enables fusing multiple expert computations into a single operation during inference. The result is substantially lower memory requirements and faster inference times without any fine-tuning or loss in accuracy. This matters for deploying large MoE models on resource-limited hardware where parameter count has been a barrier.

Core claim

TileQ employs 2D-tiling structured low-rank quantization to share low-rank factors across input and output dimensions of MoE experts. It also introduces an efficient inference technique that fuses multiple low-rank expert computations into a single-pass operation. Experiments demonstrate that this cuts down additional memory usage up to 10 times and reduces inference latency to about 5 percent while preserving state-of-the-art accuracy.

What carries the argument

2D-tiling structured low-rank quantization that shares low-rank factors across input and output dimensions to support fused single-pass inference.

Load-bearing premise

That sharing low-rank factors across input and output dimensions via 2D tiling without any fine-tuning will preserve accuracy for different Mixture-of-Experts models and tasks.

What would settle it

Testing TileQ on a held-out MoE model and task combination where the accuracy falls noticeably below the unquantized baseline.

Figures

Figures reproduced from arXiv: 2605.09281 by Fangfang Liu, Hongyaoxing Gu, Lijuan Hu, Xinzhe Chen.

**Figure 1.** Figure 1: Overview of TILEQ. (a) discusses the observations and motivations of low-rank PTQs, (b) presents the challenges and strategies of our algorithm, and (c) shows the evaluations results. The pseudo-code is shown in Algorithm 1. where Em,n ∈ {0, 1}M×N is the basis matrix with 1 at (m, n), and ⊗ denotes the Kronecker product in block (i.e., placing W˜ k at the (mk, nk)-th tile). The resulting mapping is ϕ(k) 7→… view at source ↗

**Figure 2.** Figure 2: Efficient MoE with Low-Rank Decomposition in TILEQ: Enable 2D-Tiling to In-Place Optimization and Fast Inference pass, where the weights in Eq.3 are replaced with quantized matrices, ensuring compatibility with established inference frameworks (Kwon et al., 2023; Zheng et al., 2024). Meanwhile, LoTileMoE is an efficient inference method for the tiled low-rank. It exploits the low-rank structure to fused c… view at source ↗

**Figure 3.** Figure 3: Inference latency in MoE MLP-block on A800. Here TILEQ-2D denotes our proposed 2D-tiling low-rank algorithm. The Element-wise baseline refers to the traditional per expert low-rank approach. The results of H800 and 5090 are given in Appendix D.1. 1 4 8 16 32 Batch Size 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Throughput in log scale (token/s) 7.8% 52.2% 40.0% 7.8% 51.7% 40.6% 7.8% 51.6% 40.6% 7.8% 51.6% 40.6% 7.1% 48.3… view at source ↗

**Figure 4.** Figure 4: Inference throughput and the time proportion of each module in Qwen1.5-MoE-A2.7B with sequence length 4096 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Implementation of inference codes in TILEQ. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Inference latency in MoE MLP-block on H800 1 4 16 32 0% 25% 50% 75% 100% Prefill Latency (%) Deepseek-MoE-16B 1 4 16 32 Mixtral-8x7B 1 4 16 32 Qwen1.5-MoE-A2.7B 1 4 16 32 Qwen3-30B-A3B 1 4 16 32 Qwen3-Next-80B-A3B 1 4 16 32 0% 25% 50% 75% 100% Decode Latency (%) Deepseek-MoE-16B 1 4 16 32 Mixtral-8x7B 1 4 16 32 Qwen1.5-MoE-A2.7B 1 4 16 32 Qwen3-30B-A3B 1 4 16 32 Qwen3-Next-80B-A3B Element-Wise MiLo-1D Tile… view at source ↗

**Figure 7.** Figure 7: Inference latency in MoE MLP-block on 5090 support for the newer model families evaluated in our work; • MXMOE further relies on custom operators or hardware-specific primitives that are incompatible with our experimental infrastructure. Nevertheless, to situate our approach within the broader research landscape, we include in this appendix a brief overview of these concurrent methods and perform limited c… view at source ↗

read the original abstract

Mixture-of-Experts (MoE) models achieve remarkable performance by sparsely activating specialized experts, yet their massive parameters in experts pose significant challenges for deployment. While low-rank quantization offers a promising route to compress MoE models, existing methods still incur nonnegligible memory overhead and inference latency. To address these limitations, we propose \textsc{TileQ}, a fine-tuning-free post-training quantization (PTQ) method that employs 2D-tiling structured low-rank quantization to share low-rank factors across both input and output dimensions of MoE experts. Furthermore, we introduce an efficient inference technique for \textsc{TileQ} that fuses multiple low-rank expert computations into a single-pass operation, significantly improving hardware utilization. Experiments show that \textsc{TileQ} cuts down additional memory usage up to 10$\times$ and reduces inference latency to $\sim$5\% while preserving state-of-the-art accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TileQ's 2D-tiling low-rank PTQ for MoE experts plus fused inference is a targeted systems tweak that could help deployment, but accuracy without fine-tuning is the part that needs the most checking.

read the letter

TileQ's main contribution is a fine-tuning-free PTQ scheme that uses 2D tiling to share low-rank factors across both the input and output dimensions of each MoE expert, then fuses the low-rank computations for multiple experts into a single inference pass. This directly targets the memory overhead and latency that come from having many separate expert matrices in sparse models. The reported gains—up to 10x less additional memory and latency down to roughly 5% of baseline while keeping accuracy—address a practical bottleneck for running large MoE models on ordinary hardware. The fusion step in particular looks like a sensible engineering choice to improve hardware utilization during the sparse activation phase. What is actually new is the specific 2D structure applied to the shared low-rank factors in the MoE setting rather than a generic low-rank quantizer. The paper does a reasonable job framing the problem around real deployment constraints and proposing a combined quantization-plus-inference solution. The soft spot is the accuracy claim. MoE experts are sparsely activated and highly specialized, so forcing the same low-rank factors across input and output dimensions for every expert risks more distortion than standard per-expert or 1D low-rank approaches. The abstract gives no derivation showing why 2D tiling bounds that error better, and without seeing detailed ablations, error bars, or comparisons to prior PTQ baselines on multiple MoE architectures, it is hard to know how well the no-fine-tuning assumption holds. If the full experiments include those controls and the numbers are reproducible, the central argument is fine; otherwise the accuracy preservation is the load-bearing part that could weaken. This is for systems-oriented readers working on model compression or efficient inference for sparse transformers. A practitioner looking for concrete quantization tricks for MoE would get value from the method and any released code. It deserves a serious referee to verify the experimental details and check whether the 2D sharing actually delivers the claimed robustness.

Referee Report

2 major / 1 minor

Summary. The paper proposes TileQ, a fine-tuning-free post-training quantization (PTQ) method for Mixture-of-Experts (MoE) models. It employs 2D-tiling structured low-rank quantization to share low-rank factors across both input and output dimensions of the experts. An efficient inference technique is introduced that fuses multiple low-rank expert computations into a single-pass operation. The authors claim that this approach reduces additional memory usage by up to 10× and inference latency to approximately 5% while preserving state-of-the-art accuracy.

Significance. If the experimental claims hold with proper validation, TileQ would represent a meaningful advance in compressing and accelerating large MoE models for deployment. By avoiding fine-tuning and using structured low-rank approximations with fusion, it addresses both memory and latency bottlenecks in a way that could be broadly applicable to various MoE architectures, potentially lowering the barrier for using high-performance sparse models on edge or resource-limited hardware.

major comments (2)

[§3] §3 (Method): The description of the 2D-tiling structured low-rank approximation does not include a derivation, error bound, or comparison to 1D alternatives showing that sharing low-rank factors across input and output dimensions sufficiently bounds distortion for highly specialized and sparsely activated MoE experts in a pure PTQ setting without fine-tuning; this is load-bearing for the central accuracy-preservation claim.
[§4] §4 (Experiments): The reported performance numbers (up to 10× memory reduction, ~5% latency, preserved SOTA accuracy) are stated without experimental details such as the specific MoE models and tasks tested, baselines, ablation studies, number of runs, or error bars, preventing verification of the results.

minor comments (1)

[Abstract] Abstract: The terms 'additional memory usage' and 'state-of-the-art accuracy' are used without defining the exact metrics or reference models/benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The two major comments identify areas where additional rigor and detail will strengthen the paper. We address each point below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (Method): The description of the 2D-tiling structured low-rank approximation does not include a derivation, error bound, or comparison to 1D alternatives showing that sharing low-rank factors across input and output dimensions sufficiently bounds distortion for highly specialized and sparsely activated MoE experts in a pure PTQ setting without fine-tuning; this is load-bearing for the central accuracy-preservation claim.

Authors: We agree that a formal derivation and error analysis would make the central claim more robust. In the revision we will add a dedicated subsection to §3 that derives the 2D-tiling low-rank factorization, provides a Frobenius-norm error bound for the shared factors, and includes a direct comparison to standard 1D low-rank quantization. The new material will also report empirical distortion measurements on the actual expert weight matrices of the evaluated MoE models, confirming that the 2D sharing keeps approximation error within acceptable limits for PTQ without fine-tuning. revision: yes
Referee: [§4] §4 (Experiments): The reported performance numbers (up to 10× memory reduction, ~5% latency, preserved SOTA accuracy) are stated without experimental details such as the specific MoE models and tasks tested, baselines, ablation studies, number of runs, or error bars, preventing verification of the results.

Authors: We acknowledge that the current experimental section lacks the level of detail needed for full reproducibility. The revised §4 will explicitly list the MoE models (Mixtral-8x7B, DeepSeek-MoE-16B, etc.), evaluation tasks (MMLU, GSM8K, HumanEval), all baselines (including GPTQ, AWQ, and prior MoE quantizers), the ablation configurations for tiling size and fusion, and results reported as mean ± standard deviation over three independent runs. Hardware platform details and measurement methodology for the latency figures will also be provided. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical PTQ method with no load-bearing derivations or self-referential fits

full rationale

The paper presents TileQ as a fine-tuning-free PTQ algorithm that applies 2D-tiling structured low-rank quantization to MoE experts and fuses inference computations. All central claims (memory reduction up to 10×, latency to ~5%, accuracy preservation) are supported by experimental results on diverse models and tasks rather than any closed mathematical derivation. No equations, fitted parameters renamed as predictions, self-citations used as uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The 2D-tiling choice is motivated by efficiency goals and validated empirically; it does not reduce to its own inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies no explicit free parameters, axioms, or invented entities; the method description implies standard low-rank approximation assumptions but does not enumerate them.

pith-pipeline@v0.9.0 · 5470 in / 981 out tokens · 32713 ms · 2026-05-12T04:13:44.158247+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

[1]

Frantar, E

URL https://openreview.net/forum? id=pXoZLGMNDm. Frantar, E. and Alistarh, D. Qmoe: Practical sub-1-bit compression of trillion-parameter models.arXiv preprint arXiv:2310.16795, 2023. Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre- trained transformers. InThe Eleventh International Con...

work page doi:10.1016/j.comgeo.2004.03.003 2023
[2]

URL https://qwenlm.github.io/blog/ qwen-moe/. Team, Q. Qwen3 technical report, 2025. URL https: //arxiv.org/abs/2505.09388. Tseng, A., Chee, J., Sun, Q., Kuleshov, V ., and De Sa, C. Quip# : Even better llm quantization with hadamard inco- herence and lattice codebooks. InForty-first International Conference on Machine Learning, 2024a. Tseng, A., Sun, Q.,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.findings-emnlp 2025
[3]

HellaSwag: Can a Machine Really Finish Your Sentence?

URL https://aclanthology.org/2024. findings-emnlp.612/. Yu, Y ., Wang, T., and Samworth, R. J. A useful variant of the davis–kahan theorem for statisticians.Biometrika, 102(2):315–323, 2015. Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019. Zhang, C...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

This term preserves individual expert functionality despite parameter sharing

Per-module reconstruction error:PK k=1 ∥Wk − ˜W tile k ∥2 F penalizes deviations between each original expert weight Wk and its reconstructed version from the tiled low-rank components. This term preserves individual expert functionality despite parameter sharing

work page
[5]

This regularizer maintains the semantic alignment between expert similarity and tile placement

Structural consistency: ϕ enforces spatial coherence in the 2D tiling layout—e.g., by discouraging large dis- placements from ideal cluster-assigned positions (mk, nk). This regularizer maintains the semantic alignment between expert similarity and tile placement

work page
[6]

Minimizing this term ensures that the structured decomposition captures the dominant subspace of the aggregated expert weights

Global low-rank error: ∥Wbig −UΣV ⊤∥2 F measures the fidelity of the shared 2D-tiling low-rank approxi- mation across all experts. Minimizing this term ensures that the structured decomposition captures the dominant subspace of the aggregated expert weights

work page
[7]

From the review of TILEQ, the core optimization balances global compression, per-expert fidelity, structural coherence, and quantization compatibility

Quantization error:PK k=1 ∥Rk − Q(Rk)∥2 F , Rk =W k − ˜Wk accounts for the distortion introduced when mapping full-precision weights to discrete quantized values. From the review of TILEQ, the core optimization balances global compression, per-expert fidelity, structural coherence, and quantization compatibility. In the steps, errors do not accumulate pro...

work page 2004
[8]

Global low-rank structure: The block matrix Wbig admits an accurate low-rank approximation when its singular val- ues decay rapidly. This occurs precisely when experts exhibit strong similarity in their activation-aware subspaces—i.e., when the optimal clustering costsOPT U andOPT V are small—as formalized in ¶ A.2

work page
[9]

Local subspace alignment: The residual error ϵk quantifies the deviation of expertk from the shared subspace assigned to its tile. By clustering experts based on their activation-aware left and right singular vectors (uk, vk), the biclustering step promotes smallϵ k, provided the underlying subspaces are well-separated. Critically, TILEQ avoids the over-c...

work page 2025
[10]

Global Input Projection.The input tensor X∈R B×i is multiplied with the reshaped shared factor (UΣ) reshape ∈ Ri×(M r): Xproj =X·(UΣ) reshape ∈R B×(M r).(52) This is a single dense GEMM with time complexityO(BiM r)

work page
[11]

The selection involves indexing into a (B, M r) tensor using (B,K, r) indices—costing O(BKr) memory operations

Routing-Weighted Selection and Accumulation.For each token–expert pair(b, t), the algorithm: • Extracts a rank-rslice fromX proj using precomputed tile coordinates(m b,t, nb,t), • Scales it by the routing weightg b,t, • Accumulates the result into a buffer indexed by column tilen b,t viascatter add. The selection involves indexing into a (B, M r) tensor u...

work page
[12]

tall-and-skinny

Output Reconstruction.The accumulated buffer Xsum ∈R B×(N r) is multiplied with the reshaped shared right factor Vflat ∈R (N r)×o: Y=X sum ·V flat ∈R B×o,(53) which is another dense GEMM with complexityO(BN ro). Total Time Complexity.Summing the dominant terms, the overall time complexity of TILEQ inference is: TTileQ =O BiM r+BN ro+BKr .(54) InsightsWhy ...

work page 2020
[13]

For the rotation component, our approach aligns with the rotation technique used in LOPRO(Gu et al., 2026)

and GPTVQ (Van Baalen et al., 2024), without incorporating advanced optimizations such as weight clipping or learnable codebooks. For the rotation component, our approach aligns with the rotation technique used in LOPRO(Gu et al., 2026). Baseline Methods.We primarily compare against three established baselines: GPTQ (Frantar et al., 2023), GPTVQ (Van Baal...

work page 2024