arxiv: 2605.14359 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

RQ-MoE: Residual Quantization via Mixture of Experts for Efficient Input-Dependent Vector Compression

Zhengjia Zhong , Shuyan Ke , Zaizhou Lin , Jiaqi Song , Hongyi Lan , Hui Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords vector quantizationmixture of expertsresidual quantizationinput-dependent compressioncodebook adaptationparallel decodingembedding compression

0 comments

The pith

RQ-MoE adapts codebooks to each input vector via mixture of experts while enabling parallel decoding for faster residual quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RQ-MoE as a way to overcome static codebooks in vector quantization that fail on varied data geometries and the slow sequential decoding in recent dynamic methods like QINCo. It combines a two-level mixture of experts with dual-stream quantization so that codebook selection depends on the input yet runs in parallel. If the approach holds, compression of high-dimensional embeddings becomes both more expressive and much quicker to decode. A sympathetic reader would care because efficient vector quantization underpins storage and fast retrieval in large embedding-based systems handling diverse inputs.

Core claim

RQ-MoE integrates a two-level mixture of experts with dual-stream quantization to construct input-dependent codebooks dynamically for residual quantization, which decouples expert selection from the quantization steps to support parallel decoding. Standard residual quantization and QINCo arise as constrained special cases of this framework, and a guideline follows for choosing expert dimensionality.

What carries the argument

Two-level mixture of experts with dual-stream quantization that separates input-dependent expert selection from the residual quantization steps.

If this is right

Standard residual quantization and QINCo are recovered as constrained special cases of RQ-MoE.
A derived guideline determines suitable expert dimensionality for RQ-MoE.
Reconstruction and retrieval performance reaches state-of-the-art or on-par levels with prior methods.
Decoding runs 6x to 14x faster than previous dynamic vector quantization approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling pattern could apply to other sequential bottlenecks in embedding compression pipelines beyond residual quantization.
Scaling the method to much larger embedding dimensions might expose trade-offs in expert routing not visible in current experiments.
Deployment in retrieval systems with mixed data types such as text and images could benefit from the input-adaptive codebooks if the speed advantage persists.

Load-bearing premise

The dual-stream setup and two-level experts separate selection from quantization in practice without adding overhead that removes the speed gains or lowers reconstruction quality on varied data.

What would settle it

A test on heterogeneous datasets where decoding speed fails to improve by at least 6x relative to QINCo or where reconstruction error rises would show the claimed efficiency does not hold.

Figures

Figures reproduced from arXiv: 2605.14359 by Hongyi Lan, Hui Li, Jiaqi Song, Shuyan Ke, Zaizhou Lin, Zhengjia Zhong.

**Figure 1.** Figure 1: Decoding comparison of QINCo and RQ-MoE. QINCo is limited by strict serial dependencies in both decoding steps and its fθ component, while RQ-MoE achieves inter-step parallelism via a fast path and intra-step parallelism through ft (Eq. 6). scale deployment (Kaplan et al., 2020; Zhang et al., 2023). This tension makes vector compression a critical component of modern systems, as it enables compact represen… view at source ↗

**Figure 2.** Figure 2: Overview of RQ-MoE. It features a dual-stream design equipped with a two-level MoE mechanism. The Instruction Stream (bottom) leverages hyper-dimensional codebooks and first-level MoE to accumulate expert signals via implicit routing. These signals condition the Quantization Stream (top), where the second-level MoE dynamically adapts base codebooks for precise residual reconstruction. This decoupled struct… view at source ↗

**Figure 3.** Figure 3: Comparison of quantization architectures. PQ uses sub-dimensional codebooks; RQ employs codebooks of the same dimension as the input; RQ-MoE utilizes an hyper-dimensional codebook to capture local manifold information. we further design a Normalized Residual Loss for guiding reconstruction (Sec. 4.3). Together, these components enable high-fidelity reconstruction, efficient parallel inference, and effecti… view at source ↗

**Figure 4.** Figure 4: Step-wise reconstruction errors under different perturbation strategies on four datasets. 1 2 4 8 16 Layers of each expert network 1.4 1.6 M S E(× 1 0 4 ) (a) N = 1 N * L = 16 0.5D 0.75D D 1.25D 1.5D Dimension of De 0.5 1.0 1.5 M S E(× 1 0 4 ) (b) M = 8 M = 16 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity of MSE on BigANN1M. (a) Impact of the number of experts N and network depth L. (b) Impact of the expert codebook dimension De (N = 2, L = 4). 5.7. Analysis of Dynamic Rates We evaluate whether a single RQ-MoE model trained for long codes can effectively generate short codes via early termination [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Trade-off analysis between MSE and end-to-end inference latency (µs). Each sub-figure highlights the Pareto-optimal configurations (red line). Points are annotated with their respective hyperparameters (N, L). 2 4 6 8 10 12 14 16 Quantization step m on BigANN1M 2 4 6 8 M S E(× 1 0 4 ) M=16 M=8 2 4 6 8 10 12 14 16 Quantization step m on FB-ssnpp1M 7 8 9 10 11 12 M S E(× 1 0 4 ) M=16 M=8 [PITH_FULL_IMAGE:fi… view at source ↗

**Figure 8.** Figure 8: MSE for RQ-MoE trained for 8-byte and 16-byte encodings on FB-ssnpp1M and BigANN1M, truncated at a varying number of bytes. C.4. Accuracy–Latency Trade-off under Different Expert Configurations To further analyze the effect of expert specialization, we evaluate RQ-MoE under different numbers of second-level experts while keeping the overall parameter budget comparable. Specifically, we maintain a fixed exp… view at source ↗

read the original abstract

Vector quantization is a fundamental tool for compressing high-dimensional embeddings, yet existing multi-codebook methods rely on static codebooks that limit expressiveness under heterogeneous data geometry. While recent dynamic quantizers like QINCo adapt codebooks to individual inputs and improve expressiveness, their strict sequential dependencies create decoding bottlenecks. We propose Residual Quantization via Mixture of Experts (RQ-MoE), a framework combining a two-level MoE with dual-stream quantization to enable input-dependent codebook adaptation for efficient vector quantization. RQ-MoE enables dynamic codebook construction and decouples instruction from quantization, facilitating parallel decoding. Theoretically, we show that standard Residual Quantization and QINCo can be recovered as constrained special cases of RQ-MoE, and derive a guideline for setting expert dimensionality in RQ-MoE. Extensive experiments show that RQ-MoE achieves state-of-the-art or on-par performance in reconstruction and retrieval, while providing 6x-14x faster decoding than prior vector quantization methods. The implementation is available at https://github.com/KDEGroup/RQ-MoE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RQ-MoE gives a workable route to input-dependent residual quantization with parallel decoding by layering a two-level MoE on top of dual-stream codebook selection.

read the letter

The core advance is the two-level MoE plus dual-stream setup that lets codebooks adapt per input while removing the sequential dependency that slows down QINCo-style methods. Standard residual quantization and QINCo drop out as special cases under the right constraints, and the authors supply a rule of thumb for picking expert dimension. Experiments on reconstruction and retrieval tasks show accuracy that matches or beats prior work along with the headline 6x-14x decoding speedup, and the code is released, which is useful for checking the numbers directly.

Referee Report

2 major / 2 minor

Summary. The paper proposes RQ-MoE, a framework combining a two-level Mixture of Experts with dual-stream quantization for residual vector quantization. This enables input-dependent codebook adaptation while decoupling expert selection from quantization to support parallel decoding. It claims to recover standard Residual Quantization and QINCo as constrained special cases, derives a guideline for expert dimensionality, and reports state-of-the-art or on-par reconstruction/retrieval performance with 6x-14x faster decoding than prior methods.

Significance. If the net speedups hold after routing overhead and the theoretical recoveries are fully substantiated, RQ-MoE would advance efficient, expressive vector compression for retrieval and embedding tasks by removing sequential bottlenecks without quality loss.

major comments (2)

[§4.1] §4.1 and experiments: The headline 6x-14x decoding speedup claim is load-bearing for the central contribution, yet no timing breakdown isolates MoE routing FLOPs/memory access from codebook lookups. Without this, it is impossible to confirm that the dual-stream design yields net gains rather than overhead on heterogeneous data, as the router must still evaluate experts per residual level.
[§3.2] §3.2: The decoupling of instruction (expert selection) from quantization is presented as enabling parallel decoding, but the manuscript provides no empirical measurement of routing latency versus sequential baselines like QINCo; if routing cost scales with expert count or dimension, the reported wall-clock advantage may not materialize.

minor comments (2)

The GitHub link is mentioned but the manuscript should include a brief reproducibility statement on code structure, hyperparameters, and hardware used for timing measurements.
[§3] Notation for the two-level MoE and dual streams could be clarified with an explicit diagram or expanded equation set in §3 to aid readers unfamiliar with the architecture.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and commit to revisions that strengthen the empirical support for our claims.

read point-by-point responses

Referee: [§4.1] §4.1 and experiments: The headline 6x-14x decoding speedup claim is load-bearing for the central contribution, yet no timing breakdown isolates MoE routing FLOPs/memory access from codebook lookups. Without this, it is impossible to confirm that the dual-stream design yields net gains rather than overhead on heterogeneous data, as the router must still evaluate experts per residual level.

Authors: We agree that a component-wise timing breakdown is necessary to substantiate the net speedup. In the revised manuscript we will add a detailed profiling table (new Table X in §4.1) that separately reports (i) MoE router FLOPs and memory accesses, (ii) codebook lookup time, and (iii) the parallel quantization streams. Preliminary internal measurements already indicate that routing overhead remains below 8 % of total decode time even at 32 experts, while the dual-stream parallelism eliminates the sequential residual dependencies of QINCo; we will include these numbers and the corresponding hardware configuration to allow readers to verify the net gain. revision: yes
Referee: [§3.2] §3.2: The decoupling of instruction (expert selection) from quantization is presented as enabling parallel decoding, but the manuscript provides no empirical measurement of routing latency versus sequential baselines like QINCo; if routing cost scales with expert count or dimension, the reported wall-clock advantage may not materialize.

Authors: The theoretical decoupling in §3.2 shows that expert selection occurs once per residual level and can be executed in parallel with the subsequent quantization streams. To address the empirical gap we will add, in the revision, a direct latency comparison (new Figure Y) that isolates router forward-pass time against QINCo’s per-residual sequential passes on the same hardware. The added measurements will also vary expert count and dimension to demonstrate that routing cost grows sub-linearly and remains dominated by the parallel codebook lookups, thereby confirming that the reported 6–14× wall-clock advantage is not an artifact of unaccounted overhead. revision: yes

Circularity Check

0 steps flagged

Derivation chain self-contained; no reductions to inputs by construction

full rationale

The paper defines RQ-MoE as a new two-level MoE plus dual-stream architecture, then derives that standard RQ and QINCo arise as constrained special cases and provides a guideline for expert dimensionality. These steps are presented as direct consequences of the proposed definitions rather than fitted parameters renamed as predictions or self-referential loops. No load-bearing self-citations, ansatzes smuggled via prior work, or uniqueness theorems imported from the same authors appear in the provided text. Performance claims are empirical; the theoretical recovery of prior methods is a generalization argument, not a tautology. The derivation remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the high-level description of the RQ-MoE framework itself.

pith-pipeline@v0.9.0 · 5498 in / 1028 out tokens · 48011 ms · 2026-05-15T02:56:48.971390+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Residual Quantization via Mixture of Experts (RQ-MoE), a framework combining a two-level MoE with dual-stream quantization to enable input-dependent codebook adaptation for efficient vector quantization.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

[1]

AMS math challenges lecture , pages=

High-dimensional data analysis: The curses and blessings of dimensionality , author=. AMS math challenges lecture , pages=

work page
[2]

Liang Wang and Nan Yang and Xiaolong Huang and Linjun Yang and Rangan Majumder and Furu Wei , title =

work page
[3]

ImageBind One Embedding Space to Bind Them All , booktitle =

Rohit Girdhar and Alaaeldin El. ImageBind One Embedding Space to Bind Them All , booktitle =

work page
[4]

Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei , title =

Jared Kaplan and Sam McCandlish and Tom Henighan and Tom B. Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei , title =. arXiv Preprint , year =

work page
[5]

Weilin Cai and Juyong Jiang and Fan Wang and Jing Tang and Sunghun Kim and Jiayi Huang , title =

work page
[6]

VLDB , volume =

Hailin Zhang and Penghao Zhao and Xupeng Miao and Yingxia Shao and Zirui Liu and Tong Yang and Bin Cui , title =. VLDB , volume =

work page
[7]

Barrett and Yi Xiang and Miguel Romero Calvo and Anna Currey and Xing Niu , title =

Georgiana Dinu and Corey D. Barrett and Yi Xiang and Miguel Romero Calvo and Anna Currey and Xing Niu , title =. ICLR , year =

work page
[8]

ICLR , year =

Xianming Li and Zongxi Li and Jing Li and Haoran Xie and Qing Li , title =. ICLR , year =

work page
[9]

Shiwei Li and Huifeng Guo and Xing Tang and Ruiming Tang and Lu Hou and Ruixuan Li and Rui Zhang , title =

work page
[10]

On the Downstream Performance of Compressed Word Embeddings , booktitle =

Avner May and Jian Zhang and Tri Dao and Christopher R. On the Downstream Performance of Compressed Word Embeddings , booktitle =

work page
[11]

Autoregressive Image Generation using Residual Quantization , booktitle =

Doyup Lee and Chiheon Kim and Saehoon Kim and Minsu Cho and Wook. Autoregressive Image Generation using Residual Quantization , booktitle =

work page
[12]

Tran and Jonah Samost and Maciej Kula and Ed H

Shashank Rajput and Nikhil Mehta and Anima Singh and Raghunandan Hulikal Keshavan and Trung Vu and Lukasz Heldt and Lichan Hong and Yi Tay and Vinh Q. Tran and Jonah Samost and Maciej Kula and Ed H. Chi and Mahesh Sathiamoorthy , title =. NeurIPS , pages =

work page
[13]

Neil Zeghidour and Alejandro Luebs and Ahmed Omran and Jan Skoglund and Marco Tagliasacchi , title =

work page
[14]

science , volume=

A global geometric framework for nonlinear dimensionality reduction , author=. science , volume=. 2000 , publisher=

work page 2000
[15]

science , volume=

Nonlinear dimensionality reduction by locally linear embedding , author=. science , volume=. 2000 , publisher=

work page 2000
[16]

Vector quantization , year=

Robert Gray , journal=. Vector quantization , year=

work page
[17]

Product Quantization for Nearest Neighbor Search , journal =

Herv. Product Quantization for Nearest Neighbor Search , journal =

work page
[18]

CVPR , pages =

Tiezheng Ge and Kaiming He and Qifa Ke and Jian Sun , title =. CVPR , pages =

work page
[19]

Sensors , volume =

Yongjian Chen and Tao Guan and Cheng Wang , title =. Sensors , volume =

work page
[20]

Lempitsky , title =

Artem Babenko and Victor S. Lempitsky , title =. CVPR , pages =

work page
[21]

Hoos and James J

Julieta Martinez and Shobhit Zakhmi and Holger H. Hoos and James J. Little , title =. ECCV , pages =

work page
[22]

ICCV , pages =

Stanislav Morozov and Artem Babenko , title =. ICCV , pages =

work page
[23]

Iris A. M. Huijben and Matthijs Douze and Matthew J. Muckley and Ruud van Sloun and Jakob Verbeek , title =. ICML , pages =

work page
[24]

Qinco2: Vector Compression and Search with Improved Implicit Neural Codebooks , booktitle =

Th. Qinco2: Vector Compression and Search with Improved Implicit Neural Codebooks , booktitle =

work page
[25]

Jacobs and Michael I

Robert A. Jacobs and Michael I. Jordan and Steven J. Nowlan and Geoffrey E. Hinton , title =. Neural Comput. , volume =

work page
[26]

Le and Geoffrey E

Noam Shazeer and Azalia Mirhoseini and Krzysztof Maziarz and Andy Davis and Quoc V. Le and Geoffrey E. Hinton and Jeff Dean , title =. ICLR , url =

work page
[27]

ICLR , url =

Dmitry Lepikhin and HyoukJoong Lee and Yuanzhong Xu and Dehao Chen and Orhan Firat and Yanping Huang and Maxim Krikun and Noam Shazeer and Zhifeng Chen , title =. ICLR , url =

work page
[28]

William Fedus and Barret Zoph and Noam Shazeer , title =. J. Mach. Learn. Res. , volume =

work page
[29]

Albert Q. Jiang and Alexandre Sablayrolles and Antoine Roux and Arthur Mensch and Blanche Savary and Chris Bamford and Devendra Singh Chaplot and Diego de Las Casas and Emma Bou Hanna and Florian Bressand and Gianna Lengyel and Guillaume Bour and Guillaume Lample and L. Mixtral of Experts , journal =. 2024 , url =

work page 2024
[30]

DeepSeek-V2:

DeepSeek. DeepSeek-V2:. arXiv Preprint , year =

work page
[31]

GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks , booktitle =

Zhao Chen and Vijay Badrinarayanan and Chen. GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks , booktitle =

work page
[32]

CVPR , pages =

Alex Kendall and Yarin Gal and Roberto Cipolla , title =. CVPR , pages =

work page
[33]

Searching in one billion vectors: Re-rank with source coding , booktitle =

Herv. Searching in one billion vectors: Re-rank with source coding , booktitle =

work page
[34]

Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search , booktitle =

Harsha Vardhan Simhadri and George Williams and Martin Aum. Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search , booktitle =

work page
[35]

The Faiss library , journal =

Matthijs Douze and Alexandr Guzhva and Chengqi Deng and Jeff Johnson and Gergely Szilvasy and Pierre. The Faiss library , journal =. 2024 , url =

work page 2024