arxiv: 2605.04952 · v1 · submitted 2026-05-06 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts

Bernhard Sch\"olkopf, Klaus-Rudolf Kladny, Maximilian Mordig, Michael Muehlebach

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:52 UTC · model grok-4.3

classification 💻 cs.LG

keywords mixture of expertsrouting mechanismvector quantizationinverted indexgranular expertstransformer modelsapproximate top-k

0 comments

The pith

AIR-MoE approximates top-k expert routing in granular mixtures using vector quantization shortlisting followed by fine scoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces adaptive inverted-index routing for mixture-of-experts models to manage the increased routing costs that come with using many small experts. It proposes a two-stage process where tokens are first assigned to vector quantization codewords to build a shortlist of candidate experts, after which exact routing scores are computed only on that shortlist. This approach approximates the performance of full top-k routing without requiring any changes to the expert parameters or model architecture. A lower bound on the mass recall of the shortlist is derived to explain its behavior. Empirical evaluations indicate that this method outperforms previous routing strategies in settings with high expert granularity.

Core claim

AIR-MoE performs coarse shortlisting by assigning tokens to VQ codewords to construct a candidate set of experts, then computes exact routing scores restricted to this shortlist. This two-stage procedure approximates true top-k routing while avoiding full expert scoring and imposing no structural constraints on expert parameters. It serves as a drop-in replacement requiring no modifications to the model architecture or loss function, and a lower bound on mass recall is provided.

What carries the argument

The two-stage adaptive inverted-index routing mechanism, where vector quantization codewords enable coarse candidate set construction for subsequent fine scoring.

If this is right

Granular MoE models can scale to more experts without routing computation becoming prohibitive.
The method maintains compatibility with standard training procedures since no architectural or loss changes are needed.
Performance gains are observed over existing routing approaches in high-granularity regimes.
The mass recall lower bound offers a way to analyze and tune the shortlisting stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar inverted-index techniques might apply to other sparse selection problems beyond MoE.
Increasing the number of VQ codewords could trade off shortlist size against recall quality in practice.
Testing on even larger models would reveal if the approximation remains effective as model scale grows.

Load-bearing premise

The vector quantization codewords must generate candidate sets with sufficiently high mass recall so that the fine scoring stage can recover near-optimal experts.

What would settle it

Observing a case where the shortlist recall falls significantly below the derived lower bound or where the final model performance underperforms a true top-k router on identical expert setups.

Figures

Figures reproduced from arXiv: 2605.04952 by Bernhard Sch\"olkopf, Klaus-Rudolf Kladny, Maximilian Mordig, Michael Muehlebach.

**Figure 1.** Figure 1: Overview of AIR-MoE. The method consists of two parts: The coarse shortlisting stage uses vector quantization to select a codeword (blue) that stores a pre-computed expert shortlist L that references specific expert centroids (green) and is updated after each optimizer step. The fine scoring stage takes the shortlisted expert weights and scores them exactly. Notably, the codebook is learned via gradientfr… view at source ↗

**Figure 2.** Figure 2: Relative Reduction in PPL. The plots show the relative improvement in PPL with respect to the best baseline (PEER). Larger means better. AIR-MoE achieves consistent PPL improvements (up to 10%) in comparison to the PEER baseline, across model sizes and data sets view at source ↗

**Figure 3.** Figure 3: Feedback Loop. AIR-MoE forms a bi-level optimization loop between gradient-learned token representations and a codebook updated via adaptive clustering. A Full Optimization Loop Alg. 3 summarizes the full training procedure of AIR-MoE. The algorithm can be viewed as a bi-level optimization loop between gradient-trained model parameters and a gradient-free adaptive codebook. Let θ denote all parameters tra… view at source ↗

**Figure 4.** Figure 4: Qualitative Comparison of Expert Usage. Existing routing methods impose hard structural constraints on the expert selection mechanism, either via structural constraints (PEER) or grouping constraints. These constraints, in turn, lead to routing artifacts depicted in (a) and (b), respectively. AIRMoE, in contrast, imposes no assumptions by computing an approximation of the top-K expert scores whose qualit… view at source ↗

read the original abstract

Mixture-of-experts (MoE) models enable scalable transformer architectures by activating only a subset of experts per token. Recent evidence suggests that performance improves with increasingly granular experts, i.e., many small experts instead of a few large ones. However, this regime substantially increases routing cost, which can dominate computation. We introduce adaptive inverted-index routing for MoE (AIR-MoE), an inverted-index-inspired routing architecture based on vector quantization (VQ). In a first stage, AIR-MoE performs coarse shortlisting by assigning tokens to VQ codewords to construct a candidate set of experts. In a second stage, fine scoring computes exact routing scores restricted to this shortlist. This two-stage procedure approximates true top-k routing while avoiding full expert scoring and, in contrast to prior work, imposing no structural constraints on expert parameters. AIR-MoE serves as a drop-in replacement for standard routers and requires no modifications to the model architecture or loss function. We further provide a lower bound on the mass recall achieved by AIR-MoE that yields insights into its inner workings. Empirically, we demonstrate that AIR-MoE achieves improved performance compared to existing routing approaches in granular MoE settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AIR-MoE gives a clean two-stage VQ shortlist plus exact scoring router for granular MoE that avoids expert constraints and supplies a mass-recall lower bound, but the bound's practical tightness and the strength of the reported gains are the parts that still need checking.

read the letter

The core contribution is a routing method that first assigns tokens to vector-quantization codewords to build a small candidate expert set, then scores only inside that set. This approximates full top-k routing at lower cost and, unlike some earlier routers, leaves the expert parameters untouched so it drops straight into existing MoE codebases. They also derive a lower bound on the fraction of true top-k mass that survives the shortlist step, which is a useful piece of analysis even if it is not yet shown to be tight in the operating regimes that matter most.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces AIR-MoE, a two-stage routing architecture for granular Mixture-of-Experts models. The first stage uses vector quantization to assign tokens to codewords and construct a shortlist of candidate experts; the second stage performs exact fine-grained scoring only on this shortlist. The approach is presented as a drop-in replacement for standard routers that approximates true top-k routing without structural constraints on expert parameters or changes to the model/loss. A lower bound on the mass recall of the VQ shortlisting stage is derived, and empirical results are reported showing improved performance over existing routing methods in granular MoE regimes.

Significance. If the mass-recall lower bound is tight enough in practice and the reported gains are reproducible with proper controls, the work would be significant for scaling MoE transformers to finer expert granularity. The drop-in compatibility and avoidance of expert-parameter constraints distinguish it from prior constrained routers, potentially enabling broader adoption in large-scale training.

major comments (2)

[Abstract] Abstract: the central claim that AIR-MoE approximates true top-k routing rests on the VQ shortlist achieving sufficiently high mass recall for the fine-scoring stage to recover near-optimal experts. The abstract states that a lower bound is derived but provides neither the bound itself, its dependence on codebook size or granularity, nor any tightness analysis or empirical mass-recall values; without these the approximation quality asserted in the strongest claim cannot be evaluated.
[Empirical Evaluation] Empirical section (referenced in abstract): the abstract reports performance improvements over existing routers but supplies no experimental details, baselines, error bars, number of runs, or ablation on codebook size. This absence makes it impossible to verify whether the gains are attributable to the two-stage procedure or to other factors, directly affecting the soundness of the performance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below. Where the comments identify opportunities for greater clarity, we have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that AIR-MoE approximates true top-k routing rests on the VQ shortlist achieving sufficiently high mass recall for the fine-scoring stage to recover near-optimal experts. The abstract states that a lower bound is derived but provides neither the bound itself, its dependence on codebook size or granularity, nor any tightness analysis or empirical mass-recall values; without these the approximation quality asserted in the strongest claim cannot be evaluated.

Authors: We agree that the abstract would benefit from explicit reference to the mass-recall lower bound and its properties to better support the central approximation claim. The full manuscript derives this bound in Section 3, establishing its dependence on codebook size and expert granularity. In the revised version we have updated the abstract with a concise statement of the bound and its key dependencies. We have also added empirical mass-recall values together with a tightness discussion in the experimental section so that readers can directly assess approximation quality. revision: yes
Referee: [Empirical Evaluation] Empirical section (referenced in abstract): the abstract reports performance improvements over existing routers but supplies no experimental details, baselines, error bars, number of runs, or ablation on codebook size. This absence makes it impossible to verify whether the gains are attributable to the two-stage procedure or to other factors, directly affecting the soundness of the performance claim.

Authors: We acknowledge that the abstract's brevity omits experimental specifics and that the empirical section would be strengthened by additional controls. The manuscript already compares against standard top-k and other routing baselines and includes codebook-size ablations; however, to directly address verifiability we have revised both the abstract and the empirical section to report results averaged over multiple independent runs with error bars, to list all baselines explicitly, and to present the codebook ablation results more prominently. These additions allow clearer attribution of gains to the two-stage procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: AIR-MoE routing and mass-recall bound are independently derived

full rationale

The paper defines AIR-MoE as a two-stage procedure (VQ-based shortlisting followed by exact fine scoring on the candidate set) and separately derives a lower bound on mass recall from the VQ assignment properties. Neither the performance claim nor the bound reduces to a fitted parameter, self-referential definition, or self-citation chain; the approximation to top-k is presented as a consequence of the (bounded) recall rather than being assumed in the derivation. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of VQ shortlisting for preserving mass recall and on the empirical superiority of the resulting router; both are introduced by the paper rather than derived from prior results.

invented entities (1)

AIR-MoE two-stage router no independent evidence
purpose: Efficient approximation of top-k expert selection for granular MoE
New routing architecture introduced in the paper; no independent evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5522 in / 1184 out tokens · 92607 ms · 2026-05-08T17:52:50.415523+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost (Jcost, Jcost_unit0, Jcost_pos_of_ne_one) Jcost / cosh-cost framework unclear
MassRecall(h) ≥ exp(−2ε(h)) ρ_M(c(h)) ... ε(h) := ‖h − c(h)‖
IndisputableMonolith.Foundation.LogicAsFunctionalEquation washburn_uniqueness_aczel / J-cost forcing unclear
vector quantization (Gray, 1984) ... cosine similarity ... adaptive spherical k-means ... EMA updates
IndisputableMonolith (whole corpus) n/a (domain disjoint) unclear
approximate top-K routing for granular MoE; no structural constraints on expert parameters

Reference graph

Works this paper leans on

53 extracted references · 14 canonical work pages · 5 internal anchors

[1]

To Index or Not to Index: Optimizing Exact Maximum Inner Product Search

Firas Abuzaid, Geet Sethi, Peter Bailis, and Matei Zaharia. To Index or Not to Index: Optimizing Exact Maximum Inner Product Search . International Conference on Data Engineering, pp.\ 1250--1261, 2019

2019
[2]

Sampling Techniques for Kernel Methods

Dimitris Achlioptas, Frank McSherry, and Bernhard Sch \"o lkopf. Sampling Techniques for Kernel Methods . Advances in Neural Information Processing Systems, 14, 2001

2001
[3]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints . Empirical Methods in Natural Language Processing, pp.\ 4895--4901, 2023

2023
[4]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization . arXiv preprint arXiv:1607.06450, 2016

work page Pith review arXiv 2016
[5]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas L \'e onard, and Aaron Courville. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation . arXiv preprint arXiv:1308.3432, 2013

work page internal anchor Pith review arXiv 2013
[6]

Datasheet for the pile

Stella Biderman, Kieran Bicheno, and Leo Gao. Datasheet for the Pile . arXiv preprint arXiv:2201.07311, 2022

work page arXiv 2022
[7]

On the Representation Collapse of Sparse Mixture of Experts

Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Bo Li, Saksham Singhal, Prakhar Bajaj, Xia Song, and Furu Wei. On the Representation Collapse of Sparse Mixture of Experts . Advances in Neural Information Processing Systems, 2022

2022
[8]

Unified Scaling Laws for Routed Language Models

Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified Scaling Laws for Routed Language Models . International Conference on Machine Learning, pp.\ 4057--4086, 2022

2022
[9]

Common crawl creative commons corpus (c5)

Common Crawl Foundation . Common crawl creative commons corpus (c5). Common Crawl, 2025. Available at https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons, accessed 2025-08-11

2025
[10]

Dhillon and Dharmendra S

Inderjit S. Dhillon and Dharmendra S. Modha. Concept Decompositions for Large Sparse Text Data Using Clustering . Machine Learning, 42 0 (1): 0 143--175, 2001

2001
[11]

On the Benefits of Learning to Route in Mixture-of-Experts Models

Nishanth Dikkala, Nikhil Ghosh, Raghu Meka, Rina Panigrahy, Nikhil Vyas, and Xin Wang. On the Benefits of Learning to Route in Mixture-of-Experts Models . Conference on Empirical Methods in Natural Language Processing, pp.\ 9376--9396, 2023

2023
[12]

On the Role of Discrete Representation in Sparse Mixture of Experts

Giang Do, Kha Pham, Hung Le, and Truyen Tran. On the Role of Discrete Representation in Sparse Mixture of Experts . arXiv preprint arXiv:2411.19402, 2024

work page arXiv 2024
[13]

Kimi-VL Technical Report

Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-VL Technical Report . arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review arXiv 2025
[14]

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts . International Conference on Machine Learning, pp.\ 5547--5569, 2022

2022
[15]

Taming Transformers for High-Resolution Image Synthesis

Patrick Esser, Robin Rombach, and Björn Ommer. Taming Transformers for High-Resolution Image Synthesis . Conference on Computer Vision and Pattern Recognition, pp.\ 12873--12883, 2021

2021
[16]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity . Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022

2022
[17]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review arXiv 2020
[18]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 Herd of Models . arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review arXiv 2024
[19]

Vector Quantization

Robert Gray. Vector Quantization . IEEE Assp Magazine, 1 0 (2): 0 4--29, 1984

1984
[20]

arXiv preprint arXiv:2407.04153 , year=

Xu Owen He. Mixture of a Million Experts . arXiv preprint arXiv:2407.04153, 2024

work page arXiv 2024
[21]

Approximate nearest neighbors: towards removing the curse of dimensionality

Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality . ACM Symposium on Theory of Computing, pp.\ 604--613, 1998

1998
[22]

Jacobs, Michael I

Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive Mixtures of Local Experts . Neural Computation, 3 0 (1): 0 79--87, 1991

1991
[23]

Mixtral of Experts

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of Experts . arXiv preprint arXiv:2401.04088, 2024

work page Pith review arXiv 2024
[24]

Billion-scale similarity search with GPUs

Jeff Johnson, Matthijs Douze, and Herv \'e J \'e gou. Billion-scale similarity search with GPUs . IEEE Transactions on Big Data, 7 0 (3): 0 535--547, 2019

2019
[25]

Jordan and Robert A

Michael I. Jordan and Robert A. Jacobs. Hierarchies of Adaptive Experts . Advances in Neural Information Processing Systems, 4, 1991

1991
[26]

Jordan and Robert A

Michael I. Jordan and Robert A. Jacobs. Hierarchical Mixtures of Experts and the EM Algorithm . Neural Computation, 6 0 (2): 0 181--214, 1994

1994
[27]

Scaling Laws for Fine-Grained Mixture of Experts

Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pi \'o ro, Micha Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Kr \'o l, Tomasz Odrzyg \'o \'z d \'z , Piotr Sankowski, et al. Scaling Laws for Fine-Grained Mixture of Experts . International Conference on Machine Learning, 2024

2024
[28]

Large Memory Layers with Product Keys

Guillaume Lample, Alexandre Sablayrolles, Marc'Aurelio Ranzato, Ludovic Denoyer, and Herv \'e J \'e gou. Large Memory Layers with Product Keys . Advances in Neural Information Processing Systems, 32, 2019

2019
[29]

Evaluation of text clustering methods and their dataspace embeddings: an exploration

Alain Lelu and Martine Cadot. Evaluation of text clustering methods and their dataspace embeddings: an exploration. International Federation of Classification Societies, pp.\ 131--139, 2019

2019
[30]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding . International Conference on Learning Representations, 2021

2021
[31]

An Algorithm for Vector Quantizer Design

Yoseph Linde, Andres Buzo, and Robert Gray. An Algorithm for Vector Quantizer Design . IEEE Transactions on communications, 28 0 (1): 0 84--95, 1980

1980
[32]

James B. McQueen. Some Methods of Classification and Analysis of Multivariate Observations . Proc. of 5th Berkeley Symposium on Math. Stat. and Prob., pp.\ 281--297, 1967

1967
[33]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer Sentinel Mixture Models . International Conference on Learning Representations, 2017

2017
[34]

Statistical advantages of perturbing cosine router in sparse mixture of experts

Huy Nguyen, Pedram Akbarian, Trang Pham, Trang Nguyen, Shujian Zhang, and Nhat Ho. Statistical advantages of perturbing cosine router in sparse mixture of experts. arXiv preprint arXiv:2405.14131, 6, 2024

work page arXiv 2024
[35]

Memory Augmented Language Models through Mixture of Word Experts

Cicero Nogueira Dos Santos, James Lee-Thorp, Isaac Noble, Chung-Ching Chang, and David Uthus. Memory Augmented Language Models through Mixture of Word Experts . North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 4425--4438, 2024

2024
[36]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer . Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020

2020
[37]

Generating Diverse High-Fidelity Images with VQ-VAE-2

Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating Diverse High-Fidelity Images with VQ-VAE-2 . Advances in Neural Information Processing Systems, 32, 2019

2019
[38]

Hash Layers For Large Sparse Models

Stephen Roller, Sainbayar Sukhbaatar, and Jason Weston. Hash Layers For Large Sparse Models . Advances in Neural Information Processing Systems, 34: 0 17555--17566, 2021

2021
[39]

Fox, and Harry Wu

Gerard Salton, Edward A. Fox, and Harry Wu. Extended Boolean Information Retrieval . Communications of the ACM, 26 0 (11): 0 1022--1036, 1983

1983
[40]

GLU Variants Improve Transformer

Noam Shazeer. GLU Variants Improve Transformer . arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review arXiv 2002
[41]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer . International Conference on Learning Representations, 2017

2017
[42]

Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips)

Anshumali Shrivastava and Ping Li. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). Advances in Neural Information Processing Systems, 27, 2014

2014
[43]

Video Google: A text retrieval approach to object matching in videos

Sivic and Zisserman. Video Google: A text retrieval approach to object matching in videos . International Conference on Computer Vision, pp.\ 1470--1477, 2003

2003
[44]

RoFormer: Enhanced transformer with Rotary Position Embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with Rotary Position Embedding . Neurocomputing, 568: 0 127063, 2024 a

2024
[45]

MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts

Zhenpeng Su, Zijia Lin, Xue Bai, Xing Wu, Yizhe Xiong, Haoran Lian, Guangyuan Ma, Hui Chen, Guiguang Ding, Wei Zhou, et al. MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts . arXiv preprint arXiv:2407.09816, 2024 b

work page arXiv 2024
[46]

Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint arXiv:2505.21411, 2025

Yehui Tang, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jinpeng Li, Hui Zang, et al. Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity . arXiv preprint arXiv:2505.21411, 2025

work page arXiv 2025
[47]

Neural Discrete Representation Learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural Discrete Representation Learning . Advances in Neural Information Processing Systems, 30, 2017

2017
[48]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention Is All You Need . Advances in Neural Information Processing Systems, 30, 2017

2017
[49]

A Neural Corpus Indexer for Document Retrieval

Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, et al. A Neural Corpus Indexer for Document Retrieval . Advances in Neural Information Processing Systems, 35: 0 25600--25614, 2022

2022
[50]

Hierarchical Quantized Autoencoders

Will Williams, Sam Ringer, Tom Ash, David MacLeod, Jamie Dougherty, and John Hughes. Hierarchical Quantized Autoencoders . Advances in Neural Information Processing Systems, 33: 0 4524--4535, 2020

2020
[51]

MoEC: Mixture of Expert Clusters

Yuan Xie, Shaohan Huang, Tianyu Chen, and Furu Wei. MoEC: Mixture of Expert Clusters . AAAI Conference on Artificial Intelligence, 37 0 (11): 0 13807--13815, 2023

2023
[52]

Mixture-of-Experts with Expert Choice Routing

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-Experts with Expert Choice Routing . Advances in Neural Information Processing Systems, 35: 0 7103--7114, 2022

2022
[53]

Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation

Shengyao Zhuang, Houxing Ren, Linjun Shou, Jian Pei, Ming Gong, Guido Zuccon, and Daxin Jiang. Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation . arXiv preprint arXiv:2206.10128, 2022

work page arXiv 2022