pith. machine review for the scientific record. sign in

arxiv: 2605.04952 · v1 · submitted 2026-05-06 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts

Bernhard Sch\"olkopf, Klaus-Rudolf Kladny, Maximilian Mordig, Michael Muehlebach

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:52 UTC · model grok-4.3

classification 💻 cs.LG
keywords mixture of expertsrouting mechanismvector quantizationinverted indexgranular expertstransformer modelsapproximate top-k
0
0 comments X

The pith

AIR-MoE approximates top-k expert routing in granular mixtures using vector quantization shortlisting followed by fine scoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces adaptive inverted-index routing for mixture-of-experts models to manage the increased routing costs that come with using many small experts. It proposes a two-stage process where tokens are first assigned to vector quantization codewords to build a shortlist of candidate experts, after which exact routing scores are computed only on that shortlist. This approach approximates the performance of full top-k routing without requiring any changes to the expert parameters or model architecture. A lower bound on the mass recall of the shortlist is derived to explain its behavior. Empirical evaluations indicate that this method outperforms previous routing strategies in settings with high expert granularity.

Core claim

AIR-MoE performs coarse shortlisting by assigning tokens to VQ codewords to construct a candidate set of experts, then computes exact routing scores restricted to this shortlist. This two-stage procedure approximates true top-k routing while avoiding full expert scoring and imposing no structural constraints on expert parameters. It serves as a drop-in replacement requiring no modifications to the model architecture or loss function, and a lower bound on mass recall is provided.

What carries the argument

The two-stage adaptive inverted-index routing mechanism, where vector quantization codewords enable coarse candidate set construction for subsequent fine scoring.

If this is right

  • Granular MoE models can scale to more experts without routing computation becoming prohibitive.
  • The method maintains compatibility with standard training procedures since no architectural or loss changes are needed.
  • Performance gains are observed over existing routing approaches in high-granularity regimes.
  • The mass recall lower bound offers a way to analyze and tune the shortlisting stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar inverted-index techniques might apply to other sparse selection problems beyond MoE.
  • Increasing the number of VQ codewords could trade off shortlist size against recall quality in practice.
  • Testing on even larger models would reveal if the approximation remains effective as model scale grows.

Load-bearing premise

The vector quantization codewords must generate candidate sets with sufficiently high mass recall so that the fine scoring stage can recover near-optimal experts.

What would settle it

Observing a case where the shortlist recall falls significantly below the derived lower bound or where the final model performance underperforms a true top-k router on identical expert setups.

Figures

Figures reproduced from arXiv: 2605.04952 by Bernhard Sch\"olkopf, Klaus-Rudolf Kladny, Maximilian Mordig, Michael Muehlebach.

Figure 1
Figure 1. Figure 1: Overview of AIR-MoE. The method consists of two parts: The coarse shortlisting stage uses vector quantization to select a codeword (blue) that stores a pre-computed expert shortlist L that references specific expert centroids (green) and is updated after each optimizer step. The fine scoring stage takes the shortlisted expert weights and scores them exactly. Notably, the codebook is learned via gradient￾fr… view at source ↗
Figure 2
Figure 2. Figure 2: Relative Reduction in PPL. The plots show the relative improvement in PPL with respect to the best baseline (PEER). Larger means better. AIR-MoE achieves consistent PPL improvements (up to 10%) in comparison to the PEER baseline, across model sizes and data sets view at source ↗
Figure 3
Figure 3. Figure 3: Feedback Loop. AIR-MoE forms a bi-level optimization loop between gradient-learned token representations and a codebook updated via adaptive clustering. A Full Optimization Loop Alg. 3 summarizes the full training procedure of AIR-MoE. The algorithm can be viewed as a bi-level op￾timization loop between gradient-trained model parameters and a gradient-free adaptive codebook. Let θ denote all parameters tra… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Comparison of Expert Usage. Existing routing methods impose hard struc￾tural constraints on the expert selection mechanism, either via structural constraints (PEER) or grouping constraints. These constraints, in turn, lead to routing artifacts depicted in (a) and (b), respectively. AIR￾MoE, in contrast, imposes no assumptions by computing an approximation of the top-K expert scores whose qualit… view at source ↗
read the original abstract

Mixture-of-experts (MoE) models enable scalable transformer architectures by activating only a subset of experts per token. Recent evidence suggests that performance improves with increasingly granular experts, i.e., many small experts instead of a few large ones. However, this regime substantially increases routing cost, which can dominate computation. We introduce adaptive inverted-index routing for MoE (AIR-MoE), an inverted-index-inspired routing architecture based on vector quantization (VQ). In a first stage, AIR-MoE performs coarse shortlisting by assigning tokens to VQ codewords to construct a candidate set of experts. In a second stage, fine scoring computes exact routing scores restricted to this shortlist. This two-stage procedure approximates true top-k routing while avoiding full expert scoring and, in contrast to prior work, imposing no structural constraints on expert parameters. AIR-MoE serves as a drop-in replacement for standard routers and requires no modifications to the model architecture or loss function. We further provide a lower bound on the mass recall achieved by AIR-MoE that yields insights into its inner workings. Empirically, we demonstrate that AIR-MoE achieves improved performance compared to existing routing approaches in granular MoE settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces AIR-MoE, a two-stage routing architecture for granular Mixture-of-Experts models. The first stage uses vector quantization to assign tokens to codewords and construct a shortlist of candidate experts; the second stage performs exact fine-grained scoring only on this shortlist. The approach is presented as a drop-in replacement for standard routers that approximates true top-k routing without structural constraints on expert parameters or changes to the model/loss. A lower bound on the mass recall of the VQ shortlisting stage is derived, and empirical results are reported showing improved performance over existing routing methods in granular MoE regimes.

Significance. If the mass-recall lower bound is tight enough in practice and the reported gains are reproducible with proper controls, the work would be significant for scaling MoE transformers to finer expert granularity. The drop-in compatibility and avoidance of expert-parameter constraints distinguish it from prior constrained routers, potentially enabling broader adoption in large-scale training.

major comments (2)
  1. [Abstract] Abstract: the central claim that AIR-MoE approximates true top-k routing rests on the VQ shortlist achieving sufficiently high mass recall for the fine-scoring stage to recover near-optimal experts. The abstract states that a lower bound is derived but provides neither the bound itself, its dependence on codebook size or granularity, nor any tightness analysis or empirical mass-recall values; without these the approximation quality asserted in the strongest claim cannot be evaluated.
  2. [Empirical Evaluation] Empirical section (referenced in abstract): the abstract reports performance improvements over existing routers but supplies no experimental details, baselines, error bars, number of runs, or ablation on codebook size. This absence makes it impossible to verify whether the gains are attributable to the two-stage procedure or to other factors, directly affecting the soundness of the performance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below. Where the comments identify opportunities for greater clarity, we have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that AIR-MoE approximates true top-k routing rests on the VQ shortlist achieving sufficiently high mass recall for the fine-scoring stage to recover near-optimal experts. The abstract states that a lower bound is derived but provides neither the bound itself, its dependence on codebook size or granularity, nor any tightness analysis or empirical mass-recall values; without these the approximation quality asserted in the strongest claim cannot be evaluated.

    Authors: We agree that the abstract would benefit from explicit reference to the mass-recall lower bound and its properties to better support the central approximation claim. The full manuscript derives this bound in Section 3, establishing its dependence on codebook size and expert granularity. In the revised version we have updated the abstract with a concise statement of the bound and its key dependencies. We have also added empirical mass-recall values together with a tightness discussion in the experimental section so that readers can directly assess approximation quality. revision: yes

  2. Referee: [Empirical Evaluation] Empirical section (referenced in abstract): the abstract reports performance improvements over existing routers but supplies no experimental details, baselines, error bars, number of runs, or ablation on codebook size. This absence makes it impossible to verify whether the gains are attributable to the two-stage procedure or to other factors, directly affecting the soundness of the performance claim.

    Authors: We acknowledge that the abstract's brevity omits experimental specifics and that the empirical section would be strengthened by additional controls. The manuscript already compares against standard top-k and other routing baselines and includes codebook-size ablations; however, to directly address verifiability we have revised both the abstract and the empirical section to report results averaged over multiple independent runs with error bars, to list all baselines explicitly, and to present the codebook ablation results more prominently. These additions allow clearer attribution of gains to the two-stage procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: AIR-MoE routing and mass-recall bound are independently derived

full rationale

The paper defines AIR-MoE as a two-stage procedure (VQ-based shortlisting followed by exact fine scoring on the candidate set) and separately derives a lower bound on mass recall from the VQ assignment properties. Neither the performance claim nor the bound reduces to a fitted parameter, self-referential definition, or self-citation chain; the approximation to top-k is presented as a consequence of the (bounded) recall rather than being assumed in the derivation. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of VQ shortlisting for preserving mass recall and on the empirical superiority of the resulting router; both are introduced by the paper rather than derived from prior results.

invented entities (1)
  • AIR-MoE two-stage router no independent evidence
    purpose: Efficient approximation of top-k expert selection for granular MoE
    New routing architecture introduced in the paper; no independent evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5522 in / 1184 out tokens · 92607 ms · 2026-05-08T17:52:50.415523+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

53 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    To Index or Not to Index: Optimizing Exact Maximum Inner Product Search

    Firas Abuzaid, Geet Sethi, Peter Bailis, and Matei Zaharia. To Index or Not to Index: Optimizing Exact Maximum Inner Product Search . International Conference on Data Engineering, pp.\ 1250--1261, 2019

  2. [2]

    Sampling Techniques for Kernel Methods

    Dimitris Achlioptas, Frank McSherry, and Bernhard Sch \"o lkopf. Sampling Techniques for Kernel Methods . Advances in Neural Information Processing Systems, 14, 2001

  3. [3]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints . Empirical Methods in Natural Language Processing, pp.\ 4895--4901, 2023

  4. [4]

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization . arXiv preprint arXiv:1607.06450, 2016

  5. [5]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas L \'e onard, and Aaron Courville. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation . arXiv preprint arXiv:1308.3432, 2013

  6. [6]

    Datasheet for the pile

    Stella Biderman, Kieran Bicheno, and Leo Gao. Datasheet for the Pile . arXiv preprint arXiv:2201.07311, 2022

  7. [7]

    On the Representation Collapse of Sparse Mixture of Experts

    Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Bo Li, Saksham Singhal, Prakhar Bajaj, Xia Song, and Furu Wei. On the Representation Collapse of Sparse Mixture of Experts . Advances in Neural Information Processing Systems, 2022

  8. [8]

    Unified Scaling Laws for Routed Language Models

    Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified Scaling Laws for Routed Language Models . International Conference on Machine Learning, pp.\ 4057--4086, 2022

  9. [9]

    Common crawl creative commons corpus (c5)

    Common Crawl Foundation . Common crawl creative commons corpus (c5). Common Crawl, 2025. Available at https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons, accessed 2025-08-11

  10. [10]

    Dhillon and Dharmendra S

    Inderjit S. Dhillon and Dharmendra S. Modha. Concept Decompositions for Large Sparse Text Data Using Clustering . Machine Learning, 42 0 (1): 0 143--175, 2001

  11. [11]

    On the Benefits of Learning to Route in Mixture-of-Experts Models

    Nishanth Dikkala, Nikhil Ghosh, Raghu Meka, Rina Panigrahy, Nikhil Vyas, and Xin Wang. On the Benefits of Learning to Route in Mixture-of-Experts Models . Conference on Empirical Methods in Natural Language Processing, pp.\ 9376--9396, 2023

  12. [12]

    On the Role of Discrete Representation in Sparse Mixture of Experts

    Giang Do, Kha Pham, Hung Le, and Truyen Tran. On the Role of Discrete Representation in Sparse Mixture of Experts . arXiv preprint arXiv:2411.19402, 2024

  13. [13]

    Kimi-VL Technical Report

    Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-VL Technical Report . arXiv preprint arXiv:2504.07491, 2025

  14. [14]

    GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

    Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts . International Conference on Machine Learning, pp.\ 5547--5569, 2022

  15. [15]

    Taming Transformers for High-Resolution Image Synthesis

    Patrick Esser, Robin Rombach, and Björn Ommer. Taming Transformers for High-Resolution Image Synthesis . Conference on Computer Vision and Pattern Recognition, pp.\ 12873--12883, 2021

  16. [16]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity . Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022

  17. [17]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

  18. [18]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 Herd of Models . arXiv preprint arXiv:2407.21783, 2024

  19. [19]

    Vector Quantization

    Robert Gray. Vector Quantization . IEEE Assp Magazine, 1 0 (2): 0 4--29, 1984

  20. [20]

    arXiv preprint arXiv:2407.04153 , year=

    Xu Owen He. Mixture of a Million Experts . arXiv preprint arXiv:2407.04153, 2024

  21. [21]

    Approximate nearest neighbors: towards removing the curse of dimensionality

    Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality . ACM Symposium on Theory of Computing, pp.\ 604--613, 1998

  22. [22]

    Jacobs, Michael I

    Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive Mixtures of Local Experts . Neural Computation, 3 0 (1): 0 79--87, 1991

  23. [23]

    Mixtral of Experts

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of Experts . arXiv preprint arXiv:2401.04088, 2024

  24. [24]

    Billion-scale similarity search with GPUs

    Jeff Johnson, Matthijs Douze, and Herv \'e J \'e gou. Billion-scale similarity search with GPUs . IEEE Transactions on Big Data, 7 0 (3): 0 535--547, 2019

  25. [25]

    Jordan and Robert A

    Michael I. Jordan and Robert A. Jacobs. Hierarchies of Adaptive Experts . Advances in Neural Information Processing Systems, 4, 1991

  26. [26]

    Jordan and Robert A

    Michael I. Jordan and Robert A. Jacobs. Hierarchical Mixtures of Experts and the EM Algorithm . Neural Computation, 6 0 (2): 0 181--214, 1994

  27. [27]

    Scaling Laws for Fine-Grained Mixture of Experts

    Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pi \'o ro, Micha Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Kr \'o l, Tomasz Odrzyg \'o \'z d \'z , Piotr Sankowski, et al. Scaling Laws for Fine-Grained Mixture of Experts . International Conference on Machine Learning, 2024

  28. [28]

    Large Memory Layers with Product Keys

    Guillaume Lample, Alexandre Sablayrolles, Marc'Aurelio Ranzato, Ludovic Denoyer, and Herv \'e J \'e gou. Large Memory Layers with Product Keys . Advances in Neural Information Processing Systems, 32, 2019

  29. [29]

    Evaluation of text clustering methods and their dataspace embeddings: an exploration

    Alain Lelu and Martine Cadot. Evaluation of text clustering methods and their dataspace embeddings: an exploration. International Federation of Classification Societies, pp.\ 131--139, 2019

  30. [30]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding . International Conference on Learning Representations, 2021

  31. [31]

    An Algorithm for Vector Quantizer Design

    Yoseph Linde, Andres Buzo, and Robert Gray. An Algorithm for Vector Quantizer Design . IEEE Transactions on communications, 28 0 (1): 0 84--95, 1980

  32. [32]

    James B. McQueen. Some Methods of Classification and Analysis of Multivariate Observations . Proc. of 5th Berkeley Symposium on Math. Stat. and Prob., pp.\ 281--297, 1967

  33. [33]

    Pointer Sentinel Mixture Models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer Sentinel Mixture Models . International Conference on Learning Representations, 2017

  34. [34]

    Statistical advantages of perturbing cosine router in sparse mixture of experts

    Huy Nguyen, Pedram Akbarian, Trang Pham, Trang Nguyen, Shujian Zhang, and Nhat Ho. Statistical advantages of perturbing cosine router in sparse mixture of experts. arXiv preprint arXiv:2405.14131, 6, 2024

  35. [35]

    Memory Augmented Language Models through Mixture of Word Experts

    Cicero Nogueira Dos Santos, James Lee-Thorp, Isaac Noble, Chung-Ching Chang, and David Uthus. Memory Augmented Language Models through Mixture of Word Experts . North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 4425--4438, 2024

  36. [36]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer . Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020

  37. [37]

    Generating Diverse High-Fidelity Images with VQ-VAE-2

    Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating Diverse High-Fidelity Images with VQ-VAE-2 . Advances in Neural Information Processing Systems, 32, 2019

  38. [38]

    Hash Layers For Large Sparse Models

    Stephen Roller, Sainbayar Sukhbaatar, and Jason Weston. Hash Layers For Large Sparse Models . Advances in Neural Information Processing Systems, 34: 0 17555--17566, 2021

  39. [39]

    Fox, and Harry Wu

    Gerard Salton, Edward A. Fox, and Harry Wu. Extended Boolean Information Retrieval . Communications of the ACM, 26 0 (11): 0 1022--1036, 1983

  40. [40]

    GLU Variants Improve Transformer

    Noam Shazeer. GLU Variants Improve Transformer . arXiv preprint arXiv:2002.05202, 2020

  41. [41]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer . International Conference on Learning Representations, 2017

  42. [42]

    Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips)

    Anshumali Shrivastava and Ping Li. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). Advances in Neural Information Processing Systems, 27, 2014

  43. [43]

    Video Google: A text retrieval approach to object matching in videos

    Sivic and Zisserman. Video Google: A text retrieval approach to object matching in videos . International Conference on Computer Vision, pp.\ 1470--1477, 2003

  44. [44]

    RoFormer: Enhanced transformer with Rotary Position Embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with Rotary Position Embedding . Neurocomputing, 568: 0 127063, 2024 a

  45. [45]

    MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts

    Zhenpeng Su, Zijia Lin, Xue Bai, Xing Wu, Yizhe Xiong, Haoran Lian, Guangyuan Ma, Hui Chen, Guiguang Ding, Wei Zhou, et al. MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts . arXiv preprint arXiv:2407.09816, 2024 b

  46. [46]

    Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint arXiv:2505.21411, 2025

    Yehui Tang, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jinpeng Li, Hui Zang, et al. Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity . arXiv preprint arXiv:2505.21411, 2025

  47. [47]

    Neural Discrete Representation Learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural Discrete Representation Learning . Advances in Neural Information Processing Systems, 30, 2017

  48. [48]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention Is All You Need . Advances in Neural Information Processing Systems, 30, 2017

  49. [49]

    A Neural Corpus Indexer for Document Retrieval

    Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, et al. A Neural Corpus Indexer for Document Retrieval . Advances in Neural Information Processing Systems, 35: 0 25600--25614, 2022

  50. [50]

    Hierarchical Quantized Autoencoders

    Will Williams, Sam Ringer, Tom Ash, David MacLeod, Jamie Dougherty, and John Hughes. Hierarchical Quantized Autoencoders . Advances in Neural Information Processing Systems, 33: 0 4524--4535, 2020

  51. [51]

    MoEC: Mixture of Expert Clusters

    Yuan Xie, Shaohan Huang, Tianyu Chen, and Furu Wei. MoEC: Mixture of Expert Clusters . AAAI Conference on Artificial Intelligence, 37 0 (11): 0 13807--13815, 2023

  52. [52]

    Mixture-of-Experts with Expert Choice Routing

    Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-Experts with Expert Choice Routing . Advances in Neural Information Processing Systems, 35: 0 7103--7114, 2022

  53. [53]

    Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation

    Shengyao Zhuang, Houxing Ren, Linjun Shou, Jian Pei, Ming Gong, Guido Zuccon, and Daxin Jiang. Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation . arXiv preprint arXiv:2206.10128, 2022