arxiv: 2602.01219 · v5 · submitted 2026-02-01 · 💻 cs.LG · cs.CV

Recognition: no theorem link

Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights

Qishuai Wen , Zhiyuan Huang , Xianghan Meng , Wei He , Chun-Guang Li

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:49 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords efficient attentionmixture of expertsfast weightstransformersscalable attentionvision transformerstop-k routing

0 comments

The pith

Mixture-of-Top-k Attention scales self-attention by routing to deformable fast-weight experts selected via landmark queries and a shared compressed expert.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper views vanilla self-attention as an N-width fast-weight MLP that grows expressive with sequence length but becomes unscalable. It frames prior efficient attention methods as ways to make those fast weights scalable via routing or compression, then introduces Mixture-of-Top-k Attention (MiTA). MiTA gathers top-k key-value pairs through a small set of landmark queries to create query-aware deformable experts and compresses the wide MLP into one narrower shared expert. Experiments on vision tasks show it matches or exceeds prior attention variants in accuracy while using less compute, with side effects like emergent token pruning.

Core claim

MiTA employs a small set of landmark queries to gather top-k attended key-value pairs as query-aware and deformable routed experts, while compressing the N-width MLP into a narrower shared expert. This improves the flexibility of prior MoE attention from rigid to deformable fast-weight experts and the scalability of prior top-k attention from query-specific set to reusable top-k set.

What carries the argument

Mixture-of-Top-k Attention (MiTA): landmark queries select a reusable top-k set of key-value pairs as deformable routed experts plus one narrower shared expert that replaces the full N-width fast-weight MLP.

If this is right

The reusable top-k set allows attention cost to remain sub-quadratic in sequence length while still being query-dependent.
The shared compressed expert reduces parameter count and memory footprint relative to maintaining separate experts per block.
Emergent token pruning appears during inference, dynamically shortening effective sequence length.
The design generalizes from standard attention without requiring architectural changes at test time.
Performance on vision benchmarks improves in both accuracy and speed over rigid-block MoE attention and query-specific top-k methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same landmark-plus-top-k routing could be applied to autoregressive language models to handle contexts of tens of thousands of tokens without retraining the entire stack.
Because the top-k set is reusable across queries, it opens a path to caching those selected pairs in a persistent memory module for multi-turn or retrieval-augmented generation.
The observed token-pruning effect suggests a natural way to add dynamic early-exit or sequence-compression layers on top of the attention block.
If the compression step can be made adaptive per layer, the method might yield a single model that automatically trades compute for accuracy at different sequence lengths.

Load-bearing premise

Routing via a small set of landmark queries to top-k attended key-value pairs plus compression of the N-width MLP into a narrower shared expert preserves the expressive capacity and performance of full self-attention.

What would settle it

A controlled test on long-sequence vision or language modeling where MiTA is trained and evaluated against full self-attention and the accuracy gap widens beyond a small threshold as sequence length grows.

Figures

Figures reproduced from arXiv: 2602.01219 by Chun-Guang Li, Qishuai Wen, Wei He, Xianghan Meng, Zhiyuan Huang.

**Figure 1.** Figure 1: As the context extends, the width of the two-layer fastweight MLP induced by full attention increases accordingly. We categorize efficient fast-weight scaling approaches into two strategies: scaling a) by routing and b) by compression, and illustrate each with a representative method: MoBA (Lu et al., 2025) and TTT (Sun et al., 2025). queries (Bietti et al., 2023). However, such an all-to-all lookup para… view at source ↗

**Figure 2.** Figure 2: , MiTA Attention concatenates these compressed keyvalue pairs with a routed, deformable subset of the original key-value pairs for each query. Contributions. Contributions of the paper are summarized as follows. 1. We introduce a five-dimensional taxonomy to sort the existing efficient attention methods from a fast-weight scaling perspective. 2. We propose an efficient attention method—MiTA, which constr… view at source ↗

**Figure 3.** Figure 3: Visualization of experts’ gathered key-value pairs, and routed queries. The red box marks the local window from which the landmark query is obtained via average pooling. The attention heatmap (averaged over attention heads) indicates key-value pairs within each expert (top row) and the queries routed to it (bottom row). Notably, neither the expert’s key–value pairs nor the routed queries are confined to th… view at source ↗

**Figure 4.** Figure 4: The token pruning effect of MiTA Attention. Each row visualizes, for each layer, the positions of key-value pairs (aggregated over heads) selected as experts; the leftmost image shows the original input. In later layers, most tokens are effectively “pruned” (i.e., not selected as experts), and attention concentrates on class-relevant regions. The examples are sampled from the ImageNet-1K training set. (i.e… view at source ↗

**Figure 5.** Figure 5: The layer-wise positional overlap (between the key–value pairs gathered by an expert and the queries routed to it) is quantified by mIoU, averaged over experts and attention heads. 4. Experiments To verify the efficacy of our MiTA Attention, we conduct image classification experiments on ImageNet-1K (Deng et al., 2009) and semantic segmentation experiments on ADE20K (Zhou et al., 2019). Moreover, we assess… view at source ↗

**Figure 6.** Figure 6: Inference throughput. The results are measured on a three-layer Transformer with an embedding dimension of 128, and the batch size is tuned for each run to maximize throughput. Results. As depicted in Tab. 5, MiTA Attention remarkably reduces the FLOPs (by up to 42%) while achieving comparable segmentation performance. Note that MiTA Attention is not fully exploited here because the backbone is not native… view at source ↗

**Figure 8.** Figure 8: Evaluation with a different inference attention. The x-axis indexes the training attention, while the y-axis indexes the inference attention. We omit the diagonal entries from the heatmap since they are not of interest. Currently, only standard attention, Agent Attention, and MiTA Attention are included. generalization setting, where model parameters are fixed, we also finetune models pretrained on a large… view at source ↗

**Figure 9.** Figure 9: Ablation study of m and k on CIFAR-100. Experiments with other (m, k) were not conducted due to out-of-memory (OOM). Landmark extraction. Our simple default choice — average pooling over evenly split, non-overlapping rectangular regions — even outperforms parameterized alternatives. This may stem from the fact that landmark queries serve the dual roles of compressed keys and routers, potentially leading to… view at source ↗

**Figure 10.** Figure 10: Visual comparison of related attention methods. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

read the original abstract

The vanilla self-attention mechanism in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically induced by inputs and whose hidden dimension is equal to the sequence length $N$. As the context extends, the expressive capacity of such an $N$-width MLP increases, but it becomes unscalable for extremely long sequences. Recently, this fast-weight perspective has motivated the Mixture-of-Experts (MoE) attention mechanism, which partitions the sequence into rigid blocks, treats them as fast-weight experts, and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for efficient attention mechanisms, interpreting them as making fast weights scalable through either routing or compression, and organizing them into a five-dimensional taxonomy. Then, we propose Mixture-of-Top-$k$ Attention (MiTA), which employs a small set of landmark queries to gather top-$k$ attended key-value pairs as query-aware and deformable routed experts, while compressing the $N$-width MLP into a narrower shared expert. Consequently, our MiTA improves the flexibility of prior MoE attention from rigid to deformable fast-weight experts, as well as the scalability of prior top-$k$ attention from query-specific set to reusable top-$k$ set. We conduct extensive experiments on vision tasks showing the superior effectiveness and efficiency of our MiTA, and also uncovering intriguing properties such as an emergent token-pruning effect and easy generalization from standard attention. Code is available at https://github.com/QishuaiWen/MiTA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MiTA adds landmark-based deformable top-k routing and shared-expert compression to the fast-weight view of attention, but the paper supplies no approximation bounds and the vision-task results need closer inspection to confirm they hold without hidden costs.

read the letter

The core contribution here is a synthesis that turns prior rigid-block MoE attention into deformable experts via a small set of landmark queries that select reusable top-k key-value pairs, while also compressing the sequence-length MLP into a narrower shared expert. The authors also lay out a five-dimensional taxonomy that groups efficient attention methods by how they achieve scalability through routing or compression. That framing is useful and goes a bit beyond the rigid MoE and query-specific top-k baselines they cite. The experiments on vision tasks are presented as showing better effectiveness and efficiency, with side observations about emergent token pruning and easy transfer from standard attention. Releasing code helps with checking the implementation details. The main soft spot is the missing analysis of whether the approximation actually preserves the N-dependent capacity of full attention. Landmark selection could miss query-specific structure for large N, and the compression step reduces the hidden dimension without a rank or error bound to quantify the loss. The abstract claims superiority but gives no numbers, baselines, or variance, so the full paper's tables and ablations will decide whether the gains are robust or sensitive to the choice of k and landmark count. This work is aimed at people already working on efficient transformers and MoE-style attention variants. A reader looking for practical ways to extend context length without full quadratic cost could pick up usable ideas from the taxonomy and routing scheme. I would bring it to a reading group to discuss the deformable-expert angle. I would not cite it in my own work until the approximation quality is clearer. It deserves peer review because the framework is coherent and the claimed improvements are concrete enough to test, even if the theoretical gap will likely require revisions.

Referee Report

3 major / 2 minor

Summary. The paper frames vanilla self-attention as an N-width fast-weight MLP whose capacity grows with sequence length N but becomes unscalable. It organizes prior efficient attention methods into a five-dimensional taxonomy based on routing and compression strategies, then introduces Mixture-of-Top-k Attention (MiTA). MiTA selects a small fixed set of landmark queries to retrieve a reusable top-k set of key-value pairs (deformable routed experts) while compressing the N-width MLP into a narrower shared expert. The central claim is that this yields both greater flexibility than rigid-block MoE attention and greater scalability than query-specific top-k attention, with experiments on vision tasks demonstrating superior effectiveness, efficiency, an emergent token-pruning effect, and easy generalization from standard attention.

Significance. If the approximation preserves the expressive capacity of full attention, the work would supply a practical unification of routing-based and compression-based efficiency techniques together with a concrete mechanism (landmark-driven reusable top-k experts) that could scale attention to longer contexts without quadratic cost. The taxonomy itself is a useful organizing contribution, and the reported emergent properties (token pruning, generalization) would be of independent interest if quantitatively substantiated.

major comments (3)

[§3.2–3.3] §3.2–3.3: The claim that landmark routing plus shared-expert compression preserves the N-dependent expressive capacity of full self-attention is load-bearing for the central contribution, yet no rank bound, approximation error analysis, or formal characterization of the span of realizable query-key interactions is provided. The landmark selection could systematically omit query-specific structure for large N, and the paper supplies neither a proof nor a counter-example construction showing when the reduced form remains universal.
[Experimental section] Experimental section (vision tasks): The abstract and introduction assert superior effectiveness, but the manuscript does not report error bars, statistical significance tests, or direct comparisons against full self-attention on sequences where N exceeds the chosen k and landmark count. Without these, the claim that performance is retained while scalability improves cannot be evaluated.
[§4.2] §4.2 (ablation studies): The scalability claim depends on the number of landmark queries and the top-k value remaining small relative to N; however, no ablation is shown that varies these hyperparameters while measuring both accuracy and wall-clock cost across increasing sequence lengths, leaving the practical regime of the method uncharacterized.

minor comments (2)

Notation: the symbols for the landmark query matrix and the shared expert width are introduced without an explicit table of symbols, making cross-references between the taxonomy and the MiTA equations harder to follow.
Figure 2 caption: the diagram of deformable versus rigid experts would benefit from an explicit legend indicating which arrows correspond to the landmark routing step versus the shared-expert compression step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the theoretical grounding and experimental validation of MiTA. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§3.2–3.3] The claim that landmark routing plus shared-expert compression preserves the N-dependent expressive capacity of full self-attention is load-bearing for the central contribution, yet no rank bound, approximation error analysis, or formal characterization of the span of realizable query-key interactions is provided. The landmark selection could systematically omit query-specific structure for large N, and the paper supplies neither a proof nor a counter-example construction showing when the reduced form remains universal.

Authors: We acknowledge that the manuscript does not include a formal rank bound or approximation-error analysis, which would provide stronger theoretical support. Our primary contribution is the algorithmic unification via the fast-weight perspective and the empirical demonstration that MiTA retains competitive performance while improving scalability. In the revision we will add a dedicated discussion subsection that (i) explains the design rationale for landmark selection (representative queries obtained via clustering) and why it is expected to preserve the dominant query-key interactions, (ii) states the conditions under which the reduced form is intended to approximate full attention, and (iii) explicitly notes the absence of a universality proof as a limitation for future theoretical work. We do not claim strict preservation of capacity for arbitrary N; rather, we show that the practical expressive power suffices for the vision tasks considered. revision: partial
Referee: [Experimental section] The abstract and introduction assert superior effectiveness, but the manuscript does not report error bars, statistical significance tests, or direct comparisons against full self-attention on sequences where N exceeds the chosen k and landmark count. Without these, the claim that performance is retained while scalability improves cannot be evaluated.

Authors: We will revise the experimental section to report mean performance and standard deviation across multiple random seeds, together with paired t-tests for statistical significance against the strongest baselines. Direct comparisons to full self-attention are already present for all datasets where N is small enough for quadratic attention to run (e.g., standard ImageNet patch sequences); we will add explicit statements clarifying that full attention becomes infeasible once N exceeds the chosen k and landmark count, which is precisely the regime where MiTA’s scalability advantage is intended to apply. Additional results on longer synthetic sequences will be included to illustrate the scaling behavior. revision: yes
Referee: [§4.2] The scalability claim depends on the number of landmark queries and the top-k value remaining small relative to N; however, no ablation is shown that varies these hyperparameters while measuring both accuracy and wall-clock cost across increasing sequence lengths, leaving the practical regime of the method uncharacterized.

Authors: We agree that a more systematic characterization is needed. In the revised §4.2 we will present new ablation tables that sweep the number of landmarks and the top-k value over a range of settings, reporting both top-1 accuracy and wall-clock time (forward + backward) for sequence lengths from 256 up to 4096 tokens. These results will delineate the practical operating regime in which MiTA remains accurate while delivering clear efficiency gains. revision: yes

Circularity Check

0 steps flagged

No circularity: proposal introduces independent routing and compression without reduction to inputs

full rationale

The paper frames vanilla attention as an N-width fast-weight MLP and proposes MiTA via landmark queries for deformable top-k experts plus narrower shared-expert compression. No quoted equations or steps reduce the claimed flexibility/scalability gains to a fitted parameter, self-defined quantity, or load-bearing self-citation chain; the taxonomy and method are presented as new organizing constructs with empirical support on vision tasks. The derivation remains self-contained as an architectural proposal rather than a tautological renaming or prediction-by-construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; specific free parameters such as the exact value of k and number of landmarks, plus any additional assumptions in the full derivations, cannot be audited.

free parameters (2)

k (top-k count)
Hyperparameter controlling how many key-value pairs are routed per query; value chosen during experiments.
number of landmark queries
Size of the small set used to gather top-k experts; not numerically specified in abstract.

axioms (1)

domain assumption Vanilla self-attention can be viewed as a two-layer fast-weight MLP whose hidden dimension equals sequence length N.
This perspective is used to motivate the entire taxonomy and the MiTA design.

pith-pipeline@v0.9.0 · 5589 in / 1333 out tokens · 27029 ms · 2026-05-16T08:49:23.475693+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 9 internal anchors

[1]

Titans: Learning to Memorize at Test Time

Behrouz, A., Zhong, P., and Mirrokni, V . Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long- document transformer.arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[3]

Mixture of contexts for long video generation.arXiv preprint arXiv:2508.21058,

Cai, S., Yang, C., Zhang, L., Guo, Y ., Xiao, J., Yang, Z., Xu, Y ., Yang, Z., Yuille, A., Guibas, L., et al. Mixture of contexts for long video generation.arXiv preprint arXiv:2508.21058,

work page arXiv
[4]

ViT$^3$: Unlocking Test-Time Training in Vision

Han, D., Pu, Y ., Xia, Z., Han, Y ., Pan, X., Li, X., Lu, J., Song, S., and Huang, G. Bridging the divide: Reconsidering softmax and linear attention. InNeurIPS, 2024a. Han, D., Ye, T., Han, Y ., Xia, Z., Pan, S., Wan, P., Song, S., and Huang, G. Agent attention: On the integration of softmax and linear attention. InECCV, 2024b. Han, D., Li, Y ., Li, T., ...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

MoGA: Mixture-of-groups attention for end-to-end long video generation.arXiv preprint arXiv:2510.18692,

Jia, W., Lu, Y ., Huang, M., Wang, H., Huang, B., Chen, N., Liu, M., Jiang, J., and Mao, Z. MoGA: Mixture-of-groups attention for end-to-end long video generation.arXiv preprint arXiv:2510.18692,

work page arXiv
[7]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[8]

General- purpose in-context learning by meta-learning transformers

Kirsch, L., Harrison, J., Sohl-Dickstein, J., and Metz, L. General- purpose in-context learning by meta-learning transformers. arXiv preprint arXiv:2212.04458,

work page arXiv
[9]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al. DeepSeek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

MoBA: Mixture of Block Attention for Long-Context LLMs

Lu, E., Jiang, Z., Liu, J., Du, Y ., Jiang, T., Hong, C., Liu, S., He, W., Yuan, E., Wang, Y ., et al. MoBA: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Online normalizer calculation for softmax

Milakov, M. and Gimelshein, N. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

ViT-5: Vision transformers for the mid-2020s

Wang, F., Ren, S., Zhang, T., Neskovic, P., Bhattad, A., Xie, C., and Yuille, A. ViT-5: Vision transformers for the mid-2020s. arXiv preprint arXiv:2602.08071,

work page arXiv
[13]

Linformer: Self-Attention with Linear Complexity

Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Lin- former: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[14]

Parallel loop transformer for efficient test-time computation scaling.arXiv preprint arXiv:2510.24824, 2025a

Wu, B., Chen, M., Luo, X., Yan, S., Yu, Q., Xia, F., Zhang, T., Zhan, H., Zhong, Z., Zhou, X., et al. Parallel loop transformer for efficient test-time computation scaling.arXiv preprint arXiv:2510.24824, 2025a. Wu, J., Hou, L., Yang, H., Tao, X., Tian, Y ., Wan, P., Zhang, D., and Tong, Y . VMoBA: Mixture-of-block attention for video diffusion models.arX...

work page arXiv
[15]

MHLA: Restoring expressivity of linear attention via token-level multi-head.arXiv preprint arXiv:2601.07832,

11 MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-kActivations Zhang, K., Huang, Y ., Deng, Y ., Yu, J., Chen, J., Ling, H., Xie, E., and Zhou, D. MHLA: Restoring expressivity of linear attention via token-level multi-head.arXiv preprint arXiv:2601.07832,

work page arXiv
[16]

Experiments with other (m, k) were not conducted due to out-of-memory (OOM)

Setting Acc.∆ Landmark Extraction Random Selection 70.6 -0.5 Learnable Parameters 66.3 -4.8 1D Average Pooling 70.4 -0.7 2D Average Pooling 71.1 Default Convolution 67.0 -4.1 Depth-Wise Convolution 68.8 -2.3 Token Merging (Bolya et al., 2023)71.3 +0.2 m×k 16×16 70.0 -1.1 16×25 71.0 -0.1 25×16 70.5 -0.6 25×25 71.1 Default 25×36 71.7 +0.6 36×25 71·2 +0.1 36...

work page 2023
[17]

For practical use, we provide a simple rule of thumb: first choose a fixed ratio, m×k N ; then start fromm=kand explorek > mduring subsequent tuning

and DeepSeek Sparse Attention (Liu et al., 2025)) is redundant, and that many such activations can instead be shared through routing. For practical use, we provide a simple rule of thumb: first choose a fixed ratio, m×k N ; then start fromm=kand explorek > mduring subsequent tuning. Compression and routing.Routing is more critical than compression. Nevert...

work page 2025