Recognition: no theorem link
Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights
Pith reviewed 2026-05-16 08:49 UTC · model grok-4.3
The pith
Mixture-of-Top-k Attention scales self-attention by routing to deformable fast-weight experts selected via landmark queries and a shared compressed expert.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MiTA employs a small set of landmark queries to gather top-k attended key-value pairs as query-aware and deformable routed experts, while compressing the N-width MLP into a narrower shared expert. This improves the flexibility of prior MoE attention from rigid to deformable fast-weight experts and the scalability of prior top-k attention from query-specific set to reusable top-k set.
What carries the argument
Mixture-of-Top-k Attention (MiTA): landmark queries select a reusable top-k set of key-value pairs as deformable routed experts plus one narrower shared expert that replaces the full N-width fast-weight MLP.
If this is right
- The reusable top-k set allows attention cost to remain sub-quadratic in sequence length while still being query-dependent.
- The shared compressed expert reduces parameter count and memory footprint relative to maintaining separate experts per block.
- Emergent token pruning appears during inference, dynamically shortening effective sequence length.
- The design generalizes from standard attention without requiring architectural changes at test time.
- Performance on vision benchmarks improves in both accuracy and speed over rigid-block MoE attention and query-specific top-k methods.
Where Pith is reading between the lines
- The same landmark-plus-top-k routing could be applied to autoregressive language models to handle contexts of tens of thousands of tokens without retraining the entire stack.
- Because the top-k set is reusable across queries, it opens a path to caching those selected pairs in a persistent memory module for multi-turn or retrieval-augmented generation.
- The observed token-pruning effect suggests a natural way to add dynamic early-exit or sequence-compression layers on top of the attention block.
- If the compression step can be made adaptive per layer, the method might yield a single model that automatically trades compute for accuracy at different sequence lengths.
Load-bearing premise
Routing via a small set of landmark queries to top-k attended key-value pairs plus compression of the N-width MLP into a narrower shared expert preserves the expressive capacity and performance of full self-attention.
What would settle it
A controlled test on long-sequence vision or language modeling where MiTA is trained and evaluated against full self-attention and the accuracy gap widens beyond a small threshold as sequence length grows.
Figures
read the original abstract
The vanilla self-attention mechanism in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically induced by inputs and whose hidden dimension is equal to the sequence length $N$. As the context extends, the expressive capacity of such an $N$-width MLP increases, but it becomes unscalable for extremely long sequences. Recently, this fast-weight perspective has motivated the Mixture-of-Experts (MoE) attention mechanism, which partitions the sequence into rigid blocks, treats them as fast-weight experts, and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for efficient attention mechanisms, interpreting them as making fast weights scalable through either routing or compression, and organizing them into a five-dimensional taxonomy. Then, we propose Mixture-of-Top-$k$ Attention (MiTA), which employs a small set of landmark queries to gather top-$k$ attended key-value pairs as query-aware and deformable routed experts, while compressing the $N$-width MLP into a narrower shared expert. Consequently, our MiTA improves the flexibility of prior MoE attention from rigid to deformable fast-weight experts, as well as the scalability of prior top-$k$ attention from query-specific set to reusable top-$k$ set. We conduct extensive experiments on vision tasks showing the superior effectiveness and efficiency of our MiTA, and also uncovering intriguing properties such as an emergent token-pruning effect and easy generalization from standard attention. Code is available at https://github.com/QishuaiWen/MiTA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper frames vanilla self-attention as an N-width fast-weight MLP whose capacity grows with sequence length N but becomes unscalable. It organizes prior efficient attention methods into a five-dimensional taxonomy based on routing and compression strategies, then introduces Mixture-of-Top-k Attention (MiTA). MiTA selects a small fixed set of landmark queries to retrieve a reusable top-k set of key-value pairs (deformable routed experts) while compressing the N-width MLP into a narrower shared expert. The central claim is that this yields both greater flexibility than rigid-block MoE attention and greater scalability than query-specific top-k attention, with experiments on vision tasks demonstrating superior effectiveness, efficiency, an emergent token-pruning effect, and easy generalization from standard attention.
Significance. If the approximation preserves the expressive capacity of full attention, the work would supply a practical unification of routing-based and compression-based efficiency techniques together with a concrete mechanism (landmark-driven reusable top-k experts) that could scale attention to longer contexts without quadratic cost. The taxonomy itself is a useful organizing contribution, and the reported emergent properties (token pruning, generalization) would be of independent interest if quantitatively substantiated.
major comments (3)
- [§3.2–3.3] §3.2–3.3: The claim that landmark routing plus shared-expert compression preserves the N-dependent expressive capacity of full self-attention is load-bearing for the central contribution, yet no rank bound, approximation error analysis, or formal characterization of the span of realizable query-key interactions is provided. The landmark selection could systematically omit query-specific structure for large N, and the paper supplies neither a proof nor a counter-example construction showing when the reduced form remains universal.
- [Experimental section] Experimental section (vision tasks): The abstract and introduction assert superior effectiveness, but the manuscript does not report error bars, statistical significance tests, or direct comparisons against full self-attention on sequences where N exceeds the chosen k and landmark count. Without these, the claim that performance is retained while scalability improves cannot be evaluated.
- [§4.2] §4.2 (ablation studies): The scalability claim depends on the number of landmark queries and the top-k value remaining small relative to N; however, no ablation is shown that varies these hyperparameters while measuring both accuracy and wall-clock cost across increasing sequence lengths, leaving the practical regime of the method uncharacterized.
minor comments (2)
- Notation: the symbols for the landmark query matrix and the shared expert width are introduced without an explicit table of symbols, making cross-references between the taxonomy and the MiTA equations harder to follow.
- Figure 2 caption: the diagram of deformable versus rigid experts would benefit from an explicit legend indicating which arrows correspond to the landmark routing step versus the shared-expert compression step.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the theoretical grounding and experimental validation of MiTA. We address each major comment below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§3.2–3.3] The claim that landmark routing plus shared-expert compression preserves the N-dependent expressive capacity of full self-attention is load-bearing for the central contribution, yet no rank bound, approximation error analysis, or formal characterization of the span of realizable query-key interactions is provided. The landmark selection could systematically omit query-specific structure for large N, and the paper supplies neither a proof nor a counter-example construction showing when the reduced form remains universal.
Authors: We acknowledge that the manuscript does not include a formal rank bound or approximation-error analysis, which would provide stronger theoretical support. Our primary contribution is the algorithmic unification via the fast-weight perspective and the empirical demonstration that MiTA retains competitive performance while improving scalability. In the revision we will add a dedicated discussion subsection that (i) explains the design rationale for landmark selection (representative queries obtained via clustering) and why it is expected to preserve the dominant query-key interactions, (ii) states the conditions under which the reduced form is intended to approximate full attention, and (iii) explicitly notes the absence of a universality proof as a limitation for future theoretical work. We do not claim strict preservation of capacity for arbitrary N; rather, we show that the practical expressive power suffices for the vision tasks considered. revision: partial
-
Referee: [Experimental section] The abstract and introduction assert superior effectiveness, but the manuscript does not report error bars, statistical significance tests, or direct comparisons against full self-attention on sequences where N exceeds the chosen k and landmark count. Without these, the claim that performance is retained while scalability improves cannot be evaluated.
Authors: We will revise the experimental section to report mean performance and standard deviation across multiple random seeds, together with paired t-tests for statistical significance against the strongest baselines. Direct comparisons to full self-attention are already present for all datasets where N is small enough for quadratic attention to run (e.g., standard ImageNet patch sequences); we will add explicit statements clarifying that full attention becomes infeasible once N exceeds the chosen k and landmark count, which is precisely the regime where MiTA’s scalability advantage is intended to apply. Additional results on longer synthetic sequences will be included to illustrate the scaling behavior. revision: yes
-
Referee: [§4.2] The scalability claim depends on the number of landmark queries and the top-k value remaining small relative to N; however, no ablation is shown that varies these hyperparameters while measuring both accuracy and wall-clock cost across increasing sequence lengths, leaving the practical regime of the method uncharacterized.
Authors: We agree that a more systematic characterization is needed. In the revised §4.2 we will present new ablation tables that sweep the number of landmarks and the top-k value over a range of settings, reporting both top-1 accuracy and wall-clock time (forward + backward) for sequence lengths from 256 up to 4096 tokens. These results will delineate the practical operating regime in which MiTA remains accurate while delivering clear efficiency gains. revision: yes
Circularity Check
No circularity: proposal introduces independent routing and compression without reduction to inputs
full rationale
The paper frames vanilla attention as an N-width fast-weight MLP and proposes MiTA via landmark queries for deformable top-k experts plus narrower shared-expert compression. No quoted equations or steps reduce the claimed flexibility/scalability gains to a fitted parameter, self-defined quantity, or load-bearing self-citation chain; the taxonomy and method are presented as new organizing constructs with empirical support on vision tasks. The derivation remains self-contained as an architectural proposal rather than a tautological renaming or prediction-by-construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- k (top-k count)
- number of landmark queries
axioms (1)
- domain assumption Vanilla self-attention can be viewed as a two-layer fast-weight MLP whose hidden dimension equals sequence length N.
Reference graph
Works this paper leans on
-
[1]
Titans: Learning to Memorize at Test Time
Behrouz, A., Zhong, P., and Mirrokni, V . Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Longformer: The Long-Document Transformer
Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long- document transformer.arXiv preprint arXiv:2004.05150,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[3]
Mixture of contexts for long video generation.arXiv preprint arXiv:2508.21058,
Cai, S., Yang, C., Zhang, L., Guo, Y ., Xiao, J., Yang, Z., Xu, Y ., Yang, Z., Yuille, A., Guibas, L., et al. Mixture of contexts for long video generation.arXiv preprint arXiv:2508.21058,
-
[4]
ViT$^3$: Unlocking Test-Time Training in Vision
Han, D., Pu, Y ., Xia, Z., Han, Y ., Pan, X., Li, X., Lu, J., Song, S., and Huang, G. Bridging the divide: Reconsidering softmax and linear attention. InNeurIPS, 2024a. Han, D., Ye, T., Han, Y ., Xia, Z., Pan, S., Wan, P., Song, S., and Huang, G. Agent attention: On the integration of softmax and linear attention. InECCV, 2024b. Han, D., Li, Y ., Li, T., ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Distilling the Knowledge in a Neural Network
Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Jia, W., Lu, Y ., Huang, M., Wang, H., Huang, B., Chen, N., Liu, M., Jiang, J., and Mao, Z. MoGA: Mixture-of-groups attention for end-to-end long video generation.arXiv preprint arXiv:2510.18692,
-
[7]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[8]
General- purpose in-context learning by meta-learning transformers
Kirsch, L., Harrison, J., Sohl-Dickstein, J., and Metz, L. General- purpose in-context learning by meta-learning transformers. arXiv preprint arXiv:2212.04458,
-
[9]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al. DeepSeek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
MoBA: Mixture of Block Attention for Long-Context LLMs
Lu, E., Jiang, Z., Liu, J., Du, Y ., Jiang, T., Hong, C., Liu, S., He, W., Yuan, E., Wang, Y ., et al. MoBA: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Online normalizer calculation for softmax
Milakov, M. and Gimelshein, N. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
ViT-5: Vision transformers for the mid-2020s
Wang, F., Ren, S., Zhang, T., Neskovic, P., Bhattad, A., Xie, C., and Yuille, A. ViT-5: Vision transformers for the mid-2020s. arXiv preprint arXiv:2602.08071,
-
[13]
Linformer: Self-Attention with Linear Complexity
Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Lin- former: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[14]
Wu, B., Chen, M., Luo, X., Yan, S., Yu, Q., Xia, F., Zhang, T., Zhan, H., Zhong, Z., Zhou, X., et al. Parallel loop transformer for efficient test-time computation scaling.arXiv preprint arXiv:2510.24824, 2025a. Wu, J., Hou, L., Yang, H., Tao, X., Tian, Y ., Wan, P., Zhang, D., and Tong, Y . VMoBA: Mixture-of-block attention for video diffusion models.arX...
-
[15]
11 MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-kActivations Zhang, K., Huang, Y ., Deng, Y ., Yu, J., Chen, J., Ling, H., Xie, E., and Zhou, D. MHLA: Restoring expressivity of linear attention via token-level multi-head.arXiv preprint arXiv:2601.07832,
-
[16]
Experiments with other (m, k) were not conducted due to out-of-memory (OOM)
Setting Acc.∆ Landmark Extraction Random Selection 70.6 -0.5 Learnable Parameters 66.3 -4.8 1D Average Pooling 70.4 -0.7 2D Average Pooling 71.1 Default Convolution 67.0 -4.1 Depth-Wise Convolution 68.8 -2.3 Token Merging (Bolya et al., 2023)71.3 +0.2 m×k 16×16 70.0 -1.1 16×25 71.0 -0.1 25×16 70.5 -0.6 25×25 71.1 Default 25×36 71.7 +0.6 36×25 71·2 +0.1 36...
work page 2023
-
[17]
and DeepSeek Sparse Attention (Liu et al., 2025)) is redundant, and that many such activations can instead be shared through routing. For practical use, we provide a simple rule of thumb: first choose a fixed ratio, m×k N ; then start fromm=kand explorek > mduring subsequent tuning. Compression and routing.Routing is more critical than compression. Nevert...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.