Recognition: no theorem link
MoBA: Mixture of Block Attention for Long-Context LLMs
Pith reviewed 2026-05-16 06:12 UTC · model grok-4.3
The pith
Mixture of Block Attention lets LLMs learn their own sparse patterns and match full attention performance on long contexts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MoBA applies mixture-of-experts routing to the attention mechanism by partitioning the input sequence into fixed-size blocks and learning a router that selects which blocks each query block attends to. The architecture supports both dense and sparse modes within the same trained model, allowing seamless transitions during inference. The method has been integrated into production long-context serving for Kimi and is released with code.
What carries the argument
Mixture of Block Attention, which uses learned routers to select subsets of token blocks for attention computation instead of fixed or full patterns.
If this is right
- The same model weights can be used for both high-accuracy dense inference and lower-cost sparse inference without separate training runs.
- Context length can increase beyond current limits while keeping attention compute sub-quadratic on average.
- Attention patterns emerge from data rather than task-specific heuristics such as sink or window attention.
- The block-routing idea can be inserted into existing transformer stacks with only local changes to the attention layer.
Where Pith is reading between the lines
- Similar block-wise routing could be applied to other quadratic components such as feed-forward layers to further reduce long-sequence costs.
- The learned routers might reveal interpretable patterns of long-range dependency that differ across domains or tasks.
- Production systems could dynamically adjust sparsity level per request by changing the number of routed blocks at inference time.
Load-bearing premise
End-to-end trained block routing will reliably discover attention patterns that are both more efficient than fixed sparse structures and at least as effective as full attention across diverse long-context tasks.
What would settle it
A controlled long-context reasoning benchmark on which a trained MoBA model scores measurably below a full-attention baseline of the same size while also failing to show clear compute savings over fixed block-sparse attention.
read the original abstract
Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the ``less structure'' principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi's long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/MoBA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Mixture of Block Attention (MoBA), which applies Mixture-of-Experts routing to the attention mechanism by partitioning the context into blocks and training a router to select a subset of blocks for each query. This design aims to reduce quadratic complexity for long contexts while following a 'less structure' principle that lets the model discover attention patterns autonomously rather than imposing fixed biases such as sliding windows or sink tokens. The authors report superior performance on long-context benchmarks, the ability to transition seamlessly between full and sparse attention without performance degradation, and production deployment in Kimi; code is released at the provided GitHub link.
Significance. If the empirical results survive proper controls, MoBA would constitute a meaningful advance in efficient long-context modeling by demonstrating that learned block routing can preserve full-attention quality on complex reasoning tasks while enabling sparsity. The avoidance of hand-crafted biases and the open-sourced implementation are concrete strengths that could influence subsequent work on dynamic attention mechanisms.
major comments (2)
- [§4 (Experiments)] §4 (Experiments) and associated tables: the central claim that end-to-end block routing discovers patterns superior to fixed sparsity requires direct ablations against non-learned block structures (e.g., sliding-window blocks or global+local patterns). Without these controls it remains unclear whether reported gains derive from the learned router or simply from the block decomposition itself.
- [§3.2 (Router)] §3.2 (Router) and §4.3 (Analysis): no routing entropy, load-balance statistics, or per-task router activation maps are provided. Given known MoE collapse risks, explicit verification that the router learns dynamic, task-adaptive behavior rather than converging to a near-static pattern is load-bearing for the 'autonomous discovery' claim.
minor comments (2)
- [Abstract] Abstract and §1: the phrase 'seamlessly transition between full and sparse attention' should be accompanied by a concrete metric (e.g., performance delta at varying sparsity levels) rather than left as a qualitative statement.
- [Related Work] Related Work: recent block-sparse and hierarchical attention papers should be cited to situate MoBA more precisely against contemporaneous alternatives.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [§4 (Experiments)] §4 (Experiments) and associated tables: the central claim that end-to-end block routing discovers patterns superior to fixed sparsity requires direct ablations against non-learned block structures (e.g., sliding-window blocks or global+local patterns). Without these controls it remains unclear whether reported gains derive from the learned router or simply from the block decomposition itself.
Authors: We agree that direct ablations against fixed block structures are required to isolate the contribution of the learned router. In the revised manuscript we have added these controls in §4, comparing MoBA against sliding-window block attention and global+local patterns on the same long-context benchmarks. The new results show that learned routing outperforms the fixed alternatives, indicating that performance gains are not attributable solely to the block decomposition. revision: yes
-
Referee: [§3.2 (Router)] §3.2 (Router) and §4.3 (Analysis): no routing entropy, load-balance statistics, or per-task router activation maps are provided. Given known MoE collapse risks, explicit verification that the router learns dynamic, task-adaptive behavior rather than converging to a near-static pattern is load-bearing for the 'autonomous discovery' claim.
Authors: We acknowledge that explicit router diagnostics are necessary to substantiate the dynamic, task-adaptive behavior. The revised §4.3 now includes routing entropy, load-balance statistics across tasks, and per-task activation maps. These additions demonstrate that the router does not collapse to static patterns and instead exhibits task-dependent selection, consistent with the autonomous-discovery principle. revision: yes
Circularity Check
No circularity: empirical architecture with external benchmarks
full rationale
The paper proposes MoBA as a novel attention architecture applying MoE principles to block-level attention, allowing dynamic transition between full and sparse modes. No derivation chain exists; claims rest on end-to-end training and empirical results on long-context tasks, not on any equation that reduces to its own inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central advantage (seamless full-to-sparse transition without performance loss) is presented as an empirical outcome measured against external benchmarks, making the work self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of blocks and router capacity
axioms (1)
- standard math Standard transformer attention can be partitioned into independent blocks without changing the underlying computation graph when routing is dense.
Forward citations
Cited by 19 Pith papers
-
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
-
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
-
Long Context Pre-Training with Lighthouse Attention
Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
-
Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation
GVR uses previous-step Top-K predictions, pre-indexed stats, secant counting, and shared-memory verification to deliver 1.88x average speedup over radix-select while preserving bit-exact Top-K on DeepSeek-V3.2 workloads.
-
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
-
Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights
MiTA makes attention scalable by gathering query-aware top-k key-value pairs through landmarks as deformable routed experts and compressing the N-width fast-weight MLP into a shared narrower expert.
-
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines wit...
-
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.
-
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
-
Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding
Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.
-
Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention
Gist Sparse Attention uses learnable gist compression tokens as both summaries and routing signals, then selectively unfolds relevant raw chunks for fine-grained attention, outperforming compression and sparse-attenti...
-
HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention
HISA speeds up fine-grained sparse attention indexers via block-then-token hierarchy, delivering substantial speedups at 64K context with no training and quality matching the original DSA on long-context benchmarks.
-
Why Attend to Everything? Focus is the Key
Focus learns a few centroids to gate long-range token attention, producing sparse attention that matches or beats full attention quality with up to 8.6x speedup at million-token lengths.
-
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
RAT+ pretrains a single dense recurrent-augmented attention model that supports flexible dilated sparse inference after short adaptation, matching dense accuracy at moderate dilation and losing only 1-3 points at high...
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
-
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.
-
VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation
VFA optimizes Flash Attention by pre-computing global max approximations from key blocks and reordering traversal to reduce vector bottlenecks while preserving exact computation.
Reference graph
Works this paper leans on
-
[1]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[2]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [3]
-
[4]
Generating Long Sequences with Sparse Transformers
Generating long sequences with sparse transformers , author=. arXiv preprint arXiv:1904.10509 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[5]
arXiv preprint arXiv:2112.07916 , year=
LongT5: Efficient text-to-text transformer for long sequences , author=. arXiv preprint arXiv:2112.07916 , year=
-
[6]
Advances in neural information processing systems , volume=
Big bird: Transformers for longer sequences , author=. Advances in neural information processing systems , volume=
-
[7]
arXiv preprint arXiv:2307.02486 , year=
Longnet: Scaling transformers to 1,000,000,000 tokens , author=. arXiv preprint arXiv:2307.02486 , year=
-
[8]
arXiv preprint arXiv:1912.12180 , year=
Axial attention in multidimensional transformers , author=. arXiv preprint arXiv:1912.12180 , year=
-
[9]
Zangwei Zheng and Xiangyu Peng and Tianji Yang and Chenhui Shen and Shenggui Li and Hongxin Liu and Yukun Zhou and Tianyi Li and Yang You , title =. 2024 , url =
work page 2024
-
[10]
Longformer: The Long-Document Transformer
Longformer: The long-document transformer , author=. arXiv preprint arXiv:2004.05150 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[11]
arXiv preprint arXiv:2004.08483 , year=
ETC: Encoding long and structured inputs in transformers , author=. arXiv preprint arXiv:2004.08483 , year=
-
[12]
Gmat: Global memory augmentation for transformers , author=. arXiv preprint arXiv:2006.03274 , year=
-
[13]
arXiv preprint arXiv:1902.09113 , year=
Star-transformer , author=. arXiv preprint arXiv:1902.09113 , year=
-
[14]
arXiv preprint arXiv:1911.02972 , year=
Blockwise self-attention for long document understanding , author=. arXiv preprint arXiv:1911.02972 , year=
-
[15]
Reformer: The Efficient Transformer
Reformer: The efficient transformer , author=. arXiv preprint arXiv:2001.04451 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[16]
Transactions of the Association for Computational Linguistics , volume=
Efficient content-based sparse attention with routing transformers , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=
work page 2021
-
[17]
arXiv preprint arXiv:2203.08913 , year=
Memorizing transformers , author=. arXiv preprint arXiv:2203.08913 , year=
-
[18]
Advances in Neural Information Processing Systems , volume=
Unlimiformer: Long-range transformers with unlimited length input , author=. Advances in Neural Information Processing Systems , volume=
-
[19]
arXiv preprint arXiv:2303.09752 , year=
Colt5: Faster long-range transformers with conditional computation , author=. arXiv preprint arXiv:2303.09752 , year=
-
[20]
International Conference on Machine Learning , pages=
Sparse sinkhorn attention , author=. International Conference on Machine Learning , pages=. 2020 , organization=
work page 2020
-
[21]
Advances in Neural Information Processing Systems , volume=
H2o: Heavy-hitter oracle for efficient generative inference of large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[22]
Efficient Streaming Language Models with Attention Sinks
Efficient streaming language models with attention sinks , author=. arXiv preprint arXiv:2309.17453 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
arXiv preprint arXiv:2407.02490 , year=
Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention , author=. arXiv preprint arXiv:2407.02490 , year=
-
[24]
arXiv preprint arXiv:2406.14909 , year=
Moa: Mixture of sparse attention for automatic large language model compression , author=. arXiv preprint arXiv:2406.14909 , year=
-
[25]
arXiv preprint arXiv:2410.13276 , year=
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs , author=. arXiv preprint arXiv:2410.13276 , year=
-
[26]
Model tells you what to discard: Adaptive kv cache compression for llms.ArXiv, abs/2310.01801,
Model tells you what to discard: Adaptive kv cache compression for llms , author=. arXiv preprint arXiv:2310.01801 , year=
-
[27]
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference , author=. arXiv preprint arXiv:2406.10774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
arXiv preprint arXiv:2401.06104 , year=
Transformers are multi-state rnns , author=. arXiv preprint arXiv:2401.06104 , year=
-
[29]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv preprint arXiv:2312.00752 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
RWKV: Reinventing RNNs for the Transformer Era
Rwkv: Reinventing rnns for the transformer era , author=. arXiv preprint arXiv:2305.13048 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
International Conference on Machine Learning , pages=
Hyena hierarchy: Towards larger convolutional language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[32]
Rethinking Attention with Performers
Rethinking attention with performers , author=. arXiv preprint arXiv:2009.14794 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[33]
Linformer: Self-Attention with Linear Complexity
Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[34]
International conference on machine learning , pages=
Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=
work page 2020
-
[35]
Efficient transformers: A survey. arXiv , author=. arXiv preprint arXiv:2009.06732 , year=
-
[36]
Ring Attention with Blockwise Transformers for Near-Infinite Context
Ring attention with blockwise transformers for near-infinite context , author=. arXiv preprint arXiv:2310.01889 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models , author=. arXiv preprint arXiv:2309.14509 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
arXiv preprint arXiv:2405.07719 , year=
A Unified Sequence Parallelism Approach for Long Context Generative AI , author=. arXiv preprint arXiv:2405.07719 , year=
-
[40]
arXiv preprint arXiv:2309.16039 , year=
Effective Long-Context Scaling of Foundation Models , author =. arXiv preprint arXiv:2309.16039 , year=
-
[41]
Training Compute-Optimal Large Language Models
Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
arXiv preprint arXiv:2410.18745 , year=
Why Does the Effective Context Length of LLMs Fall Short? , author=. arXiv preprint arXiv:2410.18745 , year=
-
[43]
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Breaking the softmax bottleneck: A high-rank RNN language model , author=. arXiv preprint arXiv:1711.03953 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models , author=. arXiv preprint arXiv:2401.06066 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Qwen2. 5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
Extending Context Window of Large Language Models via Positional Interpolation , author=. 2023 , journal=
work page 2023
-
[47]
arXiv preprint arXiv:2402.10685 , year=
LongHeads: Multi-Head Attention is Secretly a Long Context Processor , author=. arXiv preprint arXiv:2402.10685 , year=
-
[48]
arXiv preprint arXiv:2410.13846 , year=
Simlayerkv: A simple framework for layer-level KV cache reduction , author=. arXiv preprint arXiv:2410.13846 , year=
-
[49]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi k1. 5: Scaling Reinforcement Learning with LLMs , author=. arXiv preprint arXiv:2501.12599 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
Deliberative alignment: Reasoning enables safer language models , author=. arXiv preprint arXiv:2412.16339 , year=
-
[52]
arXiv preprint arXiv:2405.06640 , year=
Linearizing Large Language Models , author=. arXiv preprint arXiv:2405.06640 , year=
-
[53]
Advances in Neural Information Processing Systems , volume=
Transformers to ssms: Distilling quadratic knowledge to subquadratic models , author=. Advances in Neural Information Processing Systems , volume=
-
[54]
arXiv preprint arXiv:2408.15237 , year=
The mamba in the llama: Distilling and accelerating hybrid models , author=. arXiv preprint arXiv:2408.15237 , year=
-
[55]
arXiv preprint arXiv:2410.10254 , year=
LoLCATs: On Low-Rank Linearizing of Large Language Models , author=. arXiv preprint arXiv:2410.10254 , year=
-
[56]
Retentive Network: A Successor to Transformer for Large Language Models
Retentive network: A successor to transformer for large language models , author=. arXiv preprint arXiv:2307.08621 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality , author=. arXiv preprint arXiv:2405.21060 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
arXiv preprint arXiv:2404.05892 , year=
Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence , author=. arXiv preprint arXiv:2404.05892 , year=
-
[59]
MiniMax-01: Scaling Foundation Models with Lightning Attention
Minimax-01: Scaling foundation models with lightning attention , author=. arXiv preprint arXiv:2501.08313 , year=
work page internal anchor Pith review arXiv
-
[60]
arXiv preprint arXiv:2409.10516 , year=
Retrievalattention: Accelerating long-context llm inference via vector retrieval , author=. arXiv preprint arXiv:2409.10516 , year=
-
[61]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[62]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Gshard: Scaling giant models with conditional computation and automatic sharding , author=. arXiv preprint arXiv:2006.16668 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[63]
Journal of Machine Learning Research , volume=
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=
-
[64]
ST-MoE: Designing Stable and Transferable Sparse Expert Models
St-moe: Designing stable and transferable sparse expert models , author=. arXiv preprint arXiv:2202.08906 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [65]
-
[66]
Blockwise Parallel Transformer for Large Context Models , author=. 2023 , journal=
work page 2023
- [67]
-
[68]
Introducing 100K Context Windows , author=
-
[69]
International conference on machine learning , pages=
Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale , author=. International conference on machine learning , pages=. 2022 , organization=
work page 2022
-
[70]
Advances in Neural Information Processing Systems , volume=
Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in Neural Information Processing Systems , volume=
-
[71]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[72]
Human hippocampal CA3 uses specific functional connectivity rules for efficient associative memory , author=. Cell , volume=. 2025 , publisher=
work page 2025
-
[73]
and Yang, Zhilin and Zhou, Xinyu and Zhang, Mingxing and Qiu, Jiezhong , title =
Lu, Enzhe and Jiang, Zhejun and Liu, Jingyuan and Du, Yulun and Jiang, Tao and Hong, Chao and Liu, Shaowei and He, Weiran and Yuan, Enming and Wang, Yuzhi and Huang, Zhiqi and Yuan, Huan and Xu, Suting and Xu, Xinran and Lai, Guokun and Chen, Yanru and Zheng, Huabin and Yan, Junjie and Su, Jianlin and Wu, Yuxin and Zhang, Neo Y. and Yang, Zhilin and Zhou,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.