arxiv: 2502.13189 · v1 · submitted 2025-02-18 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

MoBA: Mixture of Block Attention for Long-Context LLMs

Enzhe Lu , Zhejun Jiang , Jingyuan Liu , Yulun Du , Tao Jiang , Chao Hong , Shaowei Liu , Weiran He

show 17 more authors

Enming Yuan Yuzhi Wang Zhiqi Huang Huan Yuan Suting Xu Xinran Xu Guokun Lai Yanru Chen Huabin Zheng Junjie Yan Jianlin Su Yuxin Wu Neo Y. Zhang Zhilin Yang Xinyu Zhou Mingxing Zhang Jiezhong Qiu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords mixture of expertsattention mechanismlong-context modelingsparse attentionblock attentionefficient transformersLLM scaling

0 comments

The pith

Mixture of Block Attention lets LLMs learn their own sparse patterns and match full attention performance on long contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Mixture of Block Attention to address the quadratic cost of standard attention when context lengths grow. Rather than hand-designing sparse patterns such as sliding windows or attention sinks, MoBA divides the sequence into blocks and trains the model to route each query block to a small number of key-value blocks using an MoE-style mechanism. The routing is learned end-to-end, so the network can recover full attention when necessary while using cheaper sparse computation otherwise. If the approach works as claimed, long-context LLMs could scale context length further without either paying the full quadratic price or accepting the performance risks of fixed sparsity structures.

Core claim

MoBA applies mixture-of-experts routing to the attention mechanism by partitioning the input sequence into fixed-size blocks and learning a router that selects which blocks each query block attends to. The architecture supports both dense and sparse modes within the same trained model, allowing seamless transitions during inference. The method has been integrated into production long-context serving for Kimi and is released with code.

What carries the argument

Mixture of Block Attention, which uses learned routers to select subsets of token blocks for attention computation instead of fixed or full patterns.

If this is right

The same model weights can be used for both high-accuracy dense inference and lower-cost sparse inference without separate training runs.
Context length can increase beyond current limits while keeping attention compute sub-quadratic on average.
Attention patterns emerge from data rather than task-specific heuristics such as sink or window attention.
The block-routing idea can be inserted into existing transformer stacks with only local changes to the attention layer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar block-wise routing could be applied to other quadratic components such as feed-forward layers to further reduce long-sequence costs.
The learned routers might reveal interpretable patterns of long-range dependency that differ across domains or tasks.
Production systems could dynamically adjust sparsity level per request by changing the number of routed blocks at inference time.

Load-bearing premise

End-to-end trained block routing will reliably discover attention patterns that are both more efficient than fixed sparse structures and at least as effective as full attention across diverse long-context tasks.

What would settle it

A controlled long-context reasoning benchmark on which a trained MoBA model scores measurably below a full-attention baseline of the same size while also failing to show clear compute savings over fixed block-sparse attention.

read the original abstract

Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the ``less structure'' principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi's long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/MoBA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoBA routes attention at the block level with an MoE router so the model can learn when to go full versus sparse, and the production deployment in Kimi is the most concrete signal, but the abstract leaves the router's actual behavior unexamined.

read the letter

MoBA applies mixture-of-experts routing to blocks of attention instead of feed-forward layers. The router decides which blocks get full quadratic attention and which get sparsified, with a built-in path back to dense computation when the model needs it. That seamless switch is the part that stands out from most sparse-attention papers, which usually lock in a fixed mask or approximation from the start.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Mixture of Block Attention (MoBA), which applies Mixture-of-Experts routing to the attention mechanism by partitioning the context into blocks and training a router to select a subset of blocks for each query. This design aims to reduce quadratic complexity for long contexts while following a 'less structure' principle that lets the model discover attention patterns autonomously rather than imposing fixed biases such as sliding windows or sink tokens. The authors report superior performance on long-context benchmarks, the ability to transition seamlessly between full and sparse attention without performance degradation, and production deployment in Kimi; code is released at the provided GitHub link.

Significance. If the empirical results survive proper controls, MoBA would constitute a meaningful advance in efficient long-context modeling by demonstrating that learned block routing can preserve full-attention quality on complex reasoning tasks while enabling sparsity. The avoidance of hand-crafted biases and the open-sourced implementation are concrete strengths that could influence subsequent work on dynamic attention mechanisms.

major comments (2)

[§4 (Experiments)] §4 (Experiments) and associated tables: the central claim that end-to-end block routing discovers patterns superior to fixed sparsity requires direct ablations against non-learned block structures (e.g., sliding-window blocks or global+local patterns). Without these controls it remains unclear whether reported gains derive from the learned router or simply from the block decomposition itself.
[§3.2 (Router)] §3.2 (Router) and §4.3 (Analysis): no routing entropy, load-balance statistics, or per-task router activation maps are provided. Given known MoE collapse risks, explicit verification that the router learns dynamic, task-adaptive behavior rather than converging to a near-static pattern is load-bearing for the 'autonomous discovery' claim.

minor comments (2)

[Abstract] Abstract and §1: the phrase 'seamlessly transition between full and sparse attention' should be accompanied by a concrete metric (e.g., performance delta at varying sparsity levels) rather than left as a qualitative statement.
[Related Work] Related Work: recent block-sparse and hierarchical attention papers should be cited to situate MoBA more precisely against contemporaneous alternatives.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to strengthen the empirical support for our claims.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments) and associated tables: the central claim that end-to-end block routing discovers patterns superior to fixed sparsity requires direct ablations against non-learned block structures (e.g., sliding-window blocks or global+local patterns). Without these controls it remains unclear whether reported gains derive from the learned router or simply from the block decomposition itself.

Authors: We agree that direct ablations against fixed block structures are required to isolate the contribution of the learned router. In the revised manuscript we have added these controls in §4, comparing MoBA against sliding-window block attention and global+local patterns on the same long-context benchmarks. The new results show that learned routing outperforms the fixed alternatives, indicating that performance gains are not attributable solely to the block decomposition. revision: yes
Referee: [§3.2 (Router)] §3.2 (Router) and §4.3 (Analysis): no routing entropy, load-balance statistics, or per-task router activation maps are provided. Given known MoE collapse risks, explicit verification that the router learns dynamic, task-adaptive behavior rather than converging to a near-static pattern is load-bearing for the 'autonomous discovery' claim.

Authors: We acknowledge that explicit router diagnostics are necessary to substantiate the dynamic, task-adaptive behavior. The revised §4.3 now includes routing entropy, load-balance statistics across tasks, and per-task activation maps. These additions demonstrate that the router does not collapse to static patterns and instead exhibits task-dependent selection, consistent with the autonomous-discovery principle. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture with external benchmarks

full rationale

The paper proposes MoBA as a novel attention architecture applying MoE principles to block-level attention, allowing dynamic transition between full and sparse modes. No derivation chain exists; claims rest on end-to-end training and empirical results on long-context tasks, not on any equation that reduces to its own inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central advantage (seamless full-to-sparse transition without performance loss) is presented as an empirical outcome measured against external benchmarks, making the work self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that a learned router can discover useful attention patterns without explicit bias terms; no new physical constants or invented particles are introduced.

free parameters (1)

number of blocks and router capacity
Block size and number of experts per token are architectural choices that must be tuned for a given context length.

axioms (1)

standard math Standard transformer attention can be partitioned into independent blocks without changing the underlying computation graph when routing is dense.
Invoked when claiming seamless transition between sparse and full attention.

pith-pipeline@v0.9.0 · 5595 in / 1247 out tokens · 29661 ms · 2026-05-16T06:12:24.593491+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
cs.CV 2026-05 unverdicted novelty 7.0

CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
cs.LG 2026-05 conditional novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
Long Context Pre-Training with Lighthouse Attention
cs.CL 2026-05 conditional novelty 7.0

Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation
cs.DC 2026-04 unverdicted novelty 7.0

GVR uses previous-step Top-K predictions, pre-indexed stats, secant counting, and shared-memory verification to deliver 1.88x average speedup over radix-select while preserving bit-exact Top-K on DeepSeek-V3.2 workloads.
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
cs.CV 2026-04 unverdicted novelty 7.0

AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights
cs.LG 2026-02 unverdicted novelty 7.0

MiTA makes attention scalable by gathering query-aware top-k key-value pairs through landmarks as deformable routed experts and compressing the N-width fast-weight MLP into a shared narrower expert.
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
cs.DC 2026-05 unverdicted novelty 6.0

AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines wit...
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
cs.CL 2026-05 unverdicted novelty 6.0

UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
cs.AR 2026-05 unverdicted novelty 6.0

MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding
cs.AR 2026-04 unverdicted novelty 6.0

Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.
Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention
cs.LG 2026-04 unverdicted novelty 6.0

Gist Sparse Attention uses learnable gist compression tokens as both summaries and routing signals, then selectively unfolds relevant raw chunks for fine-grained attention, outperforming compression and sparse-attenti...
HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention
cs.LG 2026-03 unverdicted novelty 6.0

HISA speeds up fine-grained sparse attention indexers via block-then-token hierarchy, delivering substantial speedups at 64K context with no training and quality matching the original DSA on long-context benchmarks.
Why Attend to Everything? Focus is the Key
cs.CL 2026-03 conditional novelty 6.0

Focus learns a few centroids to gate long-range token attention, producing sparse attention that matches or beats full attention quality with up to 8.6x speedup at million-token lengths.
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
cs.LG 2026-02 conditional novelty 6.0

RAT+ pretrains a single dense recurrent-augmented attention model that supports flexible dilated sparse inference after short adaptation, matching dense accuracy at moderate dilation and losing only 1-3 points at high...
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
cs.CL 2025-07 unverdicted novelty 6.0

MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
cs.CL 2025-06 unverdicted novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
cs.LG 2026-05 unverdicted novelty 5.0

Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.
VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation
cs.LG 2026-04 unverdicted novelty 5.0

VFA optimizes Flash Attention by pre-computing global max approximations from key blocks and reordering traversal to reduce vector bottlenecks while preserving exact computation.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 19 Pith papers · 25 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[4]

Generating Long Sequences with Sparse Transformers

Generating long sequences with sparse transformers , author=. arXiv preprint arXiv:1904.10509 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1904
[5]

arXiv preprint arXiv:2112.07916 , year=

LongT5: Efficient text-to-text transformer for long sequences , author=. arXiv preprint arXiv:2112.07916 , year=

work page arXiv
[6]

Advances in neural information processing systems , volume=

Big bird: Transformers for longer sequences , author=. Advances in neural information processing systems , volume=

work page
[7]

arXiv preprint arXiv:2307.02486 , year=

Longnet: Scaling transformers to 1,000,000,000 tokens , author=. arXiv preprint arXiv:2307.02486 , year=

work page arXiv
[8]

arXiv preprint arXiv:1912.12180 , year=

Axial attention in multidimensional transformers , author=. arXiv preprint arXiv:1912.12180 , year=

work page arXiv 1912
[9]

2024 , url =

Zangwei Zheng and Xiangyu Peng and Tianji Yang and Chenhui Shen and Shenggui Li and Hongxin Liu and Yukun Zhou and Tianyi Li and Yang You , title =. 2024 , url =

work page 2024
[10]

Longformer: The Long-Document Transformer

Longformer: The long-document transformer , author=. arXiv preprint arXiv:2004.05150 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2004
[11]

arXiv preprint arXiv:2004.08483 , year=

ETC: Encoding long and structured inputs in transformers , author=. arXiv preprint arXiv:2004.08483 , year=

work page arXiv 2004
[12]

2020 , copyright =

Gmat: Global memory augmentation for transformers , author=. arXiv preprint arXiv:2006.03274 , year=

work page arXiv 2006
[13]

arXiv preprint arXiv:1902.09113 , year=

Star-transformer , author=. arXiv preprint arXiv:1902.09113 , year=

work page arXiv 1902
[14]

arXiv preprint arXiv:1911.02972 , year=

Blockwise self-attention for long document understanding , author=. arXiv preprint arXiv:1911.02972 , year=

work page arXiv 1911
[15]

Reformer: The Efficient Transformer

Reformer: The efficient transformer , author=. arXiv preprint arXiv:2001.04451 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[16]

Transactions of the Association for Computational Linguistics , volume=

Efficient content-based sparse attention with routing transformers , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=

work page 2021
[17]

arXiv preprint arXiv:2203.08913 , year=

Memorizing transformers , author=. arXiv preprint arXiv:2203.08913 , year=

work page arXiv
[18]

Advances in Neural Information Processing Systems , volume=

Unlimiformer: Long-range transformers with unlimited length input , author=. Advances in Neural Information Processing Systems , volume=

work page
[19]

arXiv preprint arXiv:2303.09752 , year=

Colt5: Faster long-range transformers with conditional computation , author=. arXiv preprint arXiv:2303.09752 , year=

work page arXiv
[20]

International Conference on Machine Learning , pages=

Sparse sinkhorn attention , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[21]

Advances in Neural Information Processing Systems , volume=

H2o: Heavy-hitter oracle for efficient generative inference of large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[22]

Efficient Streaming Language Models with Attention Sinks

Efficient streaming language models with attention sinks , author=. arXiv preprint arXiv:2309.17453 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

arXiv preprint arXiv:2407.02490 , year=

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention , author=. arXiv preprint arXiv:2407.02490 , year=

work page arXiv
[24]

arXiv preprint arXiv:2406.14909 , year=

Moa: Mixture of sparse attention for automatic large language model compression , author=. arXiv preprint arXiv:2406.14909 , year=

work page arXiv
[25]

arXiv preprint arXiv:2410.13276 , year=

SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs , author=. arXiv preprint arXiv:2410.13276 , year=

work page arXiv
[26]

Model tells you what to discard: Adaptive kv cache compression for llms.ArXiv, abs/2310.01801,

Model tells you what to discard: Adaptive kv cache compression for llms , author=. arXiv preprint arXiv:2310.01801 , year=

work page arXiv
[27]

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference , author=. arXiv preprint arXiv:2406.10774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

arXiv preprint arXiv:2401.06104 , year=

Transformers are multi-state rnns , author=. arXiv preprint arXiv:2401.06104 , year=

work page arXiv
[29]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv preprint arXiv:2312.00752 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

RWKV: Reinventing RNNs for the Transformer Era

Rwkv: Reinventing rnns for the transformer era , author=. arXiv preprint arXiv:2305.13048 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

International Conference on Machine Learning , pages=

Hyena hierarchy: Towards larger convolutional language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[32]

Rethinking Attention with Performers

Rethinking attention with performers , author=. arXiv preprint arXiv:2009.14794 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[33]

Linformer: Self-Attention with Linear Complexity

Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006
[34]

International conference on machine learning , pages=

Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[35]

arXiv , author=

Efficient transformers: A survey. arXiv , author=. arXiv preprint arXiv:2009.06732 , year=

work page arXiv 2009
[36]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Ring attention with blockwise transformers for near-infinite context , author=. arXiv preprint arXiv:2310.01889 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models , author=. arXiv preprint arXiv:2309.14509 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

arXiv preprint arXiv:2405.07719 , year=

A Unified Sequence Parallelism Approach for Long Context Generative AI , author=. arXiv preprint arXiv:2405.07719 , year=

work page arXiv
[40]

arXiv preprint arXiv:2309.16039 , year=

Effective Long-Context Scaling of Foundation Models , author =. arXiv preprint arXiv:2309.16039 , year=

work page arXiv
[41]

Training Compute-Optimal Large Language Models

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

arXiv preprint arXiv:2410.18745 , year=

Why Does the Effective Context Length of LLMs Fall Short? , author=. arXiv preprint arXiv:2410.18745 , year=

work page arXiv
[43]

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

Breaking the softmax bottleneck: A high-rank RNN language model , author=. arXiv preprint arXiv:1711.03953 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models , author=. arXiv preprint arXiv:2401.06066 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Qwen2.5 Technical Report

Qwen2. 5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

2023 , journal=

Extending Context Window of Large Language Models via Positional Interpolation , author=. 2023 , journal=

work page 2023
[47]

arXiv preprint arXiv:2402.10685 , year=

LongHeads: Multi-Head Attention is Secretly a Long Context Processor , author=. arXiv preprint arXiv:2402.10685 , year=

work page arXiv
[48]

arXiv preprint arXiv:2410.13846 , year=

Simlayerkv: A simple framework for layer-level KV cache reduction , author=. arXiv preprint arXiv:2410.13846 , year=

work page arXiv
[49]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling Reinforcement Learning with LLMs , author=. arXiv preprint arXiv:2501.12599 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Y ., Joglekar, M., Wallace, E., Jain, S., Barak, B., Heylar, A., Dias, R., Vallone, A., Ren, H., Wei, J., et al

Deliberative alignment: Reasoning enables safer language models , author=. arXiv preprint arXiv:2412.16339 , year=

work page arXiv
[52]

arXiv preprint arXiv:2405.06640 , year=

Linearizing Large Language Models , author=. arXiv preprint arXiv:2405.06640 , year=

work page arXiv
[53]

Advances in Neural Information Processing Systems , volume=

Transformers to ssms: Distilling quadratic knowledge to subquadratic models , author=. Advances in Neural Information Processing Systems , volume=

work page
[54]

arXiv preprint arXiv:2408.15237 , year=

The mamba in the llama: Distilling and accelerating hybrid models , author=. arXiv preprint arXiv:2408.15237 , year=

work page arXiv
[55]

arXiv preprint arXiv:2410.10254 , year=

LoLCATs: On Low-Rank Linearizing of Large Language Models , author=. arXiv preprint arXiv:2410.10254 , year=

work page arXiv
[56]

Retentive Network: A Successor to Transformer for Large Language Models

Retentive network: A successor to transformer for large language models , author=. arXiv preprint arXiv:2307.08621 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality , author=. arXiv preprint arXiv:2405.21060 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

arXiv preprint arXiv:2404.05892 , year=

Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence , author=. arXiv preprint arXiv:2404.05892 , year=

work page arXiv
[59]

MiniMax-01: Scaling Foundation Models with Lightning Attention

Minimax-01: Scaling foundation models with lightning attention , author=. arXiv preprint arXiv:2501.08313 , year=

work page internal anchor Pith review arXiv
[60]

arXiv preprint arXiv:2409.10516 , year=

Retrievalattention: Accelerating long-context llm inference via vector retrieval , author=. arXiv preprint arXiv:2409.10516 , year=

work page arXiv
[61]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Gshard: Scaling giant models with conditional computation and automatic sharding , author=. arXiv preprint arXiv:2006.16668 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006
[63]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

work page
[64]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

St-moe: Designing stable and transferable sparse expert models , author=. arXiv preprint arXiv:2202.08906 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

2018 , journal=

Online normalizer calculation for softmax , author=. 2018 , journal=

work page 2018
[66]

2023 , journal=

Blockwise Parallel Transformer for Large Context Models , author=. 2023 , journal=

work page 2023
[67]

NIPS , year=

Attention is all you need , author=. NIPS , year=

work page
[68]

Introducing 100K Context Windows , author=

work page
[69]

International conference on machine learning , pages=

Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale , author=. International conference on machine learning , pages=. 2022 , organization=

work page 2022
[70]

Advances in Neural Information Processing Systems , volume=

Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in Neural Information Processing Systems , volume=

work page
[71]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[72]

Cell , volume=

Human hippocampal CA3 uses specific functional connectivity rules for efficient associative memory , author=. Cell , volume=. 2025 , publisher=

work page 2025
[73]

and Yang, Zhilin and Zhou, Xinyu and Zhang, Mingxing and Qiu, Jiezhong , title =

Lu, Enzhe and Jiang, Zhejun and Liu, Jingyuan and Du, Yulun and Jiang, Tao and Hong, Chao and Liu, Shaowei and He, Weiran and Yuan, Enming and Wang, Yuzhi and Huang, Zhiqi and Yuan, Huan and Xu, Suting and Xu, Xinran and Lai, Guokun and Chen, Yanru and Zheng, Huabin and Yan, Junjie and Su, Jianlin and Wu, Yuxin and Zhang, Neo Y. and Yang, Zhilin and Zhou,...

work page