arxiv: 2605.09516 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Mixture of Layers with Hybrid Attention

Ivan Ternovtsii , Yurii Bilak

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords mixture of layershybrid attentionsparse routinglinear attentiontransformer efficiencymixture of experts

0 comments

The pith

Mixture of Layers replaces monolithic transformer blocks with K parallel thin blocks at reduced width, routed by top-k selection and linked by down/up projections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Mixture of Layers (MoL) as an alternative to standard transformer blocks that stay full-width and monolithic. Instead of routing tokens only among experts inside a fixed layer, the approach decomposes each layer into multiple thinner parallel blocks whose dimensionality is much smaller than the model width. Tokens are routed to the top-k of these blocks after a learned down-projection, and the block outputs are combined after up-projection. Scaling the number of such blocks creates an attention coverage problem because each thin block processes fewer tokens. The authors solve this with hybrid attention that keeps one shared softmax attention block for global context while using Gated DeltaNet linear attention inside the routed thin blocks.

Core claim

Mixture of Layers (MoL) replaces the conventional full-width transformer block of dimension d_model with K parallel thin blocks of dimension d_thin much smaller than d_model. These blocks are connected by learned down-projection and up-projection matrices, and each token is routed to its top-k blocks by a learned router. The resulting attention coverage problem is addressed by hybrid attention: a single shared softmax attention block supplies global context, while each routed thin block uses Gated DeltaNet linear attention.

What carries the argument

Mixture of Layers (MoL) with hybrid attention: K thin parallel blocks connected by down/up projections, selected by top-k routing, and equipped with one shared softmax attention block plus Gated DeltaNet linear attention in the routed blocks.

If this is right

Transformer capacity can be increased by adding more thin blocks rather than widening existing blocks.
Compute per token stays roughly constant even as the total number of blocks grows, because only top-k blocks are activated.
Global context is preserved by the shared softmax block while local computation uses efficient linear attention.
The same routing and projection mechanism can be applied at every layer without changing the overall model depth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The design may allow models to allocate specialized thin blocks to different domains or modalities within the same network.
Memory bandwidth during inference could decrease because only the active thin blocks and their projections need to be loaded.
Interpretability might improve if individual thin blocks learn distinct functions that can be inspected separately.

Load-bearing premise

Hybrid attention consisting of one shared softmax block plus Gated DeltaNet linear attention in the routed blocks is enough to preserve attention coverage and model quality when the number of thin blocks grows large.

What would settle it

Train two otherwise identical MoL models at increasing block counts; replace the shared softmax block with additional Gated DeltaNet blocks in one model and measure whether perplexity or downstream accuracy degrades faster in the version without the shared softmax block.

Figures

Figures reproduced from arXiv: 2605.09516 by Ivan Ternovtsii, Yurii Bilak.

read the original abstract

Standard Mixture-of-Experts (MoE) transformers route tokens to expert subnetworks within each layer, but the layer structure itself remains monolithic. We introduce Mixture of Layers (MoL), which replaces full-width transformer blocks (d_model) with K parallel thin blocks at reduced dimensionality (d_thin << d_model), connected via learned down/up projections and composed via top-k block routing. Scaling sparse block routing to many blocks creates an attention coverage problem, as each block sees fewer tokens. We address this by introducing hybrid attention, which pairs one shared softmax block for global context with Gated DeltaNet linear attention in routed blocks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clean architectural proposal for layer-level sparsity in transformers, but it offers only description with no results or analysis to back the hybrid attention fix.

read the letter

The core of this paper is a proposal for Mixture of Layers (MoL), which sparsifies the transformer at the layer level rather than just routing tokens within a layer. It uses K parallel thin blocks of reduced width, selected via top-k routing, with learned projections to connect them to the model dimension. The hybrid attention combines one shared softmax attention for global context with Gated DeltaNet in the routed thin blocks to handle the coverage loss from sparsity. What stands out as new is this shift to layer-level mixture with block routing, along with the specific hybrid attention design using Gated DeltaNet. It differs from typical MoE by making the blocks themselves thin and parallel instead of keeping full-width layers with expert FFNs. The paper does a good job stating the attention coverage problem that arises when scaling such sparse routing and outlining how the hybrid setup is meant to mitigate it. The architecture description is clear and the motivation is logical. The soft spots are significant though. The entire piece is descriptive with no empirical results, ablations, or theoretical analysis provided. Nothing shows whether the shared softmax block effectively shares context through the down/up projections or if the linear attention in thin blocks maintains performance. The stress-test concern holds: there's no evidence or argument that this compensates for each block seeing only a subset of tokens as the number of blocks increases. Without that, the central claim remains untested. This paper is aimed at researchers exploring alternatives to standard transformer scaling and MoE designs for efficiency. Someone looking for new architectural ideas could find it useful as a concept to build on, but it would need substantial additional work to become practical. I recommend putting it through peer review. The idea has enough structure to warrant feedback, even if it requires major revisions to include validation.

Referee Report

2 major / 0 minor

Summary. The paper proposes Mixture of Layers (MoL) as an alternative to standard transformer blocks: each full-width (d_model) block is replaced by K parallel thin blocks (d_thin << d_model) linked by learned down/up projections and composed via top-k routing. To mitigate the resulting attention coverage loss (each routed block sees only a subset of tokens), the authors introduce hybrid attention consisting of one shared softmax attention block for global context plus Gated DeltaNet linear attention inside the routed thin blocks.

Significance. If the hybrid attention mechanism can be shown to propagate global context effectively through the projections and preserve expressivity under sparsity, MoL would represent a distinct scaling axis for transformers that sparsifies at the layer level rather than only within experts. This could yield new efficiency trade-offs, but the absence of any empirical results, ablations, or scaling analysis leaves the practical significance speculative.

major comments (2)

[Abstract] Abstract: the claim that hybrid attention 'addresses' the attention coverage problem when scaling sparse block routing supplies neither a derivation of how global context from the shared softmax block is injected into every routed path via the down/up projections nor any argument that Gated DeltaNet preserves sufficient expressivity as K grows.
[Abstract] Abstract / architecture description: no ablation studies, baseline comparisons (standard transformer, MoE, or thin-block variants), or scaling curves are presented to test whether model quality is maintained under top-k routing and reduced per-block width; without such evidence the central architectural claim remains unverified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing Mixture of Layers with hybrid attention. We address each major comment point by point below, indicating planned revisions to strengthen the presentation and support for the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that hybrid attention 'addresses' the attention coverage problem when scaling sparse block routing supplies neither a derivation of how global context from the shared softmax block is injected into every routed path via the down/up projections nor any argument that Gated DeltaNet preserves sufficient expressivity as K grows.

Authors: We agree the abstract is high-level and does not contain a full derivation. The manuscript body describes the architecture but lacks explicit propagation math. In revision we will add a dedicated subsection deriving how the shared softmax output is linearly combined into each routed path: the down-projection matrix maps the global context vector into the thin-block space, the Gated DeltaNet processes it locally, and the up-projection integrates the result back into the residual stream, ensuring every token receives the global signal regardless of routing. We will also add an expressivity argument showing that Gated DeltaNet's recurrent state and gating preserve per-block capacity as K grows, because routing sparsity is offset by the fixed shared block and the fact that each thin block only needs to model a subset of the token distribution. revision: yes
Referee: [Abstract] Abstract / architecture description: no ablation studies, baseline comparisons (standard transformer, MoE, or thin-block variants), or scaling curves are presented to test whether model quality is maintained under top-k routing and reduced per-block width; without such evidence the central architectural claim remains unverified.

Authors: The current manuscript is an architectural proposal and does not yet contain empirical results. We accept that verification requires evidence. In the revised version we will add an Experiments section reporting (i) direct comparisons against a standard Transformer and a comparable MoE baseline on language-modeling perplexity, (ii) ablations isolating the shared softmax versus pure Gated DeltaNet routing, and (iii) scaling curves that vary K and d_thin while holding total parameter count fixed, demonstrating that quality is preserved under the proposed sparsity. revision: yes

Circularity Check

0 steps flagged

No circularity; architectural proposal is self-contained

full rationale

The manuscript presents Mixture of Layers as a new architectural construction with hybrid attention introduced to mitigate coverage under top-k routing. No equations, fitted parameters, or predictions are shown that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and the hybrid attention is described as an explicit design choice rather than derived from prior self-referential results. The derivation chain consists of forward engineering steps without self-definitional loops or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim rests on the effectiveness of newly introduced architectural components (MoL routing and hybrid attention) whose performance benefits are asserted without derivations or external benchmarks in the abstract.

invented entities (2)

Mixture of Layers (MoL) no independent evidence
purpose: Replacing monolithic full-width transformer blocks with K parallel thin routed blocks connected by projections
Newly introduced concept described in the abstract.
Hybrid attention no independent evidence
purpose: Solving attention coverage issues in sparse block routing by combining shared softmax global context with Gated DeltaNet in routed blocks
New mechanism proposed to address the coverage problem.

pith-pipeline@v0.9.0 · 5390 in / 1343 out tokens · 43042 ms · 2026-05-12T03:46:55.400381+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Mixture of Layers (MoL), which replaces full-width transformer blocks (d_model) with K parallel thin blocks at reduced dimensionality (d_thin ≪ d_model), connected via learned down/up projections and composed via top-k block routing... hybrid attention, which pairs one shared softmax block for global context with Gated DeltaNet linear attention in routed blocks.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Scaling sparse block routing to many blocks creates an attention coverage problem... We address this by introducing hybrid attention...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 5 internal anchors

[1]

co/blog/cosmopedia

URLhttps://huggingface. co/blog/cosmopedia. Yilong Chen, Naibin Gu, Junyuan Shang, Zhenyu Zhang, Yuchen Feng, Jiawei Sheng, Tingwen Liu, Shuo- huan Wang, Yu Sun, Hua Wu, and Haifeng Wang. Mixture of universal experts: Scaling virtual width via depth-width transformation.arXiv preprint arXiv:2603.04971,

work page arXiv
[2]

Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

DeepSeek-V3 Technical Report

DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Zamba: A compact 7b SSM.arXiv preprint arXiv:2405.16712,

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybrid model.arXiv preprint arXiv:2405.16712,

work page arXiv
[5]

Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality

Ivan Ternovtsii and Yurii Bilak. Equifinality in mixture of experts: Routing topology does not determine language modeling quality.arXiv preprint arXiv:2604.14419,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

URLhttps://arxiv.org/abs/2604. 14419. Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464,

work page internal anchor Pith review arXiv
[7]

Union of experts: Adapting hierarchical routing to equivalently decomposed transformer.arXiv preprint arXiv:2503.02495,

Yujiao Yang, Jing Lian, and Linhui Li. Union of experts: Adapting hierarchical routing to equivalently decomposed transformer.arXiv preprint arXiv:2503.02495,

work page arXiv
[8]

Flash multi-head feed-forward network

Minshen Zhang, Xiang Hu, Jianguo Li, Wei Wu, and Kewei Tu. Flash multi-head feed-forward network. arXiv preprint (under review at ICLR 2026),

work page 2026
[9]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906,

work page internal anchor Pith review arXiv
[10]

A The Rank-1 Expressiveness Ceiling Beforedecomposinglayerstructure, weestablishedthefundamentalqualityconstraint: per-expertexpressive- ness. A rank-1 mirror experiment compares 65,536 rank-1 experts against 256 rank-256 experts at matched total expert parameters (∼67M/layer) and matched active compute (K=1024rank-1 vsK=4rank-256). Table 8: Rank-1 mirror...

work page 2018
[11]

on WikiText-103, 50K steps. Seed 1+3of15 1+2of5 Softmax 1+2of5 Sparse 3/5 Dense top-3 Dense all (198M) (87.5M) (85.3M) (85.3M) (85.3M) (85.3M) 42 29.96 30.95 32.46 32.04 30.85 31.80 137 29.94 31.57 32.41 32.33 30.93 32.03 256 30.08 31.20 32.19 32.21 31.06 31.98 Mean±std29.99±0.08 31.24±0.31 32.35±0.15 32.19±0.15 30.95±0.11 31.94±0.12 12 Mixture of Layers ...

work page 2048
[12]

was2.74×Dense’sdff=1120, leaving a much larger margin to amortise routing overhead. 13 Mixture of Layers with Hybrid Attention: Parallel Thin Blocks for Sparse Transformer ComputeA Preprint Within-MoL sparsity.Higher at 1.3B: 4-of-15 active vs 3-of-5 on Cosmopedia 1+2of5. More routed pool to specialise across at the same active count. Total-parameter unde...

work page 2048
[13]

Table 14: Zero-shot accuracy with stderr (%)

with TruthfulQA mc1, rawaccfor HellaSwag/ARC/PIQA (whose headline metric isacc_norm), and stderr per cell. Table 14: Zero-shot accuracy with stderr (%). Stderr from lm-eval-harness 1000-iter bootstrap. Task Metric Dense 1.3B Dense 0.7BMoL0.61B/2.08B MMLU 5-shot acc24.31±0.4 23.98±0.4 23.07±0.4 HellaSwag acc33.49±0.5 26.81±0.4 27.51±0.4 HellaSwag acc_norm3...

work page 2022