Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations
Pith reviewed 2026-06-29 14:21 UTC · model grok-4.3
The pith
A model can share most parameters between attention and linear recurrent mixers while switching between them along the sequence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Oryx is a hybrid model that can flexibly switch between different token mixers throughout a sequence, for example using quadratic attention for rich context and linear recurrences for efficient generation, while tying at least 90 percent of its parameters across the mixers so they share internal representations. Validated with Mamba-2 and Gated DeltaNet variants up to 1.4B parameters, Oryx achieves comparable or better performance than single-mixer baselines on language modeling tasks under fixed token budgets and a mixed-training strategy, outperforming each baseline by at least 0.7 percentage points at 1.4B scale on averaged tasks, and reaches Transformer-level retrieval performance while
What carries the argument
Sequence-axis hybridization in Oryx, which switches between mixers along the token sequence with at least 90 percent parameter sharing across modes.
If this is right
- Hybrid models can outperform pure attention or pure recurrent baselines on averaged language modeling tasks at the 1.4B scale.
- Retrieval performance comparable to a Transformer can be obtained while routing only a small fraction of tokens through attention.
- Attention and linear recurrent models can operate over largely shared internal representations without large performance loss.
- Sequence-axis hybridization offers an alternative to static block interleaving for combining sub-quadratic and quadratic mixers.
Where Pith is reading between the lines
- If representations can be shared this way, separate training runs for attention-only and recurrence-only models may become less necessary.
- Dynamic mixer selection during inference could further improve efficiency on tasks with varying context demands.
- Extending the approach to more than two mixer types or to learned switch decisions could be tested directly on the same training setup.
Load-bearing premise
The reported performance gains are produced by the flexible sequence switching and parameter sharing rather than by the mixed-training procedure, token allocation, or the particular choice of base mixers.
What would settle it
A controlled comparison in which Oryx is trained without sequence switching but with the same mixed strategy and token budget, and still matches the reported gains, would show that the switching itself is not required.
read the original abstract
Softmax attention is the cornerstone of modern large language models, but its memory scales linearly and compute quadratically with sequence length. Linear recurrent models, such as linear attention and state space models, have become widely studied as alternatives to attention due to their linear compute and constant memory. While these sub-quadratic token mixing methods, or mixers, achieve promising efficiency gains and competitive results on a wide range of benchmarks, current linear recurrent models still lag behind on tasks that require long-context retrieval or in-context learning. A growing body of work studies hybrid architectures that attempt to mitigate these trade-offs by statically interleaving or merging attention and recurrent blocks. In this work, we explore a new axis of developing hybrid models: across the token sequence. We propose Oryx, a hybrid model that can, throughout a sequence, flexibly switch between different mixers, for example quadratic attention for rich context utilization and linear recurrences for efficient generation. Oryx ties at least 90% of its parameters across mixers, enabling attention and recurrent modes to operate over shared internal representations. We validate our design with Mamba-2 and Gated DeltaNet variants, up to 1.4B models. Under fixed token budgets and a mixed-training strategy, Oryx achieves comparable or better performance than its single-mixer baselines. At the 1.4B scale, all instances of Oryx outperform their respective baselines by at least 0.7 percentage points on averaged language modeling tasks. On retrieval tasks, Oryx achieves performance comparable to the Transformer baseline even when processing only a tiny fraction (<10%) of the tokens in attention mode. These results suggest that attention and linear recurrent models can share internal representations, and motivate sequence-axis hybridization as a promising direction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Oryx, a hybrid model that flexibly switches between quadratic attention and linear recurrent mixers (e.g., Mamba-2, Gated DeltaNet) along the token sequence axis while tying at least 90% of parameters across modes. Under fixed token budgets and a mixed-training strategy, Oryx is claimed to outperform single-mixer baselines by at least 0.7 pp on averaged language modeling tasks at the 1.4B scale and to match Transformer retrieval performance with <10% of tokens in attention mode, suggesting that attention and linear recurrent models can share internal representations.
Significance. If the central claim holds after isolating the hybridization effect, the work would be significant for establishing sequence-axis hybridization as a viable alternative to static block interleaving in hybrid architectures. The high degree of parameter sharing combined with empirical gains on retrieval using minimal attention tokens would provide concrete evidence that distinct mixing mechanisms can operate over shared representations, potentially guiding more efficient long-context model designs.
major comments (2)
- [Abstract] Abstract: The claim that performance gains (0.7+ pp on LM tasks, retrieval parity with <10% attention tokens) are driven by flexible sequence-axis switching plus ≥90% parameter tying is not isolated from the mixed-training strategy. The abstract states results hold 'under fixed token budgets and a mixed-training strategy' but does not confirm that single-mixer baselines receive identical training; without this control the hybridization benefit cannot be attributed to the proposed mechanism.
- [Abstract] Abstract: Performance numbers are reported with no experimental details, error bars, dataset descriptions, ablation controls, or confirmation of baseline equivalence, rendering it impossible to verify whether the data supports the central claims about shared representations.
minor comments (2)
- [Abstract] The term 'mixed-training strategy' is introduced without definition or pointer to a methods section, leaving the training procedure across modes unclear.
- No equations, pseudocode, or formal description of the switching logic or parameter-tying implementation appear in the provided abstract, which would clarify how shared representations are realized.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback on the abstract. We agree that greater clarity is needed regarding training equivalence and experimental context. We will revise the abstract accordingly while preserving its conciseness. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that performance gains (0.7+ pp on LM tasks, retrieval parity with <10% attention tokens) are driven by flexible sequence-axis switching plus ≥90% parameter tying is not isolated from the mixed-training strategy. The abstract states results hold 'under fixed token budgets and a mixed-training strategy' but does not confirm that single-mixer baselines receive identical training; without this control the hybridization benefit cannot be attributed to the proposed mechanism.
Authors: We agree the abstract phrasing is ambiguous on this point. In the full manuscript, all models (Oryx variants and single-mixer baselines) are trained under identical conditions: the same fixed token budgets, the same datasets, and the same optimization hyperparameters. The mixed-training strategy applies specifically to Oryx by sampling both attention and recurrent modes during training; baselines remain in single-mixer mode throughout. This setup is detailed in Section 4. We will revise the abstract to explicitly state that single-mixer baselines receive identical training regimes, thereby isolating the hybridization effect. revision: yes
-
Referee: [Abstract] Abstract: Performance numbers are reported with no experimental details, error bars, dataset descriptions, ablation controls, or confirmation of baseline equivalence, rendering it impossible to verify whether the data supports the central claims about shared representations.
Authors: We acknowledge that the abstract omits these elements due to length constraints. The full paper supplies them: model scales up to 1.4B, dataset descriptions (standard language modeling and retrieval corpora), error bars from repeated runs, ablation studies on parameter sharing (>90%) and mode switching ratios, and explicit confirmation of baseline training equivalence (Section 4). We will expand the abstract slightly to reference the 1.4B scale, the averaged LM and retrieval tasks, and direct readers to Sections 4–5 for full experimental details, controls, and baseline equivalence. revision: yes
Circularity Check
No circularity: purely empirical results with no derivation chain
full rationale
The paper contains no equations, derivations, predictions from first principles, or load-bearing self-citations. All claims rest on experimental outcomes from training Oryx variants under fixed token budgets and mixed-training, benchmarked against single-mixer baselines. No step reduces a claimed result to its own inputs by construction, fitting, or self-referential definition. This is the standard case for an empirical architecture paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2009.14794. P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URLhttps://arxiv. org/abs/1803.05457. T. Dao and A. Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state ...
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[2]
URLhttps://arxiv.org/abs/2502.13685. D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs, 2019. URLhttps://arxiv.org/abs/ 1903.00161. Y. Fang, W. Yu, S. Zhong, Q. Ye, X. Xiong, and L. Wei. Artificial hippocampus networks for efficient long-context modeling...
-
[3]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
URLhttps://arxiv.org/abs/2507.07955. IBM Research. Granite 4.0 models documentation. https://www.ibm.com/granite/docs/ models/granite, 2025. Accessed: 2025-02-06. K. Irie, M. Yau, and S. J. Gershman. Blending complementary memory systems in hybrid quadratic- linear transformers, 2025. URLhttps://arxiv.org/abs/2506.00744. M. Joshi, E. Choi, D. S. Weld, and...
-
[4]
URLhttps://arxiv.org/abs/1606.06031. H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. Smith, and L. Kong. Random feature attention,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
URLhttps://arxiv.org/abs/2103.02143. Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners, 2019. URLhttps://api.semanticscholar.org/CorpusID:160025533. P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know:...
-
[6]
URLhttps://arxiv.org/abs/2102.11174. N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017. URLhttps://arxiv.org/ abs/1701.06538. J. Shi and B. Wu. Wonderful matrices: Combining for a more efficient and effective foundation model architecture, 2...
-
[7]
Chunk state:Here, each chunk𝑖 computes its contribution to the recurrent state𝑺𝑖 ∈ℝ 𝐷𝑘 ×𝐷 𝑣 via𝑲 ⊤ 𝑖 𝑽𝑖 which results in2𝐶 𝐷𝑘 𝐷𝑣 FLOPs per chunk and2𝑇 𝐷𝑘 𝐷𝑣 for the entire sequence
-
[8]
This results in approximately2𝑇 𝐷𝑘 𝐷𝑣/𝐶FLOPs
State passing:The chunk states are updated to incorporate prior state information via a scan on the𝑇/𝐶total states of size𝐷 𝑘 ×𝐷 𝑣. This results in approximately2𝑇 𝐷𝑘 𝐷𝑣/𝐶FLOPs
-
[9]
The total FLOPs for all chunks are then2𝑇𝐶 𝐷𝑘 +2𝑇𝐶 𝐷 𝑣
Intra-chunk output:Here, the output from the intra-chunk interactions are calculated via (𝑳 𝑖 ◦𝑸 𝑖 𝑲⊤ 𝑖 )𝑽𝑖 which results in2𝐶2 𝐷𝑘 + 2𝐶2 𝐷𝑣 per chunk, when ignoring the mask. The total FLOPs for all chunks are then2𝑇𝐶 𝐷𝑘 +2𝑇𝐶 𝐷 𝑣
-
[10]
Here, the𝑸𝑖𝑺𝑖−1 calculation results in2𝐶 𝐷𝑘 𝐷𝑣 per chunk and2𝑇 𝐷𝑘 𝐷𝑣 in total
Inter-chunk output:Finally, the output from the cumulative prior hidden states must be calculated and added to the output arising from the intra-chunk calculations. Here, the𝑸𝑖𝑺𝑖−1 calculation results in2𝐶 𝐷𝑘 𝐷𝑣 per chunk and2𝑇 𝐷𝑘 𝐷𝑣 in total. The summation of the inter-chunk and intra-chunk outputs results in𝑇 𝑃FLOPs, which is negligible, thus ignored. I...
2022
-
[11]
Constant linear state update:All chunks of the sequence need to update the state, resulting in 2𝑇 𝐷𝑘 𝐷𝑣 + 2𝑇 𝐷𝑘 𝐷𝑣/𝐶 total FLOPs when accounting for the chunk state and state passing portions of the linear model
-
[12]
22 Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations
Linear output:As only 𝛿 chunks are assigned to the linear mixer and thus require computation, the overall number of FLOPs used is approximately𝛿(2𝑇𝐶 𝐷𝑘 +2𝑇𝐶 𝐷𝑣 +2𝑇 𝐷𝑘 𝐷𝑣) when accounting for the intra- and inter-chunk operations. 22 Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations
-
[13]
Under this simplifying condition, the computation cost in expectation is(1−𝛿)𝑇(𝑇+1) (𝐷 𝑘 +𝐷 𝑣)
Attention output:Here, we make the assumption that the routing𝛿 selects chunks uniformly at random to help with analysis. Under this simplifying condition, the computation cost in expectation is(1−𝛿)𝑇(𝑇+1) (𝐷 𝑘 +𝐷 𝑣). When combined, the total compute required for anOryxforward pass is2𝑇 𝐷𝑘 𝐷𝑣 + 2𝑇 𝐷𝑘 𝐷𝑣/𝐶+ 𝛿(2𝑇𝐶 𝐷𝑘 +2𝑇𝐶 𝐷𝑣 +2𝑇 𝐷𝑘 𝐷𝑣) + (1−𝛿)𝑇(𝑇+ 1) (𝐷𝑘 +𝐷...
2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.