arxiv: 2605.08300 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

mHC-SSM: Manifold-Constrained Hyper-Connections for State Space Language Models with Stream-Specialized Adapters

Abdulvahap Mutlu , \c{S}eng\"ul Do\u{g}an , T\"urker Tuncer

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords state space modelsmanifold constraintsresidual mixingstream adapterslanguage modelingSinkhorn projectionWikiText-2hyper-connections

0 comments

The pith

Manifold-constrained hyper-connections improve validation loss and perplexity in state space language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a stability-focused version of multi-stream residual mixing transfers from transformers to state space models for language modeling. It expands the residual stream into parallel paths, applies simplex-constrained mixing before and after the SSM block, and projects the mixing matrices onto doubly stochastic matrices using Sinkhorn-Knopp iteration at each layer. Stream-specialized adapters, built from a shared bottleneck with per-stream scaling factors, are inserted before aggregation and after scattering to add targeted capacity. On WikiText-2 under matched training conditions, both the static constrained version and the adapter-enhanced version record lower validation loss and perplexity than a single-stream baseline, accompanied by modest drops in tokens per second and rises in peak memory. Readers following efficient sequence modeling may note that these gains suggest a concrete way to increase SSM quality through topology constraints rather than scale alone.

Core claim

By expanding the residual stream into multiple parallel streams around an SSM block, applying simplex-constrained pre- and post-mixing, enforcing Sinkhorn-projected doubly stochastic matrices on the inter-stream residuals, and inserting stream-specialized adapters, the mHC-SSM architecture achieves validation loss of 6.2448 (static) and 6.1353 (with adapters) and perplexity of 515.35 and 461.88 respectively, compared with 6.3507 and 572.91 for the baseline single-stream SSM on WikiText-2.

What carries the argument

Manifold-Constrained Hyper-Connections (mHC) that project residual-stream mixing matrices onto the doubly stochastic manifold via Sinkhorn-Knopp iteration while routing streams through simplex-constrained aggregation and scattering around the SSM block.

If this is right

Static mHC reduces validation loss from 6.3507 to 6.2448 and perplexity from 572.91 to 515.35.
Adding stream-specialized adapters further lowers loss to 6.1353 and perplexity to 461.88.
Throughput falls from 1025.52 to 938.90 tokens per second and peak memory rises from 2365 MB to 3092 MB.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Sinkhorn projection step could be inserted into other recurrent or state-space blocks to test whether doubly stochastic mixing stabilizes deeper or longer-sequence training.
The shared-bottleneck adapter pattern offers a low-parameter way to increase per-stream expressivity that might transfer to mixture-of-experts variants of SSMs.
If the constraint reduces variance in gradient flow, it may allow higher learning rates or fewer regularization terms without divergence.

Load-bearing premise

The measured drops in loss and perplexity are caused by the manifold constraint and stream adapters rather than uncontrolled differences in code, seeds, or training schedule.

What would settle it

Re-running the three configurations multiple times with independent random seeds and verifying whether the loss and perplexity gaps remain stable across runs.

Figures

Figures reproduced from arXiv: 2605.08300 by Abdulvahap Mutlu, \c{S}eng\"ul Do\u{g}an, T\"urker Tuncer.

**Figure 1.** Figure 1: Proposed mHC-SLM architecture and internal mHC-SSM block design. (a) The full language modeling pipeline expands token representations into multiple residual streams, processes them through a stack of static mHC-SSM blocks, aggregates streams, and produces logits through a tied language modeling head. (b) Each mHC-SSM block applies optional stream-specialized adapters, simplex-constrained pre/post stream m… view at source ↗

read the original abstract

Manifold-Constrained Hyper-Connections (mHC) introduce a stability-motivated variant of multi stream residual mixing by constraining residual stream mixing matrices to the manifold of doubly stochastic matrices via Sinkhorn-Knopp projection. In his work, we study whether mHC-style constrained multi-stream residual topology transfers effectively to state space model (SSM) language modeling. We implement a static mHC mechanism around an SSM block by expanding the residual stream into multiple parallel streams, aggregating streams into a single SSM input through simplex-constrained pre-mixing, scattering the SSM output back to streams through simplex-constrained post-mixing, and applying Sinkhorn-projected residual stream mixing at each layer. We further introduce stream-specialized adapters that add lightweight stream-specific capacity through a shared bottleneck with per-stream scaling, applied both before stream aggregation and after the SSM output prior to scattering. We evaluate baseline single-stream SSM, static mHC SSM, and mHC SSM with adapters on WikiText-2 using identical training settings and report checkpoint-based validation loss, perplexity, throughput, and peak GPU memory. Under the reported fair checkpoint evaluation, static mHC improves validation loss from 6.3507 to 6.2448 and reduces perplexity from 572.91 to 515.35, while mHC with adapters further improves validation loss to 6.1353 and perplexity to 461.88. These gains are accompanied by modest throughput reductions from 1025.52 to 964.81 and 938.90 tokens per second, and increased peak memory from 2365 MB to 2568 MB and 3092 MB. The results suggest that mHC-inspired constrained multi-stream residual mixing can yield measurable quality improvements in SSM language models and that stream-specialized adapter capacity can further enhance performance with predictable efficiency tradeoffs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper transfers mHC-style Sinkhorn mixing and stream adapters to SSM blocks and reports perplexity drops on WikiText-2, but single-run results leave the cause of the gains unclear.

read the letter

The headline claim here is that wrapping an SSM with manifold-constrained hyper-connections and stream adapters improves validation loss and perplexity on WikiText-2. They move from 6.35 loss and 573 perplexity down to 6.13 and 462, with modest efficiency costs. What stands out as new is the specific setup: expanding the residual into multiple streams, using simplex-constrained pre and post mixing around the SSM, Sinkhorn projection for the mixing matrices, and then the stream adapters that share a bottleneck but scale per stream. They apply this both before aggregation and after the SSM output. The description is concrete enough that someone could reimplement it. They do a decent job of reporting the tradeoffs in throughput and peak memory alongside the quality metrics, all under the same training setup. That gives a practical picture of what the changes cost. The soft spot is the lack of repeated trials. Everything is one checkpoint per configuration. No error bars, no seed averages, and no discussion of how they ensured the baseline had exactly the same capacity and hyperparameters. In SSM training those loss deltas can come from random factors, so the manifold constraint and adapters are not clearly isolated as the reason for the improvement. This paper is for researchers already working with state space models who want to experiment with multi-stream residual topologies. A reader who runs their own SSM ablations might pick up the adapter idea or the constrained mixing and test it themselves. I would recommend sending it to peer review once the authors add a few independent runs with basic statistics. The core idea is simple and the implementation details are there, but the current evidence is too thin to stand on its own.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes manifold-constrained hyper-connections (mHC) for state space models (SSMs) by expanding the residual stream into multiple parallel streams, applying simplex-constrained pre- and post-mixing around the SSM block, and projecting residual mixing matrices onto the doubly stochastic manifold via Sinkhorn-Knopp iterations. It further introduces stream-specialized adapters that inject lightweight per-stream capacity via a shared bottleneck with scaling factors. On WikiText-2 under identical training settings, the paper reports that static mHC improves validation loss from 6.3507 to 6.2448 and perplexity from 572.91 to 515.35, with adapters yielding further gains to 6.1353 and 461.88, accompanied by modest throughput reductions and increased peak memory.

Significance. If the empirical gains prove robust, the work establishes that stability-motivated manifold constraints on multi-stream residuals transfer to SSM language modeling and can be augmented by stream adapters, yielding measurable quality improvements with predictable efficiency trade-offs. The provision of concrete checkpoint-based metrics (loss, perplexity, tokens/s, GPU memory) supplies a direct, reproducible comparison point for follow-up work.

major comments (1)

[Experimental evaluation] Experimental evaluation (results table reporting the 6.3507/6.2448/6.1353 losses and corresponding perplexities): all comparisons rest on single training runs per configuration with no error bars, no multi-seed averages, and no statistical tests. In SSM training, loss differences of this magnitude commonly arise from random initialization, data order, or optimizer stochasticity even under fixed hyperparameters; without quantifying this variance, the reported deltas cannot be confidently attributed to the manifold constraint or adapters rather than uncontrolled experimental factors.

minor comments (2)

[Method / Experimental setup] The description of how baseline single-stream SSM capacity and hyperparameters were exactly matched to the mHC variants (e.g., parameter count, hidden dimension adjustments) is not detailed enough to allow independent reproduction of the 'identical training settings' claim.
[Method] Notation for the pre-mixing and post-mixing matrices (simplex-constrained vs. Sinkhorn-projected) could be clarified with an explicit equation or diagram showing the data flow through aggregation, SSM, scattering, and residual mixing at each layer.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address the major concern regarding the experimental evaluation below and outline the changes we will make in the revised version.

read point-by-point responses

Referee: Experimental evaluation (results table reporting the 6.3507/6.2448/6.1353 losses and corresponding perplexities): all comparisons rest on single training runs per configuration with no error bars, no multi-seed averages, and no statistical tests. In SSM training, loss differences of this magnitude commonly arise from random initialization, data order, or optimizer stochasticity even under fixed hyperparameters; without quantifying this variance, the reported deltas cannot be confidently attributed to the manifold constraint or adapters rather than uncontrolled experimental factors.

Authors: We agree that single-run results constitute a limitation in the current evaluation, as the referee correctly notes that SSM training can exhibit variance from initialization and stochastic factors. In the revised manuscript we will rerun all three configurations (baseline SSM, static mHC, and mHC with adapters) using at least three independent random seeds under identical hyperparameters. We will report mean validation loss and perplexity together with standard deviations, and we will include statistical significance tests (paired t-tests) to quantify whether the observed improvements are robust. These additions will be presented in an updated results table and discussed in the experimental section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results are direct measurements

full rationale

The paper describes an architectural proposal (mHC via Sinkhorn projection on residual mixing matrices plus stream adapters) and reports direct empirical measurements of validation loss and perplexity on WikiText-2 under fixed training settings. No derivation chain exists that reduces the reported deltas (6.3507 → 6.2448 → 6.1353 loss; 572.91 → 515.35 → 461.88 perplexity) to quantities defined by the method itself or by self-citations. The Sinkhorn-Knopp step is a standard, externally defined projection; the adapters are explicitly parameterized additions. Central claims rest on observed checkpoint values rather than any fitted-input-called-prediction or self-definitional reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities beyond standard neural-network components; the Sinkhorn projection is treated as a known algorithm.

pith-pipeline@v0.9.0 · 5668 in / 1109 out tokens · 52619 ms · 2026-05-12T01:16:23.218137+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 7 internal anchors

[1]

Introduction Residual connections remain one of the most influential architectural primitives in modern deep learning because they provide a direct path that supports stable optimization as networks become deeper1,2. The identity mapping component of the residual formulation has been analyzed as an important factor behind the forward and backward signal p...

work page
[2]

WikiText datasets preserve punctuation and casing and provide standardized train, validation, and test splits

Methods 2.1 Task Definition and Dataset We consider next -token prediction on WikiText -2, a widely used language modeling dataset introduced alongside the WikiText benchmark suite. WikiText datasets preserve punctuation and casing and provide standardized train, validation, and test splits. Tokenization is performed using a GPT -2 tokenizer through the H...

work page
[3]

3.1 Stepwise Validation During Training All three variants were trained under the same core configuration and evaluated every 500 steps

Results This section reports both the stepwise training evaluation logs and the checkpoint -based fair benchmarking summary provided. 3.1 Stepwise Validation During Training All three variants were trained under the same core configuration and evaluated every 500 steps. The baseline SSM training reported validation loss and perplex ity at steps 500 throug...

work page 2000
[4]

Discussion A central claim in the mHC paper is that unconstrained residual stream mixing in HC can lead to instability because the composite residual mapping across depth does not preserve global mean signal intensity across streams, enabling unbounded amplification or attenuation as depth increases. To restore identity-mapping-like stability in a multi-s...

work page
[5]

Conclusion This paper studied an mHC-inspired, manifold-constrained hyper-connection style residual mixing mechanism in a state space language model setting. A baseline single-stream SSM language model was compared with a static multi -stream variant that applies Sinkhorn -projected residual mixing, and with an extension that adds stream -specialized adap...

work page
[6]

Deep Residual Learning for Image Recognition

He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. Preprint at https://doi.org/10.48550/arXiv.1512.03385 (2015)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1512.03385 2015
[7]

Identity mappings in deep residual networks

He, K., Zhang, X., Ren, S. & Sun, J. Identity Mappings in Deep Residual Networks. Preprint at https://doi.org/10.48550/arXiv.1603.05027 (2016)

work page doi:10.48550/arxiv.1603.05027 2016
[8]

Vaswani, A. et al. Attention Is All You Need. Preprint at https://doi.org/10.48550/arXiv.1706.03762 (2023)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2023
[9]

Zhu, D. et al. Hyper-Connections. Preprint at https://doi.org/10.48550/arXiv.2409.19606 (2025)

work page doi:10.48550/arxiv.2409.19606 2025
[10]

Xie, Z. et al. mHC: Manifold-Constrained Hyper-Connections. Preprint at https://doi.org/10.48550/arXiv.2512.24880 (2026)

work page doi:10.48550/arxiv.2512.24880 2026
[11]

& Knopp, P

Sinkhorn, R. & Knopp, P. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics 21, 343–348 (1967)

work page 1967
[12]

Farahat, H. K. The semigroup of doubly-stochastic matrices. Proceedings of the Glasgow Mathematical Association 7, 178–183 (1966)

work page 1966
[13]

Gu, A., Goel, K. & Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. Preprint at https://doi.org/10.48550/arXiv.2111.00396 (2022)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2111.00396 2022
[14]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. & Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. Preprint at https://doi.org/10.48550/arXiv.2312.00752 (2024)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.00752 2024
[15]

Pointer Sentinel Mixture Models

Merity, S., Xiong, C., Bradbury, J. & Socher, R. Pointer Sentinel Mixture Models. Preprint at https://doi.org/10.48550/arXiv.1609.07843 (2016)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1609.07843 2016
[16]

Deep networks with stochastic depth

Huang, G., Sun, Y., Liu, Z., Sedra, D. & Weinberger, K. Deep Networks with Stochastic Depth. Preprint at https://doi.org/10.48550/arXiv.1603.09382 (2016)

work page doi:10.48550/arxiv.1603.09382 2016
[17]

P., Mao, H., Cottrell, G

Bachlechner, T., Majumder, B. P., Mao, H., Cottrell, G. & McAuley, J. ReZero is all you need: fast convergence at large depth. in Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence (eds de Campos, C. & Maathuis, M. H.) vol. 161 1352– 1361 (PMLR, 2021)

work page 2021
[18]

Fu, D. Y. et al. Hungry Hungry Hippos: Towards Language Modeling with State Space Models. Preprint at https://doi.org/10.48550/arXiv.2212.14052 (2023)

work page doi:10.48550/arxiv.2212.14052 2023
[19]

& Berant, J

Gupta, A., Gu, A. & Berant, J. Diagonal State Spaces are as Effective as Structured State Spaces

work page
[20]

& Morrone, B

Houlsby, N., Giurgiu, A., Jastrzebski, S. & Morrone, B. Parameter-Efficient Transfer Learning for NLP

work page
[21]

Hu, E. J. et al. LoRA: Low-Rank Adaptation of Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2106.09685 (2021)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2106.09685 2021
[22]

Poli, M. et al. Hyena Hierarchy: Towards Larger Convolutional Language Models. Preprint at https://doi.org/10.48550/arXiv.2302.10866 (2023)

work page doi:10.48550/arxiv.2302.10866 2023
[23]

transformers/src/transformers/models/gpt2/tokenization_gpt2.py at main · huggingface/transformers

huggingface. transformers/src/transformers/models/gpt2/tokenization_gpt2.py at main · huggingface/transformers. GitHub https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/tokeniz ation_gpt2.py

work page
[24]

Available: https://arxiv.org/abs/1910.07467

Zhang, B. & Sennrich, R. Root Mean Square Layer Normalization. Preprint at https://doi.org/10.48550/arXiv.1910.07467 (2019)

work page doi:10.48550/arxiv.1910.07467 1910
[25]

& Doya, K

Elfwing, S., Uchibe, E. & Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks 107, 3–11 (2018)

work page 2018
[26]

https://docs.pytorch.org/tutorials/recipes/recipes/amp_recipe.html

Automatic Mixed Precision — PyTorch Tutorials 2.11.0+cu130 documentation. https://docs.pytorch.org/tutorials/recipes/recipes/amp_recipe.html

work page
[27]

Decoupled Weight Decay Regularization

Loshchilov, I. & Hutter, F. Decoupled Weight Decay Regularization. Preprint at https://doi.org/10.48550/arXiv.1711.05101 (2019)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1711.05101 2019
[28]

National Academies of Sciences, E. et al. Understanding Reproducibility and Replicability. in Reproducibility and Replicability in Science (National Academies Press (US), 2019)

work page 2019