hub

mHC: Manifold-Constrained Hyper-Connections

Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li · 2025 · cs.CL · arXiv 2512.24880

42 Pith papers cite this work. Polarity classification is still indexing.

42 Pith papers citing it

open full Pith review browse 42 citing papers arXiv PDF

abstract

Recently, studies exemplified by Hyper-Connections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding substantial performance gains, this diversification fundamentally compromises the identity mapping property intrinsic to the residual connection, which causes severe training instability and restricted scalability, and additionally incurs notable memory access overhead. To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability. We anticipate that mHC, as a flexible and practical extension of HC, will contribute to a deeper understanding of topological architecture design and suggest promising directions for the evolution of foundational models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 3 unclear 1

representative citing papers

A First-Order Mean Field Control Analysis of Transformer Layers under Cross-Entropy Training

math.OC · 2026-06-22 · unverdicted · novelty 7.0

Transformer residual layers are approximated as an explicit Euler scheme for a controlled hidden-state flow whose mean-field limit is a first-order transport control problem with Pontryagin terminal condition given by the softmax residual.

DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose Estimation

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

DisPOSE approximates the multi-view person-assignment problem as a generative diffusion process over polystochastic tensors using differentiable Sinkhorn projections and a hypergraph decoder for self-supervised 3D pose estimation.

Depth-Attention: Cross-Layer Value Mixing for Language Models

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

Depth-Attention mixes values from earlier layers into the current attention value by having the query attend to previous-layer keys at the same position, yielding lower perplexity and up to 2.3 points higher average accuracy than vanilla transformers on Qwen3-style models with negligible extra FLOPs

TBP-mHC: full expressivity for manifold-constrained hyper connections through transportation polytopes

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

TBP-mHC proposes parameterizations of the Birkhoff polytope via transportation polytopes that achieve exact double stochasticity for hyper-connections using only (n-1)^2 degrees of freedom.

Delta Attention Residuals

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Delta Attention Residuals attend over per-sublayer deltas instead of cumulative hidden states, producing higher-contrast attention weights and 1.7-8.2% validation perplexity gains over standard and attention residuals across 220M-7.6B models.

Efficient and provably convergent end-to-end training of deep neural networks with linear constraints

math.OC · 2026-05-12 · unverdicted · novelty 7.0

An efficiently computable HS-Jacobian acts as a conservative mapping for projections onto polyhedral sets, supporting provably convergent Adam-based end-to-end training of linearly constrained deep neural networks.

FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.

Transformers with Selective Access to Early Representations

cs.LG · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.

Can an MLP Absorb Its Own Skip Connection?

cs.LG · 2026-04-26 · accept · novelty 7.0

Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.

Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

cs.LG · 2026-04-24 · unverdicted · novelty 7.0

A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.

LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction

cs.IR · 2026-04-21 · unverdicted · novelty 7.0

LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.

Deep Delta Learning

cs.LG · 2026-01-01 · unverdicted · novelty 7.0

Deep Delta Learning replaces additive residual updates with a gated delta-rule that selectively overwrites residual content along learned directions, improving language modeling quality over standard ResNet-style accumulation.

Mechanism-Driven Monitors for Preemptive Detection of LLM Training Instability

cs.CL · 2026-06-26 · unverdicted · novelty 6.0

Derives mechanism-based monitors from module functional roles and validates them via fault-injection experiments showing early detection of LLM training instability.

Variable-Width Transformers

cs.CL · 2026-06-16 · conditional · novelty 6.0

×-shaped variable-width transformers outperform parameter-matched uniform baselines on language modeling loss with 22% fewer FLOPs and 15% smaller KV cache.

DeRes: Decoupling Residual Stability and Adaptivity for Scalable CTR Prediction

cs.IR · 2026-06-06 · unverdicted · novelty 6.0

DeRes decouples residual stability and adaptivity via identity and block-attention paths with SiLU pointwise attention, delivering up to 0.32% AUC gains and steeper scaling laws on industrial and public CTR datasets.

Mitigating the Curse of Dimensionality in Uniform Convergence of Deep Neural Networks via Smooth Activations

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

Smoothly activated DNNs (feedforward and residual) achieve non-asymptotic uniform convergence rates that mitigate the curse of dimensionality by adaptively using hierarchical composition structure of the target function.

Rethinking Cross-Layer Information Routing in Diffusion Transformers

cs.CV · 2026-05-20 · unverdicted · novelty 6.0 · 2 refs

DAR replaces residual addition in DiTs with learnable, timestep-adaptive aggregation of sublayer outputs, yielding 2.11 FID improvement on SiT-XL/2 and 8.75x faster convergence on ImageNet 256x256.

SNLP: Layer-Parallel Inference via Structured Newton Corrections

cs.LG · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

SNLP achieves up to 2.58x wall-clock speedup on 0.5B Transformers via architecture-specific Newton corrections (IDN/HCN) that enable layer-parallel inference while preserving perplexity in milder settings.

AOT-POT: Adaptive Operator Transformation for Large-Scale PDE Pre-training

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

AOT-POT adaptively reshapes complex PDE solution operators via input-dependent transformations and parallel stream mixing to enable effective large-scale pre-training, yielding SOTA results on 12 benchmarks with minimal added parameters.

Optimistic Dual Averaging Unifies Modern Optimizers

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

SODA unifies several modern optimizers under optimistic dual averaging and supplies a 1/k decay wrapper that improves performance without weight decay tuning.

The E$\Delta$-MHC-Geo Transformer: Adaptive Geodesic Operations with Guaranteed Orthogonality

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

The EΔ-MHC-Geo Transformer achieves input-adaptive unconditionally orthogonal residual connections via a Cayley-based rotation that works for all parameters, combined with a learned hybrid gate for reflections.

Graph Normalization: Fast Binarizing Dynamics for Differentiable MWIS

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Graph Normalization is a convergent dynamical system that approximates MWIS by always reaching a binary maximum independent set via majorization-minimization and evolutionary game equivalence.

HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

Beyond the Laplacian: Doubly Stochastic Matrices for Graph Neural Networks

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

DsmNet substitutes Laplacian matrices with approximated doubly stochastic matrices in GNNs, using Neumann truncation and residual mass compensation to achieve O(K|E|) efficiency and bound Dirichlet energy decay for reduced over-smoothing.

citing papers explorer

Showing 27 of 27 citing papers after filters.

TBP-mHC: full expressivity for manifold-constrained hyper connections through transportation polytopes cs.LG · 2026-05-20 · unverdicted · none · ref 21 · internal anchor
TBP-mHC proposes parameterizations of the Birkhoff polytope via transportation polytopes that achieve exact double stochasticity for hyper-connections using only (n-1)^2 degrees of freedom.
Delta Attention Residuals cs.LG · 2026-05-13 · unverdicted · none · ref 31 · internal anchor
Delta Attention Residuals attend over per-sublayer deltas instead of cumulative hidden states, producing higher-contrast attention weights and 1.7-8.2% validation perplexity gains over standard and attention residuals across 220M-7.6B models.
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning cs.LG · 2026-05-06 · unverdicted · none · ref 55 · internal anchor
FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.
Transformers with Selective Access to Early Representations cs.LG · 2026-05-05 · unverdicted · none · ref 17 · 2 links · internal anchor
SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.
Can an MLP Absorb Its Own Skip Connection? cs.LG · 2026-04-26 · accept · none · ref 11 · internal anchor
Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning cs.LG · 2026-04-24 · unverdicted · none · ref 31 · internal anchor
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.
Deep Delta Learning cs.LG · 2026-01-01 · unverdicted · none · ref 14 · internal anchor
Deep Delta Learning replaces additive residual updates with a gated delta-rule that selectively overwrites residual content along learned directions, improving language modeling quality over standard ResNet-style accumulation.
Mitigating the Curse of Dimensionality in Uniform Convergence of Deep Neural Networks via Smooth Activations cs.LG · 2026-06-04 · unverdicted · none · ref 38 · internal anchor
Smoothly activated DNNs (feedforward and residual) achieve non-asymptotic uniform convergence rates that mitigate the curse of dimensionality by adaptively using hierarchical composition structure of the target function.
SNLP: Layer-Parallel Inference via Structured Newton Corrections cs.LG · 2026-05-18 · unverdicted · none · ref 44 · 2 links · internal anchor
SNLP achieves up to 2.58x wall-clock speedup on 0.5B Transformers via architecture-specific Newton corrections (IDN/HCN) that enable layer-parallel inference while preserving perplexity in milder settings.
AOT-POT: Adaptive Operator Transformation for Large-Scale PDE Pre-training cs.LG · 2026-05-15 · unverdicted · none · ref 44 · internal anchor
AOT-POT adaptively reshapes complex PDE solution operators via input-dependent transformations and parallel stream mixing to enable effective large-scale pre-training, yielding SOTA results on 12 benchmarks with minimal added parameters.
Optimistic Dual Averaging Unifies Modern Optimizers cs.LG · 2026-05-11 · unverdicted · none · ref 18 · internal anchor
SODA unifies several modern optimizers under optimistic dual averaging and supplies a 1/k decay wrapper that improves performance without weight decay tuning.
The E$\Delta$-MHC-Geo Transformer: Adaptive Geodesic Operations with Guaranteed Orthogonality cs.LG · 2026-05-07 · unverdicted · none · ref 23 · internal anchor
The EΔ-MHC-Geo Transformer achieves input-adaptive unconditionally orthogonal residual connections via a Cayley-based rotation that works for all parameters, combined with a learned hybrid gate for reflections.
Graph Normalization: Fast Binarizing Dynamics for Differentiable MWIS cs.LG · 2026-05-06 · unverdicted · none · ref 53 · internal anchor
Graph Normalization is a convergent dynamical system that approximates MWIS by always reaching a binary maximum independent set via majorization-minimization and evolutionary game equivalence.
Beyond the Laplacian: Doubly Stochastic Matrices for Graph Neural Networks cs.LG · 2026-04-16 · unverdicted · none · ref 18 · internal anchor
DsmNet substitutes Laplacian matrices with approximated doubly stochastic matrices in GNNs, using Neumann truncation and residual mass compensation to achieve O(K|E|) efficiency and bound Dirichlet energy decay for reduced over-smoothing.
ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism cs.LG · 2026-04-13 · unverdicted · none · ref 10 · internal anchor
ResBM achieves 128x activation compression in pipeline-parallel transformer training by adding a residual bottleneck module that preserves a low-rank identity path, with no major loss in convergence or added overhead.
SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm cs.LG · 2026-02-08 · unverdicted · none · ref 25 · internal anchor
SiameseNorm is a two-stream architecture that reconciles Pre-Norm and Post-Norm in Transformers by coupling streams via shared residual blocks, yielding performance gains with maintained stability on language, vision, and diffusion models.
The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning cs.LG · 2026-01-25 · unverdicted · none · ref 23 · internal anchor
TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.
PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training cs.LG · 2026-06-04 · unverdicted · none · ref 106 · internal anchor
A polynomial preconditioning layer controls singular value spectra of transformer weights to stabilize pre-training, shown effective on Llama-1B and supported by convergence theory for deep linear networks.
S$^3$GNN: Efficient Global Mixing and Local Message Passing for Long-Range Graph Learning cs.LG · 2026-05-22 · unverdicted · none · ref 17 · internal anchor
S³GNN mitigates oversquashing in message-passing networks via lightweight global mixing without strong prior assumptions, yielding up to 10x error reduction and 50% fewer parameters across multiple domains.
Exact Linear Attention cs.LG · 2026-05-13 · unverdicted · none · ref 8 · 3 links · internal anchor
Exact Linear Attention uses kernel decomposition for exact linear-complexity attention in Transformers, with proposed kernels addressing gradient and dilution issues plus new modules for memory and MoE.
mHC-SSM: Manifold-Constrained Hyper-Connections for State Space Language Models with Stream-Specialized Adapters cs.LG · 2026-05-08 · unverdicted · none · ref 10 · internal anchor
Manifold-constrained multi-stream mixing plus per-stream adapters improves SSM language model validation loss from 6.3507 to 6.1353 and perplexity from 572.91 to 461.88 on WikiText-2.
Cubit: Token Mixer with Kernel Ridge Regression cs.LG · 2026-05-07 · unverdicted · none · ref 89 · 2 links · internal anchor
Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.
Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling cs.LG · 2026-04-21 · unverdicted · none · ref 23 · internal anchor
Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.
Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling cs.LG · 2026-06-05 · unverdicted · none · ref 68 · internal anchor
A 120B sparse MoE model with 460 experts was trained on one 8-GPU node to loss 1.78 using reversible recurrence and state-preserving scaling from a 1.78B dense seed, with 5.93B active parameters.
Multi-Gate Residuals cs.LG · 2026-05-22 · unverdicted · none · ref 12 · internal anchor
Multi-Gate Residuals stabilizes activation scales in deep residual networks via multi-stream gating and attention pooling without added communication overhead.
Hyperloop Transformers cs.LG · 2026-04-23 · unreviewed · ref 27 · internal anchor
Beyond Linearity in Attention Projections: The Case for Nonlinear Queries cs.LG · 2026-03-11 · unreviewed · ref 26 · internal anchor

mHC: Manifold-Constrained Hyper-Connections

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer