Hyperloop Transformers

· 2026 · cs.LG · arXiv 2604.21254

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

LLM architecture research generally aims to maximize model quality subject to fixed compute/latency budgets. However, many applications of interest such as edge and on-device deployment are further constrained by the model's memory footprint, thus motivating parameter-efficient architectures for language modeling. This paper describes a simple architecture that improves the parameter-efficiency of LLMs. Our architecture makes use of looped Transformers as a core primitive, which reuse Transformer layers across depth and are thus more parameter-efficient than ordinary (depth-matched) Transformers. We organize the looped Transformer into three blocks--begin, middle, and end blocks--where each block itself consists of multiple Transformer layers, and only the middle block is applied recurrently across depth. We augment the looped middle block with hyper-connections (Xie et al., 2026), which expand the residual stream into matrix-valued residual streams. Hyper-connections are applied only after each loop, and therefore add minimal new parameters and compute cost. Across various model scales, we find that our Hyper-Connected Looped Transformer (Hyperloop Transformer) is able to outperform depth-matched Transformer and mHC Transformer baselines despite using approximately 50% fewer parameters. The outperformance persists through post-training weight quantization, thus positioning Hyperloop Transformers as an attractive architecture for memory-efficient language modeling.

representative citing papers

SMolLM: Small Language Models Learn Small Molecular Grammar

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.

Solve the Loop: Attractor Models for Language and Reasoning

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Attractor Models solve for fixed points in transformer embeddings using implicit differentiation to enable stable iterative refinement, delivering better perplexity, accuracy, and efficiency than standard or looped transformers.

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating standard LLM performance.

bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition

cs.CV · 2026-05-11 · unverdicted · novelty 5.0

A 12-step single-block recurrent ViT-B reaches accuracy comparable to a standard ViT-B on ImageNet-1K while using an order of magnitude fewer parameters.

citing papers explorer

Showing 5 of 5 citing papers.

SMolLM: Small Language Models Learn Small Molecular Grammar cs.LG · 2026-05-07 · unverdicted · none · ref 103 · internal anchor
A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models cs.LG · 2026-04-22 · unverdicted · none · ref 15 · internal anchor
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
Solve the Loop: Attractor Models for Language and Reasoning cs.LG · 2026-05-12 · unverdicted · none · ref 65 · internal anchor
Attractor Models solve for fixed points in transformer embeddings using implicit differentiation to enable stable iterative refinement, delivering better perplexity, accuracy, and efficiency than standard or looped transformers.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 15 · internal anchor
MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating standard LLM performance.
bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition cs.CV · 2026-05-11 · unverdicted · none · ref 43 · internal anchor
A 12-step single-block recurrent ViT-B reaches accuracy comparable to a standard ViT-B on ImageNet-1K while using an order of magnitude fewer parameters.

Hyperloop Transformers

fields

years

verdicts

representative citing papers

citing papers explorer