Hyperloop Transformers

· 2026 · cs.LG · arXiv 2604.21254

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

open full Pith review browse 10 citing papers arXiv PDF

abstract

LLM architecture research generally aims to maximize model quality subject to fixed compute/latency budgets. However, many applications of interest such as edge and on-device deployment are further constrained by the model's memory footprint, thus motivating parameter-efficient architectures for language modeling. This paper describes a simple architecture that improves the parameter-efficiency of LLMs. Our architecture makes use of looped Transformers as a core primitive, which reuse Transformer layers across depth and are thus more parameter-efficient than ordinary (depth-matched) Transformers. We organize the looped Transformer into three blocks--begin, middle, and end blocks--where each block itself consists of multiple Transformer layers, and only the middle block is applied recurrently across depth. We augment the looped middle block with hyper-connections (Xie et al., 2026), which expand the residual stream into matrix-valued residual streams. Hyper-connections are applied only after each loop, and therefore add minimal new parameters and compute cost. Across various model scales, we find that our Hyper-Connected Looped Transformer (Hyperloop Transformer) is able to outperform depth-matched Transformer and mHC Transformer baselines despite using approximately 50% fewer parameters. The outperformance persists through post-training weight quantization, thus positioning Hyperloop Transformers as an attractive architecture for memory-efficient language modeling.

citation-role summary

background 2 other 1

citation-polarity summary

background 1 support 1 unclear 1

representative citing papers

LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LoopMoE is a looped MoE language model that outperforms matched vanilla MoE on 8 of 9 downstream benchmarks at 3B scale and continues to outperform at 9B scale under strictly controlled budgets.

LoopQ: Quantization for Recursive Transformers

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

LoopQ provides a loop-aware PTQ framework for recursive Transformers that mitigates distribution shift, state reuse, and recursive error accumulation, yielding 68.8% higher average accuracy and 87.7% lower perplexity under W4A4 versus static baselines.

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.

Looped World Models

cs.LG · 2026-06-16 · unverdicted · novelty 6.0

Introduces looped transformer architectures for world models that iteratively refine latent states to achieve up to 100x parameter efficiency via adaptive computation depth.

SNLP: Layer-Parallel Inference via Structured Newton Corrections

cs.LG · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

SNLP achieves up to 2.58x wall-clock speedup on 0.5B Transformers via architecture-specific Newton corrections (IDN/HCN) that enable layer-parallel inference while preserving perplexity in milder settings.

Solve the Loop: Attractor Models for Language and Reasoning

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Attractor Models solve for fixed points in transformer embeddings using implicit differentiation to enable stable iterative refinement, delivering better perplexity, accuracy, and efficiency than standard or looped transformers.

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

MELT decouples reasoning depth from memory in looped language models by sharing a single gated KV cache per layer and training it via chunk-wise distillation from Ouro starting models.

bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition

cs.CV · 2026-05-11 · unverdicted · novelty 5.0

A 12-step single-block recurrent ViT-B reaches accuracy comparable to a standard ViT-B on ImageNet-1K while using an order of magnitude fewer parameters.

SMolLM: Small Language Models Learn Small Molecular Grammar

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

A 53K-parameter weight-shared transformer generates novel valid SMILES at 95% rate on ZINC-250K and resolves constraints hierarchically via bracket, ring, and valence stages as shown by probing and ablation.

CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability

cs.LG · 2026-05-31 · unverdicted · novelty 4.0

CART is a recurrent transformer with shared core, frozen prelude KV tensors, and LTI stability gate that fails to beat dense baselines at parameter parity across tested widths.

citing papers explorer

Showing 2 of 2 citing papers after filters.

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models cs.LG · 2026-04-22 · unverdicted · none · ref 15 · internal anchor
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 15 · 2 links · internal anchor
MELT decouples reasoning depth from memory in looped language models by sharing a single gated KV cache per layer and training it via chunk-wise distillation from Ouro starting models.

Hyperloop Transformers

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer