LoopMoE is a looped MoE language model that outperforms matched vanilla MoE on 8 of 9 downstream benchmarks at 3B scale and continues to outperform at 9B scale under strictly controlled budgets.
Hyperloop Transformers
10 Pith papers cite this work. Polarity classification is still indexing.
abstract
LLM architecture research generally aims to maximize model quality subject to fixed compute/latency budgets. However, many applications of interest such as edge and on-device deployment are further constrained by the model's memory footprint, thus motivating parameter-efficient architectures for language modeling. This paper describes a simple architecture that improves the parameter-efficiency of LLMs. Our architecture makes use of looped Transformers as a core primitive, which reuse Transformer layers across depth and are thus more parameter-efficient than ordinary (depth-matched) Transformers. We organize the looped Transformer into three blocks--begin, middle, and end blocks--where each block itself consists of multiple Transformer layers, and only the middle block is applied recurrently across depth. We augment the looped middle block with hyper-connections (Xie et al., 2026), which expand the residual stream into matrix-valued residual streams. Hyper-connections are applied only after each loop, and therefore add minimal new parameters and compute cost. Across various model scales, we find that our Hyper-Connected Looped Transformer (Hyperloop Transformer) is able to outperform depth-matched Transformer and mHC Transformer baselines despite using approximately 50% fewer parameters. The outperformance persists through post-training weight quantization, thus positioning Hyperloop Transformers as an attractive architecture for memory-efficient language modeling.
citation-role summary
citation-polarity summary
years
2026 10verdicts
UNVERDICTED 10representative citing papers
LoopQ provides a loop-aware PTQ framework for recursive Transformers that mitigates distribution shift, state reuse, and recursive error accumulation, yielding 68.8% higher average accuracy and 87.7% lower perplexity under W4A4 versus static baselines.
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
Introduces looped transformer architectures for world models that iteratively refine latent states to achieve up to 100x parameter efficiency via adaptive computation depth.
SNLP achieves up to 2.58x wall-clock speedup on 0.5B Transformers via architecture-specific Newton corrections (IDN/HCN) that enable layer-parallel inference while preserving perplexity in milder settings.
Attractor Models solve for fixed points in transformer embeddings using implicit differentiation to enable stable iterative refinement, delivering better perplexity, accuracy, and efficiency than standard or looped transformers.
MELT decouples reasoning depth from memory in looped language models by sharing a single gated KV cache per layer and training it via chunk-wise distillation from Ouro starting models.
A 12-step single-block recurrent ViT-B reaches accuracy comparable to a standard ViT-B on ImageNet-1K while using an order of magnitude fewer parameters.
A 53K-parameter weight-shared transformer generates novel valid SMILES at 95% rate on ZINC-250K and resolves constraints hierarchically via bracket, ring, and valence stages as shown by probing and ablation.
CART is a recurrent transformer with shared core, frozen prelude KV tensors, and LTI stability gate that fails to beat dense baselines at parameter parity across tested widths.
citing papers explorer
-
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
-
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
MELT decouples reasoning depth from memory in looped language models by sharing a single gated KV cache per layer and training it via chunk-wise distillation from Ouro starting models.