arxiv: 2604.21254 · v2 · submitted 2026-04-23 · 💻 cs.LG · cs.CL

Recognition: unknown

Hyperloop Transformers

Abbas Zeitoun, Lucas Torroba-Hennigen, Yoon Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:05 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords parameter-efficient transformerslooped transformershyper-connectionslanguage modelingmodel quantizationedge deploymentresidual streams

0 comments

The pith

A looped Transformer with selective hyper-connections outperforms standard models at matched depth while using about 50 percent fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a parameter-efficient architecture for language models that builds on looped Transformers, which reuse the same layers across multiple depths instead of adding new ones. It structures the loop into a begin block, a middle block that repeats, and an end block, then adds hyper-connections only after each middle-block iteration to expand the residual stream into matrices. This selective reuse and augmentation lets the model exceed the quality of both ordinary Transformers and other hyper-connected baselines across tested scales. The quality edge survives post-training weight quantization, which directly addresses the memory limits of edge and on-device deployment.

Core claim

By organizing a looped Transformer into begin, middle, and end blocks and applying hyper-connections only after each loop of the middle block, the resulting Hyperloop Transformer achieves higher language-modeling performance than depth-matched standard Transformers and mHC Transformers while using approximately 50 percent fewer parameters; the advantage remains after post-training quantization.

What carries the argument

The begin-middle-end block organization of the looped Transformer, with hyper-connections applied only after each middle-block loop to create matrix-valued residual streams.

If this is right

Hyperloop Transformers can be used in place of standard Transformers when memory footprint is the binding constraint.
The architecture supports post-training quantization without losing its relative advantage.
Parameter count can be halved relative to depth-matched baselines while preserving or improving quality on the tested language-modeling tasks.
The design positions the model as suitable for memory-efficient language modeling on edge devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pattern scales, the same looping-plus-selective-connection approach could let practitioners fit larger effective models inside fixed on-device memory budgets.
The selective reuse of only the middle block might combine with other compression methods such as pruning or distillation to multiply efficiency gains.
The same block organization could be tested on non-language sequence tasks where depth is currently limited by memory rather than compute.

Load-bearing premise

The specific placement of loops and hyper-connections in begin-middle-end blocks produces the observed gains without needing changes to training procedures or hyperparameters at new scales or tasks.

What would settle it

Training a Hyperloop Transformer and a depth-matched standard Transformer to the same parameter count on a held-out task or larger scale and finding that the looped version no longer shows higher accuracy or lower perplexity.

Figures

Figures reproduced from arXiv: 2604.21254 by Abbas Zeitoun, Lucas Torroba-Hennigen, Yoon Kim.

**Figure 1.** Figure 1: (Left) A vanilla middle-cycle looped Transformer architecture with two loops. (Right) A Hyperloop Transformer, which uses parallel residual streams that are written to after each loop using hyper-connections (Xie et al., 2026). Here W pre l ∈ Rn×nC, W post l ∈ Rn×nC, Wres l ∈ Rn 2×nC are linear projections, α pre l , α post l , α res l ∈ R are learned scalars, b pre l ∈ Rn , b post l ∈ Rn , b res l ∈ Rn×n … view at source ↗

**Figure 2.** Figure 2: Perplexity numbers as the number of loops is varied for the 135M (left) and 579M (right) parameter looped models. The non-looped Transformer baselines have 238M (left) and 991M (right) parameters. Each loop consists of 4 Transformer layers. Train Tokens Model Params 12.5B 100B Transformer 238.0 M 14.65 12.15 mHC 241.0 M 14.55 12.16 Looped 135.5 M 14.85 12.56 Hyperloop 135.7 M 14.40 12.19 [PITH_FULL_IMAGE:… view at source ↗

**Figure 3.** Figure 3: Pairwise cosine similarity between inner residual streams at each (effective) layer, across model scales (rows) and architectures (columns). 4.4 Analysis To better understand our model’s inner workings, we conduct a series of qualitative analyses of its internal representations on 50M tokens from the FineWeb-Edu dataset. Params Looped Hyperloop 136M 0.743 0.738 579M 0.915 0.872 991M 0.923 0.871 [PITH_FULL… view at source ↗

**Figure 4.** Figure 4: Logit lens-inspired analysis across model scales. Each column corresponds to a model scale, and each row shows a different metric: average cross-entropy (top), average entropy of vocabulary distribution (middle), and greedy decoding accuracy (bottom), computed by mapping the outer residual stream via the language modeling head. Loop boundaries are indicated at the top of each panel, though they only apply… view at source ↗

read the original abstract

LLM architecture research generally aims to maximize model quality subject to fixed compute/latency budgets. However, many applications of interest such as edge and on-device deployment are further constrained by the model's memory footprint, thus motivating parameter-efficient architectures for language modeling. This paper describes a simple architecture that improves the parameter-efficiency of LLMs. Our architecture makes use of looped Transformers as a core primitive, which reuse Transformer layers across depth and are thus more parameter-efficient than ordinary (depth-matched) Transformers. We organize the looped Transformer into three blocks--begin, middle, and end blocks--where each block itself consists of multiple Transformer layers, and only the middle block is applied recurrently across depth. We augment the looped middle block with hyper-connections (Xie et al., 2026), which expand the residual stream into matrix-valued residual streams. Hyper-connections are applied only after each loop, and therefore add minimal new parameters and compute cost. Across various model scales, we find that our Hyper-Connected Looped Transformer (Hyperloop Transformer) is able to outperform depth-matched Transformer and mHC Transformer baselines despite using approximately 50% fewer parameters. The outperformance persists through post-training weight quantization, thus positioning Hyperloop Transformers as an attractive architecture for memory-efficient language modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hyperloop Transformers split looped models into begin-middle-end blocks with selective post-loop hyper-connections to claim 50% parameter cuts and better quality than depth-matched baselines, but the gains rest on unverified training parity.

read the letter

The main point is a specific three-block organization for looped transformers: begin and end blocks run once, the middle block loops across depth, and hyper-connections from Xie et al. get added only after each loop. This reportedly beats standard depth-matched transformers and mHC variants on language modeling while using half the parameters, with the advantage surviving quantization. The setup keeps extra cost low by limiting where the hyper-connections apply. That combination of reuse plus targeted expansion of the residual stream is the concrete new piece here, and it fits the goal of memory-efficient models for edge use. The paper does a clean job framing why parameter count matters beyond raw compute and showing the idea at multiple scales. The soft spot is the experimental controls. Outperformance at fixed parameter budget sounds good, but looped reuse changes gradient flow and effective learning rates, so any edge could trace to differences in token budget, optimizer settings, or tuning effort rather than the architecture. The abstract calls the baselines depth-matched without confirming matched training procedures, which leaves the central efficiency claim open to that question. If the full paper has ablations holding those factors fixed, the result strengthens; otherwise it stays provisional. This is for researchers and engineers focused on practical parameter reduction in LLMs, especially those targeting on-device deployment. A reader who follows efficient architecture work would pick up a usable tweak and the quantization note. The claim is specific enough to warrant a serious referee who can check the training details and ask for replication. I would send it to peer review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Hyperloop Transformer, which organizes looped Transformer layers into begin, middle, and end blocks with hyper-connections (expanding the residual stream) applied only after each loop iteration in the middle block. The central empirical claim is that this architecture outperforms depth-matched standard Transformers and mHC Transformers across model scales while using approximately 50% fewer parameters, with the advantage persisting after post-training weight quantization, making it suitable for memory-constrained language modeling.

Significance. If the reported gains prove robust under matched training conditions, the work offers a practical, low-overhead extension of looped and hyper-connected primitives that could aid parameter-efficient LLM design for edge deployment. The approach is incremental rather than foundational, but the combination of selective looping and post-loop hyper-connections is simple enough to be widely adopted if the efficiency claims hold with standard training protocols.

major comments (3)

[§4] §4 (Experimental Setup and Training Details): The central claim of outperformance with ~50% fewer parameters requires that Hyperloop models and the depth-matched Transformer/mHC baselines were trained under identical conditions (token budget, optimizer, learning-rate schedule, and hyperparameter search effort). The manuscript does not state this explicitly; because looped reuse changes gradient flow and effective depth, any unstated difference in training dynamics could explain the gains rather than the begin-middle-end organization plus post-loop hyper-connections.
[§5] §5 (Results and Tables): The abstract and results assert consistent outperformance across scales and after quantization, yet no tables or figures report the exact metrics (perplexity, downstream accuracy), model sizes, or parameter counts for each scale, nor do they show ablation of the begin/middle/end split versus uniform looping. Without these, the magnitude and reliability of the 50% parameter reduction cannot be verified.
[§3.2] §3.2 (Hyper-connection Placement): The design applies hyper-connections only after each loop iteration to keep parameter overhead low. However, the paper does not quantify the added parameters or FLOPs from the matrix-valued residual streams relative to the claimed 50% savings, nor does it compare against applying hyper-connections inside the loop; this detail is load-bearing for the parameter-efficiency conclusion.

minor comments (2)

[Abstract] The citation 'Xie et al., 2026' for hyper-connections should be verified for accuracy and completeness in the reference list.
[§5] Figure captions and axis labels in the scaling plots should explicitly state the evaluation metric and whether results are averaged over multiple seeds.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each of the major comments point by point below, providing clarifications and committing to revisions where the manuscript can be strengthened without misrepresenting our results.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup and Training Details): The central claim of outperformance with ~50% fewer parameters requires that Hyperloop models and the depth-matched Transformer/mHC baselines were trained under identical conditions (token budget, optimizer, learning-rate schedule, and hyperparameter search effort). The manuscript does not state this explicitly; because looped reuse changes gradient flow and effective depth, any unstated difference in training dynamics could explain the gains rather than the begin-middle-end organization plus post-loop hyper-connections.

Authors: We confirm that the Hyperloop models and all baselines were trained under strictly identical conditions, including the same token budget, optimizer, learning-rate schedule, and hyperparameter search. The number of loop iterations was chosen to match the effective depth of the baselines. We will revise §4 to explicitly state these matched conditions and briefly discuss the implications for gradient flow in looped architectures. revision: yes
Referee: [§5] §5 (Results and Tables): The abstract and results assert consistent outperformance across scales and after quantization, yet no tables or figures report the exact metrics (perplexity, downstream accuracy), model sizes, or parameter counts for each scale, nor do they show ablation of the begin/middle/end split versus uniform looping. Without these, the magnitude and reliability of the 50% parameter reduction cannot be verified.

Authors: The manuscript reports performance through figures and summary statistics, but we agree that explicit tables are needed for precise verification. We will add tables in §5 listing exact perplexity, accuracies, model sizes, and parameter counts for each scale and quantization level. For the ablation of the begin/middle/end split versus uniform looping, this was not conducted in the original work. We will include a discussion of the rationale for the split and commit to adding a limited ablation in the revised version if compute resources allow. revision: partial
Referee: [§3.2] §3.2 (Hyper-connection Placement): The design applies hyper-connections only after each loop iteration to keep parameter overhead low. However, the paper does not quantify the added parameters or FLOPs from the matrix-valued residual streams relative to the claimed 50% savings, nor does it compare against applying hyper-connections inside the loop; this detail is load-bearing for the parameter-efficiency conclusion.

Authors: Hyper-connections add a modest overhead through the expanded residual connections, but because they are applied only post-loop and use shared parameters, the added cost is minimal (under 1% of total parameters). This preserves the reported 50% savings. We will quantify the exact parameter and FLOP overhead in the revised §3.2. We did not apply hyper-connections inside the loop as that would scale the overhead linearly with loop count, eliminating the efficiency advantage; we will add this comparison and rationale to the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical architecture comparison with no derivation chain

full rationale

The paper proposes the Hyperloop Transformer architecture (looped middle block with post-loop hyper-connections) and reports empirical outperformance versus depth-matched baselines at ~50% fewer parameters. No equations, first-principles derivations, fitted parameters renamed as predictions, or uniqueness theorems are presented. The central claims rest on experimental results rather than any self-referential reduction of outputs to inputs. The single external citation to hyper-connections (Xie et al., 2026) is not self-citation by these authors and does not bear load on any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The architecture introduces no new free parameters, axioms, or invented entities beyond standard transformer assumptions and the cited hyper-connection primitive; all claims rest on empirical benchmarking.

axioms (1)

domain assumption Standard transformer training dynamics and evaluation protocols hold for the looped variant
The paper assumes typical LLM pretraining and fine-tuning setups transfer without modification.

pith-pipeline@v0.9.0 · 5519 in / 1136 out tokens · 20853 ms · 2026-05-09T23:05:35.582971+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SMolLM: Small Language Models Learn Small Molecular Grammar
cs.LG 2026-05 unverdicted novelty 7.0

A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models
cs.LG 2026-04 unverdicted novelty 7.0

A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
Solve the Loop: Attractor Models for Language and Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

Attractor Models solve for fixed points in transformer embeddings using implicit differentiation to enable stable iterative refinement, delivering better perplexity, accuracy, and efficiency than standard or looped tr...
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
cs.CL 2026-05 unverdicted novelty 6.0

MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...
bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition
cs.CV 2026-05 unverdicted novelty 5.0

A 12-step single-block recurrent ViT-B reaches accuracy comparable to a standard ViT-B on ImageNet-1K while using an order of magnitude fewer parameters.

Reference graph

Works this paper leans on

30 extracted references · 28 canonical work pages · cited by 5 Pith papers · 15 internal anchors

[1]

Relaxed recursive transformers: Effective parameter sharing with layer-wise LoRA

Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. Relaxed recursive transformers: Effective parameter sharing with layer-wise lora.arXiv preprint arXiv:2410.20672,

work page arXiv
[2]

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation, July 2025

Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learn- ing dynamic recursive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524,

work page arXiv
[3]

A Mechanistic Analysis of Looped Reasoning Language Models

Hugh Blayney, ´Alvaro Arroyo, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, and Michael M. Bronstein andXiaowen Dong. A mechanistic analysis of looped reasoning language models.arXiv preprint arXiv:2604.11791,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

URLhttps://arxiv.org/abs/1803.05457. R´obert Csord´as, Kazuki Irie, and Juergen Schmidhuber. The devil is in the detail: Simple tricks improve systematic generalization of transformers. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 619–634,

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Universal Transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers.arXiv preprint arXiv:1807.03819,

work page internal anchor Pith review arXiv
[6]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training compression for generative pretrained transformers.arXiv preprint arXiv:2210.17323,

work page internal anchor Pith review arXiv
[7]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.arXiv preprint arXiv:2502.05171,

work page internal anchor Pith review arXiv
[8]

SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning

Andrew Gordon, Zornitsa Kozareva, and Melissa Roemmele. SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In*SEM 2012: The First Joint Conference on Lexical and Computational Semantics –, pp. 394–398. Association for Computational Linguistics,

2012
[9]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

DeepCrossAttention: Supercharging transformer residual connections.arXiv preprint arXiv:2502.06785,

Mike Heddes, Adel Javanmard, Kyriakos Axiotis, Gang Fu, MohammadHossein Bateni, and Vahab Mirrokni. Deepcrossattention: Supercharging transformer residual connections. arXiv preprint arXiv:2502.06785,

work page arXiv
[11]

Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

Harsh Kohli, Srinivasan Parthasarathy, Huan Sun, and Yuekun Yao. Loop, think, & general- ize: Implicit reasoning in recurrent-depth transformers.arXiv preprint arXiv:2604.07822,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Association for Computational Linguistics. URL https://aclanthology.org/D17-1082. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942,

work page internal anchor Pith review arXiv 1909
[13]

Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, and Micah Goldblum

URL https://huggingface.co/datasets/ HuggingFaceFW/fineweb-edu. Sean McLeish, Ang Li, John Kirchenbauer, Dayal Singh Kalra, Brian R Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, and Micah Goldblum. Teaching pretrained language models to think deeper with retrofitted recurrence.arXiv preprint arXiv:2511.07384,

work page arXiv
[14]

Olmo 3

URL https://www.lesswrong. com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens . LessWrong blog post. Team Olmo. Olmo 3.arXiv preprint arXiv:2512.13961,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Low-Bit Quantization Favors Undertrained LLMs,

Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, and Dong Yu. Low-bit quantization favors undertrained llms: Scaling laws for quantized llms with 100t training tokens.arXiv preprint arXiv:2411.17691,

work page arXiv
[16]

Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´andez

URLhttps://arxiv.org/abs/2402.02622. Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´andez. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of ACL, August

work page arXiv
[17]

Two-scale latent dynamics for recurrent-depth transformers.arXiv preprint arXiv:2509.23314,

Francesco Pappone, Donato Crisostomi, and Emanuele Rodol`a. Two-scale latent dynamics for recurrent-depth transformers.arXiv preprint arXiv:2509.23314,

work page arXiv
[18]

Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, and Daniel Y. Fu. Parcae: Scaling laws for stable looped language models.arXiv preprint arXiv:2604.12946,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641,

work page internal anchor Pith review arXiv 1907
[20]

Reasoning with latent thoughts: On the power of looped transformers

12 Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J Reddi. Rea- soning with latent thoughts: On the power of looped transformers.arXiv preprint arXiv:2502.17416,

work page arXiv
[21]

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

Kristian Schwethelm, Daniel Rueckert, and Georgios Kaissis. How much is one recurrence worth? iso-depth scaling laws for looped language models.arXiv preprint arXiv:2604.21106,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review arXiv 2002
[23]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture- of-experts layer.arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Sparse universal transformer

Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, and Chuang Gan. Sparse universal transformer. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 169–179,

2023
[25]

arXiv preprint arXiv:2603.15031 (2026)

URL https://arxiv.org/abs/ 2603.15031. Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. InNUT@EMNLP,

work page arXiv
[26]

URL https: //arxiv.org/abs/2502.12170. Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Kuai Yu, Liang Zhao, Shangyan Zhou, Zhean Xu, Zhengyan Zhang, Wangding Zeng, Shengding Hu, Yuqing Wang, Jingyang Yuan, Lean Wang, and Wenfeng Liang. mhc: Manifold-constrained hyper-connections,

work page arXiv
[27]

arXiv preprint arXiv:2512.24880 , year=

URL https: //arxiv.org/abs/2512.24880. Kevin Xu and Issei Sato. On expressive power of looped transformers: Theoretical analysis and enhancement via timestep encoding.arXiv preprint arXiv:2410.01405,

work page arXiv
[28]

Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424,

Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424,

work page arXiv
[29]

SpiralFormer: Looped Transformers Can Learn Hierarchical Dependencies via Multi-Resolution Recursion

Chengting Yu, Xiaobo Shu, Yadao Wang, Yizhen Zhang, Haoyi Wu, You Wu, Rujiao Long, Ziheng Chen, Yuchi Xu, Wenbo Su, et al. Spiralformer: Looped transformers can learn hierarchical dependencies via multi-resolution recursion.arXiv preprint arXiv:2602.11698,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

I ’m born in 1 8 7 1

13 Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper-connections. InThe Thirteenth International Conference on Learning Representations, 2025a. Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, et al. Scaling latent reasoning via looped langu...

work page arXiv 2048