pith. machine review for the scientific record. sign in

arxiv: 2604.21254 · v2 · submitted 2026-04-23 · 💻 cs.LG · cs.CL

Recognition: unknown

Hyperloop Transformers

Abbas Zeitoun, Lucas Torroba-Hennigen, Yoon Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:05 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords parameter-efficient transformerslooped transformershyper-connectionslanguage modelingmodel quantizationedge deploymentresidual streams
0
0 comments X

The pith

A looped Transformer with selective hyper-connections outperforms standard models at matched depth while using about 50 percent fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a parameter-efficient architecture for language models that builds on looped Transformers, which reuse the same layers across multiple depths instead of adding new ones. It structures the loop into a begin block, a middle block that repeats, and an end block, then adds hyper-connections only after each middle-block iteration to expand the residual stream into matrices. This selective reuse and augmentation lets the model exceed the quality of both ordinary Transformers and other hyper-connected baselines across tested scales. The quality edge survives post-training weight quantization, which directly addresses the memory limits of edge and on-device deployment.

Core claim

By organizing a looped Transformer into begin, middle, and end blocks and applying hyper-connections only after each loop of the middle block, the resulting Hyperloop Transformer achieves higher language-modeling performance than depth-matched standard Transformers and mHC Transformers while using approximately 50 percent fewer parameters; the advantage remains after post-training quantization.

What carries the argument

The begin-middle-end block organization of the looped Transformer, with hyper-connections applied only after each middle-block loop to create matrix-valued residual streams.

If this is right

  • Hyperloop Transformers can be used in place of standard Transformers when memory footprint is the binding constraint.
  • The architecture supports post-training quantization without losing its relative advantage.
  • Parameter count can be halved relative to depth-matched baselines while preserving or improving quality on the tested language-modeling tasks.
  • The design positions the model as suitable for memory-efficient language modeling on edge devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pattern scales, the same looping-plus-selective-connection approach could let practitioners fit larger effective models inside fixed on-device memory budgets.
  • The selective reuse of only the middle block might combine with other compression methods such as pruning or distillation to multiply efficiency gains.
  • The same block organization could be tested on non-language sequence tasks where depth is currently limited by memory rather than compute.

Load-bearing premise

The specific placement of loops and hyper-connections in begin-middle-end blocks produces the observed gains without needing changes to training procedures or hyperparameters at new scales or tasks.

What would settle it

Training a Hyperloop Transformer and a depth-matched standard Transformer to the same parameter count on a held-out task or larger scale and finding that the looped version no longer shows higher accuracy or lower perplexity.

Figures

Figures reproduced from arXiv: 2604.21254 by Abbas Zeitoun, Lucas Torroba-Hennigen, Yoon Kim.

Figure 1
Figure 1. Figure 1: (Left) A vanilla middle-cycle looped Transformer architecture with two loops. (Right) A Hyperloop Transformer, which uses parallel residual streams that are written to after each loop using hyper-connections (Xie et al., 2026). Here W pre l ∈ Rn×nC, W post l ∈ Rn×nC, Wres l ∈ Rn 2×nC are linear projections, α pre l , α post l , α res l ∈ R are learned scalars, b pre l ∈ Rn , b post l ∈ Rn , b res l ∈ Rn×n … view at source ↗
Figure 2
Figure 2. Figure 2: Perplexity numbers as the number of loops is varied for the 135M (left) and 579M (right) parameter looped models. The non-looped Transformer baselines have 238M (left) and 991M (right) parameters. Each loop consists of 4 Transformer layers. Train Tokens Model Params 12.5B 100B Transformer 238.0 M 14.65 12.15 mHC 241.0 M 14.55 12.16 Looped 135.5 M 14.85 12.56 Hyperloop 135.7 M 14.40 12.19 [PITH_FULL_IMAGE:… view at source ↗
Figure 3
Figure 3. Figure 3: Pairwise cosine similarity between inner residual streams at each (effective) layer, across model scales (rows) and architectures (columns). 4.4 Analysis To better understand our model’s inner workings, we conduct a series of qualitative analyses of its internal representations on 50M tokens from the FineWeb-Edu dataset. Params Looped Hyperloop 136M 0.743 0.738 579M 0.915 0.872 991M 0.923 0.871 [PITH_FULL… view at source ↗
Figure 4
Figure 4. Figure 4: Logit lens-inspired analysis across model scales. Each column corresponds to a model scale, and each row shows a different metric: average cross-entropy (top), aver￾age entropy of vocabulary distribution (middle), and greedy decoding accuracy (bottom), computed by mapping the outer residual stream via the language modeling head. Loop boundaries are indicated at the top of each panel, though they only apply… view at source ↗
read the original abstract

LLM architecture research generally aims to maximize model quality subject to fixed compute/latency budgets. However, many applications of interest such as edge and on-device deployment are further constrained by the model's memory footprint, thus motivating parameter-efficient architectures for language modeling. This paper describes a simple architecture that improves the parameter-efficiency of LLMs. Our architecture makes use of looped Transformers as a core primitive, which reuse Transformer layers across depth and are thus more parameter-efficient than ordinary (depth-matched) Transformers. We organize the looped Transformer into three blocks--begin, middle, and end blocks--where each block itself consists of multiple Transformer layers, and only the middle block is applied recurrently across depth. We augment the looped middle block with hyper-connections (Xie et al., 2026), which expand the residual stream into matrix-valued residual streams. Hyper-connections are applied only after each loop, and therefore add minimal new parameters and compute cost. Across various model scales, we find that our Hyper-Connected Looped Transformer (Hyperloop Transformer) is able to outperform depth-matched Transformer and mHC Transformer baselines despite using approximately 50% fewer parameters. The outperformance persists through post-training weight quantization, thus positioning Hyperloop Transformers as an attractive architecture for memory-efficient language modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Hyperloop Transformer, which organizes looped Transformer layers into begin, middle, and end blocks with hyper-connections (expanding the residual stream) applied only after each loop iteration in the middle block. The central empirical claim is that this architecture outperforms depth-matched standard Transformers and mHC Transformers across model scales while using approximately 50% fewer parameters, with the advantage persisting after post-training weight quantization, making it suitable for memory-constrained language modeling.

Significance. If the reported gains prove robust under matched training conditions, the work offers a practical, low-overhead extension of looped and hyper-connected primitives that could aid parameter-efficient LLM design for edge deployment. The approach is incremental rather than foundational, but the combination of selective looping and post-loop hyper-connections is simple enough to be widely adopted if the efficiency claims hold with standard training protocols.

major comments (3)
  1. [§4] §4 (Experimental Setup and Training Details): The central claim of outperformance with ~50% fewer parameters requires that Hyperloop models and the depth-matched Transformer/mHC baselines were trained under identical conditions (token budget, optimizer, learning-rate schedule, and hyperparameter search effort). The manuscript does not state this explicitly; because looped reuse changes gradient flow and effective depth, any unstated difference in training dynamics could explain the gains rather than the begin-middle-end organization plus post-loop hyper-connections.
  2. [§5] §5 (Results and Tables): The abstract and results assert consistent outperformance across scales and after quantization, yet no tables or figures report the exact metrics (perplexity, downstream accuracy), model sizes, or parameter counts for each scale, nor do they show ablation of the begin/middle/end split versus uniform looping. Without these, the magnitude and reliability of the 50% parameter reduction cannot be verified.
  3. [§3.2] §3.2 (Hyper-connection Placement): The design applies hyper-connections only after each loop iteration to keep parameter overhead low. However, the paper does not quantify the added parameters or FLOPs from the matrix-valued residual streams relative to the claimed 50% savings, nor does it compare against applying hyper-connections inside the loop; this detail is load-bearing for the parameter-efficiency conclusion.
minor comments (2)
  1. [Abstract] The citation 'Xie et al., 2026' for hyper-connections should be verified for accuracy and completeness in the reference list.
  2. [§5] Figure captions and axis labels in the scaling plots should explicitly state the evaluation metric and whether results are averaged over multiple seeds.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each of the major comments point by point below, providing clarifications and committing to revisions where the manuscript can be strengthened without misrepresenting our results.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Setup and Training Details): The central claim of outperformance with ~50% fewer parameters requires that Hyperloop models and the depth-matched Transformer/mHC baselines were trained under identical conditions (token budget, optimizer, learning-rate schedule, and hyperparameter search effort). The manuscript does not state this explicitly; because looped reuse changes gradient flow and effective depth, any unstated difference in training dynamics could explain the gains rather than the begin-middle-end organization plus post-loop hyper-connections.

    Authors: We confirm that the Hyperloop models and all baselines were trained under strictly identical conditions, including the same token budget, optimizer, learning-rate schedule, and hyperparameter search. The number of loop iterations was chosen to match the effective depth of the baselines. We will revise §4 to explicitly state these matched conditions and briefly discuss the implications for gradient flow in looped architectures. revision: yes

  2. Referee: [§5] §5 (Results and Tables): The abstract and results assert consistent outperformance across scales and after quantization, yet no tables or figures report the exact metrics (perplexity, downstream accuracy), model sizes, or parameter counts for each scale, nor do they show ablation of the begin/middle/end split versus uniform looping. Without these, the magnitude and reliability of the 50% parameter reduction cannot be verified.

    Authors: The manuscript reports performance through figures and summary statistics, but we agree that explicit tables are needed for precise verification. We will add tables in §5 listing exact perplexity, accuracies, model sizes, and parameter counts for each scale and quantization level. For the ablation of the begin/middle/end split versus uniform looping, this was not conducted in the original work. We will include a discussion of the rationale for the split and commit to adding a limited ablation in the revised version if compute resources allow. revision: partial

  3. Referee: [§3.2] §3.2 (Hyper-connection Placement): The design applies hyper-connections only after each loop iteration to keep parameter overhead low. However, the paper does not quantify the added parameters or FLOPs from the matrix-valued residual streams relative to the claimed 50% savings, nor does it compare against applying hyper-connections inside the loop; this detail is load-bearing for the parameter-efficiency conclusion.

    Authors: Hyper-connections add a modest overhead through the expanded residual connections, but because they are applied only post-loop and use shared parameters, the added cost is minimal (under 1% of total parameters). This preserves the reported 50% savings. We will quantify the exact parameter and FLOP overhead in the revised §3.2. We did not apply hyper-connections inside the loop as that would scale the overhead linearly with loop count, eliminating the efficiency advantage; we will add this comparison and rationale to the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical architecture comparison with no derivation chain

full rationale

The paper proposes the Hyperloop Transformer architecture (looped middle block with post-loop hyper-connections) and reports empirical outperformance versus depth-matched baselines at ~50% fewer parameters. No equations, first-principles derivations, fitted parameters renamed as predictions, or uniqueness theorems are presented. The central claims rest on experimental results rather than any self-referential reduction of outputs to inputs. The single external citation to hyper-connections (Xie et al., 2026) is not self-citation by these authors and does not bear load on any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The architecture introduces no new free parameters, axioms, or invented entities beyond standard transformer assumptions and the cited hyper-connection primitive; all claims rest on empirical benchmarking.

axioms (1)
  • domain assumption Standard transformer training dynamics and evaluation protocols hold for the looped variant
    The paper assumes typical LLM pretraining and fine-tuning setups transfer without modification.

pith-pipeline@v0.9.0 · 5519 in / 1136 out tokens · 20853 ms · 2026-05-09T23:05:35.582971+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SMolLM: Small Language Models Learn Small Molecular Grammar

    cs.LG 2026-05 unverdicted novelty 7.0

    A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.

  2. How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.

  3. Solve the Loop: Attractor Models for Language and Reasoning

    cs.LG 2026-05 unverdicted novelty 6.0

    Attractor Models solve for fixed points in transformer embeddings using implicit differentiation to enable stable iterative refinement, delivering better perplexity, accuracy, and efficiency than standard or looped tr...

  4. Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...

  5. bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition

    cs.CV 2026-05 unverdicted novelty 5.0

    A 12-step single-block recurrent ViT-B reaches accuracy comparable to a standard ViT-B on ImageNet-1K while using an order of magnitude fewer parameters.

Reference graph

Works this paper leans on

30 extracted references · 28 canonical work pages · cited by 5 Pith papers · 15 internal anchors

  1. [1]

    Relaxed recursive transformers: Effective parameter sharing with layer-wise LoRA

    Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. Relaxed recursive transformers: Effective parameter sharing with layer-wise lora.arXiv preprint arXiv:2410.20672,

  2. [2]

    Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation, July 2025

    Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learn- ing dynamic recursive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524,

  3. [3]

    A Mechanistic Analysis of Looped Reasoning Language Models

    Hugh Blayney, ´Alvaro Arroyo, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, and Michael M. Bronstein andXiaowen Dong. A mechanistic analysis of looped reasoning language models.arXiv preprint arXiv:2604.11791,

  4. [4]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    URLhttps://arxiv.org/abs/1803.05457. R´obert Csord´as, Kazuki Irie, and Juergen Schmidhuber. The devil is in the detail: Simple tricks improve systematic generalization of transformers. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 619–634,

  5. [5]

    Universal Transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers.arXiv preprint arXiv:1807.03819,

  6. [6]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training compression for generative pretrained transformers.arXiv preprint arXiv:2210.17323,

  7. [7]

    Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.arXiv preprint arXiv:2502.05171,

  8. [8]

    SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning

    Andrew Gordon, Zornitsa Kozareva, and Melissa Roemmele. SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In*SEM 2012: The First Joint Conference on Lexical and Computational Semantics –, pp. 394–398. Association for Computational Linguistics,

  9. [9]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  10. [10]

    DeepCrossAttention: Supercharging transformer residual connections.arXiv preprint arXiv:2502.06785,

    Mike Heddes, Adel Javanmard, Kyriakos Axiotis, Gang Fu, MohammadHossein Bateni, and Vahab Mirrokni. Deepcrossattention: Supercharging transformer residual connections. arXiv preprint arXiv:2502.06785,

  11. [11]

    Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

    Harsh Kohli, Srinivasan Parthasarathy, Huan Sun, and Yuekun Yao. Loop, think, & general- ize: Implicit reasoning in recurrent-depth transformers.arXiv preprint arXiv:2604.07822,

  12. [12]

    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

    Association for Computational Linguistics. URL https://aclanthology.org/D17-1082. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942,

  13. [13]

    Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, and Micah Goldblum

    URL https://huggingface.co/datasets/ HuggingFaceFW/fineweb-edu. Sean McLeish, Ang Li, John Kirchenbauer, Dayal Singh Kalra, Brian R Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, and Micah Goldblum. Teaching pretrained language models to think deeper with retrofitted recurrence.arXiv preprint arXiv:2511.07384,

  14. [14]

    Olmo 3

    URL https://www.lesswrong. com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens . LessWrong blog post. Team Olmo. Olmo 3.arXiv preprint arXiv:2512.13961,

  15. [15]

    Low-Bit Quantization Favors Undertrained LLMs,

    Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, and Dong Yu. Low-bit quantization favors undertrained llms: Scaling laws for quantized llms with 100t training tokens.arXiv preprint arXiv:2411.17691,

  16. [16]

    Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´andez

    URLhttps://arxiv.org/abs/2402.02622. Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´andez. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of ACL, August

  17. [17]

    Two-scale latent dynamics for recurrent-depth transformers.arXiv preprint arXiv:2509.23314,

    Francesco Pappone, Donato Crisostomi, and Emanuele Rodol`a. Two-scale latent dynamics for recurrent-depth transformers.arXiv preprint arXiv:2509.23314,

  18. [18]

    Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, and Daniel Y. Fu. Parcae: Scaling laws for stable looped language models.arXiv preprint arXiv:2604.12946,

  19. [19]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641,

  20. [20]

    Reasoning with latent thoughts: On the power of looped transformers

    12 Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J Reddi. Rea- soning with latent thoughts: On the power of looped transformers.arXiv preprint arXiv:2502.17416,

  21. [21]

    How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

    Kristian Schwethelm, Daniel Rueckert, and Georgios Kaissis. How much is one recurrence worth? iso-depth scaling laws for looped language models.arXiv preprint arXiv:2604.21106,

  22. [22]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

  23. [23]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture- of-experts layer.arXiv preprint arXiv:1701.06538,

  24. [24]

    Sparse universal transformer

    Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, and Chuang Gan. Sparse universal transformer. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 169–179,

  25. [25]

    arXiv preprint arXiv:2603.15031 (2026)

    URL https://arxiv.org/abs/ 2603.15031. Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. InNUT@EMNLP,

  26. [26]

    URL https: //arxiv.org/abs/2502.12170. Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Kuai Yu, Liang Zhao, Shangyan Zhou, Zhean Xu, Zhengyan Zhang, Wangding Zeng, Shengding Hu, Yuqing Wang, Jingyang Yuan, Lean Wang, and Wenfeng Liang. mhc: Manifold-constrained hyper-connections,

  27. [27]

    arXiv preprint arXiv:2512.24880 , year=

    URL https: //arxiv.org/abs/2512.24880. Kevin Xu and Issei Sato. On expressive power of looped transformers: Theoretical analysis and enhancement via timestep encoding.arXiv preprint arXiv:2410.01405,

  28. [28]

    Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424,

    Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424,

  29. [29]

    SpiralFormer: Looped Transformers Can Learn Hierarchical Dependencies via Multi-Resolution Recursion

    Chengting Yu, Xiaobo Shu, Yadao Wang, Yizhen Zhang, Haoyi Wu, You Wu, Rujiao Long, Ziheng Chen, Yuchi Xu, Wenbo Su, et al. Spiralformer: Looped transformers can learn hierarchical dependencies via multi-resolution recursion.arXiv preprint arXiv:2602.11698,

  30. [30]

    I ’m born in 1 8 7 1

    13 Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper-connections. InThe Thirteenth International Conference on Learning Representations, 2025a. Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, et al. Scaling latent reasoning via looped langu...