pith. machine review for the scientific record. sign in

arxiv: 2604.11791 · v1 · submitted 2026-04-13 · 💻 cs.LG · cs.AI

Recognition: unknown

A Mechanistic Analysis of Looped Reasoning Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords looped language modelsmechanistic interpretabilityrecurrent blocksfixed pointsinference stagesattention stabilizationlatent trajectoriesreasoning models
0
0 comments X

The pith

Looped language models repeat the same inference stages as feedforward models inside each recurrent cycle.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the internal workings of language models that reuse their layers in a loop to improve reasoning. It shows that these recurrent blocks settle into a repeating pattern where each layer reaches its own fixed state, causing the model to trace the same sequence of computations over and over. This pattern closely copies the step-by-step inference seen in ordinary feedforward models but applies it multiple times in depth. The analysis examines how loop size, input handling, and normalization shape whether these cycles form and stay stable. Readers would care because the findings turn abstract observations about model behavior into concrete suggestions for building more reliable reasoning systems.

Core claim

Recurrent blocks learn stages of inference that closely mirror those of feedforward models, repeating these stages in depth with each iteration. For many studied models, each layer in the cycle converges to a distinct fixed point, so the recurrent block follows a consistent cyclic trajectory in the latent space. As these fixed points are reached, attention-head behavior stabilizes and produces constant behavior across recurrences.

What carries the argument

Cyclic recurrence of latent states, in which each layer converges to its own fixed point and produces a stable repeating trajectory.

If this is right

  • Larger recurrent blocks tend to produce more stable fixed points and clearer cyclic trajectories.
  • Input injection and normalization choices control how quickly and reliably the fixed points emerge.
  • Once attention stabilizes, the model applies the same reasoning steps on every loop iteration.
  • These dynamics offer practical rules for choosing loop depth and architectural details in new models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The mirroring effect implies that adding more loop iterations could increase reasoning depth without enlarging the underlying model.
  • Disrupting the fixed-point convergence might serve as a test for whether a looped model truly benefits from recurrence.
  • The same cyclic structure could appear in other recurrent designs outside language models, offering a general principle for iterative computation.

Load-bearing premise

The convergence to distinct per-layer fixed points and the stabilization of attention are general properties of looped architectures rather than artifacts of the specific models or training procedures studied.

What would settle it

Train a looped model with a different block size or normalization scheme and check whether the per-layer fixed points and mirroring of feedforward stages still appear in the latent trajectories.

Figures

Figures reproduced from arXiv: 2604.11791 by Aaron Courville, \'Alvaro Arroyo, Hugh Blayney, Johan Obando-Ceron, Michael M. Bronstein, Pablo Samuel Castro, Xiaowen Dong.

Figure 1
Figure 1. Figure 1: Latent states after each block in a recurrent model frequently tend towards separate fixed points, meaning that the application of a recurrent block tends towards a consistent trajectory in latent space. through chain-of-thought (CoT) prompting (Wei et al., 2022) or reinforcement-learning-based fine-tuning, first popular￾ized in the DeepSeek-R1 architecture (Guo et al., 2025). More recently, research has e… view at source ↗
Figure 2
Figure 2. Figure 2: Frobenius norm between attention patterns at different depths, averaged across the batch and head dimensions. Depth index visualized on each axis, cells show the norms between attention patterns at each pair of depth indices. Left: Ouro 1.4B (Zhu et al., 2025). Center: Retrofitted Llama (McLeish et al., 2025). Right: Huginn-0125 (Geiping et al., 2025). All models looped 8 times. 0 100 Recurrence 0 5 10 15 … view at source ↗
Figure 5
Figure 5. Figure 5: Retrofitted Llama (McLeish et al., 2025) latent space trajectory traced out by the hidden states of the final sequence position on a single test prompt; reduced to two dimensions by computing PCA over all final sequence position embeddings. Tra￾jectories perfectly overlap in the second plot, demonstrating that a cyclic fixed point has been reached. see that while Huginn-0125 and Retrofitted Llama demon￾str… view at source ↗
Figure 4
Figure 4. Figure 4: Norm of the difference between the residual stream after each layer in the recurrent block and its “approximate fixed point” - the residual stream after that layer in the 128th re￾currence. While Huginn-0125 and retrofitted Llama quickly reach a fixed point, Ouro does not - despite small successive differences evidenced by view at source ↗
Figure 6
Figure 6. Figure 6: Cosine similarity between residual streams after each layer and the approximate fixed point for a range of norms, with and without input injection. Each model is randomly initialized with 12 layers. Cosine similarity is taken between the residual stream after the first layer at each recurrence and left: the approx￾imate fixed point of the first layer, right: the approximate fixed point of the layer with th… view at source ↗
Figure 8
Figure 8. Figure 8: Stages of inference for each recurrent loop in left: Ouro 1.4B (Zhu et al., 2025) center: retrofitted Llama and right: retrofitted OLMo (McLeish et al., 2025). Ouro 1.4B resembles Llama stages of inference, and the two retrofitted to their associated base models. 0 20 40 60 80 100 % Recurrent Position 10 1 10 2 10 3 Residual Norm Huginn-0125 Ouro 1.4B Retrofitted Llama view at source ↗
Figure 9
Figure 9. Figure 9: Norms of the residual stream for a range of models, demonstrating that Huginn-0125 is unable to develop the activation magnitude changes required for stages of inference due to its repeated normalization of the residual stream. 5.1. Self-Organization Into Stages of Inference An open question from our analysis of pretrained models ( view at source ↗
Figure 10
Figure 10. Figure 10: ColSum concentrations for small-scale trained Looped Transformers with a simplified loss function and constant train recurrence schedule of 4 recurrences. Also visualized in red is a “control” feedforward Transformer of depth 12. All models have 2 prelude and 2 coda layers with no input injection, left: 4 recurrent layers, center: 8 recurrent layers, right: 12 recurrent layers. 0 50 100 Recurrence 0.3 0.4… view at source ↗
Figure 11
Figure 11. Figure 11: Colsum concentration of each layer with successive recurrences for left: retrofitted Llama and right: Ouro 1.4B, both using 128 recurrences. While the layers of retrofitted Llama quickly converge to constant ColSum concentration, the layers of Ouro continually change throughout the recurrences tested. 0 50 100 % Block Depth 0.2 0.3 0.4 0.5 0.6 0.7 Colsum Concentration Retrofitted Llama (128 steps) 0 50 10… view at source ↗
Figure 12
Figure 12. Figure 12: Colsum concentration of each layer vs the percent￾age depth at which that layer appears in the recurrent block. Left: retrofitted Llama and right: Ouro 1.4B, both using 128 recurrences. Feedforward Llama shown in dashed red. out-of-domain performance. Models which exhibit “sta￾ble” stages of inference for arbitrary test time iterations also avoid performance deterioration in this extrapolation regime: whe… view at source ↗
Figure 13
Figure 13. Figure 13: Visualizing the component parts of the Orbit detection algorithm of Algorithm 1. The input sequence (cosine similarities for the residual streams of a given token and layer and successive recursions, as compared to their final residual stream) is visualized in the leftmost column. The center column visualizes the effect of windowing and de-trending, and the rightmost column shows the FFT magnitudes. The t… view at source ↗
Figure 14
Figure 14. Figure 14: Conditional probabilities of co-occurrence for the different limiting behaviors. 19 view at source ↗
Figure 15
Figure 15. Figure 15: PCA trajectories in the intermediate layers of Huginn-0125: this reproduces the leftmost column of view at source ↗
Figure 16
Figure 16. Figure 16: Stages of inference metrics (sink rate, mixing score and colsum concentration) for Huginn-0125 across 128 recurrences, for the GSM8k test prompt that exhibited the largest orbit amplitude. Visualized in realized depth. 0 20 40 60 80 100 % Block Depth 0.2 0.4 0.6 Sink Rate 0 20 40 60 80 100 % Block Depth 0.008 0.010 0.012 0.014 Mixing Score 0 20 40 60 80 100 % Block Depth 0.20 0.25 0.30 0.35 0.40 Colsum Co… view at source ↗
Figure 17
Figure 17. Figure 17: Stages of inference metrics (sink rate, mixing score and colsum concentration) for Huginn-0125 across 128 recurrences, for the GSM8k test prompt that exhibited the largest orbit amplitude: only the final 64 recurrences are visualized, to isolate the impact of the orbit. Visualized in percentage block depth. We plot the same for the largest amplitude orbit on the Retrofitted Llama model in view at source ↗
Figure 18
Figure 18. Figure 18: Stages of inference metrics (sink rate, mixing score and colsum concentration) for Retrofitted Llama across 128 recurrences, for the GSM8k test prompt that exhibited the largest orbit amplitude: only the final 64 recurrences are visualized, to isolate the impact of the orbit. Visualized in percentage block depth. D. Additional Fixed Point Results D.1. Cyclic Similarity We include here additional plots to … view at source ↗
Figure 19
Figure 19. Figure 19: complements view at source ↗
Figure 20
Figure 20. Figure 20: Cosine similarity between residual streams after every pair of layers for different Transformer models, averaged across the batch and sequence dimensions. Left: Huginn-0125 (Geiping et al., 2025). Center Left: Retrofitted Llama. Center Right: Retrofitted OLMo. Right: Retrofitted TinyLlama (McLeish et al., 2025). All models looped 32 times. Extended version of view at source ↗
Figure 21
Figure 21. Figure 21: Frobenius norm between attention matrices for different Transformer models, averaged across the batch and head dimensions. Left: Huginn-0125 (Geiping et al., 2025). Center Left: Retrofitted Llama. Center Right: Retrofitted OLMo. Right: Retrofitted TinyLlama (McLeish et al., 2025). All models looped 32 times. Extended version of view at source ↗
Figure 22
Figure 22. Figure 22: Cosine similarity between the residual stream after each layer in the recurrent block and its “approximate fixed point” - the residual stream after that layer in the 128th recursion. While Huginn-0125 and retrofitted Llama quickly reach a fixed point, Ouro does not - even though the cosine similarity between successive recursions tends towards one, as evidenced by view at source ↗
Figure 23
Figure 23. Figure 23: Frobenius norm between attention matrices of each layer in the recurrent block and their corresponding “approximate fixed point”; the attention matrices of the same layer in the 128th recursion. While Huginn-0125 and retrofitted Llama quickly reach a fixed point, Ouro does not. 0 20 40 60 80 100 120 Recurrence 0.2 0.4 0.6 0.8 1.0 Successive Cosine Similarity Ouro 1.4B (128 steps) 0 20 40 60 80 100 120 Rec… view at source ↗
Figure 24
Figure 24. Figure 24: Cosine similarities between successive recursions of the residual stream after the same layer. 0 20 40 60 80 100 120 Recurrence 0 5 10 15 Successive Attn. Diff. Norms Ouro 1.4B (128 steps) 0 20 40 60 80 100 120 Recurrence 0 1 2 3 4 Retrofitted Llama (128 steps) 0 20 40 60 80 100 120 Recurrence 0 2 4 6 8 Huginn-0125 (128 steps) Early LateLayer view at source ↗
Figure 25
Figure 25. Figure 25: Frobenius norm between attention matrices of each layer between successive recurrences. 24 view at source ↗
Figure 26
Figure 26. Figure 26: Ouro 1.4B (Zhu et al., 2025) latent space trajectory traced out by the hidden states of the final sequence position on the test prompt; reduced to two dimensions by computing PCA over all final sequence position embeddings. -40 -20 0 20 PC 1 -60 -40 -20 0 PC 2 Recurrences 0-8 -40 -20 0 20 PC 1 -60 -40 -20 0 Recurrences 8-16 -40 -20 0 20 PC 1 -60 -40 -20 0 Recurrences 16-24 -40 -20 0 20 PC 1 -60 -40 -20 0 … view at source ↗
Figure 27
Figure 27. Figure 27: Huginn-0125 (Geiping et al., 2025) latent space trajectory traced out by the hidden states of the final sequence position on the test prompt; reduced to two dimensions by computing PCA over all final sequence position embeddings. 0 10 20 PC 1 -5 0 5 10 15 PC 2 Recurrences 0-8 0 10 20 PC 1 -5 0 5 10 15 Recurrences 8-16 0 10 20 PC 1 -5 0 5 10 15 Recurrences 16-24 0 10 20 PC 1 -5 0 5 10 15 Recurrences 24-32 … view at source ↗
Figure 28
Figure 28. Figure 28: Ouro 1.4B (Zhu et al., 2025) latent space trajectory traced out by the hidden states of the final sequence position on a “maths” test prompt (The square root of 16 is); reduced to two dimensions by computing PCA over all final sequence position embeddings. 25 view at source ↗
Figure 29
Figure 29. Figure 29: Norm of the difference between the residual stream after the first layer in the recurrent block for successive recurrences and an “approximate fixed point” – the residual stream in the 128th recurrence. Two fixed point differences are visualized: the difference to the fixed point of the same (first) layer (blue) and the difference to the fixed point which has the greatest norm difference from the first la… view at source ↗
Figure 30
Figure 30. Figure 30: Cosine similarity between the residual stream after the first layer in the recurrent block for successive recurrences and an “approximate fixed point” – the residual stream in the 128th recurrence. Two fixed point differences are visualized: the difference to the fixed point of the same (first) layer (blue) and the difference to the fixed point which has the lowest cosine similarity to the first layer (gr… view at source ↗
Figure 31
Figure 31. Figure 31: Cosine similarity between the residual stream after the first layer in the recurrent block for successive recurrences and an “approximate fixed point” – the residual stream in the 128th recurrence. Two fixed point differences are visualized: the difference to the fixed point of the same (first) layer (blue) and the difference to the fixed point which has the lowest cosine similarity to the first layer (gr… view at source ↗
Figure 32
Figure 32. Figure 32: presents results for a selection of feedforward models used throughout the paper, and view at source ↗
Figure 33
Figure 33. Figure 33: Fraction of prediction and suppression neurons in a selection of looped models used throughout the paper. E.2. Input Dependent Metrics One well-studied phenomenon by which Transformers drastically reduce the mixing in given layer is that of the attention sink (Xiao et al., 2023; Barbero et al., 2025), whereby the layer focuses the majority of the attention “weight” onto the first token in the sequence; of… view at source ↗
Figure 34
Figure 34. Figure 34: Stages of inference for a selection of Looped transformers, all using 8 recurrences: Huginn-0125 (Geiping et al., 2025), Ouro 1.4B (Zhu et al., 2025) and Llama with retrofitted recurrences (McLeish et al., 2025). Note Huginn-0125 and Retrofitted Llama have prelude and coda layers too: each 2 layers in Huginn-0125 and each 4 in Retrofitted Llama. For completeness, we plot these stages of inference for all … view at source ↗
Figure 35
Figure 35. Figure 35: Stages of inference for each recurrent loop in Ouro 1.4B. The close overlap with feedforward stages of inference is a particularly striking result as this model is trained from scratch with recurrence. 29 view at source ↗
Figure 36
Figure 36. Figure 36: Stages of inference for each recurrent loop in Huginn-0125. This represents a negative result: stages of inference do not occur. We discuss possible causes for this in the main text. 0 50 100 % Block Depth 0.4 0.6 0.8 1.0 Sink Rate Sink Rates 0 50 100 % Block Depth 0.005 0.010 0.015 0.020 Mixing Score Mixing Scores 0 50 100 % Block Depth 0.2 0.3 0.4 0.5 0.6 0.7 Colsum Concentration Colsum Concentrations 0… view at source ↗
Figure 37
Figure 37. Figure 37: Stages of inference for each recurrent loop in the retrofitted Llama model (McLeish et al., 2025). Each block demonstrates very similar stages of inference to Llama, the base model from which pretrained layers are taken. 0 50 100 % Block Depth 0.0 0.2 0.4 0.6 0.8 1.0 Sink Rate Sink Rates 0 50 100 % Block Depth 0.010 0.015 0.020 Mixing Score Mixing Scores 0 50 100 % Block Depth 0.1 0.2 0.3 0.4 0.5 0.6 Cols… view at source ↗
Figure 38
Figure 38. Figure 38: Stages of inference for each recurrent loop in the retrofitted OLMo model (McLeish et al., 2025). Similarly, each block demonstrates very similar stages of inference to OLMo, the base model from which pretrained layers are taken. 30 view at source ↗
Figure 39
Figure 39. Figure 39: Stages of inference for each recurrent loop in the retrofitted TinyLlama model (McLeish et al., 2025). Similarly, each block demonstrates very similar stages of inference to TinyLlama, the base model from which pretrained layers are taken. 0 20 40 60 80 100 % Recurrent Depth 0.0 0.2 0.4 0.6 0.8 1.0 Sink Rate 0 20 40 60 80 100 % Recurrent Depth 0.0025 0.0050 0.0075 0.0100 0.0125 0.0150 0.0175 0.0200 Mixing… view at source ↗
Figure 40
Figure 40. Figure 40: Stages of inference for each recurrent loop in Ouro 2.6B. For this model we separate out the first and second half of the recurrent block and overlay them, demonstrating that both halves have close alignment with the Llama feedforward stages of inference. We suggest that this likely arises due to the training regime of Zhu et al. (2025), which first trains a single 24 layer 1.4B parameter model, and then … view at source ↗
Figure 41
Figure 41. Figure 41 view at source ↗
Figure 42
Figure 42. Figure 42: Stages of inference for each recurrent loop in Ouro 1.4B, run on the HellaSwag dataset. 0 50 100 % Block Depth 0.2 0.4 0.6 0.8 1.0 Sink Rate Sink Rates 0 50 100 % Block Depth 0.01 0.02 0.03 0.04 0.05 Mixing Score Mixing Scores 0 50 100 % Block Depth 0.2 0.3 0.4 0.5 0.6 0.7 Colsum Concentration Colsum Concentrations 0 50 100 % Block Depth 0 1 2 3 4 Residual Entropy Residual Entropy Early LateRecurrence Pre… view at source ↗
Figure 43
Figure 43. Figure 43: Stages of inference for each recurrent loop in Huginn-0125, run on the HellaSwag dataset. 32 view at source ↗
Figure 44
Figure 44. Figure 44: Stages of inference for each recurrent loop in the retrofitted Llama model (McLeish et al., 2025), run on the HellaSwag dataset. 0 50 100 % Block Depth 0.0 0.2 0.4 0.6 0.8 1.0 Sink Rate Sink Rates 0 50 100 % Block Depth 0.02 0.03 0.04 0.05 0.06 Mixing Score Mixing Scores 0 50 100 % Block Depth 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Colsum Concentration Colsum Concentrations 0 50 100 % Block Depth 2.0 2.5 3.0 3.5 Res… view at source ↗
Figure 45
Figure 45. Figure 45: Stages of inference for each recurrent loop in the retrofitted OLMo model (McLeish et al., 2025), run on the HellaSwag dataset. ColSum concentration deviates slightly from its GSM8k counterpart here, but still broadly follows the same stages of inference as the feedforward OLMo model. 0 50 100 % Block Depth 0.0 0.2 0.4 0.6 0.8 1.0 Sink Rate Sink Rates 0 50 100 % Block Depth 0.02 0.04 0.06 Mixing Score Mix… view at source ↗
Figure 46
Figure 46. Figure 46: Stages of inference for each recurrent loop in the retrofitted TinyLlama model (McLeish et al., 2025), run on the HellaSwag dataset. 33 view at source ↗
Figure 47
Figure 47. Figure 47: Stages of inference for each of the distinct blocks in Ouro (Zhu et al., 2025), as they are reapplied throughout the model for 128 recurrences. These consistently change throughout the realized depth of the model, reaching no clear fixed point. Mean and standard deviation are over separate inputs to the model, taken over the GSM8k subset. 0 200 400 Recurrent Position 0.2 0.4 0.6 Sink Rate 0 200 400 Recurr… view at source ↗
Figure 48
Figure 48. Figure 48: Stages of inference for each of the distinct blocks in Huginn-0125 (Geiping et al., 2025), as they are reapplied throughout the model for 128 recurrences. These converge to constant behavior. Mean and standard deviation are over separate inputs to the model, taken over the GSM8k subset. 0 250 500 750 Recurrent Position 0.5 0.6 0.7 0.8 0.9 1.0 Sink Rate 0 250 500 750 Recurrent Position 0.010 0.015 0.020 Mi… view at source ↗
Figure 49
Figure 49. Figure 49: Stages of inference for each of the distinct blocks in retrofitted Llama (McLeish et al., 2025), as they are reapplied throughout the model for 128 recurrences. These converge to constant behavior. Mean and standard deviation are over separate inputs to the model, taken over the GSM8k subset. We additionally plot the extended versions of view at source ↗
Figure 50
Figure 50. Figure 50: Stages of inference for Ouro with each of 128 recurrences, visualized over percentage recurrent depth. These consistently change with successive recurrences, deviating significantly from the stages of inference seen with train-time recurrences. 0 50 100 % Block Depth 0.2 0.4 0.6 0.8 1.0 Sink Rate 0 50 100 % Block Depth 0.005 0.010 0.015 0.020 Mixing Score 0 50 100 % Block Depth 0.2 0.3 0.4 0.5 0.6 0.7 Col… view at source ↗
Figure 51
Figure 51. Figure 51: Stages of inference for Huginn-0125 with each of 128 recurrences, visualized over percentage recurrent depth. These quickly reach a fixed point and do not deviate far from their starting stages of inference. 0 50 100 % Block Depth 0.4 0.6 0.8 1.0 Sink Rate 0 50 100 % Block Depth 0.005 0.010 0.015 0.020 Mixing Score 0 50 100 % Block Depth 0.3 0.4 0.5 0.6 0.7 Colsum Concentration 0 50 100 % Block Depth 0.0 … view at source ↗
Figure 52
Figure 52. Figure 52: Stages of inference for retrofitted Llama with each of 128 recurrences, visualized over percentage recurrent depth. These quickly reach a fixed point and do not deviate far from their starting stages of inference. 0 20 40 60 80 100 % Recurrent Depth 0.0 0.2 0.4 0.6 0.8 1.0 Sink Rate Coda Prelude OLMo2 0 20 40 60 80 100 % Recurrent Depth 0.010 0.015 0.020 Mixing Score 0 20 40 60 80 100 % Recurrent Depth 0.… view at source ↗
Figure 53
Figure 53. Figure 53: Stages of inference for retrofitted OLMo-2 with each of 128 recurrences, visualized over percentage recurrent depth. These quickly reach a fixed point and do not deviate far from their starting stages of inference. 35 view at source ↗
Figure 54
Figure 54. Figure 54: Stages of inference metrics for a small-scale Looped Transformer of configuration (2, 4 ⊗ 4, 2), compared to a “control” feedforward Transformer with the same training configuration and depth 12. 0.0 0.2 0.4 0.6 0.8 1.0 Percentage Depth Through Recurrent Block 0.00 0.25 0.50 0.75 1.00 Sink Rate Prelude Coda Feedforward 0.0 0.2 0.4 0.6 0.8 1.0 Percentage Depth Through Recurrent Block 0.03 0.04 0.05 Mixing … view at source ↗
Figure 55
Figure 55. Figure 55: Stages of inference metrics for a small-scale Looped Transformer of configuration (2, 8 ⊗ 4, 2), compared to a “control” feedforward Transformer with the same training configuration and depth 12. 0.0 0.2 0.4 0.6 0.8 1.0 Percentage Depth Through Recurrent Block 0.00 0.25 0.50 0.75 1.00 Sink Rate Prelude Coda Feedforward 0.0 0.2 0.4 0.6 0.8 1.0 Percentage Depth Through Recurrent Block 0.02 0.03 0.04 0.05 Mi… view at source ↗
Figure 56
Figure 56. Figure 56: Stages of inference metrics for a small-scale Looped Transformer of configuration (2, 12 ⊗ 4, 2), compared to a “control” feedforward Transformer with the same training configuration and depth 12. 36 view at source ↗
Figure 57
Figure 57. Figure 57: Stages of inference metrics for a small-scale Looped Transformer of configuration (2, 4 ⊗ 4, 2)𝐼 , compared to a “control” feedforward Transformer with the same training configuration and depth 12. 0.0 0.2 0.4 0.6 0.8 1.0 Percentage Depth Through Recurrent Block 0.00 0.25 0.50 0.75 1.00 Sink Rate Prelude Coda Feedforward 0.0 0.2 0.4 0.6 0.8 1.0 Percentage Depth Through Recurrent Block 0.02 0.03 0.04 0.05 … view at source ↗
Figure 58
Figure 58. Figure 58: Stages of inference metrics for a small-scale Looped Transformer of configuration (2, 8 ⊗ 4, 2)𝐼 , compared to a “control” feedforward Transformer with the same training configuration and depth 12. 0.0 0.2 0.4 0.6 0.8 1.0 Percentage Depth Through Recurrent Block 0.00 0.25 0.50 0.75 1.00 Sink Rate Prelude Coda Feedforward 0.0 0.2 0.4 0.6 0.8 1.0 Percentage Depth Through Recurrent Block 0.02 0.03 0.04 0.05 … view at source ↗
Figure 59
Figure 59. Figure 59: Stages of inference metrics for a small-scale Looped Transformer of configuration (2, 12 ⊗ 4, 2)𝐼 , compared to a “control” feedforward Transformer with the same training configuration and depth 12. 0.0 0.2 0.4 0.6 0.8 1.0 Percentage Depth Through Recurrent Block 0.00 0.25 0.50 0.75 1.00 Sink Rate Feedforward 0.0 0.2 0.4 0.6 0.8 1.0 Percentage Depth Through Recurrent Block 0.03 0.04 0.05 Mixing Score 0.0 … view at source ↗
Figure 60
Figure 60. Figure 60: Stages of inference metrics for a small-scale Looped Transformer of configuration (0, 4 ⊗ 4, 0), compared to a “control” feedforward Transformer with the same training configuration and depth 12. 37 view at source ↗
Figure 61
Figure 61. Figure 61: Stages of inference metrics for a small-scale Looped Transformer of configuration (0, 8 ⊗ 4, 0), compared to a “control” feedforward Transformer with the same training configuration and depth 12. 0.0 0.2 0.4 0.6 0.8 1.0 Percentage Depth Through Recurrent Block 0.00 0.25 0.50 0.75 1.00 Sink Rate Feedforward 0.0 0.2 0.4 0.6 0.8 1.0 Percentage Depth Through Recurrent Block 0.03 0.04 0.05 Mixing Score 0.0 0.2… view at source ↗
Figure 62
Figure 62. Figure 62: Stages of inference metrics for a small-scale Looped Transformer of configuration (0, 12 ⊗ 4, 0), compared to a “control” feedforward Transformer with the same training configuration and depth 12. 38 view at source ↗
Figure 63
Figure 63. Figure 63: Entire attention pattern floorplan for the retrofitted Llama model, illustrating the cyclic similarity between recurrences. 39 view at source ↗
read the original abstract

Reasoning has become a central capability in large language models. Recent research has shown that reasoning performance can be improved by looping an LLM's layers in the latent dimension, resulting in looped reasoning language models. Despite promising results, few works have investigated how their internal dynamics differ from those of standard feedforward models. In this paper, we conduct a mechanistic analysis of the latent states in looped language models, focusing in particular on how the stages of inference observed in feedforward models compare to those observed in looped ones. To this end, we analyze cyclic recurrence and show that for many of the studied models each layer in the cycle converges to a distinct fixed point; consequently, the recurrent block follows a consistent cyclic trajectory in the latent space. We provide evidence that as these fixed points are reached, attention-head behavior stabilizes, leading to constant behavior across recurrences. Empirically, we discover that recurrent blocks learn stages of inference that closely mirror those of feedforward models, repeating these stages in depth with each iteration. We study how recurrent block size, input injection, and normalization influence the emergence and stability of these cyclic fixed points. We believe these findings help translate mechanistic insights into practical guidance for architectural design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a mechanistic analysis of looped reasoning language models, showing that recurrent blocks converge to distinct per-layer fixed points in latent space, producing consistent cyclic trajectories. Attention-head behavior stabilizes upon reaching these points, and the blocks learn inference stages that mirror those in feedforward models, repeating the stages across iterations. The work includes ablations examining the effects of recurrent block size, input injection, and normalization on fixed-point emergence and stability.

Significance. If the empirical observations hold, the results offer concrete mechanistic understanding of why looped architectures improve reasoning performance and supply practical guidance for architectural choices. The paper earns credit for grounding claims in direct measurements of latent trajectories and attention patterns rather than indirect performance metrics, along with targeted ablations that test sensitivity to block size and normalization.

major comments (2)
  1. [§4] §4 (latent trajectory analysis): the assertion that recurrent blocks 'closely mirror' feedforward inference stages requires an explicit quantitative metric (e.g., layer-wise activation similarity or stage-transition clustering) and reporting of variance across seeds; without it the mirroring claim rests on qualitative description and cannot be evaluated for robustness.
  2. [§5.2] §5.2 (ablations on block size and normalization): while the experiments show influence on fixed-point stability, the paper does not report statistical tests (e.g., t-tests or bootstrap confidence intervals) comparing convergence rates across conditions, leaving open whether observed differences are reliable or could be artifacts of the specific training runs examined.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'for many of the studied models' should be accompanied by the exact count and identities of models examined to allow readers to assess scope.
  2. [Figures] Figure captions (e.g., those showing cyclic trajectories): add explicit labels for iteration count and fixed-point convergence threshold so that the stabilization behavior is immediately interpretable without cross-referencing the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [§4] §4 (latent trajectory analysis): the assertion that recurrent blocks 'closely mirror' feedforward inference stages requires an explicit quantitative metric (e.g., layer-wise activation similarity or stage-transition clustering) and reporting of variance across seeds; without it the mirroring claim rests on qualitative description and cannot be evaluated for robustness.

    Authors: We agree that a quantitative metric would make the mirroring claim more robust and reproducible. In the revised manuscript we will add a layer-wise cosine similarity metric between the fixed-point activations of each recurrent block and the corresponding layers of the feedforward baseline, together with standard deviations computed across five independent random seeds. We will also include a simple stage-transition clustering analysis based on k-means on the activation trajectories to quantify how consistently the inference stages repeat. revision: yes

  2. Referee: [§5.2] §5.2 (ablations on block size and normalization): while the experiments show influence on fixed-point stability, the paper does not report statistical tests (e.g., t-tests or bootstrap confidence intervals) comparing convergence rates across conditions, leaving open whether observed differences are reliable or could be artifacts of the specific training runs examined.

    Authors: We acknowledge that statistical quantification would strengthen the ablation results. In the revision we will report bootstrap confidence intervals (1,000 resamples) for the convergence rates and fixed-point stability metrics across block sizes and normalization variants, computed from five independent training runs per condition. revision: yes

Circularity Check

0 steps flagged

No significant circularity: claims rest on direct empirical measurements of activations and fixed-point convergence

full rationale

The paper's central claims—that recurrent blocks converge to per-layer fixed points, stabilize attention, and mirror feedforward inference stages—are supported by direct latent-state analysis, ablations on block size/input injection/normalization, and observation of cyclic trajectories. No equations define a quantity in terms of itself, no fitted parameters are relabeled as predictions, and no load-bearing steps reduce to self-citations or imported uniqueness theorems. The derivation chain consists of measurements on trained models rather than tautological redefinitions, making the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not introduce new mathematical axioms, free parameters, or invented entities; the analysis relies on standard assumptions of mechanistic interpretability applied to trained transformer models.

pith-pipeline@v0.9.0 · 5530 in / 998 out tokens · 22516 ms · 2026-05-10T15:50:13.721658+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SMolLM: Small Language Models Learn Small Molecular Grammar

    cs.LG 2026-05 unverdicted novelty 7.0

    A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.

  2. Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...

  3. Hyperloop Transformers

    cs.LG 2026-04 unverdicted novelty 5.0

    Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.

Reference graph

Works this paper leans on

63 extracted references · 34 canonical work pages · cited by 3 Pith papers · 9 internal anchors

  1. [1]

    and Yahav, E

    Alon, U. and Yahav, E. On the bottleneck of graph neural networks and its practical implications. arXiv preprint arXiv:2006.05205,

  2. [2]

    On vanishing gradients, over-smoothing, and over- squashing in gnns: Bridging recurrent and graph learning

    Arroyo, ´A., Gravina, A., Gutteridge, B., Barbero, F., Gal- licchio, C., Dong, X., Bronstein, M., and Vandergheynst, P. On vanishing gradients, over-smoothing, and over- squashing in gnns: Bridging recurrent and graph learning. arXiv preprint arXiv:2502.10818,

  3. [3]

    Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation

    ISSN 2835-8856. Bae, S., Kim, Y ., Bayat, R., Kim, S., Ha, J., Schuster, T., Fisch, A., Harutyunyan, H., Ji, Z., Courville, A., et al. Mixture-of-recursions: Learning dynamic recur- sive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524,

  4. [4]

    Pondernet: Learning to ponder

    Banino, A., Balaguer, J., and Blundell, C. Pondernet: Learn- ing to ponder. arXiv preprint arXiv:2107.05407,

  5. [5]

    Bronstein and Petar Velickovic and Razvan Pascanu , title =

    Barbero, F., Arroyo, A., Gu, X., Perivolaropoulos, C., Bron- stein, M., Veliˇckovi´c, P., and Pascanu, R. Why do llms at- tend to the first token? arXiv preprint arXiv:2504.02732,

  6. [6]

    Blayney, H., Arroyo, ´A., Dong, X., and Bronstein, M. M. glstm: Mitigating over-squashing by increasing storage capacity. arXiv preprint arXiv:2510.08450,

  7. [7]

    and Wang, Y

    Cai, C. and Wang, Y . A note on over-smoothing for graph neural networks. arXiv preprint arXiv:2006.13318,

  8. [8]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  9. [9]

    Continuous thought machines

    Darlow, L., Regan, C., Risi, S., Seely, J., and Jones, L. Continuous thought machines. arXiv preprint arXiv:2505.05522,

  10. [10]

    Universal Transformers

    Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., and Kaiser, Ł. Universal transformers. arXiv preprint arXiv:1807.03819,

  11. [11]

    Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    Geiping, J., McLeish, S., Jain, N., Kirchenbauer, J., Singh, S., Bartoldson, B. R., Kailkhura, B., Bhatele, A., and Goldstein, T. Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171,

  12. [12]

    Adaptive Computation Time for Recurrent Neural Networks

    Graves, A. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983,

  13. [13]

    When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781,

    9 A Mechanistic Analysis of Looped Language Models Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y ., and Lin, M. When attention sink emerges in language models: An empirical view. arXiv preprint arXiv:2410.10781,

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,

  15. [15]

    Hariri, A., Arroyo, ´A., Gravina, A., Eliasof, M., Sch¨onlieb, C.-B., Bacciu, D., Azizzadenesheli, K., Dong, X., and Vandergheynst, P

    ISSN 2835-8856. Hariri, A., Arroyo, ´A., Gravina, A., Eliasof, M., Sch¨onlieb, C.-B., Bacciu, D., Azizzadenesheli, K., Dong, X., and Vandergheynst, P. Return of chebnet: Understanding and improving an overlooked gnn on long range tasks. arXiv preprint arXiv:2506.07624,

  16. [16]

    Less is More: Recursive Reasoning with Tiny Networks

    Jolicoeur-Martineau, A. Less is more: Recursive reasoning with tiny networks. arXiv preprint arXiv:2510.04871 ,

  17. [17]

    Ke, Y ., Li, X., Liang, Y ., Shi, Z., and Song, Z

    URL https://github.com/karpathy/ nanochat. Ke, Y ., Li, X., Liang, Y ., Shi, Z., and Song, Z. Advancing the understanding of fixed point iterations in deep neural networks: A detailed analytical study. arXiv preprint arXiv:2410.11279,

  18. [18]

    Encode, think, decode: Scaling test-time reasoning with recursive latent thoughts.arXiv preprint arXiv:2510.07358,

    Koishekenov, Y ., Lipani, A., and Cancedda, N. Encode, think, decode: Scaling test-time reasoning with recursive latent thoughts. arXiv preprint arXiv:2510.07358,

  19. [19]

    arXiv preprint arXiv:2406.19384 , year=

    Lad, V ., Lee, J. H., Gurnee, W., and Tegmark, M. The re- markable robustness of llms: Stages of inference? arXiv preprint arXiv:2406.19384,

  20. [20]

    arXiv preprint arXiv:2507.02199 , year=

    Lu, W., Yang, Y ., Lee, K., Li, Y ., and Liu, E. Latent chain- of-thought? decoding the depth-recurrent transformer. arXiv preprint arXiv:2507.02199,

  21. [21]

    Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Jonas Geiping, Tom Goldstein, and Micah Goldblum

    McLeish, S., Li, A., Kirchenbauer, J., Kalra, D. S., Bartold- son, B. R., Kailkhura, B., Schwarzschild, A., Geiping, J., Goldstein, T., and Goldblum, M. Teaching pretrained lan- guage models to think deeper with retrofitted recurrence. arXiv preprint arXiv:2511.07384,

  22. [22]

    Softmax is 1/2-lipschitz: A tight bound across all ℓ𝑝 norms

    Nair, P. Softmax is 1/2-lipschitz: A tight bound across all ℓ𝑝 norms. arXiv preprint arXiv:2510.23012,

  23. [23]

    Two-scale latent dynamics for recurrent-depth transformers.arXiv preprint arXiv:2509.23314,

    Pappone, F., Crisostomi, D., and Rodol `a, E. Two-scale latent dynamics for recurrent-depth transformers. arXiv preprint arXiv:2509.23314,

  24. [24]

    Attention sinks and compression valleys in llms are two sides of the same coin.arXiv preprint arXiv:2510.06477,

    Queipo-de Llano, E., Arroyo, ´A., Barbero, F., Dong, X., Bronstein, M., LeCun, Y ., and Shwartz-Ziv, R. Attention sinks and compression valleys in llms are two sides of the same coin. arXiv preprint arXiv:2510.06477,

  25. [25]

    Using attention sinks to identify and evaluate dormant heads in pretrained llms

    Sandoval-Segura, P., Wang, X., Panda, A., Goldblum, M., Basri, R., Goldstein, T., and Jacobs, D. Using attention sinks to identify and evaluate dormant heads in pretrained llms. arXiv preprint arXiv:2504.03889,

  26. [26]

    Saunshi, N., Dikkala, N., Li, Z., Kumar, S., and Reddi, S. J. Reasoning with latent thoughts: On the power of looped transformers. arXiv preprint arXiv:2502.17416,

  27. [27]

    Sparse universal transformer

    Tan, S., Shen, Y ., Chen, Z., Courville, A., and Gan, C. Sparse universal transformer. arXiv preprint arXiv:2310.07096,

  28. [28]

    Softmax is not enough (for sharp size generali- sation)

    Veliˇckovi´c, P., Perivolaropoulos, C., Barbero, F., and Pas- canu, R. Softmax is not enough (for sharp size generali- sation). arXiv preprint arXiv:2410.01104,

  29. [29]

    Wang, G., Li, J., Sun, Y ., Chen, X., Liu, C., Wu, Y ., Lu, M., Song, S., and Yadkori, Y . A. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734,

  30. [30]

    Efficient Streaming Language Models with Attention Sinks

    Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453,

  31. [31]

    On expressive power of looped transformers: Theoretical analysis and enhancement via timestep encoding.arXiv preprint arXiv:2410.01405,

    Xu, K. and Sato, I. On expressive power of looped transformers: Theoretical analysis and enhancement via timestep encoding. arXiv preprint arXiv:2410.01405 ,

  32. [32]

    Looped transformers are better at learning learning algorithms

    Yang, L., Lee, K., Nowak, R., and Papailiopoulos, D. Looped transformers are better at learning learning al- gorithms. arXiv preprint arXiv:2311.12424,

  33. [33]

    Pay attention to attention distribution: A new lo- cal lipschitz bound for transformers

    Yudin, N., Gaponov, A., Kudriashov, S., and Rakhuba, M. Pay attention to attention distribution: A new lo- cal lipschitz bound for transformers. arXiv preprint arXiv:2507.07814,

  34. [34]

    Scaling Latent Reasoning via Looped Language Models

    Zhu, R.-J., Wang, Z., Hua, K., Zhang, T., Li, Z., Que, H., Wei, B., Wen, Z., Yin, F., Xing, H., et al. Scaling latent reasoning via looped language models. arXiv preprint arXiv:2510.25741,

  35. [35]

    B 𝑓 𝑗+1 (0) (Y )

    Therefore B 𝑓 𝑗+1 (𝑘−1) (B 𝑓 𝑗+1 (𝑘−2) (. . . B 𝑓 𝑗+1 (0) (Y ) . . . )) = B 𝑓 𝑗 (𝑘 ) (B 𝑓 𝑗 (𝑘−1) (. . . B 𝑓 𝑗 (1) (Y ) . . . )) (9) = B 𝑓 𝑗 (0) (B 𝑓 𝑗 (𝑘−1) (. . . B 𝑓 𝑗 (1) (Y ) . . . )) (10) Now take Eq. (8) and apply B 𝑓 𝑗 (0) to both sides, defining a new fixed pointZ ′′ = B 𝑓 𝑗 (0) (Z ′): B 𝑓 𝑗 (0) (B 𝑓 𝑗 (𝑘−1) (B 𝑓 𝑗 (𝑘−2) (. . . B 𝑓 𝑗 (0) (Z ′) . ...

  36. [36]

    Hello! I’ve been well. I hope that you’re doing well

    dataset. A few illustrative plots (for example, latent space trajectories) are instead produced with a test sequence that we obtain from Barbero et al. (2025): “Hello! I’ve been well. I hope that you’re doing well.” Additional results targetting non-reasoning behavior using the HellaSwag dataset (following an identical setup of running inference on the sa...

  37. [37]

    Our small training runs in Sec

    We use standard settings for the tokenizers of each model, and as such some models prepend a BOS token whereas others do not: we make this clear in the ‘Prepends BOS’ column of the same table. Our small training runs in Sec. 5.1 are performed by adapting a publicly available fork of Nanochat (Karpathy, 2025), https://github.com/TrelisResearch/nanochat/tre...

  38. [38]

    and train for 3.7B tokens. As discussed in the main text, loss is the same as that of a regular feedforward model: cross entropy loss on the final output representation (as opposed to the summed loss of Zhu et al. (2025)). Each model is trained for a constant 4 recurrences (as opposed to the Poisson sampling of Geiping et al. (2025)). All models use pre-n...

  39. [39]

    orbits” and “sliders

    Additional Huggingface details on pretrained Looped models used. C. Non-Fixed-Point Limiting Behavior C.1. How Frequent is Non-Fixed-Point Behavior? In this section we investigate more closely the “orbits” and “sliders” initially observed by Geiping et al. (2025). These are important as they appear to represent stable limiting behavior that are not fixed ...

  40. [40]

    Using this algorithm, we classify the limiting behavior over all tokens in the GSM8k test set for the Huginn-0125 and Retrofitted Llama models

    We set threshold 15 A Mechanistic Analysis of Looped Language Models 𝜏 = 0.05 and fixed-point fraction 𝜌 = 0.9. Using this algorithm, we classify the limiting behavior over all tokens in the GSM8k test set for the Huginn-0125 and Retrofitted Llama models. We discover that the system prompt used before presenting the GSM8k question has a large impact on th...

  41. [41]

    This percentage can be significantly increased with the longer system prompt, but these behaviors remain rare at 0.14%

    These results reveal that these non-fixed-point limiting behaviors appear to be extremely rare in practice: without a system prompt (the setting used throughout this paper) only approximately 0.02% of tokens exhibit non-fixed-point behavior. This percentage can be significantly increased with the longer system prompt, but these behaviors remain rare at 0....

  42. [42]

    Long Persona

    The input sequence (cosine similarities for the residual streams of a given token and layer and successive recursions, as compared to their final residual stream) is visualized in the leftmost column. The center column visualizes the effect of windowing and de-trending, and the rightmost column shows the FFT magnitudes. The top row visualizes the detected...

  43. [43]

    worst case

    PCA trajectories in the intermediate layers of Huginn-0125: this reproduces the leftmost column of Fig. 16 in Geiping et al. (2025) (the first two principal components) and additionally plots the latent trajectories for the intermediate layers in the recurrent block. 20 A Mechanistic Analysis of Looped Language Models C.3. How Does Non-Fixed-Point Behavio...

  44. [44]

    Left: Ouro 1.4B (Zhu et al., 2025)

    Cosine similarity between residual streams after every pair of layers for different Transformer models, averaged across the batch and sequence dimensions. Left: Ouro 1.4B (Zhu et al., 2025). Center: Retrofitted Llama (McLeish et al., 2025). Right: Huginn-0125 (Geiping et al., 2025). All models looped 8 times. Diagonal patterns indicate that the residual s...

  45. [45]

    Left: Huginn-0125 (Geiping et al., 2025)

    Cosine similarity between residual streams after every pair of layers for different Transformer models, averaged across the batch and sequence dimensions. Left: Huginn-0125 (Geiping et al., 2025). Center Left: Retrofitted Llama. Center Right: Retrofitted OLMo. Right: Retrofitted TinyLlama (McLeish et al., 2025). All models looped 32 times. Extended version of Fig

  46. [46]

    Left: Huginn-0125 (Geiping et al., 2025)

    Frobenius norm between attention matrices for different Transformer models, averaged across the batch and head dimensions. Left: Huginn-0125 (Geiping et al., 2025). Center Left: Retrofitted Llama. Center Right: Retrofitted OLMo. Right: Retrofitted TinyLlama (McLeish et al., 2025). All models looped 32 times. Extended version of Fig

  47. [47]

    approximate fixed point

    Cosine similarity between the residual stream after the first layer in the recurrent block for successive recurrences and an “approximate fixed point” – the residual stream in the 128th recurrence. Two fixed point differences are visualized: the difference to the fixed point of the same (first) layer (blue) and the difference to the fixed point which has ...

  48. [48]

    Fraction of prediction and suppression neurons in a selection of looped models used throughout the paper. E.2. Input Dependent Metrics One well-studied phenomenon by which Transformers drastically reduce the mixing in given layer is that of the attention sink (Xiao et al., 2023; Barbero et al., 2025), whereby the layer focuses the majority of the attentio...

  49. [49]

    Stages of inference for a selection of Looped transformers, all using 8 recurrences: Huginn-0125 (Geiping et al., 2025), Ouro 1.4B (Zhu et al.,

  50. [50]

    Note Huginn-0125 and Retrofitted Llama have prelude and coda layers too: each 2 layers in Huginn-0125 and each 4 in Retrofitted Llama

    and Llama with retrofitted recurrences (McLeish et al., 2025). Note Huginn-0125 and Retrofitted Llama have prelude and coda layers too: each 2 layers in Huginn-0125 and each 4 in Retrofitted Llama. For completeness, we plot these stages of inference for all other models referenced in the paper. See Fig. 35 (Ouro 1.4B), Fig. 36 (Huginn-0125), Figs. 37 to 3...

  51. [51]

    upcycles

    This model is interesting due to the training regime followed by Zhu et al. (2025), which “upcycles” a 48 layer model from the 24 layer 1.4B parameter model. As a consequence, the first and second half of each recurrent block each independently align with the Llama feedforward stages of inference. In Sec. 5 we suggested that the lack of stages of inferenc...

  52. [52]

    Stages of inference for each recurrent loop in the retrofitted Llama model (McLeish et al., 2025).Each block demonstrates very similar stages of inference to Llama, the base model from which pretrained layers are taken. 0 50 100 % Block Depth 0.0 0.2 0.4 0.6 0.8 1.0 Sink Rate Sink Rates 0 50 100 % Block Depth 0.010 0.015 0.020 Mixing Score Mixing Scores 0...

  53. [53]

    Similarly, each block demonstrates very similar stages of inference to OLMo, the base model from which pretrained layers are taken

    Stages of inference for each recurrent loop in the retrofitted OLMo model (McLeish et al., 2025). Similarly, each block demonstrates very similar stages of inference to OLMo, the base model from which pretrained layers are taken. 30 A Mechanistic Analysis of Looped Language Models 0 50 100 % Block Depth 0.0 0.2 0.4 0.6 0.8 1.0 Sink Rate Sink Rates 0 50 10...

  54. [54]

    Stages of inference for each recurrent loop in the retrofitted TinyLlama model (McLeish et al., 2025).Similarly, each block demonstrates very similar stages of inference to TinyLlama, the base model from which pretrained layers are taken. 0 20 40 60 80 100 % Recurrent Depth 0.0 0.2 0.4 0.6 0.8 1.0 Sink Rate 0 20 40 60 80 100 % Recurrent Depth 0.0025 0.005...

  55. [55]

    upcycles

    Stages of inference for each recurrent loop in Ouro 2.6B.For this model we separate out the first and second half of the recurrent block and overlay them, demonstrating that both halves have close alignment with the Llama feedforward stages of inference. We suggest that this likely arises due to the training regime of Zhu et al. (2025), which first trains...

  56. [56]

    31 A Mechanistic Analysis of Looped Language Models E.4

    Stages of inference for each recurrent loop in Retrofitted Llama for which the massive activations have been ablated. 31 A Mechanistic Analysis of Looped Language Models E.4. Non-Reasoning Stages of Inference Throughout the rest of the paper, experiments are conducted on the GSM8k dataset. In this appendix we verify that the stages of inference we observe...

  57. [57]

    Stages of inference for each recurrent loop in the retrofitted Llama model (McLeish et al., 2025), run on the HellaSwag dataset. 0 50 100 % Block Depth 0.0 0.2 0.4 0.6 0.8 1.0 Sink Rate Sink Rates 0 50 100 % Block Depth 0.02 0.03 0.04 0.05 0.06 Mixing Score Mixing Scores 0 50 100 % Block Depth 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Colsum Concentration Colsum Concen...

  58. [58]

    ColSum concentration deviates slightly from its GSM8k counterpart here, but still broadly follows the same stages of inference as the feedforward OLMo model

    Stages of inference for each recurrent loop in the retrofitted OLMo model (McLeish et al., 2025), run on the HellaSwag dataset. ColSum concentration deviates slightly from its GSM8k counterpart here, but still broadly follows the same stages of inference as the feedforward OLMo model. 0 50 100 % Block Depth 0.0 0.2 0.4 0.6 0.8 1.0 Sink Rate Sink Rates 0 5...

  59. [59]

    33 A Mechanistic Analysis of Looped Language Models E.5

    Stages of inference for each recurrent loop in the retrofitted TinyLlama model (McLeish et al., 2025), run on the HellaSwag dataset. 33 A Mechanistic Analysis of Looped Language Models E.5. Stability To Unseen Test-Time Recurrences This section extends the results presented in Sec. 5.2. We supplement Fig. 11 by plotting how stages of inference change per-...

  60. [60]

    The large standard deviations in Huginn-0125 and retrofitted Llama mixing scores reflect the fact that these models tend to reach different, but still stable, constant states. 0 1000 2000 3000 Recurrent Position 0.00 0.25 0.50 0.75 1.00 Sink Rate 0 1000 2000 3000 Recurrent Position 0.005 0.010 0.015 0.020 Mixing Score 0 1000 2000 3000 Recurrent Position 0...

  61. [61]

    These consistently change throughout the realized depth of the model, reaching no clear fixed point

    Stages of inference for each of the distinct blocks in Ouro (Zhu et al., 2025), as they are reapplied throughout the model for 128 recurrences. These consistently change throughout the realized depth of the model, reaching no clear fixed point. Mean and standard deviation are over separate inputs to the model, taken over the GSM8k subset. 0 200 400 Recurr...

  62. [62]

    These converge to constant behavior

    Stages of inference for each of the distinct blocks in Huginn-0125 (Geiping et al., 2025), as they are reapplied throughout the model for 128 recurrences. These converge to constant behavior. Mean and standard deviation are over separate inputs to the model, taken over the GSM8k subset. 0 250 500 750 Recurrent Position 0.5 0.6 0.7 0.8 0.9 1.0 Sink Rate 0 ...

  63. [63]

    These converge to constant behavior

    Stages of inference for each of the distinct blocks in retrofitted Llama (McLeish et al., 2025), as they are reapplied throughout the model for 128 recurrences. These converge to constant behavior. Mean and standard deviation are over separate inputs to the model, taken over the GSM8k subset. We additionally plot the extended versions of Fig. 12 in Figs. 50 to