pith. machine review for the scientific record. sign in

arxiv: 2605.09403 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.AI· cs.NE

Recognition: no theorem link

Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE
keywords transformermixture of expertssparsityfeedforward networkattention mechanismarithmetic tasksinterpretabilityGLU gating
0
0 comments X

The pith

Sparse MoE routing in the FFN block shifts computation into the attention layers of one-layer Transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how choices inside the feedforward network change what the attention mechanism ends up computing in small Transformers. On tasks such as carry addition, modular arithmetic, and histogram counting, models with sparse mixture-of-experts routing move more work out of the per-token FFN and into attention. This shift appears even when routing is frozen at random values, indicating that the sparsity of the architecture itself, rather than any learned specialization, drives the redistribution. A secondary result shows that GLU gating spreads task-relevant structure across many neurons instead of concentrating it in single units. The authors support these observations with controls that match parameter count, vary width, and ablate routing type.

Core claim

In one-layer Transformers trained on digit addition with carry, modular arithmetic, and histogram counting, sparse MoE routing in the FFN shifts computation to attention; this redistribution arises from reduced per-token FFN capacity and sparse partitioning across experts, as shown by frozen random routing nearly matching the effects of learned routing.

What carries the argument

Sparse MoE routing in the FFN, which lowers per-token capacity and partitions tokens across experts, thereby forcing attention to handle more of the overall computation.

If this is right

  • Attention layers learn to perform operations such as carry propagation that would otherwise be handled inside the FFN.
  • The same sparsity-induced redistribution occurs under random fixed routing, showing that learned router specialization is not required.
  • GLU multiplicative gating moves structured Fourier components out of individual neuron bases and into distributed subspaces.
  • The effect is strongest on carry-based addition and weaker on modular arithmetic and counting tasks.
  • Parameter-matched narrow FFNs and top-2 MoE variants reproduce the same pattern of attention compensation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of larger sparse models may need to monitor attention capacity when increasing FFN sparsity, since the compensation effect observed here could scale.
  • Interpretability methods that inspect single FFN neurons may miss task structure once GLU gating is used, requiring instead analysis of attention or combined subspaces.
  • The finding raises the possibility that sparsity in one block systematically alters the division of labor across all blocks, a pattern worth testing on sequence modeling tasks beyond arithmetic.

Load-bearing premise

The observed shift of computation can be cleanly measured and attributed to reduced per-token FFN capacity and sparse expert partitioning without unaccounted interactions from training dynamics or multi-layer effects.

What would settle it

A controlled run in which widening the FFN experts or switching to dense routing eliminates the measured increase in attention-layer activity on carry-addition examples while keeping total parameters fixed.

Figures

Figures reproduced from arXiv: 2605.09403 by Chris Mascioli, Gabriel Smithline.

Figure 1
Figure 1. Figure 1: Sparse partitioning, not learned routing, drives FFN-to-attention redistribution. Mean no-FFN accuracy on add-7 over 5 seeds. Narrow FFN isolates reduced per-token capacity; top-2 relaxes the bottleneck; random routing preserves sparse partitioning without learned router adaptation and retains comparable no-FFN accuracy to learned routing. Full decomposition in Sec. 4.2. Preprint. arXiv:2605.09403v1 [cs.LG… view at source ↗
Figure 2
Figure 2. Figure 2: Active FFN capacity per token across architecture variants. Green blocks are active for a token; hatched gray blocks are inactive for that token but active for other routed tokens. MoE top-k activates only routed experts, producing the per-token FFN bottleneck studied in Sec. 4.1. Widths are total-parameter matched; full details are in Tab. 2. We study three algorithmic tasks chosen to stress different bal… view at source ↗
Figure 3
Figure 3. Figure 3: Component ablation across tasks at matched parameter widths All variants are total-parameter-matched within 1-2%: FFN, GLU with h= 2 3 hdense, MoE with hE=hdense/E, and MoE-GLU with hE= 2 3 hdense/E, so total expert width is 2 3 hdense (Tab. 2). At parity, MoE retains much higher no-FFN accuracy on tasks with ablation-visible redistribution pressure: largest on add-7 (+32.4 pp over FFN), smaller but consis… view at source ↗
Figure 4
Figure 4. Figure 4: Position-dependent redistribution on add-7 (5 seeds). Left: without FFN, MoE’s attention handles all positions at 38-56%, while FFN collapses to ∼10% (chance). Right: without attention, FFN retains high overflow accuracy while MoE’s FFN drops substantially. Because overflow is highly imbalanced, this no-attention result should be interpreted as localization evidence rather than proof that the FFN computes … view at source ↗
Figure 5
Figure 5. Figure 5: GLU reshapes neuron-level Fourier structure. Per-neuron Fourier concentration on (a + b) mod 113 at total-parameter-matched widths, pooled across 5 seeds. Mean per-neuron concentration ± seed-mean std: FFN 0.34 ± 0.13, MoE 0.48 ± 0.01, GLU 0.17 ± 0.12, MoE-GLU 0.16 ± 0.10. The gap between MoE-class and GLU-class variants is robust across seed-to-seed variation. All variants reach ∼ 100% test accuracy. Full… view at source ↗
Figure 6
Figure 6. Figure 6: Norm comparisons and weight norm dynamics during grokking. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Norm effect on component ablation (add-7). No-norm produces cleaner separation. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Direct logit attribution across tasks (no norm, 5 seeds). Each bar shows the fraction of the correct output logit attributable to attention vs. FFN. (a) On modular addition, FFN dominates (88-99%). (b) On add-7, MoE’s attention contributes ∼60% vs. ∼40% for FFN. Error bars: ±1 std across 5 seeds [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: DLA by output position on add-7 (5 seeds). Error bars: [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: DLA by carry-chain length on add-7 (5 seeds). Each row is a variant; columns are [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Head-level attention concentration tracks the FFN capacity bottleneck. Per-head direct logit attribution on add-7, split by output position. Heads are sorted within each seed by total absolute correct-logit contribution before averaging, so Rank 1 is the largest contributor within that seed rather than a fixed head index. The four conditions match the causal decomposition in Sec. 4.2: FFN, narrow FFN, ran… view at source ↗
Figure 12
Figure 12. Figure 12: Ablation stratified by carry-chain length on add-7. MoE advantage under no-FFN ablation [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Attention pattern heatmaps across all 4 variants on add-7 (no norm). MoE develops [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Activation patching flip rates by position and variant on add-7. In most position–variant [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: MoE accelerates grokking on modular addition (5 seeds). (a) Shaded regions show ±1 standard deviation across seeds. (b) The E=1 control has the auxiliary loss but no routing; it does not accelerate grokking, confirming sparse routing itself contributes. Dropout = 0.3 achieves a faster speedup than MoE (App. D.3). All error bars and shaded regions in this paper show ±1 std across 5 seeds unless otherwise n… view at source ↗
Figure 16
Figure 16. Figure 16: Regularization baselines on modular addition. Dropout [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Top-1 vs. top-2 routing on modular addition. Top-2 eliminates the grokking advantage [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Width scaling of MoE grokking advantage on modular addition. Mean epoch-to- [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Expert-operation routing on add-7 (MoE-GLU, seed 42, the strongest-specialization seed; see App. E.1 for all 5 seeds). Left: routing heatmap showing expert 2 receives 88% of +7 tokens. Right: expert ablation confirms causal impact. Specialization varies substantially across seeds (NMI = 0.28 ± 0.20). On add-7, MoE-GLU achieves normalized mutual information of 0.28 ± 0.20 (mean ± std across 5 seeds) betwee… view at source ↗
Figure 20
Figure 20. Figure 20: Expert routing heatmaps across all 5 seeds for MoE and MoE-GLU on add-7. Specializa [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: GLU decomposition on modular addition (matched-width [PITH_FULL_IMAGE:figures/full_fig_p024_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Weight vs. activation Fourier concentration for GLU and MoE-GLU on [PITH_FULL_IMAGE:figures/full_fig_p025_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Narrow dense FFN vs. full-width dense FFN vs. MoE on modular addition (5 seeds). [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Auxiliary loss sweep for MoE on modular addition (5 seeds). [PITH_FULL_IMAGE:figures/full_fig_p030_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: GELU vs. SiLU accuracy across all tasks and GLU variants (5 seeds, error bars: 95% CI). [PITH_FULL_IMAGE:figures/full_fig_p032_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Mechanistic verification under SiLU. (a) Redistribution: MoE-GLU’s no-FFN survival advantage over GLU is preserved (∼19 pp for both activations). (b) Fourier opacity: per-neuron concentration stays low while top-PC concentration stays high for both activations, confirming the rotated representation. (c) Routing MI: paired per-seed NMI values are virtually identical (means 0.253 vs. 0.256), confirming spec… view at source ↗
read the original abstract

Architectural choices inside the Transformer feedforward network (FFN) block do not merely affect the block itself; they reshape the computations learned by the rest of the model. We study this effect in one-layer Transformers trained on digit addition with carry, modular arithmetic, and histogram counting. Comparing dense FFNs, gated linear units (GLUs), mixture-of-experts (MoE), and MoE-GLUs, we find that sparse MoE routing can shift computation from FFN to attention, with the strongest ablation-visible effect on carry-based addition. We decompose this redistribution into reduced per-token FFN capacity and sparse partitioning across experts. Critically, frozen random routing nearly matches learned routing, suggesting that redistribution is driven largely by architectural sparsity rather than router-learned specialization. As a secondary finding, GLU-style multiplicative gating rotates task-relevant Fourier structure out of the per-neuron basis and into distributed subspaces, making neuron-level interpretability less informative while preserving structured computation. We validate these conclusions with random-routing, narrow-FFN, and top-2 MoE controls, plus parameter-matching, activation-function, and width-scaling analyses. Together, these results show that local FFN design choices can have nonlocal consequences for Transformer computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that in one-layer Transformers trained on carry-based addition, modular arithmetic, and histogram counting, sparse MoE FFN architectures shift computation from the FFN to attention layers. This redistribution is driven primarily by reduced per-token FFN capacity and sparse expert partitioning rather than learned router specialization, as frozen random routing nearly matches learned routing. A secondary claim is that GLU multiplicative gating rotates task-relevant Fourier structure into distributed subspaces, reducing neuron-level interpretability. These conclusions are supported by ablations including random-routing, narrow-FFN, top-2 MoE, parameter-matching, activation-function, and width-scaling controls.

Significance. If the results hold, they establish that local FFN design choices have measurable nonlocal effects on Transformer computation allocation, with implications for interpretability and architecture search in small models. The empirical strength comes from the explicit set of controls (random routing, parameter matching, width scaling) that isolate architectural sparsity as the driver, providing reproducible evidence for the redistribution hypothesis on simple tasks where effects are visible.

major comments (1)
  1. [Abstract and §3-4] Abstract and experimental sections: the central claim that sparse MoE 'shifts computation from FFN to attention' depends on a precise, reproducible definition of the shift metric (e.g., changes in attention output norms, head-wise activation magnitudes, or task-specific attribution). Without an explicit formula or procedure for this quantification, it is difficult to evaluate the magnitude of the effect or confirm that the ablations (random routing, narrow-FFN) correctly attribute it to sparsity rather than other factors.
minor comments (2)
  1. [Methods] Expand the description of how 'frozen random routing' is implemented (e.g., whether the router is replaced by fixed random weights or sampling) and confirm it preserves the same parameter count and sparsity pattern as learned routing.
  2. [§4 (Ablations)] The width-scaling and parameter-matching analyses should report exact parameter counts and FLOPs for each variant to allow direct verification that controls are fair.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation, the recommendation of minor revision, and the careful identification of a clarity issue in our central claim. We address the point below and will revise the manuscript to incorporate an explicit definition of the shift metric.

read point-by-point responses
  1. Referee: [Abstract and §3-4] Abstract and experimental sections: the central claim that sparse MoE 'shifts computation from FFN to attention' depends on a precise, reproducible definition of the shift metric (e.g., changes in attention output norms, head-wise activation magnitudes, or task-specific attribution). Without an explicit formula or procedure for this quantification, it is difficult to evaluate the magnitude of the effect or confirm that the ablations (random routing, narrow-FFN) correctly attribute it to sparsity rather than other factors.

    Authors: We agree that an explicit, reproducible definition strengthens the paper. In the current manuscript the redistribution is quantified via two complementary measures reported in §3–4: (1) the relative L2 norm of the attention-block output versus the FFN-block output, normalized by the total residual-stream norm at each token (averaged over the test set), and (2) head-wise mean activation magnitudes in attention together with per-expert activation frequencies in the MoE FFN. These quantities are compared across dense, GLU, MoE, and MoE-GLU architectures under identical training protocols. To address the concern we will add a dedicated subsection titled “Quantifying Computation Redistribution” immediately before the experimental results. It will contain the exact formulas, the aggregation procedure over tokens and heads, and a short pseudocode block showing how the same metric is applied to the random-routing and narrow-FFN ablations. This addition will make it straightforward to verify that the observed shift is driven by capacity reduction and partitioning rather than router specialization. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical study with independent experimental controls

full rationale

The paper is an empirical investigation of FFN architecture effects on attention in one-layer Transformers, relying on direct ablation experiments (random routing, narrow FFN, parameter-matched, width-scaling) across tasks like carry addition and modular arithmetic. No mathematical derivation chain, self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the claims or abstract. The redistribution findings are grounded in observable ablation differences rather than any reduction to inputs by construction, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim is supported by empirical ablations rather than theoretical derivations or new postulates; it relies on standard assumptions about what the synthetic tasks test and how to measure computation shift.

axioms (1)
  • domain assumption Training on synthetic arithmetic and counting tasks reveals general principles of Transformer computation redistribution
    The paper uses these tasks as proxies for studying computation without specifying why they generalize beyond the specific settings.

pith-pipeline@v0.9.0 · 5524 in / 1485 out tokens · 65620 ms · 2026-05-12T02:53:46.647927+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    Searching for Activation Functions

    URL http://arxiv.org/abs/1710.05941. 530 citations (Semantic Scholar/arXiv) [2024-05-16] arXiv:1710.05941 [cs]. Noam Shazeer. GLU Variants Improve Transformer, February 2020. URL http://arxiv.org/ abs/2002.05202. 390 citations (Semantic Scholar/arXiv) [2024-05-16] arXiv:2002.05202 [cs, stat] version: 1. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, ...

  2. [2]

    Train/test: 30%/70% of all 1132 pairs

    Loss at position 2. Train/test: 30%/70% of all 1132 pairs. Full-batch training, 40k epochs, weight decay 1.0. Networks develop discrete Fourier circuits with per-neuron frequency selectivity [Nanda et al., 2023]. Histogram counting.Given L=10 tokens from alphabet T=32 , predict each token’s count. Illus- trative L=6 example (the full task uses L=10): [2,0...

  3. [3]

    This reveals each component’s contribution to the final prediction

    Component ablation: Zero attention or FFN output at inference and measure per-token output accuracy degradation. This reveals each component’s contribution to the final prediction

  4. [4]

    Each term’s projection through WU gives that component’s contribution to the correct output logit

    Direct logit attribution (DLA): Decompose output logits into additive contributions via the residual stream [Nanda et al., 2023, Quirke and Barez, 2023]: logits=W U(xembed +x attn + xffn). Each term’s projection through WU gives that component’s contribution to the correct output logit

  5. [5]

    This provides causalevidence for which component carries the decision [Meng et al., 2022]

    Activation patching: For pairs of inputs requiring different operations at the same position, swap attention or FFN activations and measure whether the prediction flips. This provides causalevidence for which component carries the decision [Meng et al., 2022]

  6. [6]

    Trained with Adam ( lr=10−3, no weight decay), cross-entropy loss, batch size 128, for 500 steps on a 70/30 train/test split of the 1000 examples

    Linear probes: Train a single layer (logistic regression) to predict the operation type (+7/+1/+0) from attention or FFN outputs. Trained with Adam ( lr=10−3, no weight decay), cross-entropy loss, batch size 128, for 500 steps on a 70/30 train/test split of the 1000 examples. This is the weakest possible probe, so 100% accuracy means the features are line...

  7. [7]

    High concentration means the neuron responds cleanly to one frequency

    Fourier analysis (activation-based): For modular addition, compute per-neuron Fourier concentration as the fraction of spectral power at the dominant frequency when activations are viewed as a function of (a+b) modp[Nanda et al., 2023] . High concentration means the neuron responds cleanly to one frequency. For add-7, where there is no canonical cyclic gr...

  8. [8]

    22 as a weight-space analogue of the activation-based metric above

    Fourier analysis (weight-based, GLU only): Used in Fig. 22 as a weight-space analogue of the activation-based metric above. The pipeline: (a) Bilinear tensor .Following the bilinear interpretation of GLU [Pearce et al., 2024], form the third-order tensor Tijk = P m Wdown[i, m]W gate[m, j]W up[m, k] from the raw weight matrices, so that GLU(x)i ≈ P j,k Tij...

  9. [9]

    Ablate individual experts and measure per-operation accuracy drops

    Routing analysis(MoE): Compute normalized mutual information between expert assign- ment and operation type. Ablate individual experts and measure per-operation accuracy drops. C Redistribution Supporting Evidence C.1 Norm vs. No-Norm Comparison We train without RMSNorm following standard practice in mechanistic interpretability [Nanda et al., 2023, Quirk...

  10. [10]

    Increasing per-token capacity from 1/E of dense to full dense, while keeping the sparse- partitioning architecture, removes most of the no-FFN gap

    On add-7, per-active matching more than halves the redistribution( 41.7%→18.3% ). Increasing per-token capacity from 1/E of dense to full dense, while keeping the sparse- partitioning architecture, removes most of the no-FFN gap. This is consistent with the narrow-dense-FFN control in the main text (Sec. 4.2), which attributes a substantial fraction of th...

  11. [11]

    1.6%, both close to dense FFN’s1.3%)

    On modular addition, both conventions give negligible redistribution( 2.6% vs. 1.6%, both close to dense FFN’s1.3%). Modular addition’s reliance on Fourier circuits exceeds attention’s expressive capacity[Nanda et al., 2023], so attention cannot absorb FFN compu- tation regardless of bottleneck severity, and the matching convention has little effect

  12. [12]

    MoE-GLU does better only because it has more total parameters

    On histogram, both conventions give identical no-FFN( 10.4% vs. 10.2%, both at the chance line for L=10 tokens). Histogram is the non-substitutable control: both matching conventions leave no-FFN accuracy at chance. Appendix H shows that histogram strategies can shift internally, especially in the FFN-family, but the FFN remains necessary for the count re...