pith. machine review for the scientific record. sign in

arxiv: 2602.18523 · v3 · submitted 2026-02-19 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords multi-task grokkingweight decaytransformermodular arithmeticsuperposition subspaceparameter space geometrytransverse instability
0
0 comments X

The pith

Multi-task grokking builds a compact superposition subspace in parameter space where weight decay supplies compression pressure and excess parameters add geometric redundancy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends single-task geometric analysis to Transformers trained jointly on modular arithmetic tasks such as addition and multiplication. It reports that grokking unfolds in a consistent staggered sequence across seeds, with multiplication generalizing first. Optimization remains confined to an empirically invariant low-dimensional manifold, and defects orthogonal to that manifold precede the jump to generalization. Weight decay sweeps produce distinct dynamical regimes, including a sharp failure mode with no decay, while final solutions prove holographically incompressible: they occupy only a handful of principal trajectory directions yet sit in full-rank weights and collapse under small perturbations. Overparameterization supplies redundant center manifolds that allow partial recovery after aggressive removal of transverse gradient components.

Core claim

Multi-task grokking constructs a compact superposition subspace in parameter space, with weight decay acting as compression pressure and excess parameters supplying geometric redundancy in optimization pathways. The supporting observations are staggered grokking order, universal integrability on an invariant manifold, systematic weight-decay phase structure, holographic incompressibility of the final weights, and transverse fragility offset by redundancy.

What carries the argument

The low-dimensional execution manifold that confines all optimization trajectories together with the orthogonal commutator defects that reliably precede generalization.

If this is right

  • Grokking timescale, curvature depth, and defect lead covary systematically with weight decay, producing distinct dynamical regimes.
  • Final solutions occupy only 4-8 principal trajectory directions yet are destroyed by SVD truncation, magnitude pruning, or uniform scaling.
  • Removal of less than 10 percent of orthogonal gradient components eliminates grokking, although dual-task models show partial recovery.
  • Multiplication generalizes before squaring, which precedes addition, with consistent delays across random seeds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The redundancy supplied by excess parameters may confer robustness against transverse instabilities when the number of tasks increases.
  • The observed staggered order could reflect algebraic complexity differences among the modular operations rather than model architecture alone.
  • If the superposition subspace scales sublinearly with task count, multi-task training could remain parameter-efficient even for larger task sets.

Load-bearing premise

The low-dimensional execution manifold stays invariant during training and commutator defects always precede generalization in the modular tasks examined.

What would settle it

An experiment in which grokking occurs without any detectable commutator defects preceding it or in which the observed manifold dimensionality changes measurably during the transition.

Figures

Figures reproduced from arXiv: 2602.18523 by Yongzhong Xu.

Figure 1
Figure 1. Figure 1: Multi-task grokking dynamics. (a) Dual-task: multiplication leads addition. (b) Tri-task: a three-way staggered ordering emerges. 4 Manifold Structure: Lower Rank and Orthogonal Heads PC1% is lower than single-task. PCA on attention weight trajectories reveals that multi-task grokking produces lower PC1% than single-task training ( [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PC1% decreases with task count. (a) Dual-task: 55–77%. (b) Tri-task: 49–56%. The manifold is no longer rank-1 but remains strongly low-dimensional. (a) Dual-task expanding-window PC1% (seed 42): declines over training. (b) Tri-task expanding-window PC1%: same declin￾ing pattern [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: PC1% declines over training in multi-task settings, unlike single-task grokking where [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Grok (WD=1.0) vs. no-WD (WD=0.0) eigenspectra for tri-task (seed 42). No-WD has [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Task-specific head weights are nearly orthogonal. The shared trunk learns a representation [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The execution manifold is empirically integrable in multi-task settings. The invariance [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cross-task gradient structure. (a) Same-task cosines are high (∼0.8), cross-task cosines are moderate (∼0.3). (b) Cross-task defect has roughly the same magnitude as total-loss defect. 6 Hessian Eigenvalue Analysis: Saddle-Mediated Transitions To probe the loss landscape curvature directly [Li et al., 2018b, Fort and Jastrzebski, 2019], we compute the bottom eigenvalue of the Hessian (via power iteration o… view at source ↗
Figure 8
Figure 8. Figure 8: Hessian curvature depth scales with weight decay. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-task bottom Hessian eigenvalues (seed 42, WD=1.0). Mul (blue) has deeper negative [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Tri-task Hessian analysis replicates the dual-task pattern: stronger WD drives deeper [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Grokking timescale vs. weight decay (dual-task, 3 seeds, log scale). The relationship is [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Tri-task defect onset always precedes grokking (27 [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Dual-task defect onset always precedes grokking (15/15 conditions), with a non-monotonic [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: V-shaped lead fraction across weight decay (dual-task only). [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Defect traces across five WD values (seed 42). [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Reconstruction threshold. The grokking solution requires 5–10 PCA directions, with [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: The grokking solution is incompressible by any post-hoc method. Only trajectory [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Orthogonal deletion dose-response for both tri-task and dual-task. Grok delay increases [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Phase portrait for tri-task arithmetic (seed 42, layer 0) in the ( [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Grokking vs. memorizing phase portraits for dual-task arithmetic (modular addition + [PITH_FULL_IMAGE:figures/full_fig_p021_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Layer-wise phase portrait overlay for tri-task arithmetic (seed 42). Layer 0 (blue) traces [PITH_FULL_IMAGE:figures/full_fig_p022_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Constraint-induced compression. (a) Dual-task models require k ∗ = 5–10 PCA directions depending on WD, with the orthogonal complement carrying 0.2–0.8% of trajectory variance. (b) Tri-task models require fewer components (k ∗ = 3–8), consistent with stronger superposition pressure from additional task constraints. See Tables 14 and 15 for per-k accuracy breakdowns. 3. Late phase (superposition completion… view at source ↗
Figure 23
Figure 23. Figure 23: Top-10 eigenspectra for dual-task and tri-task. Both show a dominant first eigenvalue, [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: SVD of weight deltas. Top-5 singular values grow concurrently in multi-task settings, [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Commutator defect time series for both multi-task settings. [PITH_FULL_IMAGE:figures/full_fig_p035_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Tri-task defect vs. per-task test accuracy across three WD values ( [PITH_FULL_IMAGE:figures/full_fig_p036_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Hero figure: defect predicts grokking in the tri-task setting (add task, [PITH_FULL_IMAGE:figures/full_fig_p037_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Combined defect magnitude and integrability over training. Integrability remains at [PITH_FULL_IMAGE:figures/full_fig_p037_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Hessian curvature scaling across WD values (multi-seed). Curvature depth increases [PITH_FULL_IMAGE:figures/full_fig_p037_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Tri-task cross-WD defect summary. WD=0.5 shows the largest final defect (analysis [PITH_FULL_IMAGE:figures/full_fig_p038_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Dual-task PCA eigenspectrum across weight decay values. The top 5–10 eigenvalues [PITH_FULL_IMAGE:figures/full_fig_p038_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Detailed comparison of reconstruction threshold [PITH_FULL_IMAGE:figures/full_fig_p038_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Heatmap comparison of reconstruction accuracy across WD and [PITH_FULL_IMAGE:figures/full_fig_p039_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Dual-task gradient projection ablation (averaged across 3 seeds, WD=1.0). PCA [PITH_FULL_IMAGE:figures/full_fig_p039_34.png] view at source ↗
read the original abstract

Grokking -- the abrupt transition from memorization to generalization long after near-zero training loss -- has been studied mainly in single-task settings. We extend geometric analysis to multi-task modular arithmetic, training shared-trunk Transformers on dual-task (mod-add + mod-mul) and tri-task (mod-add + mod-mul + mod-sq) objectives across a systematic weight decay sweep. Five consistent phenomena emerge. (1) Staggered grokking order: multiplication generalizes first, followed by squaring, then addition, with consistent delays across seeds. (2) Universal integrability: optimization trajectories remain confined to an empirically invariant low-dimensional execution manifold; commutator defects orthogonal to this manifold reliably precede generalization. (3) Weight decay phase structure: grokking timescale, curvature depth, reconstruction threshold, and defect lead covary systematically with weight decay, revealing distinct dynamical regimes and a sharp no-decay failure mode. (4) Holographic incompressibility: final solutions occupy only 4--8 principal trajectory directions yet are distributed across full-rank weights and destroyed by minimal perturbations; SVD truncation, magnitude pruning, and uniform scaling all fail to preserve performance. (5) Transverse fragility and redundancy: removing less than 10% of orthogonal gradient components eliminates grokking, yet dual-task models exhibit partial recovery under extreme deletion, suggesting redundant center manifolds enabled by overparameterization. Together, these results support a dynamical picture in which multi-task grokking constructs a compact superposition subspace in parameter space, with weight decay acting as compression pressure and excess parameters supplying geometric redundancy in optimization pathways.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper extends single-task grokking analysis to multi-task modular arithmetic (mod-add + mod-mul, and + mod-sq) using shared-trunk Transformers. It reports five empirical phenomena across weight-decay sweeps: staggered generalization order, confinement to an invariant low-dimensional execution manifold with orthogonal commutator defects preceding generalization, systematic weight-decay phase structure in timescales and curvature, holographic incompressibility of final solutions (4-8 principal directions yet full-rank and fragile to perturbation), and transverse fragility with partial redundancy from overparameterization. These are interpreted as evidence for construction of a compact superposition subspace under weight-decay compression.

Significance. If the reported patterns are robust and the manifold invariance holds under controlled interventions, the work supplies a concrete geometric mechanism linking weight decay, parameter redundancy, and multi-task generalization. The phase-structure and holographic-incompressibility observations are particularly novel and could guide regularization design in overparameterized models.

major comments (2)
  1. [Abstract (2)] Abstract, phenomenon (2): the assertion that commutator defects 'reliably precede generalization' and support a mechanistic superposition-subspace picture rests on observational correlation across seeds and sweeps. No ablation, projection, or regularizer is described that selectively suppresses or amplifies these defects while holding other trajectory statistics fixed; without such tests the defects could be downstream consequences rather than load-bearing drivers of transverse instability.
  2. [Abstract (3),(5)] Abstract, phenomenon (3) and (5): the claimed 'universal integrability' and 'transverse fragility' require that the low-dimensional execution manifold remains invariant across the tested scales and tasks. The manuscript provides no quantitative test (e.g., distance to the manifold under controlled perturbations or across model widths) that would falsify invariance; the reported consistency across seeds is necessary but not sufficient for the geometric claim.
minor comments (2)
  1. [Abstract] The abstract and described results omit error bars, exact seed counts, precise hyperparameter ranges, and data-exclusion criteria; these details are needed to evaluate whether fitting choices affect the reported phase boundaries and defect-lead times.
  2. [Methods] Notation for 'commutator defects' and 'principal trajectory directions' is introduced without an explicit definition or reference to the underlying SVD or Lie-bracket construction; a short methods subsection would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications based on the presented evidence and indicate revisions where they strengthen the claims without misrepresenting the work.

read point-by-point responses
  1. Referee: [Abstract (2)] Abstract, phenomenon (2): the assertion that commutator defects 'reliably precede generalization' and support a mechanistic superposition-subspace picture rests on observational correlation across seeds and sweeps. No ablation, projection, or regularizer is described that selectively suppresses or amplifies these defects while holding other trajectory statistics fixed; without such tests the defects could be downstream consequences rather than load-bearing drivers of transverse instability.

    Authors: We agree that the evidence for commutator defects as load-bearing drivers is observational, drawn from consistent patterns across multiple random seeds and weight-decay sweeps in the multi-task setting. The manuscript shows these defects appearing orthogonally to the execution manifold prior to generalization in all tested configurations, supporting the superposition-subspace interpretation. However, we have not performed selective ablations or interventions that hold other statistics fixed. We will revise the abstract and add a discussion section explicitly characterizing the evidence as correlational while outlining targeted future ablation experiments to test causality. revision: partial

  2. Referee: [Abstract (3),(5)] Abstract, phenomenon (3) and (5): the claimed 'universal integrability' and 'transverse fragility' require that the low-dimensional execution manifold remains invariant across the tested scales and tasks. The manuscript provides no quantitative test (e.g., distance to the manifold under controlled perturbations or across model widths) that would falsify invariance; the reported consistency across seeds is necessary but not sufficient for the geometric claim.

    Authors: The referee is correct that invariance is supported by empirical consistency of trajectory confinement across seeds, tasks (dual- and tri-task), and scales rather than explicit falsification via perturbation distances or width variations. The full manuscript demonstrates this through PCA projections and manifold reconstruction metrics, but lacks quantitative tests such as measuring deviation under controlled perturbations. We will incorporate such quantitative invariance tests, including perturbation-based distance metrics and width-scaling experiments, into a revised version to provide stronger geometric validation. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical observations from training runs

full rationale

The paper reports five consistent phenomena observed across training runs on shared-trunk Transformers for multi-task modular arithmetic, including staggered grokking order, confinement to an empirically invariant low-dimensional execution manifold, weight decay phase structure, holographic incompressibility, and transverse fragility. These are presented as patterns from systematic sweeps over seeds, tasks, and weight decay values, with no derivation chain, equations, or self-citations that reduce a claimed prediction or first-principles result to fitted inputs or prior author work by construction. The central dynamical picture is framed as supported by these observations rather than any tautological redefinition or imported uniqueness theorem.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Central claims rest on empirical observations of training trajectories; the invariant low-dimensional manifold and superposition subspace are interpretive constructs drawn from the data rather than derived from first principles.

free parameters (1)
  • weight decay coefficient
    Systematic sweep used to delineate phase structure; specific values not stated in abstract.
axioms (1)
  • domain assumption Optimization trajectories remain confined to an empirically invariant low-dimensional execution manifold
    Stated as universal integrability and used to interpret commutator defects as predictors of generalization.
invented entities (1)
  • compact superposition subspace no independent evidence
    purpose: Describes the low-rank yet full-weight structure of final solutions
    Introduced to unify the holographic incompressibility and redundancy observations

pith-pipeline@v0.9.0 · 5596 in / 1347 out tokens · 52148 ms · 2026-05-15T20:36:59.473932+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression

    cs.LG 2026-04 unverdicted novelty 7.0

    The spectral edge transitions from a gradient-driven functional direction before grokking to a perturbation-flat, ablation-critical compression axis at grokking, forming three universality classes predicted by a gap f...

  2. Spectral Edge Dynamics Reveal Functional Modes of Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    Spectral edge dynamics during grokking reveal task-dependent low-dimensional functional modes over inputs, such as Fourier modes for modular addition and cross-term decompositions for x squared plus y squared.

  3. Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training

    cs.LG 2026-03 unverdicted novelty 6.0

    Spectral gaps in the Gram matrix of parameter updates control phase transitions such as grokking in neural network training.

  4. Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories

    cs.LG 2026-04 unverdicted novelty 5.0

    Gradient-based SVD diagnostic uncovers hidden SED-LCH coupling in single and multitask settings and shows rank-3 subspace constraints speed up grokking by 2.3x.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 4 Pith papers · 3 internal anchors

  1. [1]

    Xander Davies, Lauro Langosco, and David Krueger

    URL https://transformer-circuits.pub/2023/monosemantic-features/index.html. Xander Davies, Lauro Langosco, and David Krueger. Unifying grokking and double descent.arXiv preprint arXiv:2303.06173,

  2. [2]

    Stanislav Fort and Stanislaw Jastrzebski

    URL https://transformer-circuits.pub/2022/ toy_model/index.html. Stanislav Fort and Stanislaw Jastrzebski. Large scale structure of neural network loss landscapes. InAdvances in Neural Information Processing Systems, volume 32,

  3. [3]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

  4. [4]

    Grokking as the transition from lazy to rich training dynamics.arXiv preprint arXiv:2310.06110,

    Tanishq Kumar, Blake Bordelon, Samuel J Gershman, and Cengiz Pehlevan. Grokking as the transition from lazy to rich training dynamics.arXiv preprint arXiv:2310.06110,

  5. [5]

    Measuring the intrinsic dimension of objective landscapes

    Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. InInternational Conference on Learning Representations, 2018a. Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. InAdvances in Neural Information Processing Systems, vol...

  6. [6]

    Dichotomy of early and late phase implicit biases can provably induce grokking.arXiv preprint arXiv:2311.18817,

    32 Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon S Du, Jason D Lee, and Wei Hu. Dichotomy of early and late phase implicit biases can provably induce grokking.arXiv preprint arXiv:2311.18817,

  7. [7]

    A tale of two circuits: Grokking as competition of sparse and dense subnetworks.arXiv preprint arXiv:2303.11873,

    William Merrill, Nikolaos Tsilivis, and Aman Shukla. A tale of two circuits: Grokking as competition of sparse and dense subnetworks.arXiv preprint arXiv:2303.11873,

  8. [9]

    Neel Nanda, Lawrence Chan, Tom Liberum, Jess Smith, and Jacob Steinhardt

    URLhttps://arxiv.org/abs/2602.01434. Neel Nanda, Lawrence Chan, Tom Liberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability.arXiv preprint arXiv:2301.05217,

  9. [10]

    Grokking: Generalization beyond overfitting on small algorithmic datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. InICLR 2022 Workshop on MATH-AI,

  10. [11]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    URLhttps://arxiv.org/abs/2201.02177. Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817,

  11. [12]

    Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390,

    Vikrant Varma, Rohin Shah, Zachary Kenton, J´ anos Kram´ ar, and Neel Nanda. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390,

  12. [13]

    Early-Warning Signals of Grokking via Loss-Landscape Geometry

    Yongzhong Xu. Early-warning signals of grokking via loss-landscape geometry.arXiv preprint arXiv:2602.16967, 2026a. URLhttps://arxiv.org/abs/2602.16967. Yongzhong Xu. Low-dimensional and transversely curved optimization dynamics in grokking.arXiv preprint arXiv:2602.16746, 2026b. URLhttps://arxiv.org/abs/2602.16746. Yongzhong Xu. Low-dimensional execution...