pith. sign in

arxiv: 2605.20449 · v1 · pith:77Y4IXDJnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

LLM Pretraining Shapes a Generalizable Manifold: Insights into Cross-Modal Transfer to Time Series

Pith reviewed 2026-05-21 07:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords cross-modal transfertime series forecastingLLM pretraininglinear probinglow-rank updatesmanifoldtransfer learning
0
0 comments X

The pith

Language pretraining builds a reusable manifold in LLM states that allows linear decoding of time series trajectories and competitive forecasting via retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that cross-modal transfer to time series forecasting succeeds because language pretraining on LLMs creates a generalizable manifold in the model's states. This manifold contains structure for time series dynamics, as shown by a linear probe on frozen states decoding realistic trajectories without any paired supervision. Retrieval in the projected space produces competitive forecasts, indicating the necessary structure exists before any finetuning. The paper also shows that pretrained initialization leads to coherent gradients and an anisotropic loss landscape, while finetuning performs low-dimensional alignment through low-rank updates that reuse existing directions for features like periodicity and trend.

Core claim

Cross-modal transfer arises because language pretraining preconditions time series training with a reusable manifold. A linear probe on frozen LLM states decodes realistic time-series trajectories without paired supervision, and retrieval in this projected space yields competitive forecasts, showing that structure and dynamics exist before finetuning. Pretrained initialization also improves optimization, producing coherent gradients and a highly anisotropic loss landscape unlike random initialization. Finetuning then acts as low-dimensional alignment, reusing existing directions rather than learning temporal primitives from scratch, as evidenced by low-rank updates, subspace alignment, and共享

What carries the argument

The reusable manifold induced by language pretraining in the LLM's hidden states, which supports linear decoding of time series trajectories and low-rank alignment during finetuning.

Load-bearing premise

The observed structure in frozen LLM states and the low-rank nature of finetuning updates are caused by language pretraining creating a generalizable manifold, rather than by model architecture, optimizer choices, or properties of the time-series datasets themselves.

What would settle it

If a transformer trained from random initialization on the same time series tasks shows comparable linear probe performance for decoding realistic trajectories and similar low-rank finetuning updates, the claim that language pretraining specifically shapes the manifold would be falsified.

Figures

Figures reproduced from arXiv: 2605.20449 by Alexis Roger, Andrei Mircea, Gwen Legate, Irina Rish, Prateek Humane, Vasilii Feofanov, Zhenghan Tai.

Figure 1
Figure 1. Figure 1: Phase and gradient coherence across training. Higher the coherence, better the model captures the periodic nature of the input (a sine wave here). Using t-SNE, we visualize hidden states at training steps 1, 512, and 8192 and see that pretrained LLM (LangInit) inherits periodic structure from pretraining, starting high. Random initialization (RandInit) starts near zero and undergoes an abrupt phase transit… view at source ↗
Figure 2
Figure 2. Figure 2: Training progression of h=1 forecasting metrics across training steps for four training [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pretrained hidden states contain time-series-compatible structure. A single linear map yˆt = w⊤ht + b is trained on frozen LLM hidden states via EM-style nearest-neighbor matching to a bank of 10,000 real time series (no paired data). (a) Three decoded outputs (blue) overlaid with their nearest real time series (green). Input text shown above each plot (gray). MSE = 0.23–0.38; full distribution in Appendix… view at source ↗
Figure 4
Figure 4. Figure 4: Forecasting example with largest gap compared to baseline from 500 evaluation queries. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Gradient alignment and evaluation loss across training for four adaptation regimes. Each panel shows per-sample gradient alignment (left axis; mean pairwise cosine similarity of individual time-series gradients over 32 held-out examples) and CRPS evaluation loss (right axis) as a function of training steps (log scale). Solid lines denote language-pretrained (LangInit) models; dashed lines denote randomly i… view at source ↗
Figure 6
Figure 6. Figure 6: Effective data transfer (see Appendix A.4) from language pretraining, following the [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effective rank dynamics during training. (a) Mean effective rank across all 28 layers. Left: on time-series input, RandInit starts near-isotropic (∼440) and collapses to ∼10 within 500 steps; LangInit declines more gradually from the pretrained baseline (∼50) to ∼27. Transparent lines show CRPS loss on a secondary axis, where decreasing loss correlates with rank. Right: on text input, LangInit’s effective … view at source ↗
Figure 8
Figure 8. Figure 8: Left: Sine wave representations across layers. A sine wave (period 64) passed through LangInit (top) and RandInit (bottom) at layers 8, 13, and 18, shown as 2D PCA colored by input phase. RandInit produces clean arcs at every layer (72–73% PCA variance); LangInit produces complex, layer-varying loops (25–41%). Right: Hidden-state trajectories for synthetic inputs at Layer 13 (PCA). Each row is a different … view at source ↗
Figure 9
Figure 9. Figure 9: Training progression of all h = 1 forecasting metrics across training steps for four training regimes. Solid lines denote language-pretrained initialization (Qwen3-0.6B); dashed lines denote random initialization. Horizontal dotted lines indicate baseline performance (Chronos-T5, Chronos￾Bolt, Chronos-2, ARIMA). Language-initialized models consistently converge earlier than their randomly initialized count… view at source ↗
Figure 10
Figure 10. Figure 10: Training progression of all h = 64 forecasting metrics across training steps for four training regimes. Solid lines denote language-pretrained initialization (Qwen3-0.6B); dashed lines denote random initialization. Horizontal dotted lines indicate baseline performance (Chronos-T5, Chronos-Bolt, Chronos-2, ARIMA). Language-initialized models consistently converge earlier than their randomly initialized cou… view at source ↗
Figure 11
Figure 11. Figure 11: Hidden-state trajectories for synthetic inputs at Layer 13 (2 PCA). Trajectories are 2D PCA projections of hidden states, colored by input phase. Percentages show variance captured by 2 PCs. Random and Base produce unstructured trajectories. RandInit discovers clean, low-dimensional representations (62–92% PCA variance) while LangInit creates geometrically complex but structured trajectories (34–98%) that… view at source ↗
Figure 12
Figure 12. Figure 12: Hidden-state trajectories for synthetic inputs at Layer 13 (t-SNE). Same setup as [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: LangInit: sine wave PCA trajectories across all 28 layers. Each subplot shows the 2D PCA projection of hidden states from a sine wave (period 64) at one transformer layer, colored by input phase. Percentages show variance explained by 2 PCs. LangInit exhibits layer-specific geometry with varying complexity and moderate PCA variance (17–49%), reflecting the rich, heterogeneous representations inherited fro… view at source ↗
Figure 14
Figure 14. Figure 14: RandInit: sine wave PCA trajectories across all 28 layers. Same setup as [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: LangInit: sine wave representations across training checkpoints. Rows are training steps (top: PT baseline; bottom: final checkpoint at step 10,000), columns are selected layers. LangInit begins from the pretrained model’s complex trajectories and gradually reshapes them into structured loops, with the transition occurring around steps 256–1024. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: RandInit: sine wave representations across training checkpoints. Same setup as [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Feature 1712 (Layer 10): Quantitative magnitude transitions. Top-3 activating time￾series windows. Blue: raw signal; red dashed: peak-activation timestep; orange shading: activation intensity. All three windows share a sudden jump from a low baseline to a high-magnitude plateau [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Feature 2469 (Layer 9): Tropical weather systems. Top-3 activating windows. The left window shows high-variance oscillations with abrupt drops; the middle and right show diverse volatile patterns with regime switching. 1. (act=8.7) “. . . the mean locus of formation shifts westward to the Caribbean and Gulf of Mexico, reversing the eastward progression of June through August. Wind shear from westerlies in… view at source ↗
Figure 19
Figure 19. Figure 19: Feature 3888 (Layer 7): Naval battle events. Top-3 activating windows. Each shows a low-variance baseline punctuated by sharp spikes at the peak-activation timestep (red dashed line). 1. (act=10.1) “King George V had only 32 percent of her fuel left while Rodney had only enough fuel to continue the chase at high speed until 8:00 the following day. Admiral Tovey signalled his battlegroup. . . ” 2. (act=9.3… view at source ↗
read the original abstract

Can language-pretrained transformers become effective time-series forecasters, and why? In this paper, we show that cross-modal transfer arises because language pretraining preconditions time series training with a reusable manifold. A linear probe on frozen LLM states decodes realistic time-series trajectories without paired supervision, and retrieval in this projected space yields competitive forecasts, showing that structure and dynamics exist before finetuning. Pretrained initialization also improves optimization, producing coherent gradients and a highly anisotropic loss landscape unlike random initialization. Finetuning then acts as low-dimensional alignment, reusing existing directions rather than learning temporal primitives from scratch, as evidenced by low-rank updates, subspace alignment, and shared features for periodicity, trend, and repetition. Together, these results support a geometric account of LLM-to-time-series transfer: language pretraining builds the manifold, and finetuning projects numerical dynamics onto task-relevant directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that language pretraining of transformers preconditions time-series training by building a reusable manifold. Evidence includes linear probes on frozen LLM states decoding realistic trajectories without paired supervision, retrieval in the projected space producing competitive forecasts, pretrained initialization yielding coherent gradients and anisotropic loss landscapes unlike random starts, and finetuning consisting of low-rank updates that align to existing directions for periodicity, trend, and repetition rather than learning primitives from scratch.

Significance. If the results hold, the work supplies a geometric account of cross-modal transfer with concrete empirical support from linear probing, retrieval forecasts, and low-rank update analysis. These elements constitute reproducible strengths that could inform initialization strategies and adaptation methods for sequential tasks.

major comments (2)
  1. [Abstract and experiments on frozen states / low-rank updates] Abstract and experiments contrasting pretrained vs. random initialization: the claim that language pretraining specifically builds the reusable manifold requires isolating the pretraining corpus from architecture and optimizer effects. The current design does not hold the transformer architecture fixed while ablating the pretraining data (language corpus vs. synthetic sequences vs. none), leaving open that low-dimensional directions for periodicity/trend and coherent gradients could arise from any transformer on sequential data under standard optimizers.
  2. [Finetuning analysis] § on finetuning updates and subspace alignment: the interpretation that finetuning reuses existing directions rather than learning temporal primitives rests on low-rank updates and shared features, but without controls that vary only the pretraining objective while matching architecture and data statistics, the geometric account remains compatible with architecture-driven priors.
minor comments (2)
  1. [Abstract] The abstract packs multiple distinct experiments into a single paragraph; separating the linear-probe, retrieval, and optimization results into distinct sentences would improve readability.
  2. Notation for the projected space and manifold dimensions is introduced without an explicit definition or reference to a methods subsection; adding a short notation table or equation would clarify reuse across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the work's significance. We address each major comment below with clarifications on the experimental design and scope of our claims, while indicating targeted revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and experiments on frozen states / low-rank updates] Abstract and experiments contrasting pretrained vs. random initialization: the claim that language pretraining specifically builds the reusable manifold requires isolating the pretraining corpus from architecture and optimizer effects. The current design does not hold the transformer architecture fixed while ablating the pretraining data (language corpus vs. synthetic sequences vs. none), leaving open that low-dimensional directions for periodicity/trend and coherent gradients could arise from any transformer on sequential data under standard optimizers.

    Authors: We agree that fully isolating the language corpus would require additional controls such as pretraining the same transformer architecture on synthetic sequences. Our current experiments hold architecture and optimizer fixed while contrasting language-pretrained initialization against random initialization, which isolates the contribution of language pretraining to the observed manifold properties (linear probe decoding, retrieval forecasts, gradient coherence, and anisotropic landscapes). We will revise the abstract and introduction to clarify that our claims concern the sufficiency of language pretraining for building a reusable manifold rather than exclusivity over all possible pretraining regimes, and we will add a limitations paragraph discussing synthetic pretraining ablations as valuable future work. revision: partial

  2. Referee: [Finetuning analysis] § on finetuning updates and subspace alignment: the interpretation that finetuning reuses existing directions rather than learning temporal primitives rests on low-rank updates and shared features, but without controls that vary only the pretraining objective while matching architecture and data statistics, the geometric account remains compatible with architecture-driven priors.

    Authors: We acknowledge that architecture-induced priors could influence low-rank updates and subspace alignments in isolation. However, by comparing pretrained versus randomly initialized models under identical architectures, data statistics during finetuning, and optimizers, our results show differential behavior: pretrained models exhibit more coherent gradients, anisotropic loss landscapes, and low-rank updates aligned to periodicity/trend directions, while random starts do not. This differential evidence supports that language pretraining shapes the manifold. We will expand the finetuning analysis section to explicitly discuss architecture priors as a potential contributing factor and how the pretrained-versus-random contrast addresses the geometric account within the scope of our controls. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on empirical contrasts between pretrained and random initializations

full rationale

The paper's derivation consists of experimental observations: linear probes on frozen LLM states decode trajectories, retrieval in the projected space produces forecasts, pretrained initialization yields coherent gradients and anisotropic landscapes unlike random starts, and finetuning produces low-rank updates with subspace alignment. These findings are obtained by direct measurement and comparison on the same models and datasets; they do not reduce to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. No equations or uniqueness theorems are invoked that presuppose the target manifold structure, and the geometric interpretation is presented as a post-experimental account rather than a premise that forces the results. The account is therefore self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on interpreting linear-probe success and low-rank finetuning as direct evidence of a pre-existing manifold induced by language pretraining; this interpretation assumes that alternative explanations (architecture, data statistics) are ruled out by the experimental design.

axioms (1)
  • domain assumption The internal representations of a frozen LLM contain decodable temporal structure independent of any time-series supervision.
    Invoked to support the linear probe and retrieval results as evidence of the manifold.
invented entities (1)
  • reusable manifold no independent evidence
    purpose: To explain why language pretraining enables effective time-series transfer without learning temporal primitives from scratch.
    Postulated geometric object whose existence is inferred from probe success, retrieval performance, and low-rank update observations.

pith-pipeline@v0.9.0 · 5702 in / 1345 out tokens · 46514 ms · 2026-05-21T07:21:08.529521+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    becoming later in the year by about two days every 243-year cycle. Transits usually occur in pairs, on nearly the same date eight years apart

    (act=14.1)“. . . becoming later in the year by about two days every 243-year cycle. Transits usually occur in pairs, on nearly the same date eight years apart.”

  2. [2]

    In practice, forward premiums and discounts are quoted as annualized percentage deviations from the spot exchange rate

    (act=13.8)“In practice, forward premiums and discounts are quoted as annualized percentage deviations from the spot exchange rate. . . ”

  3. [3]

    The working age population of the town in 2011

    (act=13.7)“Wages are reflective of the type of jobs available locally, including higher than average employ- ment in manufacturing and the public sector. The working age population of the town in 2011. . . ”

  4. [4]

    So for americium-241, the resistivity at 4.2 K increases with time from about 2 µOhm ·cm to 10 µOhm·cm after 40 hours, and saturates at about 16 µOhm·cm

    (act=13.7)“So for americium-241, the resistivity at 4.2 K increases with time from about 2 µOhm ·cm to 10 µOhm·cm after 40 hours, and saturates at about 16 µOhm·cm. . . ”

  5. [5]

    Falcon’s Fury can theoretically accommodate 800 riders per hour. Carbon-fiber wings buttress each end of a group of seats

    (act=13.7)“Falcon’s Fury can theoretically accommodate 800 riders per hour. Carbon-fiber wings buttress each end of a group of seats. . . ” The shared representation encodesquantitative magnitude and transition: literal level shifts in time series, and passages dense with measurements, unit conversions, and numerical comparisons in text. Feature 2469 (Lay...

  6. [6]

    the mean locus of formation shifts westward to the Caribbean and Gulf of Mexico, reversing the eastward progression of June through August

    (act=8.7)“. . . the mean locus of formation shifts westward to the Caribbean and Gulf of Mexico, reversing the eastward progression of June through August. Wind shear from westerlies increases substantially through November. . . ”

  7. [7]

    due to a combination of very high wind shear and dry air

    (act=8.3)“. . . due to a combination of very high wind shear and dry air. By October 17, most of the deep convection associated with the system dissipated; however, a brief decrease in wind shear allowed Omar to re-strengthen. . . ”

  8. [8]

    The convective system organized into Tropical Depression Twenty-E on September 28

    (act=8.1)“The wave continued westward and related thunderstorm activity increased during the following week. The convective system organized into Tropical Depression Twenty-E on September 28. . . ”

  9. [9]

    It strengthened at a moderate pace and reached hurricane intensity on October 18.”

    (act=8.1)“A tropical wave moved across the northeast Pacific Ocean and formed a tropical depression south of Mexico on October 16. It strengthened at a moderate pace and reached hurricane intensity on October 18.”

  10. [10]

    formation of Typhoon Chanchu in the western Pacific enhanced convective activity over the Bay of Bengal

    (act=7.9)“. . . formation of Typhoon Chanchu in the western Pacific enhanced convective activity over the Bay of Bengal. By April 22, a trough developed along an axis from the southern Bay of Bengal eastward to the Andaman Sea.” The time-series patterns—volatile signals with sudden regime changes—mirror the physical phenomena de- scribed in the text: trop...

  11. [11]

    Admiral Tovey signalled his battlegroup

    (act=10.1)“King George V had only 32 percent of her fuel left while Rodney had only enough fuel to continue the chase at high speed until 8:00 the following day. Admiral Tovey signalled his battlegroup. . . ”

  12. [12]

    (act=9.3)“At 7:20 on 19 July, the destroyer force spotted and was spotted by a pair of Italian light cruisers; Giovanni dalle Bande Nere and Bartolomeo Colleoni, which opened fire seven minutes later.”

  13. [13]

    The opposing ships began an artillery

    (act=9.2)“Shortly before 16:00 the battlecruisers of I Scouting Group encountered the British 1st Bat- tlecruiser Squadron under the command of Vice Admiral David Beatty. The opposing ships began an artillery. . . ”

  14. [14]

    The eastern wind was not communicated to the aircraft, but was 270°, varying from 20 to 40 knots (37 to 74 km/h). The take-off started at 14:42:43

    (act=9.2)“The eastern wind was not communicated to the aircraft, but was 270°, varying from 20 to 40 knots (37 to 74 km/h). The take-off started at 14:42:43. . . ”

  15. [15]

    torpedo boat attacks and at 07:30, Burrough sent Eskimo and Somali back to help Manchester but they arrived too late, took on survivors

    (act=9.1)“. . . torpedo boat attacks and at 07:30, Burrough sent Eskimo and Somali back to help Manchester but they arrived too late, took on survivors. . . ” Both modalities encodesudden, precisely-located events: an anomalous spike at a single timestep in time series, and a precisely-timestamped combat event in text. Feature 2567 (Layer 8) — Missing / n...

  16. [16]

    at the Royal Navy School of Flight Deck Operations at RNAS Culdrose. The following is an incomplete list of some of the surviving aircraft

    (act=19.8)“. . . at the Royal Navy School of Flight Deck Operations at RNAS Culdrose. The following is an incomplete list of some of the surviving aircraft.”

  17. [17]

    <unk>, <unk>, <unk>, <unk>, <unk>, Ulaid. Slightly later major groups included the Con- nachta,<unk>,<unk>. Smaller groups included the<unk>

    (act=18.5)“ <unk>, <unk>, <unk>, <unk>, <unk>, Ulaid. Slightly later major groups included the Con- nachta,<unk>,<unk>. Smaller groups included the<unk>. . . ”

  18. [18]

    he encountered bad weather, forcing him to return to Japan with heavy damage. Without waiting for Vizcaino, another ship—built in Izu by the Tokugawa shogunate

    (act=18.0)“. . . he encountered bad weather, forcing him to return to Japan with heavy damage. Without waiting for Vizcaino, another ship—built in Izu by the Tokugawa shogunate. . . ” 4.(act=17.8)“Luke 9:<unk>-<unk>—κα`ι<unk>,<unk> <unk> <unk>πνε´υµατ oς<unk> <unk>. . . ”

  19. [19]

    whom he married in the late 250s when she was 17 or 18 years old. The number of children Odaenathus had with his first wife is unknown and only one is attested

    (act=17.7)“. . . whom he married in the late 250s when she was 17 or 18 years old. The number of children Odaenathus had with his first wife is unknown and only one is attested.” The model representsabsent informationidentically across modalities: NaN values in time series and <unk> tokens in text both occupy the same region of representation space. E Cau...