When Do Autoregressive Sequence Models Forecast Physical Wavefields? A Controlled Study on Synthetic Seismograms

Alexander Kappes; Christine Thomas; Jana Klinge; Stuart Russell; Waleed Esmail

arxiv: 2606.10868 · v1 · pith:34DI4ZHMnew · submitted 2026-06-09 · 💻 cs.LG · astro-ph.IM

When Do Autoregressive Sequence Models Forecast Physical Wavefields? A Controlled Study on Synthetic Seismograms

Waleed Esmail , Stuart Russell , Jana Klinge , Alexander Kappes , Christine Thomas This is my paper

Pith reviewed 2026-06-27 13:38 UTC · model grok-4.3

classification 💻 cs.LG astro-ph.IM

keywords autoregressive forecastingseismogramsrollout stabilitymulti-token predictionwavefieldserror accumulationsynthetic testbedoscillatory signals

0 comments

The pith

Multi-token prediction accounts for nearly all stability gains in long-horizon seismogram forecasting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests when autoregressive models can maintain stable forecasts of oscillatory wavefields such as seismograms over hundreds of steps. Using controlled ablations on synthetic three-component seismograms, it isolates the effect of each architectural choice during free-running rollout. Multi-token prediction delivers almost the entire improvement over single-token baselines. A hybrid prediction head and cross-horizon spectral coherence loss add smaller consistent gains, while performance collapses below a context-ratio threshold near one. The main remaining error is polarity inversion that magnitude-based losses cannot penalize.

Core claim

In intra-architecture ablations evaluated on free-running rollout with paired significance tests, multi-token prediction accounts for almost the entire improvement over a single-token baseline (+0.040 median NCC); a horizon-embedding hybrid prediction head and a cross-horizon STFT-magnitude coherence loss each add a small but consistent further gain. Performance depends sharply on a context-ratio threshold near one, roughly the full P-S interval of observed signal, below which rollout generalization collapses. The dominant residual failure is a polarity inversion that a magnitude-based spectral loss cannot, by construction, penalize.

What carries the argument

Multi-token prediction head in the autoregressive forecaster, which predicts several future tokens at each step to limit error accumulation during extended rollout.

If this is right

Switching from single-token to multi-token prediction yields the bulk of the observed stability improvement.
Context length must reach or exceed the full observed P-S wave interval or rollout performance drops sharply.
Adding a horizon-embedding hybrid head and STFT-magnitude coherence loss produces further modest but reliable gains.
Phase-aware loss terms are required to address the residual polarity-inversion failures.
The study frames its results as a controlled examination of rollout stability rather than an architecture benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-token stabilization may extend to other oscillatory signals such as gravitational-wave strain.
The sharp context-ratio threshold points to a physical limit set by the duration of the dominant wave arrivals.
Replacing the magnitude-only spectral loss with a phase-sensitive objective is the direct next design step.
The polarity-inversion failure suggests testing whether sign-flip augmentations during training can reduce that specific error mode.

Load-bearing premise

The synthetic three-component seismograms capture the essential dynamical properties of real oscillatory wavefields sufficiently well that ablation results on rollout stability will generalize beyond the controlled testbed.

What would settle it

Repeating the identical ablation suite on a set of real recorded three-component seismograms and checking whether the same ordering of contributions from multi-token prediction, context ratio, and auxiliary losses is recovered.

Figures

Figures reproduced from arXiv: 2606.10868 by Alexander Kappes, Christine Thomas, Jana Klinge, Stuart Russell, Waleed Esmail.

**Figure 1.** Figure 1: The two non-standard components of SeismoGPT (Esmail et al., 2026). (a) Prediction-head design space. Independent heads place H separate MLPs on the backbone in parallel, with no parameter sharing across horizons; the DeepSeek-V3 sequential variant chains H modules, each conditioned on the previous module’s hidden state; the hybrid head applies a single shared MLP fθ to zt + eh, where eh is a learned per-h… view at source ↗

**Figure 2.** Figure 2: Rollout stability across the autoregressive continuation. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Full-model rollout versus ground truth for a median event (50th percentile, NCC [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: NCC at a fixed 240 s rollout horizon as a function of the context ratio ρ, where ρ sets the observed context length relative to the event-specific P-S interval. Shading shows the interquartile range. Performance is high and nearly flat for ρ ≥ 1 and degrades sharply below this threshold. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Worst-case rollouts (lowest NCC), prediction versus ground truth, all three components. NCC is [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

read the original abstract

Long-horizon autoregressive forecasting of oscillatory physical signals, such as seismograms, gravitational-wave strain, and similar wavefields is limited by error accumulation: as a causal model is fed its own outputs over hundreds of steps, small per-step errors compound into phase drift that pointwise metrics fail to detect. We ask when such rollout stays stable, using synthetic three-component seismograms as a physically structured testbed and the \textsc{SeismoGPT} autoregressive forecaster as the model under study. Through controlled, intra-architecture ablations evaluated on free-running rollout with paired significance tests, we isolate the contribution of each design choice. Multi-token prediction is the dominant stabilizer, accounting for almost the entire improvement over a single-token baseline ($+0.040$ median NCC); a horizon-embedding hybrid prediction head and a cross-horizon STFT-magnitude coherence loss each add a small but consistent further gain. Performance depends sharply on a context-ratio threshold near one, roughly the full P-S interval of observed signal, below which rollout generalization collapses. The dominant residual failure is a polarity inversion that a magnitude-based spectral loss cannot, by construction, penalize, identifying phase-aware objectives as the natural next step. We frame this as a controlled study of rollout stability on oscillatory wavefields, not a benchmark of forecasting architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Controlled ablations on synthetic seismograms show multi-token prediction drives most of the rollout stability gain, with context length and polarity issues as clear secondary factors.

read the letter

The main result is that multi-token prediction accounts for nearly all the improvement in free-running NCC on these synthetic three-component seismograms, with the hybrid head and spectral loss adding smaller consistent lifts. Context must cover roughly the full P-S interval or generalization collapses, and polarity inversion remains the main failure mode the loss cannot penalize.

The work is a clean intra-architecture ablation study with paired significance tests on rollout metrics. That targeted dissection of design choices for oscillatory signals is the actual novelty; most prior work either compares unrelated architectures or reports single-model results without this level of controlled variation. The paper stays within its stated bounds and does not extrapolate to real data or other domains.

The obvious soft spot is the synthetic testbed. The authors themselves frame the study as controlled rather than a claim about field data, which is appropriate, but it means the practical takeaway for seismology or gravitational-wave applications rests on an untested assumption that the synthetic dynamics capture the essential rollout failure modes. No other major gaps appear in the reported claims.

This is for researchers already building or tuning autoregressive forecasters on wave-like physical signals who need concrete guidance on what stabilizes long horizons. A reader in that niche will extract usable design rules from the ranking of factors.

I would send it to peer review. The experiments are scoped and the quantitative claims are backed by the described tests, so referees can evaluate the details without the paper overreaching.

Referee Report

0 major / 0 minor

Summary. The manuscript reports a controlled intra-architecture ablation study on the SeismoGPT autoregressive forecaster using synthetic three-component seismograms to isolate factors that stabilize long-horizon free-running rollouts of oscillatory physical wavefields. It finds that multi-token prediction accounts for nearly the entire improvement over a single-token baseline (+0.040 median NCC), with smaller consistent gains from a horizon-embedding hybrid prediction head and a cross-horizon STFT-magnitude coherence loss; rollout stability exhibits a sharp threshold at a context-ratio near one (roughly the full P-S interval), below which performance collapses, and identifies polarity inversion as the dominant residual failure mode not correctable by magnitude-based losses.

Significance. If the ablation results and paired significance tests hold, the work supplies useful, scoped empirical guidance on architectural choices for rollout stability in a physically structured synthetic testbed. The explicit framing as an intra-architecture study rather than a general benchmark or real-data claim is a strength, as is the focus on free-running rollouts and the identification of phase-aware objectives as future work. The synthetic testbed limits broader generalization, but the paper does not overclaim this.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation to accept. The review accurately captures the scope and contributions of our controlled intra-architecture study on rollout stability for synthetic seismograms.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is a controlled empirical ablation study on synthetic three-component seismograms. It isolates the effects of multi-token prediction, hybrid heads, and spectral losses via direct rollout evaluations and paired significance tests. No equations, derivations, or self-citations reduce any reported gain or threshold to a quantity defined by the paper's own fitted parameters or prior results; the claims rest on external testbed measurements rather than internal redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical controlled ablation study; the abstract introduces no free parameters, mathematical axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5783 in / 1280 out tokens · 31537 ms · 2026-06-27T13:38:23.629756+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 16 canonical work pages · 8 internal anchors

[1]

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 1506.03099 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Improving Multi-Step Prediction of Learned Time Series Models , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =
[3]

Professor Forcing: A New Algorithm for Training Recurrent Networks

Professor Forcing: A New Algorithm for Training Recurrent Networks , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 1610.09038 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Findings of the Association for Computational Linguistics (ACL) , year =

Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation , author =. Findings of the Association for Computational Linguistics (ACL) , year =. 2204.01171 , archivePrefix =

work page arXiv
[5]

2024 , eprint =

Harnessing Loss Decomposition for Long-Horizon Wave Predictions via Deep Neural Networks , author =. 2024 , eprint =

2024
[6]

Findings of the Association for Computational Linguistics (EMNLP) , year =

ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training , author =. Findings of the Association for Computational Linguistics (EMNLP) , year =. 2001.04063 , archivePrefix =

work page arXiv 2001
[7]

2024 , eprint =

Better & Faster Large Language Models via Multi-token Prediction , author =. 2024 , eprint =

2024
[8]

2024 , eprint =

DeepSeek-V3 Technical Report , author =. 2024 , eprint =

2024
[9]

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers , author =. International Conference on Learning Representations (ICLR) , year =. 2211.14730 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[10]

A decoder-only foundation model for time-series forecasting

A Decoder-Only Foundation Model for Time-Series Forecasting , author =. Proceedings of the International Conference on Machine Learning (ICML) , year =. 2310.10688 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Chronos: Learning the Language of Time Series

Chronos: Learning the Language of Time Series , author =. Transactions on Machine Learning Research (TMLR) , year =. 2403.07815 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[12]

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =

Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , author =. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =. 1910.11480 , archivePrefix =

work page arXiv 1910
[13]

International Conference on Learning Representations (ICLR) , year =

DDSP: Differentiable Digital Signal Processing , author =. International Conference on Learning Representations (ICLR) , year =. 2001.04643 , archivePrefix =

work page arXiv 2001
[14]

2026 , eprint =

Data-Driven Forecasting of Three-Component Seismograms Using Transformer Architectures , author =. 2026 , eprint =

2026
[15]

A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition , journal =

Souhaib. A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition , journal =. 2012 , issn =. doi:https://doi.org/10.1016/j.eswa.2012.01.039 , url =

work page doi:10.1016/j.eswa.2012.01.039 2012
[16]

International Journal of Forecasting , volume =

Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting , author =. International Journal of Forecasting , volume =. 2021 , eprint =

2021
[17]

2023 , eprint =

Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting , author =. 2023 , eprint =

2023
[18]

arXiv preprint arXiv:2402.02592 , year=

Unified Training of Universal Time Series Forecasting Transformers , author =. Proceedings of the International Conference on Machine Learning (ICML) , year =. 2402.02592 , archivePrefix =

work page arXiv
[19]

2023 , eprint =

Learning from Predictions: Fusing Training and Autoregressive Inference for Long-Term Spatiotemporal Forecasts , author =. 2023 , eprint =

2023
[20]

Geophysical Journal International , volume =

PhaseNet: A Deep-Neural-Network-Based Seismic Arrival-Time Picking Method , author =. Geophysical Journal International , volume =. 2019 , eprint =

2019
[21]

Nature Communications , volume =

Earthquake Transformer---An Attentive Deep-Learning Model for Simultaneous Earthquake Detection and Phase Picking , author =. Nature Communications , volume =
[22]

2024 , eprint =

SeisLM: a Foundation Model for Seismic Waveforms , author =. 2024 , eprint =

2024
[23]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author =. Conference on Language Modeling (COLM) , year =. 2312.00752 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[24]

International Conference on Learning Representations (ICLR) , year =

Message Passing Neural PDE Solvers , author =. International Conference on Learning Representations (ICLR) , year =. 2202.03376 , archivePrefix =

work page arXiv
[25]

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =

Neural Speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses , author =. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =. 2211.15974 , archivePrefix =

work page arXiv
[26]

Gaussian Error Linear Units (GELUs)

Hendrycks, Dan and Gimpel, Kevin. Gaussian Error Linear Units (GELUs). 2016. arXiv:1606.08415

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

Neurocomputing , year =

Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng , title =. Neurocomputing , year =
[28]

International Conference on Learning Representations (ICLR) , year =

Loshchilov, Ilya and Hutter, Frank , title =. International Conference on Learning Representations (ICLR) , year =
[29]

The use of fast

Welch, Peter , journal=. The use of fast. 1967 , publisher=

1967
[30]

Layer Normalization

Layer Normalization , author =. arXiv preprint arXiv:1607.06450 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 1506.03099 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Improving Multi-Step Prediction of Learned Time Series Models , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =

[3] [3]

Professor Forcing: A New Algorithm for Training Recurrent Networks

Professor Forcing: A New Algorithm for Training Recurrent Networks , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 1610.09038 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Findings of the Association for Computational Linguistics (ACL) , year =

Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation , author =. Findings of the Association for Computational Linguistics (ACL) , year =. 2204.01171 , archivePrefix =

work page arXiv

[5] [5]

2024 , eprint =

Harnessing Loss Decomposition for Long-Horizon Wave Predictions via Deep Neural Networks , author =. 2024 , eprint =

2024

[6] [6]

Findings of the Association for Computational Linguistics (EMNLP) , year =

ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training , author =. Findings of the Association for Computational Linguistics (EMNLP) , year =. 2001.04063 , archivePrefix =

work page arXiv 2001

[7] [7]

2024 , eprint =

Better & Faster Large Language Models via Multi-token Prediction , author =. 2024 , eprint =

2024

[8] [8]

2024 , eprint =

DeepSeek-V3 Technical Report , author =. 2024 , eprint =

2024

[9] [9]

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers , author =. International Conference on Learning Representations (ICLR) , year =. 2211.14730 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

A decoder-only foundation model for time-series forecasting

A Decoder-Only Foundation Model for Time-Series Forecasting , author =. Proceedings of the International Conference on Machine Learning (ICML) , year =. 2310.10688 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Chronos: Learning the Language of Time Series

Chronos: Learning the Language of Time Series , author =. Transactions on Machine Learning Research (TMLR) , year =. 2403.07815 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =

Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , author =. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =. 1910.11480 , archivePrefix =

work page arXiv 1910

[13] [13]

International Conference on Learning Representations (ICLR) , year =

DDSP: Differentiable Digital Signal Processing , author =. International Conference on Learning Representations (ICLR) , year =. 2001.04643 , archivePrefix =

work page arXiv 2001

[14] [14]

2026 , eprint =

Data-Driven Forecasting of Three-Component Seismograms Using Transformer Architectures , author =. 2026 , eprint =

2026

[15] [15]

A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition , journal =

Souhaib. A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition , journal =. 2012 , issn =. doi:https://doi.org/10.1016/j.eswa.2012.01.039 , url =

work page doi:10.1016/j.eswa.2012.01.039 2012

[16] [16]

International Journal of Forecasting , volume =

Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting , author =. International Journal of Forecasting , volume =. 2021 , eprint =

2021

[17] [17]

2023 , eprint =

Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting , author =. 2023 , eprint =

2023

[18] [18]

arXiv preprint arXiv:2402.02592 , year=

Unified Training of Universal Time Series Forecasting Transformers , author =. Proceedings of the International Conference on Machine Learning (ICML) , year =. 2402.02592 , archivePrefix =

work page arXiv

[19] [19]

2023 , eprint =

Learning from Predictions: Fusing Training and Autoregressive Inference for Long-Term Spatiotemporal Forecasts , author =. 2023 , eprint =

2023

[20] [20]

Geophysical Journal International , volume =

PhaseNet: A Deep-Neural-Network-Based Seismic Arrival-Time Picking Method , author =. Geophysical Journal International , volume =. 2019 , eprint =

2019

[21] [21]

Nature Communications , volume =

Earthquake Transformer---An Attentive Deep-Learning Model for Simultaneous Earthquake Detection and Phase Picking , author =. Nature Communications , volume =

[22] [22]

2024 , eprint =

SeisLM: a Foundation Model for Seismic Waveforms , author =. 2024 , eprint =

2024

[23] [23]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author =. Conference on Language Modeling (COLM) , year =. 2312.00752 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

International Conference on Learning Representations (ICLR) , year =

Message Passing Neural PDE Solvers , author =. International Conference on Learning Representations (ICLR) , year =. 2202.03376 , archivePrefix =

work page arXiv

[25] [25]

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =

Neural Speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses , author =. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =. 2211.15974 , archivePrefix =

work page arXiv

[26] [26]

Gaussian Error Linear Units (GELUs)

Hendrycks, Dan and Gimpel, Kevin. Gaussian Error Linear Units (GELUs). 2016. arXiv:1606.08415

work page internal anchor Pith review Pith/arXiv arXiv 2016

[27] [27]

Neurocomputing , year =

Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng , title =. Neurocomputing , year =

[28] [28]

International Conference on Learning Representations (ICLR) , year =

Loshchilov, Ilya and Hutter, Frank , title =. International Conference on Learning Representations (ICLR) , year =

[29] [29]

The use of fast

Welch, Peter , journal=. The use of fast. 1967 , publisher=

1967

[30] [30]

Layer Normalization

Layer Normalization , author =. arXiv preprint arXiv:1607.06450 , year =

work page internal anchor Pith review Pith/arXiv arXiv