When Do Autoregressive Sequence Models Forecast Physical Wavefields? A Controlled Study on Synthetic Seismograms
Pith reviewed 2026-06-27 13:38 UTC · model grok-4.3
The pith
Multi-token prediction accounts for nearly all stability gains in long-horizon seismogram forecasting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In intra-architecture ablations evaluated on free-running rollout with paired significance tests, multi-token prediction accounts for almost the entire improvement over a single-token baseline (+0.040 median NCC); a horizon-embedding hybrid prediction head and a cross-horizon STFT-magnitude coherence loss each add a small but consistent further gain. Performance depends sharply on a context-ratio threshold near one, roughly the full P-S interval of observed signal, below which rollout generalization collapses. The dominant residual failure is a polarity inversion that a magnitude-based spectral loss cannot, by construction, penalize.
What carries the argument
Multi-token prediction head in the autoregressive forecaster, which predicts several future tokens at each step to limit error accumulation during extended rollout.
If this is right
- Switching from single-token to multi-token prediction yields the bulk of the observed stability improvement.
- Context length must reach or exceed the full observed P-S wave interval or rollout performance drops sharply.
- Adding a horizon-embedding hybrid head and STFT-magnitude coherence loss produces further modest but reliable gains.
- Phase-aware loss terms are required to address the residual polarity-inversion failures.
- The study frames its results as a controlled examination of rollout stability rather than an architecture benchmark.
Where Pith is reading between the lines
- The same multi-token stabilization may extend to other oscillatory signals such as gravitational-wave strain.
- The sharp context-ratio threshold points to a physical limit set by the duration of the dominant wave arrivals.
- Replacing the magnitude-only spectral loss with a phase-sensitive objective is the direct next design step.
- The polarity-inversion failure suggests testing whether sign-flip augmentations during training can reduce that specific error mode.
Load-bearing premise
The synthetic three-component seismograms capture the essential dynamical properties of real oscillatory wavefields sufficiently well that ablation results on rollout stability will generalize beyond the controlled testbed.
What would settle it
Repeating the identical ablation suite on a set of real recorded three-component seismograms and checking whether the same ordering of contributions from multi-token prediction, context ratio, and auxiliary losses is recovered.
Figures
read the original abstract
Long-horizon autoregressive forecasting of oscillatory physical signals, such as seismograms, gravitational-wave strain, and similar wavefields is limited by error accumulation: as a causal model is fed its own outputs over hundreds of steps, small per-step errors compound into phase drift that pointwise metrics fail to detect. We ask when such rollout stays stable, using synthetic three-component seismograms as a physically structured testbed and the \textsc{SeismoGPT} autoregressive forecaster as the model under study. Through controlled, intra-architecture ablations evaluated on free-running rollout with paired significance tests, we isolate the contribution of each design choice. Multi-token prediction is the dominant stabilizer, accounting for almost the entire improvement over a single-token baseline ($+0.040$ median NCC); a horizon-embedding hybrid prediction head and a cross-horizon STFT-magnitude coherence loss each add a small but consistent further gain. Performance depends sharply on a context-ratio threshold near one, roughly the full P-S interval of observed signal, below which rollout generalization collapses. The dominant residual failure is a polarity inversion that a magnitude-based spectral loss cannot, by construction, penalize, identifying phase-aware objectives as the natural next step. We frame this as a controlled study of rollout stability on oscillatory wavefields, not a benchmark of forecasting architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports a controlled intra-architecture ablation study on the SeismoGPT autoregressive forecaster using synthetic three-component seismograms to isolate factors that stabilize long-horizon free-running rollouts of oscillatory physical wavefields. It finds that multi-token prediction accounts for nearly the entire improvement over a single-token baseline (+0.040 median NCC), with smaller consistent gains from a horizon-embedding hybrid prediction head and a cross-horizon STFT-magnitude coherence loss; rollout stability exhibits a sharp threshold at a context-ratio near one (roughly the full P-S interval), below which performance collapses, and identifies polarity inversion as the dominant residual failure mode not correctable by magnitude-based losses.
Significance. If the ablation results and paired significance tests hold, the work supplies useful, scoped empirical guidance on architectural choices for rollout stability in a physically structured synthetic testbed. The explicit framing as an intra-architecture study rather than a general benchmark or real-data claim is a strength, as is the focus on free-running rollouts and the identification of phase-aware objectives as future work. The synthetic testbed limits broader generalization, but the paper does not overclaim this.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation to accept. The review accurately captures the scope and contributions of our controlled intra-architecture study on rollout stability for synthetic seismograms.
Circularity Check
No significant circularity identified
full rationale
The paper is a controlled empirical ablation study on synthetic three-component seismograms. It isolates the effects of multi-token prediction, hybrid heads, and spectral losses via direct rollout evaluations and paired significance tests. No equations, derivations, or self-citations reduce any reported gain or threshold to a quantity defined by the paper's own fitted parameters or prior results; the claims rest on external testbed measurements rather than internal redefinitions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 1506.03099 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Proceedings of the AAAI Conference on Artificial Intelligence , year =
Improving Multi-Step Prediction of Learned Time Series Models , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =
-
[3]
Professor Forcing: A New Algorithm for Training Recurrent Networks
Professor Forcing: A New Algorithm for Training Recurrent Networks , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 1610.09038 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Findings of the Association for Computational Linguistics (ACL) , year =
Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation , author =. Findings of the Association for Computational Linguistics (ACL) , year =. 2204.01171 , archivePrefix =
-
[5]
2024 , eprint =
Harnessing Loss Decomposition for Long-Horizon Wave Predictions via Deep Neural Networks , author =. 2024 , eprint =
2024
-
[6]
Findings of the Association for Computational Linguistics (EMNLP) , year =
ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training , author =. Findings of the Association for Computational Linguistics (EMNLP) , year =. 2001.04063 , archivePrefix =
-
[7]
2024 , eprint =
Better & Faster Large Language Models via Multi-token Prediction , author =. 2024 , eprint =
2024
-
[8]
2024 , eprint =
DeepSeek-V3 Technical Report , author =. 2024 , eprint =
2024
-
[9]
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers , author =. International Conference on Learning Representations (ICLR) , year =. 2211.14730 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
A decoder-only foundation model for time-series forecasting
A Decoder-Only Foundation Model for Time-Series Forecasting , author =. Proceedings of the International Conference on Machine Learning (ICML) , year =. 2310.10688 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Chronos: Learning the Language of Time Series
Chronos: Learning the Language of Time Series , author =. Transactions on Machine Learning Research (TMLR) , year =. 2403.07815 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =
Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , author =. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =. 1910.11480 , archivePrefix =
-
[13]
International Conference on Learning Representations (ICLR) , year =
DDSP: Differentiable Digital Signal Processing , author =. International Conference on Learning Representations (ICLR) , year =. 2001.04643 , archivePrefix =
-
[14]
2026 , eprint =
Data-Driven Forecasting of Three-Component Seismograms Using Transformer Architectures , author =. 2026 , eprint =
2026
-
[15]
Souhaib. A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition , journal =. 2012 , issn =. doi:https://doi.org/10.1016/j.eswa.2012.01.039 , url =
-
[16]
International Journal of Forecasting , volume =
Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting , author =. International Journal of Forecasting , volume =. 2021 , eprint =
2021
-
[17]
2023 , eprint =
Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting , author =. 2023 , eprint =
2023
-
[18]
arXiv preprint arXiv:2402.02592 , year=
Unified Training of Universal Time Series Forecasting Transformers , author =. Proceedings of the International Conference on Machine Learning (ICML) , year =. 2402.02592 , archivePrefix =
-
[19]
2023 , eprint =
Learning from Predictions: Fusing Training and Autoregressive Inference for Long-Term Spatiotemporal Forecasts , author =. 2023 , eprint =
2023
-
[20]
Geophysical Journal International , volume =
PhaseNet: A Deep-Neural-Network-Based Seismic Arrival-Time Picking Method , author =. Geophysical Journal International , volume =. 2019 , eprint =
2019
-
[21]
Nature Communications , volume =
Earthquake Transformer---An Attentive Deep-Learning Model for Simultaneous Earthquake Detection and Phase Picking , author =. Nature Communications , volume =
-
[22]
2024 , eprint =
SeisLM: a Foundation Model for Seismic Waveforms , author =. 2024 , eprint =
2024
-
[23]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author =. Conference on Language Modeling (COLM) , year =. 2312.00752 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
International Conference on Learning Representations (ICLR) , year =
Message Passing Neural PDE Solvers , author =. International Conference on Learning Representations (ICLR) , year =. 2202.03376 , archivePrefix =
-
[25]
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =
Neural Speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses , author =. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year =. 2211.15974 , archivePrefix =
-
[26]
Gaussian Error Linear Units (GELUs)
Hendrycks, Dan and Gimpel, Kevin. Gaussian Error Linear Units (GELUs). 2016. arXiv:1606.08415
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[27]
Neurocomputing , year =
Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng , title =. Neurocomputing , year =
-
[28]
International Conference on Learning Representations (ICLR) , year =
Loshchilov, Ilya and Hutter, Frank , title =. International Conference on Learning Representations (ICLR) , year =
-
[29]
The use of fast
Welch, Peter , journal=. The use of fast. 1967 , publisher=
1967
-
[30]
Layer Normalization , author =. arXiv preprint arXiv:1607.06450 , year =
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.