arxiv: 2605.06310 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

Perceive, Route and Modulate: Dynamic Pattern Recalibration for Time Series Forecasting

Haohuan Fu, Haoyang Li, Qingsong Wen, Siru Zhong, Yuxuan Liang, Zhao Meng

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:56 UTC · model grok-4.3

classification 💻 cs.LG

keywords time series forecastingdynamic pattern recalibrationsoft routingtoken-level modulationbackbone-agnostic adapterHadamard productforecasting benchmarkslocal temporal patterns

0 comments

The pith

Dynamic Pattern Recalibration adapts forecasting models to shifting local temporal patterns using token-level modulation instead of fixed weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that real-world time series contain continuously shifting local patterns, so models with globally shared fixed weight matrices settle into a compromised average that cannot respond well to changing dynamics. It introduces Dynamic Pattern Recalibration as a lightweight adapter that runs a Perceive-Route-Modulate pipeline: it perceives the current state, routes softly over a learned basis of response patterns, and modulates the hidden states with a residual Hadamard product. A sympathetic reader would care because the method is backbone-agnostic and adds little overhead, suggesting a general fix rather than architecture-specific redesigns. The standalone DPRNet version is shown to reach competitive accuracy on 12 benchmarks, indicating that dynamic recalibration can substitute for some of the gains from macroscopic parameter scaling.

Core claim

Current deep forecasting models apply fixed transformations uniformly to all temporal tokens and therefore cannot adapt to continuously shifting local patterns; DPR counters this by computing a soft-routing distribution over a learned basis of adaptive response patterns to produce a time-aware modulation vector that recalibrates hidden states through a residual Hadamard product.

What carries the argument

The Perceive-Route-Modulate pipeline, which generates a modulation vector from soft routing over a learned basis of adaptive response patterns and applies it via residual Hadamard modulation.

Load-bearing premise

That local temporal patterns shift in ways a learned basis of response patterns plus soft routing and residual modulation can capture and correct, beyond what attention or normalization layers already achieve.

What would settle it

An experiment that adds DPR to a standard transformer or linear forecaster and observes no accuracy gain on multiple benchmarks with documented non-stationary behavior would falsify the claim that the recalibration addresses a general, previously unmet bottleneck.

Figures

Figures reproduced from arXiv: 2605.06310 by Haohuan Fu, Haoyang Li, Qingsong Wen, Siru Zhong, Yuxuan Liang, Zhao Meng.

**Figure 1.** Figure 1: Comparison of forecasting paradigms. (a) Standard backbones: fixed mapping compromises across local dynamics. (b) MoE: discrete expert routing scales parameters and requires load balancing. (c) DPR: dynamic pattern recalibration via a lightweight Perceive-Route-Modulate mechanism. To solve this, we propose Dynamic Pattern Recalibration (DPR), a general mechanism that decouples global temporal mapping from… view at source ↗

**Figure 2.** Figure 2: Overview of DPR. (Top) Backbone-agnostic adapter. (Middle) Perceive-Route-Modulate: view at source ↗

**Figure 3.** Figure 3: Dataset Diversity Landscape. Complexity vs. Non-stationarity; bubble size ∝ data volume. Baselines and Evaluation Protocol. We compare eight models across diverse paradigms: attention-based architectures (Informer [22], Crossformer [24], iTransformer [15], PatchTST [14]), efficient linear/MLP models enhanced by structural priors (TimeMixer [27], WPMixer [28]), and complex filtering operations (TimesNet … view at source ↗

**Figure 4.** Figure 4: Parameter Scaling vs. DPR. Scaling backbone capacity often degrades performance. DPR achieves better gains at negligible cost. Full results in Appendix 12 view at source ↗

**Figure 5.** Figure 5: Sensitivity Analysis. (a) λorth; (b) Basis size K; (c-d) Kernel configurations on PatchTST and Crossformer. Solid: per-horizon; dashed: horizon-averaged; ⋆: preferred setting. however, performance plateaus for larger values, suggesting routing redundancy. Optimal kernels are architecture-dependent: PatchTST prefers pointwise filtering (k = 1), whereas Crossformer favors multi-scale kernels (k = (3, 7)), al… view at source ↗

**Figure 6.** Figure 6: Efficiency Trade-off. Minimal error at negligible parameter cost. Computational and Parameter Efficiency view at source ↗

**Figure 7.** Figure 7: contrasts DPR’s adaptive tracking with static backbone drift. Panels (a–b) show static backbones (red) diverging from GT (green) in the forecast window, while DPR (blue) calibrates. Panel (c) traces routing-probability evolution, revealing pattern switching; panel (d) zooms in on volatility spikes. During calm periods, mass concentrates on smooth-trend bases; at spikes, it abruptly redistributes to transie… view at source ↗

**Figure 8.** Figure 8: Local non-stationarity across the twelve benchmark datasets (4 view at source ↗

read the original abstract

Local temporal patterns in real-world time series continuously shift, rendering globally shared transformations suboptimal. Current deep forecasting models, despite their scale and complexity, rely on fixed weight matrices applied uniformly to all temporal tokens. This creates a static pattern response: models settle into a compromised average, unable to adapt to changing local dynamics. We introduce Dynamic Pattern Recalibration (DPR), a backbone-agnostic mechanism that resolves this via token-level recalibration. Through a lightweight "Perceive-Route-Modulate" pipeline, DPR computes a soft-routing distribution over a learned basis of adaptive response patterns, generating a time-aware modulation vector that recalibrates hidden states via a residual Hadamard product. As a backbone-agnostic adapter, DPR enhances forecasting across diverse architectures with minimal overhead, confirming it addresses a general bottleneck. As a minimalist standalone model, DPRNet achieves competitive performance across 12 benchmarks, validating dynamic recalibration against macroscopic parameter scaling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DPR adds a lightweight Perceive-Route-Modulate adapter for token-level recalibration, but its edge over attention-based dynamics in existing backbones is not yet clearly separated.

read the letter

The paper's core move is to treat fixed-weight transformations as a real limitation for non-stationary time series and then supply a small module that perceives local patterns, routes over a learned basis, and modulates via residual Hadamard product. That pipeline is presented both as a drop-in adapter and as the standalone DPRNet. The adapter claim is the more interesting one because it targets a general bottleneck rather than another full architecture. The benchmarks across 12 datasets and the minimal-overhead story are the parts that could matter if the gains survive proper controls. What the work does cleanly is name the static-response problem and show a concrete, low-parameter way to generate time-aware modulation vectors. The standalone competitiveness is also useful as a sanity check that the mechanism can stand alone without heavy machinery. The soft spot is the stress-test point: many current backbones already compute token-dependent transformations through attention or dynamic normalization, so it is not obvious that the learned-basis routing plus Hadamard step supplies something those layers miss. Without ablations that isolate the modulation on attention-equipped models, or without showing that the routing distribution differs meaningfully from attention weights, the general-bottleneck claim stays provisional. The abstract gives no derivation or error breakdown, which leaves the central mechanism hard to verify from the description alone. This is the kind of paper that belongs in a reading group focused on efficient adaptation layers for forecasting. Readers who build or tune sequence models would get practical value from the adapter results if the comparisons are tight. It deserves a serious referee to check the experimental controls and to see whether the modulation actually escapes the attention redundancy concern.

Referee Report

2 major / 2 minor

Summary. The paper claims that fixed weight matrices in deep time series forecasting models produce suboptimal static responses to continuously shifting local temporal patterns. It introduces Dynamic Pattern Recalibration (DPR) as a lightweight, backbone-agnostic adapter using a Perceive-Route-Modulate pipeline: a learned basis of adaptive response patterns, soft routing to produce a time-aware modulation vector, and residual Hadamard product recalibration of hidden states. DPR is shown to enhance diverse architectures with minimal overhead and, as standalone DPRNet, achieves competitive results across 12 benchmarks.

Significance. If the central mechanism proves non-redundant with existing dynamic components such as attention, DPR could offer a general, low-cost way to address local pattern shifts without macroscopic scaling, with the standalone competitiveness providing evidence that the recalibration itself is effective.

major comments (2)

[§3] §3 (Perceive-Route-Modulate pipeline) and Eq. (3)–(5): the soft-routing distribution and resulting modulation vector are presented as independent of backbone dynamics, but the formulation (learned basis + token-dependent routing + Hadamard residual) risks being a low-rank parallel to the token-dependent transformations already computed by attention layers in Transformer backbones; without an explicit derivation showing the modulation cannot be absorbed into existing attention weights, the claim that DPR addresses a general bottleneck is not yet load-bearing.
[Table 2, §4.3] Table 2 and §4.3 (adapter experiments on attention-equipped models): performance gains are reported for various backbones, but the ablation does not isolate whether DPR adds value beyond the dynamic weighting already present in attention; the central premise that fixed matrices create a compromised average is least secure here, and the results do not yet confirm the mechanism is necessary rather than redundant.

minor comments (2)

[Abstract] The abstract states '12 benchmarks' without naming the datasets or metrics; this should be expanded in the introduction or §4 for immediate clarity.
[§3] Notation for the modulation vector and routing distribution is introduced without a consolidated table of symbols; adding one would aid readability of the method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. We address each major comment below with clarifications on the mechanism and plans for strengthening the empirical support. We believe these points can be resolved without altering the core claims of the work.

read point-by-point responses

Referee: [§3] §3 (Perceive-Route-Modulate pipeline) and Eq. (3)–(5): the soft-routing distribution and resulting modulation vector are presented as independent of backbone dynamics, but the formulation (learned basis + token-dependent routing + Hadamard residual) risks being a low-rank parallel to the token-dependent transformations already computed by attention layers in Transformer backbones; without an explicit derivation showing the modulation cannot be absorbed into existing attention weights, the claim that DPR addresses a general bottleneck is not yet load-bearing.

Authors: We agree that an explicit argument is needed to distinguish DPR from attention. DPR computes a modulation vector by soft-routing over a fixed learned basis of response patterns, then applies it as a residual Hadamard product directly to the hidden state after the backbone layer. This is a per-token multiplicative recalibration derived from pattern matching on the current token, independent of cross-token query-key interactions. Attention, by contrast, produces additive updates via weighted sums across tokens. We will add a short derivation in §3 showing that the DPR modulation matrix cannot be folded into the attention weight matrices without changing the functional form (the Hadamard residual introduces a diagonal scaling that attention's outer-product updates do not replicate). This supports the claim that DPR targets a distinct aspect of the static-response problem. revision: partial
Referee: [Table 2, §4.3] Table 2 and §4.3 (adapter experiments on attention-equipped models): performance gains are reported for various backbones, but the ablation does not isolate whether DPR adds value beyond the dynamic weighting already present in attention; the central premise that fixed matrices create a compromised average is least secure here, and the results do not yet confirm the mechanism is necessary rather than redundant.

Authors: The current experiments demonstrate consistent gains when DPR is added to attention-based models, but we accept that they do not yet fully isolate the incremental effect beyond attention. In the revision we will expand §4.3 with a targeted ablation that (i) freezes the attention layers and (ii) compares DPR against a simple learned per-token scaling baseline. These additions will clarify whether the observed improvements stem from the pattern-basis routing rather than generic dynamic weighting. We maintain that the premise remains valid because the fixed linear transformations inside each backbone layer still produce a single compromised response per token; DPR's external recalibration operates orthogonally to that. revision: yes

Circularity Check

0 steps flagged

No circularity: DPR introduced as independent mechanism with empirical validation

full rationale

The abstract presents Dynamic Pattern Recalibration (DPR) as a novel Perceive-Route-Modulate pipeline that computes a soft-routing distribution over a learned basis to generate a modulation vector for residual Hadamard recalibration. This is positioned as an additive adapter addressing a stated limitation of fixed weights in existing models, without any equations or claims reducing the mechanism to its own fitted inputs, self-citations, or renamed prior results. Standalone DPRNet performance is reported empirically across benchmarks rather than derived by construction from the same data or parameters. The derivation chain remains self-contained with no load-bearing steps that collapse to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the premise that fixed global transformations are suboptimal for shifting local patterns and that the proposed lightweight pipeline can provide effective adaptation without introducing new free parameters beyond a learned basis.

axioms (1)

domain assumption Local temporal patterns in real-world time series continuously shift, rendering globally shared transformations suboptimal.
Directly stated as the motivating premise in the abstract.

invented entities (1)

Dynamic Pattern Recalibration (DPR) mechanism no independent evidence
purpose: Token-level recalibration of hidden states via soft routing over adaptive response patterns
Newly introduced construct whose effectiveness is asserted but not independently evidenced in the abstract.

pith-pipeline@v0.9.0 · 5478 in / 1360 out tokens · 41701 ms · 2026-05-08T12:56:11.361708+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 9 canonical work pages · 2 internal anchors

[1]

A prediction approach for stock market volatility based on time series data.IEEE Access, 7:17287–17298, 2019

Sheikh Mohammad Idrees, M Afshar Alam, and Parul Agarwal. A prediction approach for stock market volatility based on time series data.IEEE Access, 7:17287–17298, 2019

2019
[2]

Transductive LSTM for time-series prediction: An application to weather forecasting.Neural Networks, 125:1–9, 2020

Zahra Karevan and Johan AK Suykens. Transductive LSTM for time-series prediction: An application to weather forecasting.Neural Networks, 125:1–9, 2020

2020
[3]

A review on time series forecasting techniques for building energy consumption.Renewable and Sustainable Energy Reviews, 74:902–924, 2017

Chirag Deb, Fan Zhang, Junjing Yang, Siew Eang Lee, and Kwok Wei Shah. A review on time series forecasting techniques for building energy consumption.Renewable and Sustainable Energy Reviews, 74:902–924, 2017

2017
[4]

Traffic flow forecast through time series analysis based on deep learning.IEEE Access, 8:82562–82570, 2020

Jianhu Zheng and Mingfang Huang. Traffic flow forecast through time series analysis based on deep learning.IEEE Access, 8:82562–82570, 2020

2020
[5]

Cross space and time: A spatio-temporal unitized model for traffic flow forecasting.IEEE Transactions on Intelligent Transportation Systems, 2025

Weilin Ruan, Wenzhuo Wang, Siru Zhong, Wei Chen, Li Liu, and Yuxuan Liang. Cross space and time: A spatio-temporal unitized model for traffic flow forecasting.IEEE Transactions on Intelligent Transportation Systems, 2025

2025
[6]

Predicting carpark availability in singapore with cross- domain data: a new dataset and a data-driven approach

Huaiwu Zhang, Yutong Xia, Siru Zhong, Kun Wang, Zekun Tong, Qingsong Wen, Roger Zimmermann, and Yuxuan Liang. Predicting carpark availability in singapore with cross- domain data: a new dataset and a data-driven approach. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI), pages 7554–7562, 2024

2024
[7]

Towards multi- scenario forecasting of building electricity loads with multimodal data

Yongzheng Liu, Siru Zhong, Gefeng Luo, Weilin Ruan, and Yuxuan Liang. Towards multi- scenario forecasting of building electricity loads with multimodal data. InProceedings of the 33rd ACM International Conference on Multimedia, pages 2188–2196, 2025

2025
[8]

Fine-grained urban heat island effect forecasting: A context-aware thermodynamic modeling framework

Xingchen Zou, Weilin Ruan, Siru Zhong, Yuehong Hu, and Yuxuan Liang. Fine-grained urban heat island effect forecasting: A context-aware thermodynamic modeling framework. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 4226–4237, 2025

2025
[9]

Hamilton

James D. Hamilton. Analysis of time series subject to changes in regime.Journal of Economet- rics, 45(1-2):39–70, 1990

1990
[10]

Anomaly and change point detection for time series with concept drift.World Wide Web, 26(5):3229–3252, 2023

Jiayi Liu, Donghua Yang, Kaiqi Zhang, Hong Gao, and Jianzhong Li. Anomaly and change point detection for time series with concept drift.World Wide Web, 26(5):3229–3252, 2023

2023
[11]

Dropoutts: Sample-adaptive dropout for robust time series forecasting.arXiv preprint arXiv:2601.21726, 2026

Siru Zhong, Yiqiu Liu, Zhiqing Cui, Zezhi Shao, Fei Wang, Qingsong Wen, and Yuxuan Liang. Dropoutts: Sample-adaptive dropout for robust time series forecasting.arXiv preprint arXiv:2601.21726, 2026

work page arXiv 2026
[12]

Dynamic neural networks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456, 2022

Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. Dynamic neural networks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456, 2022

2022
[13]

Dynamic convolution: Attention over convolution kernels

Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng Liu. Dynamic convolution: Attention over convolution kernels. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11030–11039, 2020

2020
[14]

Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam

Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InInternational Conference on Learning Representations (ICLR), 2023

2023
[15]

iTransformer: Inverted transformers are effective for time series forecasting

Yong Liu, Tenggan Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. iTransformer: Inverted transformers are effective for time series forecasting. InInterna- tional Conference on Learning Representations (ICLR), 2024. Spotlight

2024
[16]

A decoder-only foundation model for time-series forecasting

Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. InProceedings of the 41st International Conference on Machine Learning, pages 10148–10167, 2024. 10

2024
[17]

Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.Transactions on Machine Learning Research, 2024

2024
[18]

Time- MoE: Billion-scale time series foundation models with mixture of experts

Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin. Time- MoE: Billion-scale time series foundation models with mixture of experts. InInternational Conference on Learning Representations (ICLR), 2025. Spotlight

2025
[19]

Timeexpert: Boosting long time series forecasting with temporal mix of experts

Xiaowen Ma, Shuning Ge, Fan Yang, Xiangyu Li, Yun Chen, Mengting Ma, Wei Zhang, and Zhipeng Liu. Timeexpert: Boosting long time series forecasting with temporal mix of experts. arXiv preprint arXiv:2509.23145, 2025

work page arXiv 2025
[20]

Reversible instance normalization for accurate time-series forecasting against distribution shift

Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo. Reversible instance normalization for accurate time-series forecasting against distribution shift. InInternational conference on learning representations, 2021

2021
[21]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), volume 30, pages 5998–6008, 2017

2017
[22]

Informer: Beyond efficient transformer for long sequence time-series forecasting

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2021

2021
[23]

Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting

Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

2021
[24]

Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting

Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. InInternational Conference on Learning Representa- tions (ICLR), 2023

2023
[25]

Are Transformers Effective for Time Series Forecasting?Proceedings of the AAAI Conference on Artificial Intelligence, 37(9): 11121–11128, 2023

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting?Proceedings of the AAAI Conference on Artificial Intelligence, 37(9): 11121–11128, jun 2023. ISSN 2159-5399. doi: 10.1609/aaai.v37i9.26317. URL http: //dx.doi.org/10.1609/aaai.v37i9.26317

work page doi:10.1609/aaai.v37i9.26317 2023
[26]

TimesNet: Temporal 2d-variation modeling for general time series analysis

Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. TimesNet: Temporal 2d-variation modeling for general time series analysis. InInternational Conference on Learning Representations (ICLR), 2023

2023
[27]

arXiv preprint arXiv:2405.14616 , year=

Shiyu Wang, Haixu Wu, Xiaoming Shi, Tengge Hu, Huakun Luo, Lintao Ma, James Y . Zhang, and Jun Zhou. TimeMixer: Decomposable multiscale mixing for time series forecasting, 2024. URLhttps://arxiv.org/abs/2405.14616

work page arXiv 2024
[28]

WPMixer: Efficient multi- resolution mixing for long-term time series forecasting.Proceedings of the AAAI Conference on Artificial Intelligence, 39(18):19581–19588, April 2025

Md Mahmuddun Nabi Murad, Mehmet Aktukmak, and Yasin Yilmaz. WPMixer: Efficient multi- resolution mixing for long-term time series forecasting.Proceedings of the AAAI Conference on Artificial Intelligence, 39(18):19581–19588, April 2025. ISSN 2159-5399. doi: 10.1609/ aaai.v39i18.34156. URLhttp://dx.doi.org/10.1609/aaai.v39i18.34156

work page doi:10.1609/aaai.v39i18.34156 2025
[29]

TimeFilter: Patch-specific spatial-temporal graph filtration for time series forecasting

Yifan Hu, Guibin Zhang, Peiyuan Liu, Disen Lan, Naiqi Li, Dawei Cheng, Tao Dai, Shu-Tao Xia, and Shirui Pan. TimeFilter: Patch-specific spatial-temporal graph filtration for time series forecasting. InForty-second International Conference on Machine Learning (ICML), 2025. URLhttps://openreview.net/forum?id=490VcNtjh7

2025
[30]

Unified training of universal time series forecasting transformers

Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers. InForty-first International Conference on Machine Learning, 2024

2024
[31]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations (ICLR), 2017. 11

2017
[32]

Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

2022
[33]

From sparse to soft mixtures of experts

Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts. InInternational Conference on Learning Representations (ICLR), 2024

2024
[34]

FiLM: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

2018
[35]

Squeeze-and-excitation networks

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018

2018
[36]

Cbam: Convolutional block attention module

Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. InProceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018

2018
[37]

Condconv: Conditionally parameterized convolutions for efficient inference.Advances in neural information processing systems, 32, 2019

Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam. Condconv: Conditionally parameterized convolutions for efficient inference.Advances in neural information processing systems, 32, 2019

2019
[38]

Non-stationary transformers: Exploring the stationarity in time series forecasting

Yong Liu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Non-stationary transformers: Exploring the stationarity in time series forecasting. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 9881–9893, 2022

2022
[39]

Koopa: Learning non-stationary time series dynamics with koopman predictors

Yong Liu, Chenyu Li, Jianmin Wang, and Mingsheng Long. Koopa: Learning non-stationary time series dynamics with koopman predictors. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[40]

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolu- tional and recurrent networks for sequence modeling, 2018. URL https://arxiv.org/abs/ 1803.01271

work page internal anchor Pith review arXiv 2018
[41]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[42]

Long-range transformers for dynamic spatiotemporal forecasting.arXiv preprint arXiv:2109.12218, 2021

Jake Grigsby, Zhe Wang, Nam Nguyen, and Yanjun Qi. Long-range transformers for dynamic spatiotemporal forecasting.arXiv preprint arXiv:2109.12218, 2021

work page arXiv 2021
[43]

Inouye, K

T. Inouye, K. Shinosaki, H. Sakamoto, S. Toi, S. Ukai, A. Iyama, Y . Katsuda, and M. Hirano. Quantification of EEG irregularity by use of the entropy of the power spectrum.Electroen- cephalography and Clinical Neurophysiology, 79(3):204–210, 1991

1991
[44]

The volatility of realized volatility.Econometric Reviews, 27(1-3):46–78, 2008

Fulvio Corsi, Stefan Mittnik, Christian Pigorsch, and Uta Pigorsch. The volatility of realized volatility.Econometric Reviews, 27(1-3):46–78, 2008

2008
[45]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. 12 Appendix Table of Contents A Dataset Statistics and Local Non-stationarity Analysis 13 A.1 Dataset Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.2 Quantifying Local Non-stationarity . . . . . . . . . . ....

work page internal anchor Pith review arXiv 2014
[46]

COVID19 has high V oV (1.46) from asymmetric pattern shifts: plateaus punctuated by exponential waves

COVID19andIllness(Score 19) are dominated by genuinely non-periodic dynamics. COVID19 has high V oV (1.46) from asymmetric pattern shifts: plateaus punctuated by exponential waves. Illness ranks high on both Hs and V oV , reflecting episodic outbreaks with no fixed periodicity. Neither dataset offers a periodic backbone that a static model can rely on
[47]

It has a strong daily and seasonal periodic base (rush-hour emission cycles, meteorological patterns), with episodic haze events layered on top as additive bursts

BeijingAirQualityreaches the same composite Score (19) via a different mechanism. It has a strong daily and seasonal periodic base (rush-hour emission cycles, meteorological patterns), with episodic haze events layered on top as additive bursts. A periodic backbone captures most of the variance; the non-stationarity is real but secondary
[48]

The high V oV reflects cross-variable heterogeneity—smooth temperature alongside bursty precipitation and wind—rather than intrinsic unpredictability per channel

Weatherachieves V oV of 1.68, the highest overall, but each individual channel remains strongly periodic. The high V oV reflects cross-variable heterogeneity—smooth temperature alongside bursty precipitation and wind—rather than intrinsic unpredictability per channel
[49]

VIX(Score 14) shows volatility clustering: narrow low-volatility bands during calm periods with vertical spikes at crises (2008, 2020, 2022)

2008
[50]

NABCPU(Score 13) has the highest spectral entropy (0.78), indicating nearly broadband dynamics from overlapping periodicities: daily cycles, weekly patterns, and intermittent computational bursts.Sunspots(Score 13) shows the∼11-year solar cycle with strong amplitude variation
[51]

Its dynamics resemble a driftless stochastic process: the log-price today is approximately the log-price yesterday plus noise

ExchangeRate(Score 11) is the only dataset approaching a random walk (ADF p= 0.55 ; the null of a unit root cannot be rejected at any conventional level). Its dynamics resemble a driftless stochastic process: the log-price today is approximately the log-price yesterday plus noise. This near-random-walk behaviour produces concentrated low-frequency power (...
[52]

steady trend

ETTh1/ETTh2/ETTm1/ETTm2have the cleanest periodic structure (Score 6–9, low Hs, low V oV). These serve as a counterpoint where static backbones suffice and DPR should not degrade accuracy, which our main experiments confirm. Table 7 and Figure 8 show that local non-stationarity is not a corner case but a dominant property of real-world time series, motiva...

work page arXiv 1921