Recognition: no theorem link
SeesawNet: Towards Non-stationary Time Series Forecasting with Balanced Modeling of Common and Specific Dependencies
Pith reviewed 2026-05-15 01:51 UTC · model grok-4.3
The pith
SeesawNet adaptively balances common and instance-specific dependencies in non-stationary time series forecasting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SeesawNet introduces Adaptive Stationary-Nonstationary Attention (ASNA) that captures common dependencies from normalized sequences and specific dependencies from raw sequences, adaptively fusing them according to instance-level non-stationarity. Built upon ASNA, the model alternates dedicated temporal and channel relationship modeling to jointly capture long-range and cross-variable dependencies, consistently outperforming state-of-the-art methods on multiple real-world benchmarks.
What carries the argument
Adaptive Stationary-Nonstationary Attention (ASNA), which extracts common dependencies from instance-normalized sequences and specific dependencies from raw sequences before adaptively fusing them guided by per-instance non-stationarity.
If this is right
- Improved accuracy on real-world multivariate forecasting tasks that exhibit varying degrees of non-stationarity.
- Joint modeling of temporal and channel dependencies without over-smoothing instance-specific structures.
- Dynamic per-instance adaptation that avoids one-size-fits-all normalization or de-normalization.
- Consistent gains over methods that either fully suppress or only partially recover specific dependencies.
Where Pith is reading between the lines
- The same fusion principle could be tested on non-stationary sequences in domains such as financial returns or physiological signals where both population trends and individual deviations matter.
- Replacing fixed instance normalization with ASNA-style adaptive fusion might reduce error accumulation in long-horizon recursive forecasting.
- Extending the temporal-channel alternation to include explicit cross-instance attention could further stabilize performance when distribution shifts occur between training and test periods.
Load-bearing premise
Adaptive fusion of common dependencies from normalized sequences and specific dependencies from raw sequences, guided by instance-level non-stationarity, will reliably capture heterogeneity without introducing new smoothing artifacts or overfitting to training distribution shifts.
What would settle it
A controlled experiment on a benchmark dataset where forcing the fusion weights in ASNA to be uniform or fixed produces accuracy no better than using only normalized inputs or only raw inputs across multiple forecast horizons.
Figures
read the original abstract
Instance normalization (IN) is widely used in non-stationary multivariate time series forecasting to reduce distribution shifts and highlight common patterns across samples. However, IN can over-smooth instance-specific structural information that is essential for modeling temporal and cross-channel heterogeneity. While prior methods further suppress distribution discrepancies or attempt to recover temporal specific dependencies, they often ignore a central tension: how to adaptively model common and instance-specific dependency based on each instance's non-stationary structures. To address this dilemma, we propose SeesawNet, a unified architecture that dynamically balances common and instance-specific dependency modeling in both temporal and channel dimensions. At its core is Adaptive Stationary-Nonstationary Attention (ASNA), which captures common dependencies from normalized sequences and specific dependencies from raw sequences, and adaptively fuses them according to instance-level non-stationarity. Built upon ASNA, SeesawNet alternates dedicated temporal and channel relationship modeling to jointly capture long-range and cross-variable dependencies. Extensive experiments on multiple real-world benchmarks demonstrate that SeesawNet consistently outperforms state-of-the-art methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SeesawNet, a unified neural architecture for non-stationary multivariate time series forecasting. Its core component is Adaptive Stationary-Nonstationary Attention (ASNA), which extracts common dependencies from instance-normalized sequences and instance-specific dependencies from raw sequences, then adaptively fuses them according to per-instance non-stationarity measures. SeesawNet alternates dedicated temporal and channel modeling blocks built on ASNA to jointly capture long-range temporal and cross-variable dependencies, claiming consistent outperformance over state-of-the-art methods on multiple real-world benchmarks.
Significance. If the empirical results hold under rigorous controls, the work directly tackles a recognized tension in instance normalization for time series: the risk of over-smoothing instance-specific structure while reducing distribution shift. The ASNA fusion mechanism provides a concrete, instance-adaptive solution that could improve robustness on heterogeneous data. The alternating temporal-channel design is a standard but well-motivated choice that aligns with the dual dependency goal. The paper's framing as an architectural contribution rather than a re-derivation of prior results is internally consistent.
major comments (2)
- [§4 Experiments] §4 (Experiments): The central claim of 'consistent outperformance' is presented without visible error bars, ablation controls, statistical significance tests, or explicit data-exclusion rules in the abstract and summary description. These elements are load-bearing for establishing that the ASNA fusion reliably improves over baselines rather than reflecting variance or selective reporting.
- [§3.2 ASNA] §3.2 (ASNA): The adaptive fusion weights are listed among free parameters; the manuscript must specify their initialization, any regularization, and how the instance-level non-stationarity scalar is computed from the input to ensure the mechanism does not introduce new hyperparameters that could be tuned to the evaluation sets.
minor comments (2)
- [Abstract] Abstract: The phrase 'alternates dedicated temporal and channel relationship modeling' is high-level; a one-sentence clarification of the alternation order and residual connections would improve readability without altering the technical content.
- [§3 Method] Notation: Ensure consistent use of symbols for normalized vs. raw sequences across equations and figures; minor inconsistencies in subscripting can obscure the common/specific distinction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: [§4 Experiments] §4 (Experiments): The central claim of 'consistent outperformance' is presented without visible error bars, ablation controls, statistical significance tests, or explicit data-exclusion rules in the abstract and summary description. These elements are load-bearing for establishing that the ASNA fusion reliably improves over baselines rather than reflecting variance or selective reporting.
Authors: We agree that explicit statistical controls strengthen the claims. The full manuscript already reports mean results with standard deviations over 5 random seeds in Tables 1–4 and includes ablation studies in §4.3. To address the concern, we will add paired t-test p-values comparing SeesawNet against each baseline in the revised §4, explicitly state the data preprocessing and exclusion criteria (no test-set leakage, standard train/val/test splits) in the main text, and ensure all figures display error bars. These additions will be included in the revision. revision: yes
-
Referee: [§3.2 ASNA] §3.2 (ASNA): The adaptive fusion weights are listed among free parameters; the manuscript must specify their initialization, any regularization, and how the instance-level non-stationarity scalar is computed from the input to ensure the mechanism does not introduce new hyperparameters that could be tuned to the evaluation sets.
Authors: We appreciate the request for precision. The non-stationarity scalar is computed as the normalized variance of first-order differences on the raw input sequence (explicit formula: σ = std((x_{t+1} - x_t)) / (max(x) - min(x) + ε)). Fusion weights are initialized uniformly at 0.5 and optimized end-to-end with no extra regularization beyond the forecasting loss. We will insert a dedicated paragraph in §3.2 with these exact definitions and confirm that no additional hyperparameters are introduced or tuned on test data. This clarification will appear in the revised manuscript. revision: yes
Circularity Check
No significant circularity in architectural proposal
full rationale
The paper introduces SeesawNet as a new neural architecture featuring Adaptive Stationary-Nonstationary Attention (ASNA) to dynamically balance common dependencies from normalized sequences and instance-specific dependencies from raw sequences. No equations, derivations, or predictions are shown that reduce claimed performance or balancing mechanism to quantities fitted from evaluation data or prior self-citations. The contribution is framed as an empirical architectural design with benchmark validation, remaining self-contained without load-bearing reductions to inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- adaptive fusion weights in ASNA
axioms (1)
- domain assumption Instance normalization reduces distribution shifts and highlights common patterns across samples
invented entities (1)
-
Adaptive Stationary-Nonstationary Attention (ASNA)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
[Caoet al., 2025 ] Xin Cao, Qinghua Tao, Yingjie Zhou, Lu Zhang, Le Zhang, Dongjin Song, Dapeng Oliver Wu, and Ce Zhu. From dense to sparse: Event response for en- hanced residential load forecasting.IEEE Transactions on Instrumentation and Measurement,
work page 2025
-
[2]
[Chenet al., 2024 ] Peng Chen, Yingying Zhang, Yunyao Cheng, Yang Shu, Yihang Wang, Qingsong Wen, Bin Yang, and Chenjuan Guo. Pathformer: Multi-scale trans- formers with adaptive pathways for time series forecast- ing.arXiv preprint arXiv:2402.05956,
-
[3]
Dish-ts: a general paradigm for alleviating distribution shift in time series forecasting
[Fanet al., 2023 ] Wei Fan, Pengyang Wang, Dongkun Wang, Dongjie Wang, Yuanchun Zhou, and Yanjie Fu. Dish-ts: a general paradigm for alleviating distribution shift in time series forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 7522–7529,
work page 2023
-
[4]
[Fanet al., 2024 ] Wei Fan, Kun Yi, Hangting Ye, Zhiyuan Ning, Qi Zhang, and Ning An. Deep frequency derivative learning for non-stationary time series forecasting.arXiv preprint arXiv:2407.00502,
-
[5]
Sin: Selective and interpretable normalization for long- term time series forecasting
[Hanet al., 2024 ] Lu Han, Han-Jia Ye, and De-Chuan Zhan. Sin: Selective and interpretable normalization for long- term time series forecasting. InForty-first International Conference on Machine Learning,
work page 2024
-
[6]
Re- versible instance normalization for accurate time-series forecasting against distribution shift
[Kimet al., 2021 ] Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo. Re- versible instance normalization for accurate time-series forecasting against distribution shift. InInternational con- ference on learning representations,
work page 2021
-
[7]
Battling the non-stationarity in time se- ries forecasting via test-time adaptation
[Kimet al., 2025 ] HyunGi Kim, Siwon Kim, Jisoo Mok, and Sungroh Yoon. Battling the non-stationarity in time se- ries forecasting via test-time adaptation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 17868–17876,
work page 2025
-
[8]
[Linet al., 2025 ] Sida Lin, Yankai Chen, Yiyan Qi, Chen- hao Ma, Bokai Cao, Yifei Zhang, Xue Liu, and Jian Guo. Cspo: Cross-market synergistic stock price move- ment forecasting with pseudo-volatility optimization. In Companion Proceedings of the ACM on Web Conference 2025, pages 354–363,
work page 2025
-
[9]
[Liuet al., 2022 ] Yong Liu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Non-stationary transformers: Explor- ing the stationarity in time series forecasting.Advances in neural information processing systems, 35:9881–9893,
work page 2022
-
[10]
iTransformer: Inverted Transformers Are Effective for Time Series Forecasting
[Liuet al., 2023a ] Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting.arXiv preprint arXiv:2310.06625,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
[Liuet al., 2024 ] Peiyuan Liu, Beiliang Wu, Yifan Hu, Naiqi Li, Tao Dai, Jigang Bao, and Shu-tao Xia. Timebridge: Non-stationarity matters for long-term time series fore- casting.arXiv preprint arXiv:2410.04442,
-
[12]
U-mixer: An unet-mixer architecture with stationarity correction for time series forecasting
[Maet al., 2024 ] Xiang Ma, Xuemei Li, Lexin Fang, Tian- long Zhao, and Caiming Zhang. U-mixer: An unet-mixer architecture with stationarity correction for time series forecasting. InProceedings of the AAAI conference on ar- tificial intelligence, volume 38, pages 14255–14262,
work page 2024
-
[13]
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
[Nieet al., 2022 ] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730,
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Duet: Dual cluster- ing enhanced multivariate time series forecasting
[Qiuet al., 2025 ] Xiangfei Qiu, Xingjian Wu, Yan Lin, Chenjuan Guo, Jilin Hu, and Bin Yang. Duet: Dual cluster- ing enhanced multivariate time series forecasting. InPro- ceedings of the 31st ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining V . 1, pages 1185–1196,
work page 2025
-
[15]
Timemixer++: A general time series pattern machine for universal predictive analysis
[Wanget al., 2024 ] Shiyu Wang, Jiawei Li, Xiaoming Shi, Zhou Ye, Baichuan Mo, Wenze Lin, Shengtong Ju, Zhix- uan Chu, and Ming Jin. Timemixer++: A general time series pattern machine for universal predictive analysis. arXiv preprint arXiv:2410.16032,
-
[16]
Fredf: Learning to forecast in the frequency domain
[Wanget al., 2025 ] Hao Wang, Licheng Pan, Zhichao Chen, Degui Yang, Sen Zhang, Yifei Yang, Xinggao Liu, Haox- uan Li, and Dacheng Tao. Fredf: Learning to forecast in the frequency domain. InICLR,
work page 2025
-
[17]
[Wuet al., 2021 ] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transform- ers with auto-correlation for long-term series forecast- ing.Advances in neural information processing systems, 34:22419–22430,
work page 2021
-
[18]
Fits: Modeling time series with10kparameters.arXiv preprint arXiv:2307.03756,
[Xuet al., 2023 ] Zhijian Xu, Ailing Zeng, and Qiang Xu. Fits: Modeling time series with10kparameters.arXiv preprint arXiv:2307.03756,
-
[19]
[Yeet al., 2024 ] Weiwei Ye, Songgaojun Deng, Qiaosha Zou, and Ning Gui. Frequency adaptive normalization for non-stationary time series forecasting.Advances in Neural Information Processing Systems, 37:31350–31379,
work page 2024
-
[20]
[Zenget al., 2023 ] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series fore- casting? InProceedings of the AAAI conference on artifi- cial intelligence, volume 37, pages 11121–11128,
work page 2023
-
[21]
[Zhang and Yan, 2023] Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension de- pendency for multivariate time series forecasting. InInter- national Conference on Learning Representations, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.