pith. sign in

arxiv: 2508.12247 · v2 · pith:7RAUPXFDnew · submitted 2025-08-17 · 💻 cs.LG · cs.AI

STM3: Mixture of Multiscale Mamba for Long-Term Spatio-Temporal Time-Series Prediction

Pith reviewed 2026-05-21 22:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords spatio-temporal predictionmixture of expertsMambamultiscale modelingtime series forecastinggraph causal networkdisentangled learninglong-term dependencies
0
0 comments X

The pith

STM3 integrates multiscale Mamba inside a disentangled mixture-of-experts framework to model long-term spatio-temporal time series dependencies more efficiently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

STM3 addresses the difficulty of efficiently extracting multiscale information from long temporal sequences and modeling their correlations across nodes in spatio-temporal time series. The model places a Multiscale Mamba architecture inside a Disentangled Mixture-of-Experts framework and pairs it with an adaptive graph causal network for spatial relations. Stable routing and causal contrastive learning ensure each expert learns distinct patterns, which the authors prove leads to smoother routing and better disentanglement, resulting in state-of-the-art accuracy on multiple prediction benchmarks.

Core claim

STM3 integrates a Multiscale Mamba architecture within a novel Disentangled Mixture-of-Experts (DMoE) framework to capture diverse multiscale information efficiently, while utilizing an adaptive graph causal network to model complex spatial dependencies. To ensure robust representation learning, a stable routing strategy and a causal contrastive learning strategy work with hierarchical information aggregation to guarantee scale distinguishability. The authors theoretically prove that STM3 achieves superior routing smoothness and guarantees pattern disentanglement for each expert, delivering state-of-the-art results on 10 real-world benchmarks including a 7.1% MAE, 8.5% RMSE, and 15.9% MAPE提升

What carries the argument

Disentangled Mixture-of-Experts (DMoE) framework with embedded Multiscale Mamba architecture and adaptive graph causal network, which disentangles multiscale temporal patterns through expert specialization and hierarchical aggregation.

If this is right

  • Efficient extraction of multiscale temporal information from long sequences without quadratic scaling costs.
  • Effective modeling of highly correlated multiscale information across different spatial nodes via the graph causal network.
  • Guaranteed scale distinguishability and expert specialization through the combination of stable routing and causal contrastive learning.
  • State-of-the-art empirical results across 10 diverse real-world spatio-temporal benchmarks.
  • Theoretical guarantees on routing smoothness that support reliable expert assignment during inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The disentanglement approach could transfer to other mixture-of-experts architectures in sequential domains such as video or sensor forecasting.
  • The causal contrastive component may improve interpretability of expert specialization in long-horizon prediction tasks.
  • Hybridizing the multiscale Mamba backbone with additional graph layers could extend applicability to even denser spatial graphs.
  • The efficiency gains from Mamba may allow deployment on resource-constrained edge devices for real-time spatio-temporal monitoring.

Load-bearing premise

The stable routing strategy together with causal contrastive learning is assumed to guarantee both routing smoothness and pattern disentanglement for each expert.

What would settle it

Ablation experiments on PEMSD8 showing no measurable gain in MAE, RMSE, or MAPE when the stable routing or causal contrastive learning modules are removed would falsify the central performance and disentanglement claims.

Figures

Figures reproduced from arXiv: 2508.12247 by Guangxu Zhu, Haolong Chen, Liang Zhang, Zhengyuan Xin.

Figure 1
Figure 1. Figure 1: Main structure of STM3. where ℎ (𝑞) ms ∈ R 𝑇 ×𝑑inner and ℎ (𝑞) ∈ R 𝑇 ×𝑑inner denote the input and output feature sequences at scale 𝑞. We then stack the out￾puts to obtain ℎ ∈ R 𝑇 ×𝑑inner×𝑄 , with symbols consistent with Sec￾tion 4.2. Through scale amplification, the maximum scale expands to 𝑠 (𝑄) 0 [𝑠 (𝑄) ] 𝐿 , where 𝐿 denotes the layer index of the backbone where the multiscale Mamba module is deployed, … view at source ↗
Figure 2
Figure 2. Figure 2: The comparison between two routing strategies. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study of STM3. for optimal spatio-temporal time-series prediction. More ablation study results are detailed in Appendix D.1 5.4 Hyperparameter Study (RQ3) As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: STM3’s multiscale feature extraction. (a) Expert assignment. (b) Loss [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of routing strategies. 5.5 In-Depth Analysis (RQ4 & RQ5) Expert-Wise Effectiveness. To validate MMM’s expert-wise ef￾fectiveness to model complex spatio-temporal patterns, we visual￾ized STM3’s first-layer features using t-SNE [40]. Figure 5a shows distinct feature clusters for each expert, confirming effective pat￾tern disentanglement. Figure 5b further illustrates the gating net￾work’s discrim… view at source ↗
Figure 5
Figure 5. Figure 5: MMM’s feature extraction across experts. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation study of STM3 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Hyperparameter analysis of STM3 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
read the original abstract

Recently, spatio-temporal time-series prediction has developed rapidly, yet existing deep learning methods struggle with learning complex long-term spatio-temporal dependencies efficiently. The long-term spatio-temporal dependency learning brings two new challenges: 1) The long-term temporal sequence naturally includes multiscale information, which is hard to extract efficiently; 2) The multiscale temporal information from different nodes is highly correlated and hard to model. To address these challenges, we propose Spatio-Temporal Mixture of Multiscale Mamba (STM3). STM3 integrates a Multiscale Mamba architecture within a novel Disentangled Mixture-of-Experts (DMoE) framework to capture diverse multiscale information efficiently, while utilizing an adaptive graph causal network to model complex spatial dependencies. To ensure robust representation learning, we introduce a stable routing strategy and a causal contrastive learning strategy, which work in tandem with hierarchical information aggregation to guarantee scale distinguishability. We theoretically prove that STM3 achieves superior routing smoothness and guarantees pattern disentanglement for each expert. Extensive experiments on 10 real-world benchmarks across domains demonstrate STM3's superior performance, achieving state-of-the-art results in long-term spatio-temporal time-series prediction. Notably, on the PEMSD8 dataset, it achieves significant improvements, surpassing the second-best model by 7.1% in MAE, 8.5% in RMSE, and 15.9% in MAPE. Code is available at https://github.com/IfReasonable/STM3_KDD26.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes STM3, which integrates a Multiscale Mamba architecture inside a Disentangled Mixture-of-Experts (DMoE) framework together with an adaptive graph causal network, a stable routing strategy, and causal contrastive learning. The central claims are that these components efficiently capture multiscale long-term spatio-temporal dependencies, that the authors provide theoretical proofs of superior routing smoothness and pattern disentanglement for each expert, and that the model achieves state-of-the-art results on ten real-world benchmarks, including a 7.1% MAE, 8.5% RMSE, and 15.9% MAPE improvement over the second-best model on PEMSD8.

Significance. If the theoretical guarantees on routing smoothness and expert disentanglement can be verified and the reported gains are shown to be robust via ablations and statistical reporting, the work would offer a scalable Mamba-based approach for long-horizon spatio-temporal forecasting. The public release of code at the cited GitHub repository strengthens reproducibility.

major comments (3)
  1. [Theoretical Analysis] Theoretical Analysis section: the proof that stable routing plus causal contrastive learning guarantees both routing smoothness and pattern disentanglement for each expert is presented as load-bearing for attributing the observed gains to the DMoE mechanisms rather than increased capacity or standard Mamba scaling; however, the derivation relies on unverified assumptions about how these components interact with multiscale inputs under realistic spatio-temporal correlations, and no empirical validation of those assumptions is provided.
  2. [Experiments] Experiments section, results tables (e.g., PEMSD8 row): the reported improvements (7.1% MAE, 8.5% RMSE, 15.9% MAPE) are given without error bars, standard deviations from multiple random seeds, or statistical significance tests, which is required to establish that the gains are reliable rather than artifacts of a single run.
  3. [Ablation studies] Ablation studies subsection: the manuscript lacks detailed ablations that isolate the contribution of the stable routing strategy and causal contrastive loss from the base Multiscale Mamba and DMoE components; without these, the central claim that the proposed mechanisms are responsible for the performance edge cannot be substantiated.
minor comments (2)
  1. [Methodology] Figure captions for the model architecture diagram could more explicitly label the hierarchical information aggregation and the flow of the causal contrastive loss.
  2. [Experiments] Ensure that all baseline methods in the experimental tables include their original publication references and hyper-parameter settings used for fair comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript, particularly around theoretical validation, statistical reporting, and ablation depth.

read point-by-point responses
  1. Referee: [Theoretical Analysis] Theoretical Analysis section: the proof that stable routing plus causal contrastive learning guarantees both routing smoothness and pattern disentanglement for each expert is presented as load-bearing for attributing the observed gains to the DMoE mechanisms rather than increased capacity or standard Mamba scaling; however, the derivation relies on unverified assumptions about how these components interact with multiscale inputs under realistic spatio-temporal correlations, and no empirical validation of those assumptions is provided.

    Authors: We agree that linking the theoretical guarantees more explicitly to empirical behavior strengthens the attribution of gains to the proposed mechanisms. The proofs rely on standard MoE assumptions about input separability and correlation structure, which align with the multiscale spatio-temporal setting in our model design. In the revision we will add a new subsection with empirical validation: correlation heatmaps across scales on the benchmark datasets and a sensitivity study showing how routing smoothness and expert specialization respond to controlled changes in multiscale correlation strength. This will make the assumptions verifiable without altering the core proofs. revision: yes

  2. Referee: [Experiments] Experiments section, results tables (e.g., PEMSD8 row): the reported improvements (7.1% MAE, 8.5% RMSE, 15.9% MAPE) are given without error bars, standard deviations from multiple random seeds, or statistical significance tests, which is required to establish that the gains are reliable rather than artifacts of a single run.

    Authors: We concur that single-run results limit confidence in the reported margins. We will re-execute all experiments using five independent random seeds, report mean ± standard deviation for every metric and dataset, and add paired t-test p-values (with Bonferroni correction) comparing STM3 against the second-best baseline on the primary benchmarks including PEMSD8. These additions will appear in the updated tables and a new statistical analysis paragraph. revision: yes

  3. Referee: [Ablation studies] Ablation studies subsection: the manuscript lacks detailed ablations that isolate the contribution of the stable routing strategy and causal contrastive loss from the base Multiscale Mamba and DMoE components; without these, the central claim that the proposed mechanisms are responsible for the performance edge cannot be substantiated.

    Authors: We accept that finer-grained isolation is needed to substantiate the contribution of each new component. We will expand the ablation section with three additional controlled variants on all ten benchmarks: (i) Multiscale Mamba + DMoE without stable routing, (ii) Multiscale Mamba + DMoE with stable routing but without causal contrastive loss, and (iii) the full STM3 model. Performance deltas and routing statistics will be reported to quantify the incremental benefit of each element while holding model capacity fixed. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces STM3 with a Multiscale Mamba inside a Disentangled Mixture-of-Experts (DMoE) framework, plus stable routing and causal contrastive learning. It claims to theoretically prove superior routing smoothness and pattern disentanglement, but these proofs are presented as internal derivations rather than reductions to fitted parameters or prior self-citations. Performance improvements are reported as empirical results on 10 benchmarks (e.g., PEMSD8 gains), not as predictions forced by construction from inputs. No equation or claim reduces the SOTA attribution directly to a hyper-parameter fit or renames a known result via new coordinates. The central mechanisms are motivated by stated challenges in long-term spatio-temporal dependencies and are not shown to be equivalent to their inputs by the paper's own text. This is a self-contained architectural proposal with independent empirical validation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The model rests on the prior Mamba sequence model, standard graph neural network assumptions, and several newly introduced mechanisms whose effectiveness is asserted rather than derived from first principles.

free parameters (1)
  • expert routing temperature and contrastive loss weight
    Hyper-parameters that control the stable routing and disentanglement objectives are chosen during training.
axioms (1)
  • domain assumption Mamba blocks can capture long-range temporal dependencies at multiple scales when stacked appropriately
    Invoked when the multiscale Mamba architecture is introduced.
invented entities (1)
  • Disentangled Mixture-of-Experts (DMoE) with stable routing no independent evidence
    purpose: To separate multiscale temporal patterns across experts
    New component introduced to address the correlation challenge stated in the abstract.

pith-pipeline@v0.9.0 · 5808 in / 1310 out tokens · 58636 ms · 2026-05-21T22:12:19.236360+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PIMSM: Physics-Informed Multi-Scale Mamba for Stable Neural Representations under Distribution Shift

    cs.LG 2026-05 unverdicted novelty 6.0

    PIMSM is a Mamba-based architecture that maps knee frequencies from spectra to multi-scale discretization parameters to reduce representation drift under distribution shifts in fMRI and weather forecasting.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Khaled Alkilane, Yihang He, and Der-Horng Lee. 2024. MixMamba: Time series modeling with adaptive expertise.Information Fusion112 (2024), 102589

  2. [2]

    Lei Bai, Lina Yao, Can Li, Xianzhi Wang, and Can Wang. 2020. Adaptive graph convolutional recurrent network for traffic forecasting.Advances in neural information processing systems33 (2020), 17804–17815

  3. [3]

    Gianni Barlacchi, Marco De Nadai, Roberto Larcher, Antonio Casella, Cristiana Chitic, Giovanni Torrisi, Fabrizio Antonelli, Alessandro Vespignani, Alex Pent- land, and Bruno Lepri. 2015. A multi-source dataset of urban life in the city of Milan and the Province of Trentino.Scientific data2, 1 (2015), 1–15

  4. [4]

    Xiuding Cai, Yaoyao Zhu, Xueyao Wang, and Yu Yao. 2024. MambaTS: improved selective state space models for long-term time series forecasting.arXiv preprint arXiv:2405.16440(2024)

  5. [5]

    Chao Chen, Karl Petty, Alexander Skabardonis, Pravin Varaiya, and Zhanfeng Jia. 2001. Freeway performance measurement system: mining loop detector data. Transportation research record1748, 1 (2001), 96–102

  6. [6]

    Min Chen, Guansong Pang, Wenjun Wang, and Cheng Yan. 2025. Information Bottleneck-guided MLPs for Robust Spatial-temporal Forecasting. InForty-second International Conference on Machine Learning

  7. [7]

    Peng Chen, Yingying ZHANG, Yunyao Cheng, Yang Shu, Yihang Wang, Qingsong Wen, Bin Yang, and Chenjuan Guo. 2023. Pathformer: Multi-scale Transformers with Adaptive Pathways for Time Series Forecasting. InThe Twelfth International Conference on Learning Representations

  8. [8]

    Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. 2022. To- wards understanding mixture of experts in deep learning.arXiv preprint arXiv:2208.02813(2022)

  9. [9]

    Jeongwhan Choi, Hwangyong Choi, Jeehyun Hwang, and Noseong Park. 2022. Graph neural controlled differential equations for traffic forecasting. InProceed- ings of the AAAI conference on artificial intelligence, Vol. 36. 6367–6374

  10. [10]

    Jinhyeok Choi, Heehyeon Kim, Minhyeong An, and Joyce Jiyoung Whang. 2024. Spot-mamba: Learning long-range dependency on spatio-temporal graphs with selective state spaces.arXiv preprint arXiv:2406.11244(2024)

  11. [11]

    Damai Dai, Chengqi Deng, Chenggang Zhao, Rx Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al . 2024. DeepSeekMoE: To- wards Ultimate Expert Specialization in Mixture-of-Experts Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1280–1297

  12. [12]

    Tri Dao, Daniel Y Fu, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christo- pher Ré. 2023. Hungry Hungry Hippos: Towards Language Modeling with State Space Models. InProceedings of the 11th International Conference on Learning Representations (ICLR)

  13. [13]

    Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al

  14. [14]

    In International conference on machine learning

    Glam: Efficient scaling of language models with mixture-of-experts. In International conference on machine learning. PMLR, 5547–5569

  15. [15]

    Yuchen Fang, Yanjun Qin, Haiyong Luo, Fang Zhao, Bingbing Xu, Liang Zeng, and Chenxing Wang. 2023. When spatio-temporal meet wavelets: Disentangled traffic forecasting via efficient spectral graph attention networks. In2023 IEEE 39th international conference on data engineering (ICDE). IEEE, 517–529

  16. [16]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

  17. [17]

    Albert Gu and Tri Dao. 2024. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. InFirst Conference on Language Modeling

  18. [18]

    Albert Gu, Karan Goel, and Christopher Ré. 2021. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396(2021)

  19. [19]

    Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. 2019. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 922–929

  20. [20]

    Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts.Neural computation3, 1 (1991), 79–87

  21. [21]

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts.arXiv preprint arXiv:2401.04088(2024)

  22. [22]

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668(2020)

  23. [23]

    Dongyuan Li, Shiyin Tan, Ying Zhang, Ming Jin, Shirui Pan, Manabu Okumura, and Renhe Jiang. 2024. Dyg-mamba: Continuous state space modeling on dynamic graphs.arXiv preprint arXiv:2408.06966(2024)

  24. [24]

    Hongbo Li, Sen Lin, Lingjie Duan, Yingbin Liang, and Ness B Shroff. 2024. Theory on mixture-of-experts in continual learning.arXiv preprint arXiv:2406.16437 (2024)

  25. [25]

    Lincan Li, Hanchen Wang, Wenjie Zhang, and Adelle Coster. 2024. Stg-mamba: Spatial-temporal graph learning via selective state space model.arXiv preprint arXiv:2403.12418(2024)

  26. [26]

    Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2018. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. InInternational Conference on Learning Representations

  27. [27]

    Zhonghang Li, Lianghao Xia, Yong Xu, and Chao Huang. 2023. GPT-ST: genera- tive pre-training of spatio-temporal graph neural networks.Advances in neural information processing systems36 (2023), 70229–70246

  28. [28]

    Dachuan Liu, Jin Wang, Shuo Shang, and Peng Han. 2022. Msdr: Multi-step dependency relation networks for spatial temporal forecasting. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining. 1042–1050

  29. [29]

    Hangchen Liu, Zheng Dong, Renhe Jiang, Jiewen Deng, Jinliang Deng, Quan- jun Chen, and Xuan Song. 2023. Spatio-temporal adaptive embedding makes vanilla transformer sota for traffic forecasting. InProceedings of the 32nd ACM international conference on information and knowledge management. 4125–4129

  30. [30]

    Shang Liu, Miao He, Zhiqiang Wu, Peng Lu, and Weixi Gu. 2024. Spatial–temporal graph neural network traffic prediction based load balancing with reinforcement learning in cellular networks.Information Fusion103 (2024), 102079

  31. [31]

    Ali Mehrabian, Shahab Bahrami, and Vincent WS Wong. 2023. A dynamic Bernstein graph recurrent network for wireless cellular traffic prediction. InICC 2023-IEEE International Conference on Communications. IEEE, 3842–3847

  32. [32]

    Ali Mehrabian and Vincent WS Wong. 2025. A-Gamba: An Adaptive Graph- Mamba Model for Traffic Prediction in Wireless Cellular Networks.IEEE Wireless Communications Letters(2025)

  33. [33]

    Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. 2023. Long Range Language Modeling via Gated State Spaces. InInternational Conference on Learning Representations

  34. [34]

    Huy Nguyen, Pedram Akbarian, Fanqi Yan, and Nhat Ho. 2023. Statistical perspective of top-k sparse softmax gating mixture of experts.arXiv preprint arXiv:2309.13850(2023)

  35. [35]

    Huy Nguyen, Nhat Ho, and Alessandro Rinaldo. 2024. On least square estimation in softmax gating mixture of experts.arXiv preprint arXiv:2402.02952(2024)

  36. [36]

    Mohammad Amin Shabani, Amir H Abdi, Lili Meng, and Tristan Sylvain. [n. d.]. Scaleformer: Iterative Multi-scale Refining Transformers for Time Series Fore- casting. InThe Eleventh International Conference on Learning Representations

  37. [37]

    Zhi Sheng, Yuan Yuan, Jingtao Ding, and Yong Li. 2025. Unveiling the Power of Noise Priors: Enhancing Diffusion Models for Mobile Traffic Prediction.arXiv Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Chen et al. preprint arXiv:2501.13794(2025)

  38. [38]

    Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin. 2024. Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts.arXiv preprint arXiv:2409.16040(2024)

  39. [39]

    Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. 2023. Simplified State Space Layers for Sequence Modeling. InICLR

  40. [40]

    Chao Song, Youfang Lin, Shengnan Guo, and Huaiyu Wan. 2020. Spatial- temporal synchronous graph convolutional networks: A new framework for spatial-temporal network data forecasting. InProceedings of the AAAI conference on artificial intelligence, Vol. 34. 914–921

  41. [41]

    Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research9, 11 (2008)

  42. [42]

    Shuo Wang, Yanran Li, Jiang Zhang, Qingye Meng, Lingwei Meng, and Fei Gao

  43. [43]

    5-gnn: A domain knowledge enhanced graph neural network for pm2

    Pm2. 5-gnn: A domain knowledge enhanced graph neural network for pm2. 5 forecasting. InProceedings of the 28th international conference on advances in geographic information systems. 163–166

  44. [44]

    Tan Wang, Zhongqi Yue, Jianqiang Huang, Qianru Sun, and Hanwang Zhang

  45. [45]

    Advances in Neural Information Processing Systems34 (2021), 18225–18240

    Self-supervised learning disentangled group representation as feature. Advances in Neural Information Processing Systems34 (2021), 18225–18240

  46. [46]

    Yuankai Wu, Dingyi Zhuang, Aurelie Labbe, and Lijun Sun. 2021. Inductive graph neural networks for spatiotemporal kriging. InProceedings of the AAAI conference on artificial intelligence, Vol. 35. 4478–4485

  47. [47]

    Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, Xiaojun Chang, and Chengqi Zhang. 2020. Connecting the dots: Multivariate time series forecasting with graph neural networks. InProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 753–763

  48. [48]

    Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. 2019. Graph wavenet for deep spatial-temporal graph modeling.arXiv preprint arXiv:1906.00121(2019)

  49. [49]

    Xiongxiao Xu, Canyu Chen, Yueqing Liang, Baixiang Huang, Guangji Bai, Liang Zhao, and Kai Shu. 2024. SST: Multi-Scale Hybrid Mamba-Transformer Experts for Long-Short Range Time Series Forecasting.arXiv preprint arXiv:2404.14757 (2024)

  50. [50]

    Yang Yao, Bo Gu, Zhou Su, and Mohsen Guizani. 2021. MVSTGN: A multi-view spatial-temporal graph network for cellular traffic prediction.IEEE Transactions on Mobile Computing22, 5 (2021), 2837–2849

  51. [51]

    Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2017. Spatio-temporal graph con- volutional networks: A deep learning framework for traffic forecasting.arXiv preprint arXiv:1709.04875(2017)

  52. [52]

    Haonan Yuan, Qingyun Sun, Zhaonan Wang, Xingcheng Fu, Cheng Ji, Yongjian Wang, Bo Jin, and Jianxin Li. 2025. DG-Mamba: Robust and Efficient Dynamic Graph Structure Learning with Selective State Space Models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 22272–22280

  53. [53]

    Zijian Zhang, Ze Huang, Zhiwei Hu, Xiangyu Zhao, Wanyu Wang, Zitao Liu, Junbo Zhang, S Joe Qin, and Hongwei Zhao. 2023. Mlpst: Mlp is all you need for spatio-temporal prediction. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management. 3381–3390

  54. [54]

    Zijian Zhang, Xiangyu Zhao, Qidong Liu, Chunxu Zhang, Qian Ma, Wanyu Wang, Hongwei Zhao, Yiqi Wang, and Zitao Liu. 2023. Promptst: Prompt-enhanced spatio-temporal multi-attribute prediction. InProceedings of the 32nd ACM Inter- national Conference on Information and Knowledge Management. 3195–3205

  55. [55]

    Ling Zhao, Yujiao Song, Chao Zhang, Yu Liu, Pu Wang, Tao Lin, Min Deng, and Haifeng Li. 2019. T-GCN: A temporal graph convolutional network for traffic prediction.IEEE transactions on intelligent transportation systems21, 9 (2019), 3848–3858

  56. [56]

    torch.cuda.max_memory_allocated()

    Barret Zoph. 2022. Designing effective sparse expert models. In2022 IEEE In- ternational Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 1044–1044. A More Related Work State Space Models.SSMs have demonstrated exceptional capa- bility in modeling sequential dependencies via state space. The structured state-space sequence model (S4...