arxiv: 2604.27981 · v1 · submitted 2026-04-30 · 💻 cs.LG · cs.AI

Recognition: unknown

ITS-Mina: A Harris Hawks Optimization-Based All-MLP Framework with Iterative Refinement and External Attention for Multivariate Time Series Forecasting

Pourya Zamanvaziri , Amirhossein Sadr , Aida Pakniyat , Dara Rahmati

Authors on Pith no claims yet

Pith reviewed 2026-05-07 06:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multivariate time series forecastingall-MLP architectureiterative refinementexternal attentionHarris Hawks Optimizationdropout tuningresidual mixer

0 comments

The pith

An all-MLP model using repeated shared layers, learnable-memory attention, and automatic dropout tuning matches or exceeds Transformer accuracy on multivariate time series forecasts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ITS-Mina as a fully MLP-based system for forecasting multiple related time series at once. It repeatedly applies the same residual mixer layers to build richer temporal patterns without adding new parameters each time. It swaps standard self-attention for an external module that stores global information in a small set of learnable memory units, keeping computation linear in sequence length. It also lets a Harris Hawks Optimization routine choose the dropout rate for each dataset. If these pieces deliver the reported gains, they suggest that carefully engineered MLP stacks can replace heavier Transformer designs for tasks such as energy load prediction and traffic flow modeling while using fewer resources.

Core claim

By combining an iterative refinement loop that reuses a shared-parameter residual mixer stack to deepen temporal representations, an external attention block that captures cross-sample dependencies through learnable memory units at linear cost, and Harris Hawks Optimization to set dataset-specific dropout rates, the resulting all-MLP model attains state-of-the-art or highly competitive accuracy on six standard benchmarks against eleven baseline models across multiple forecasting horizons.

What carries the argument

The iterative refinement mechanism that reapplies a shared-parameter residual mixer stack, together with external attention that uses a fixed set of learnable memory units in place of pairwise self-attention.

If this is right

Effective model depth can increase without a matching rise in parameter count or memory footprint.
Global dependency modeling becomes possible at linear rather than quadratic complexity in sequence length.
Regularization strength can be adapted automatically to each forecasting task without manual intervention.
Computational cost for real-time applications in energy and finance can drop while maintaining accuracy.
Forecasting pipelines can avoid the training overhead of full self-attention layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The shared-parameter iteration pattern could be tested on other sequence tasks such as speech or video to reduce model size.
Learnable memory units might serve as a lightweight substitute for attention in domains where quadratic scaling currently limits sequence length.
If the gains persist under matched tuning budgets, the result would weaken the case for defaulting to Transformer backbones in multivariate forecasting.
Evaluating the model on streaming data with concept drift would reveal whether the iterative refinement remains stable in non-stationary settings.

Load-bearing premise

That the reported performance edge comes from the three added components rather than from greater hyperparameter search effort or dataset-specific tuning advantages over the baselines.

What would settle it

An ablation experiment that removes iterative refinement, external attention, or HHO dropout tuning one at a time and measures whether accuracy falls on the same six datasets and horizons, or a re-run of all eleven baselines given identical hyperparameter search budgets.

Figures

Figures reproduced from arXiv: 2604.27981 by Aida Pakniyat, Amirhossein Sadr, Dara Rahmati, Pourya Zamanvaziri.

**Figure 1.** Figure 1: Overall architecture of ITS-Mina. The input is instance-normalized, refined by view at source ↗

**Figure 2.** Figure 2: External attention in ITS-Mina (notation aligned with Section 5.1). Rows of view at source ↗

read the original abstract

Multivariate time series forecasting plays a pivotal role in numerous real-world applications, including financial analysis, energy management, and traffic planning. While Transformer-based architectures have gained popularity for this task, recent studies reveal that simpler MLP-based models can achieve competitive or superior performance with significantly reduced computational cost. In this paper, we propose ITS-Mina, a novel all-MLP framework for multivariate time series forecasting that integrates three key innovations: (1) an iterative refinement mechanism that progressively enhances temporal representations by repeatedly applying a shared-parameter residual mixer stack, effectively deepening the model's computational capacity without multiplying the number of distinct parameters; (2) an external attention module that replaces traditional self-attention with learnable memory units, capturing cross-sample global dependencies at linear computational complexity; and (3) a Harris Hawks Optimization (HHO) algorithm for automatic dropout rate tuning, enabling adaptive regularization tailored to each dataset. Extensive experiments on six widely-used benchmark datasets demonstrate that ITS-Mina achieves state-of-the-art or highly competitive performance compared to eleven baseline models across multiple forecasting horizons.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ITS-Mina combines iterative shared-parameter refinement, external memory attention, and HHO dropout tuning in an all-MLP, but the abstract gives no metrics or ablations so the gains could be from the optimizer alone.

read the letter

The main thing here is that ITS-Mina is an all-MLP for multivariate forecasting that adds three pieces: repeated application of a shared residual mixer to increase depth without new parameters, an external attention block that uses learnable memory units instead of self-attention for linear cost, and Harris Hawks Optimization to pick the dropout rate per dataset. The abstract claims it matches or beats eleven baselines on six standard datasets across horizons, yet supplies no error values, no statistical tests, and no ablation tables at all. That leaves the central claim unsupported in what we have so far. The combination itself is new as a package, even if each element has appeared separately before. The iterative refinement is a straightforward way to trade compute for capacity inside a fixed parameter budget, and swapping self-attention for a memory module keeps the complexity down while still trying to capture cross-sample patterns. The HHO step is a practical way to automate regularization for each dataset. These choices line up with the current interest in lighter alternatives to Transformers for forecasting. The real weakness is the missing evidence. Without numbers it is impossible to judge whether the architecture moves the needle or whether HHO simply found a better dropout setting than whatever the baselines received. If the full paper shows ablations that turn each component off and still keeps the edge, and if the baselines were given comparable hyperparameter effort, then the integration is worth having. If those controls are absent or weak, the reported improvement may not survive scrutiny. This is the kind of paper that matters to people who need fast, cheap forecasting models for energy or traffic data and who already follow the MLP-for-time-series line of work. A reader who knows DLinear and similar baselines can see the engineering moves quickly and can evaluate the experiments once they are laid out. I would send it to peer review. The ideas are concrete, the motivation is clear, and referees can check the tables and the tuning protocol directly; the topic is active enough that a properly supported all-MLP result would be useful even if it is only an incremental engineering step.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes ITS-Mina, an all-MLP framework for multivariate time series forecasting. It integrates three components: (1) iterative refinement via a shared-parameter residual mixer stack that deepens computation without increasing distinct parameters, (2) an external attention module using learnable memory units to capture cross-sample global dependencies at linear complexity (replacing self-attention), and (3) Harris Hawks Optimization (HHO) for automatic per-dataset dropout tuning. The central claim is that extensive experiments on six benchmark datasets show ITS-Mina achieves state-of-the-art or highly competitive performance versus eleven baselines across multiple forecasting horizons.

Significance. If the empirical results hold after proper controls, the work could meaningfully contribute to the MLP-vs-Transformer debate in time series forecasting by demonstrating that targeted architectural simplifications plus adaptive regularization can match or exceed more complex models at lower cost. The shared-parameter iterative refinement and memory-based external attention are conceptually interesting efficiency ideas; if ablations confirm they add value beyond HHO tuning, this would strengthen the case for parameter-efficient all-MLP designs in practical applications such as energy and traffic forecasting.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): The central performance claim asserts SOTA or competitive results on six datasets versus eleven baselines, yet the abstract supplies no numerical metrics (MAE/MSE), ablation tables, error bars, or statistical significance tests. This is load-bearing because the entire contribution rests on empirical superiority; without these data it is impossible to evaluate whether the three proposed components deliver genuine gains.
[§4 and §3.3] §4 (Experiments) and §3.3 (HHO dropout): No ablation studies isolate the contributions of iterative refinement and external attention from the effects of HHO hyperparameter search on dropout. The skeptic concern is valid here: if the eleven baselines did not receive equivalent automated HPO effort, reported improvements may be artifacts of unequal regularization tuning rather than the all-MLP innovations. A concrete test (e.g., re-tuning baselines with the same HHO budget) is required to support the claim.
[§3.2] §3.2 (External attention): The description of the learnable memory units for cross-sample dependencies lacks any complexity analysis or direct comparison to prior memory-augmented attention mechanisms (e.g., those in existing linear-attention or memory-network literature). Without this, it is unclear whether the linear-complexity claim is novel or merely reimplements known techniques, which bears on the significance of the architectural contribution.

minor comments (2)

[§3] Notation for the shared-parameter residual mixer and the external attention memory units should be introduced with explicit equations and dimension annotations to improve reproducibility.
[§4] The manuscript should include a clear statement of the baseline hyperparameter tuning protocol (grid search, random search, or none) to allow fair comparison with the HHO-tuned dropout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each of the major comments below and outline the revisions we plan to incorporate in the updated version to strengthen the presentation of our results and contributions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central performance claim asserts SOTA or competitive results on six datasets versus eleven baselines, yet the abstract supplies no numerical metrics (MAE/MSE), ablation tables, error bars, or statistical significance tests. This is load-bearing because the entire contribution rests on empirical superiority; without these data it is impossible to evaluate whether the three proposed components deliver genuine gains.

Authors: We concur that the abstract would benefit from including key quantitative results. In the revised manuscript, we will modify the abstract to include representative MAE and MSE values demonstrating the performance of ITS-Mina relative to the baselines. The experiments section (§4) contains the full tables with metrics, but we will augment these with error bars representing standard deviations from multiple random seeds and include statistical significance tests (such as paired t-tests or Wilcoxon tests) to validate the improvements. Ablation tables are already included but will be made more prominent and expanded if necessary. revision: yes
Referee: [§4 and §3.3] §4 (Experiments) and §3.3 (HHO dropout): No ablation studies isolate the contributions of iterative refinement and external attention from the effects of HHO hyperparameter search on dropout. The skeptic concern is valid here: if the eleven baselines did not receive equivalent automated HPO effort, reported improvements may be artifacts of unequal regularization tuning rather than the all-MLP innovations. A concrete test (e.g., re-tuning baselines with the same HHO budget) is required to support the claim.

Authors: We recognize the importance of isolating the contributions of our architectural components from the hyperparameter optimization. The baselines were reproduced using the hyperparameters reported in their respective original papers, following common practice for fair comparisons in the time series forecasting literature. To directly address this, we will add ablation studies in the revised §4 where the dropout rate is fixed (without HHO tuning) and show the incremental benefits of the iterative refinement and external attention modules. We will also include a discussion on the computational cost of applying HHO to the baselines and, if feasible within revision time, provide results for re-tuned versions of a few key baselines. This will help demonstrate that the gains stem from the proposed innovations. revision: partial
Referee: [§3.2] §3.2 (External attention): The description of the learnable memory units for cross-sample dependencies lacks any complexity analysis or direct comparison to prior memory-augmented attention mechanisms (e.g., those in existing linear-attention or memory-network literature). Without this, it is unclear whether the linear-complexity claim is novel or merely reimplements known techniques, which bears on the significance of the architectural contribution.

Authors: We will revise the description in §3.2 to include a formal complexity analysis. The external attention mechanism using learnable memory units has linear complexity O(N) with respect to the number of samples, as it computes attention between the input and a fixed-size memory bank rather than pairwise among all inputs. We will also provide direct comparisons to related techniques in the memory-augmented networks and linear attention literature (e.g., citing works like Neural Turing Machines, MemNet, and efficient Transformers such as Reformer or Performer), clarifying the distinctions: our memory units are specifically designed for capturing global cross-sample dependencies in a time series context and are integrated within an all-MLP iterative framework. This will better establish the novelty of the approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper proposes an all-MLP architecture with iterative refinement (shared-parameter residual mixer), external attention (learnable memory units), and HHO-based dropout tuning. Performance claims are supported by experiments on six external benchmark datasets against eleven baselines. No equations, derivations, or self-referential definitions appear in the provided text that reduce a claimed result to its own inputs by construction. HHO is presented as a standard hyperparameter optimizer applied to dropout rates; this does not constitute a fitted-input-called-prediction pattern because the core model components are not defined in terms of the tuned outputs. No self-citations are invoked as load-bearing uniqueness theorems. The paper is self-contained against external benchmarks, so the derivation chain contains no circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified assumption that the three listed innovations produce additive gains; the abstract supplies no independent evidence for any of them beyond the final benchmark numbers.

free parameters (1)

dropout rate
Automatically selected per dataset by Harris Hawks Optimization rather than fixed in advance or derived from first principles.

axioms (1)

domain assumption Recent studies show simpler MLP-based models can match or exceed Transformer performance on time series tasks
Invoked in the opening paragraph to justify the all-MLP choice.

invented entities (1)

external attention module with learnable memory units no independent evidence
purpose: Capture cross-sample global dependencies at linear complexity
New module introduced to replace self-attention; no independent falsifiable prediction supplied in abstract.

pith-pipeline@v0.9.0 · 5506 in / 1538 out tokens · 35915 ms · 2026-05-07T06:34:22.652136+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 7 canonical work pages · 2 internal anchors

[1]

H. Wu, J. Xu, J. Wang, M. Long, Autoformer: Decomposition transformers with auto- correlation for long-term series forecasting, in: Advances in Neural Information Pro- cessing Systems, Vol. 34, 2021, pp. 22419–22430

2021
[2]

T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, R. Jin, FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting, in: International Conference on Machine Learning, 2022, pp. 27268–27286

2022
[3]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in Neural Information Processing Systems, Vol. 30, 2017

2017
[4]

H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, W. Zhang, Informer: Beyond efficient transformer for long sequence time-series forecasting, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 2021, pp. 11106–11115

2021
[5]

A. Zeng, M. Chen, L. Zhang, Q. Xu, Are transformers effective for time series forecast- ing?, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, 2023, pp. 11121–11128

2023
[6]

I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, et al., MLP-Mixer: An all-MLP architecture for vision, in: Advances in Neural Information Processing Systems, Vol. 34, 2021, pp. 24261–24272

2021
[7]

Chen, C.-L

S.-A. Chen, C.-L. Li, N. C. Yoder, S. Ö. Arık, T. Pfister, TSMixer: An all-MLP archi- tecture for time series forecasting, Transactions on Machine Learning Research (2023)

2023
[8]

Y. Nie, N. H. Nguyen, P. Sinthong, J. Kalagnanam, A time series is worth 64 words: Long-term forecasting with transformers, in: International Conference on Learning Rep- resentations, 2023

2023
[9]

S.Liu, H.Yu, C.Liao, J.Li, W.Lin, A.X.Liu, S.Dustdar, Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting, in: Interna- tional Conference on Learning Representations, 2022. 17

2022
[10]

Zhang, J

Y. Zhang, J. Yan, Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting, in: International Conference on Learning Repre- sentations, 2023

2023
[11]

Zhang, Y

T. Zhang, Y. Zhang, W. Cao, J. Bian, X. Yi, S. Zheng, J. Li, LightTS: Lightweight time series classification with adaptive ensemble distillation, arXiv preprint arXiv:2302.11468 (2023)

work page arXiv 2023
[12]

M. Li, X. Zhao, C. Lyu, M. Zhao, R. Wu, R. Guo, Mlp4rec: A pure MLP architecture for sequential recommendation, arXiv preprint arXiv:2204.11510 (2022)

work page arXiv 2022
[13]

Dehghani, S

M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, L. Kaiser, Universal transformers, in: International Conference on Learning Representations, 2019

2019
[14]

Adaptive Computation Time for Recurrent Neural Networks

A. Graves, Adaptive computation time for recurrent neural networks, arXiv preprint arXiv:1603.08983 (2016)

work page internal anchor Pith review arXiv 2016
[15]

Elbayad, J

M. Elbayad, J. Gu, E. Grave, M. Auli, Depth-adaptive transformer, arXiv preprint arXiv:1910.10073 (2019)

work page arXiv 1910
[16]

K.-S. Ng, Q. Wang, Loop neural networks for parameter sharing, arXiv preprint arXiv:2410.12012 (2024)

work page arXiv 2024
[17]

Guo, Z.-N

M.-H. Guo, Z.-N. Liu, T.-J. Mu, S.-M. Hu, Beyond self-attention: External attention using two linear layers for visual analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (5) (2022) 5436–5447

2022
[18]

A. A. Heidari, S. Mirjalili, H. Faris, I. Aljarah, M. Mafarja, H. Chen, Harris hawks optimization: Algorithm and applications, Future Generation Computer Systems 97 (2019) 849–872

2019
[19]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778

2016
[20]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 (1) (2014) 1929–1958

2014
[21]

Akiba, S

T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyper- parameter optimization framework, in: Proceedings of the 25th ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining, 2019, pp. 2623–2631

2019
[22]

D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2015)

work page internal anchor Pith review arXiv 2015
[23]

Y. Liu, T. Hu, H. Zhang, H. Wu, S. Wang, L. Ma, M. Long, iTransformer: Inverted transformers are effective for time series forecasting, in: International Conference on Learning Representations, 2024. 18

2024
[24]

Z. Gong, Y. Tang, J. Liang, PatchMixer: A patch-mixing architecture for long-term time series forecasting, arXiv preprint arXiv:2310.00655 (2023)

work page arXiv 2023
[25]

H. Wu, T. Hu, Y. Liu, H. Zhou, J. Wang, M. Long, TimesNet: Temporal 2D-variation modeling for general time series analysis, in: International Conference on Learning Representations, 2023

2023
[26]

H. Wang, J. Peng, F. Huang, J. Wang, J. Chen, Y. Xiao, MICN: Multi-scale local and global context modeling for long-term series forecasting (2023). 19

2023