CAST: Causal Anchored Simplex Transport for Distribution-Valued Time Series

arxiv: 2605.16919 · v1 · pith:2ABKKND4new · submitted 2026-05-16 · 📊 stat.ML · cs.LG

CAST: Causal Anchored Simplex Transport for Distribution-Valued Time Series

Jiecheng Lu , Jieqi Di , Runhua Wu , Yuwei Zhou This is my paper

Pith reviewed 2026-05-19 19:32 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords distribution-valued time seriescausal forecastingsimplex transporttransition kernel aliasingprobability simplexautoregressive forecastingstochastic transportcompositional time series

0 comments p. Extension

pith:2ABKKND4 Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{2ABKKND4}

Prints a linked pith:2ABKKND4 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

CAST uses causal context to retrieve non-aliased successors then anchors and transports them on the simplex to forecast distribution time series.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies forecasting for time series where each time step is a full probability distribution, such as mobility shares, queue occupancies, or air-quality profiles. It identifies latent transition-kernel aliasing as a structural failure mode in which identical observed distributions can evolve differently under different hidden regimes. CAST addresses this by retrieving empirical successors from causal context, stabilizing them with a persistence anchor, and applying bounded local stochastic transport on ordered supports, with every step preserving the simplex by construction. The approach contains the regime-aware Bayes successor and avoids an irreducible weighted Jensen-Shannon excess-risk lower bound that any aliased forecaster must incur. On eleven benchmarks spanning ecology, energy, mobility, and queueing, the method records the best average rank for one-step KL and autoregressive JSD while placing in the top two for offline KL.

Core claim

CAST is a successor-local operator that retrieves empirical successors from causal context, stabilizes them with a persistence anchor, and applies a bounded local stochastic transport on ordered supports; every stage preserves the simplex by construction. The operator class contains the regime-aware Bayes successor. For ordered supports an additional Pinsker separation holds whenever the transported successor lies outside the no-transport anchor hull. Any forecaster depending only on an aliased summary incurs an irreducible weighted Jensen-Shannon excess-risk lower bound.

What carries the argument

The Causal Anchored Simplex Transport operator, which retrieves empirical successors from causal context, stabilizes them via persistence anchoring, and performs bounded stochastic transport on ordered supports to produce simplex-preserving forecasts.

If this is right

Any forecaster relying solely on an aliased summary of the current distribution incurs a positive weighted Jensen-Shannon excess-risk lower bound.
The CAST hypothesis class contains the regime-aware Bayes successor for the underlying transition kernels.
When supports are ordered, an extra Pinsker separation bound applies to transported points lying outside the no-transport anchor hull.
Component ablations and synthetic aliasing tests isolate the contribution of causal retrieval and anchoring to the observed gains.
The method applies directly to compositional data arising in ecology, energy, mobility, and queueing systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data pipelines for distribution-valued series should prioritize collection of causal context variables to enable disambiguation of regimes.
Domains with naturally ordered supports, such as severity profiles or occupancy fractions, are likely to see the largest benefit from the transport stage.
The aliasing lower bound supplies a diagnostic: persistent high error on identical distributions may indicate hidden regime structure that context-aware methods can exploit.
Adaptive tuning of anchor strength or transport radius from data could further reduce the gap to the Bayes successor in non-stationary settings.

Load-bearing premise

Causal context must suffice to retrieve non-aliased empirical successors and supports must be ordered so that Pinsker separation holds when the transported successor falls outside the no-transport anchor hull.

What would settle it

A controlled experiment with deliberately aliased transitions in which a forecaster given only the current distribution exhibits the predicted weighted Jensen-Shannon excess risk while CAST does not.

Figures

Figures reproduced from arXiv: 2605.16919 by Jiecheng Lu, Jieqi Di, Runhua Wu, Yuwei Zhou.

**Figure 1.** Figure 1: CAST overview. (1) Real-world systems evolve as distributions on a simplex rather than scalar trajectories; (2) the same pt can arise from histories with different successors, so (3) current-only forecasters collapse to a mixture average; (4) CAST encodes history, retrieves empirical successors rt, forms a persistence–retrieval anchor at = λtpt + (1 − λt)rt, and applies bounded local transport Tt on ordere… view at source ↗

**Figure 2.** Figure 2: Main empirical results and component ablations. (A) Average offline KL and rollout JSD ranks across 11 sections for all 16 methods (lower is better). (B) Per-dataset rank heatmap for offline KL and rollout JSD; cooler = stronger rank. (C) Dataset-level standing vs. the best non-CAST baseline: red circles = CAST’s rank, gray squares = non-CAST winner when CAST is not top, horizontal segments = the rank gap.… view at source ↗

**Figure 3.** Figure 3: Qualitative offline and rollout on held-out queueing systems. (a) Offline one-step residual heat maps (annotated by mean KL): CAST’s errors are diffuse and low-magnitude; baselines show structured errors aligned with target ridges. (b) Autoregressive rollout (annotated by mean JSD): CAST preserves the moving occupancy mass and changing support shape; persistence and Comp. ETS drift toward static bands, N-H… view at source ↗

**Figure 4.** Figure 4: Synthetic latent-kernel aliasing construction. Panel (a): the same current distribution with [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: CAST random-seed robustness over five seeds (benchmark seed plus robustness seeds [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: Additional offline one-step residual visualizations across homogeneous, nonhomogeneous, [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

**Figure 7.** Figure 7: Extended rollout visualization for the selected examples. Each row shows the ground-truth [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

read the original abstract

Many decision-facing stochastic systems are observed through aggregate distributions rather than scalar trajectories: queue occupancies, mobility shares, public-health mixtures, generation-source shares, ecological compositions, and air-quality severity profiles all live on the probability simplex and evolve over time. We study causal (online) forecasting for these distribution-valued time series and argue that the transition operator itself should be structured around the simplex. We introduce CAST (Causal Anchored Simplex Transport), a successor-local operator that (i) retrieves empirical successors from causal context, (ii) stabilizes them with a persistence anchor, and (iii) applies a bounded local stochastic transport on ordered supports; every stage preserves the simplex by construction. We identify a structural failure mode, latent transition-kernel aliasing, where similar observed distributions evolve differently under different contextual regimes, and prove that any forecaster depending only on an aliased summary incurs an irreducible weighted Jensen-Shannon excess-risk lower bound, while the CAST hypothesis class contains the regime-aware Bayes successor; for ordered supports an additional Pinsker separation holds whenever the transported successor lies outside the no-transport anchor hull. On eleven public and simulated benchmarks spanning ecology, energy, diet, mortality, employment, air quality, severe weather, mobility, and G/G/1, G_t/G/1 queue occupancy, CAST attains the best average rank on both one-step KL (1.27) and autoregressive rollout JSD (1.91), winning 8/11 sections on each metric against a broad statistical, compositional, recurrent, convolutional, and Transformer baseline set, and top-2 on all 11 sections for offline KL. Component ablations and a controlled synthetic aliasing experiment corroborate the theory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAST gives a new structured operator for simplex time series with solid empirical ranks, but the Pinsker separation only applies to ordered supports and several benchmarks are unordered categories.

read the letter

The main takeaway is that CAST combines causal successor retrieval, a persistence anchor, and bounded local transport to forecast distribution-valued series while addressing latent aliasing. The paper shows this operator contains the regime-aware Bayes successor and proves an irreducible Jensen-Shannon excess-risk bound for any aliased forecaster. On eleven datasets it posts the best average rank for one-step KL and autoregressive JSD, with top-2 offline KL across the board, plus ablations and a synthetic aliasing check that line up with the claims. That empirical pattern and the explicit handling of the simplex are the parts worth noting. The theory is scoped to ordered supports for the extra Pinsker separation when the transported point falls outside the no-transport hull. Many of the reported benchmarks use categorical distributions like employment shares or diet compositions that lack a natural order, so the separation guarantee does not transfer directly unless an ordering is added or the claim rests only on the empirical procedure. The abstract states the distinction clearly, but without the full proofs it is hard to see how the method extends or whether the bound is used only where it fits. This is aimed at researchers who already work with compositional or aggregate time series in ecology, mobility, or queueing. A reader looking for a new hypothesis class with some benchmark wins would find concrete material here. It is worth sending to peer review so the proofs, the ordering assumption, and the dataset details can be checked properly.

Referee Report

2 major / 1 minor

Summary. The paper introduces CAST (Causal Anchored Simplex Transport), a successor-local operator for causal online forecasting of distribution-valued time series on the probability simplex. It retrieves empirical successors from causal context, stabilizes them via a persistence anchor, and applies bounded local stochastic transport on ordered supports, with all stages preserving the simplex. The central theoretical claims are that latent transition-kernel aliasing incurs an irreducible weighted Jensen-Shannon excess-risk lower bound for any forecaster using only an aliased summary, while the CAST hypothesis class contains the regime-aware Bayes successor; for ordered supports an additional Pinsker separation holds when the transported successor lies outside the no-transport anchor hull. Empirically, CAST achieves the best average rank on one-step KL (1.27) and autoregressive rollout JSD (1.91), winning 8/11 sections on each metric across eleven benchmarks from ecology, energy, diet, mortality, employment, air quality, severe weather, mobility, and queueing systems, against statistical, compositional, recurrent, convolutional, and Transformer baselines.

Significance. If the theoretical guarantees hold and extend appropriately, the work provides a principled simplex-structured approach to distribution time series forecasting that explicitly addresses regime-dependent aliasing, a structural failure mode not commonly isolated in prior compositional or recurrent models. The explicit containment of the regime-aware Bayes successor and the derivation of an aliasing lower bound are notable strengths, as is the empirical demonstration of consistent top performance and component ablations on a diverse benchmark suite. These elements could influence forecasting practice in aggregate-data domains such as public health, mobility, and environmental monitoring, provided the ordered-support assumption is reconciled with the categorical benchmarks used.

major comments (2)

[Abstract / Theoretical claims] Abstract and theoretical analysis: The aliasing lower-bound and Pinsker separation results are stated to require ordered supports ('for ordered supports an additional Pinsker separation holds whenever the transported successor lies outside the no-transport anchor hull'), yet the eleven benchmarks explicitly include unordered categorical distributions such as mobility shares, diet compositions, employment categories, and air-quality profiles. If the local stochastic transport step is applied without an explicit ordering (or with arbitrary ordering), the separation guarantee does not transfer, leaving the claim that CAST avoids the aliasing lower bound dependent on an unstated extension or solely on the empirical method. This gap is load-bearing for interpreting the theoretical contribution relative to the reported wins.
[Empirical evaluation] Empirical evaluation section: The reported average ranks (KL 1.27, JSD 1.91) and win counts (8/11 sections) rest on eleven datasets whose selection criteria, preprocessing steps, handling of support ordering, and statistical significance testing are not detailed. Without these, it is difficult to assess whether the benchmark wins are robust to the unordered-support issue raised above or to variations in how causal context is retrieved.

minor comments (1)

[Method definition] Notation for the persistence anchor and transport operator could be clarified with an explicit equation reference to show how simplex preservation is enforced at each stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our theoretical results and the presentation of the empirical evaluation. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract / Theoretical claims] Abstract and theoretical analysis: The aliasing lower-bound and Pinsker separation results are stated to require ordered supports ('for ordered supports an additional Pinsker separation holds whenever the transported successor lies outside the no-transport anchor hull'), yet the eleven benchmarks explicitly include unordered categorical distributions such as mobility shares, diet compositions, employment categories, and air-quality profiles. If the local stochastic transport step is applied without an explicit ordering (or with arbitrary ordering), the separation guarantee does not transfer, leaving the claim that CAST avoids the aliasing lower bound dependent on an unstated extension or solely on the empirical method. This gap is load-bearing for interpreting the theoretical contribution relative to the reported wins.

Authors: We appreciate the referee pointing out this distinction. The core theoretical contributions—the weighted Jensen-Shannon excess-risk lower bound for aliased summaries and the containment of the regime-aware Bayes successor within the CAST hypothesis class—hold for general (possibly unordered) supports and do not rely on the ordered-support assumption. The additional Pinsker separation is explicitly qualified as holding only for ordered supports. For the categorical benchmarks, the local stochastic transport step is either bypassed or performed after imposing a fixed but arbitrary category ordering; performance improvements in those cases derive primarily from the empirical successor retrieval and persistence anchor. We will revise the abstract, theoretical section, and empirical discussion to separate the general results from the ordered-support extension and to document the handling of unordered supports, thereby clarifying that the aliasing-avoidance claim is not dependent on the Pinsker result. revision: partial
Referee: [Empirical evaluation] Empirical evaluation section: The reported average ranks (KL 1.27, JSD 1.91) and win counts (8/11 sections) rest on eleven datasets whose selection criteria, preprocessing steps, handling of support ordering, and statistical significance testing are not detailed. Without these, it is difficult to assess whether the benchmark wins are robust to the unordered-support issue raised above or to variations in how causal context is retrieved.

Authors: We agree that additional documentation is required for reproducibility and to address robustness concerns. In the revised manuscript we will expand the empirical evaluation section with: (i) explicit selection criteria and public sources for each of the eleven benchmarks, (ii) preprocessing details including support construction and ordering decisions (or lack thereof) for categorical versus ordered data, (iii) the precise procedure used to retrieve causal context windows, and (iv) statistical significance tests (paired Wilcoxon signed-rank tests with Holm correction) comparing CAST against baselines. A new paragraph will also discuss the application of CAST components to unordered supports. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines CAST as a successor-local operator with three explicit stages (retrieve empirical successors from causal context, stabilize with persistence anchor, apply bounded local stochastic transport on ordered supports) that preserve the simplex by construction. It proves an irreducible weighted JSD excess-risk lower bound for any forecaster depending only on an aliased summary and shows that the CAST hypothesis class contains the regime-aware Bayes successor. These steps are presented as independent mathematical results rather than reductions to fitted parameters or self-referential definitions. No equations in the provided abstract or description equate a claimed prediction to its own inputs by construction, and the empirical results on eleven benchmarks are reported separately from the theoretical containment claim. The ordered-support assumption for the additional Pinsker separation is an applicability condition, not a circularity in the derivation itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the implicit modeling choice of ordered supports and causal context retrieval. The persistence anchor and local transport operator are introduced as part of the new method.

pith-pipeline@v0.9.0 · 5845 in / 1235 out tokens · 42323 ms · 2026-05-19T19:32:46.927091+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Latent-kernel aliasing lower bound) … inf q ∑ πz KL(uz ∥ q) = JS_π(u1,…,uK) > 0
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

bounded local stochastic transport … on ordered supports; … Pinsker separation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages

[1]

Queueing, predictions, and large language models: Challenges and open problems.Stochastic Systems, 15(3):195–219, 2025

Michael Mitzenmacher and Rana Shahout. Queueing, predictions, and large language models: Challenges and open problems.Stochastic Systems, 15(3):195–219, 2025. doi: 10.1287/stsy.2025.0106

work page doi:10.1287/stsy.2025.0106 2025
[2]

TLC Trip Record Data

New York City Taxi and Limousine Commission. TLC Trip Record Data. https://www.nyc.gov/ site/tlc/about/tlc-trip-record-data.page, 2025. Accessed 2026-05-05

work page 2025
[3]

The statistical analysis of compositional data.Journal of the Royal Statistical Society: Series B (Methodological), 44(2):139–160, 1982

John Aitchison. The statistical analysis of compositional data.Journal of the Royal Statistical Society: Series B (Methodological), 44(2):139–160, 1982. doi: 10.1111/j.2517-6161.1982.tb01195.x

work page doi:10.1111/j.2517-6161.1982.tb01195.x 1982
[4]

Chapman and Hall, London, 1986

John Aitchison.The Statistical Analysis of Compositional Data. Chapman and Hall, London, 1986

work page 1986
[5]

Snyder, J

Ralph D. Snyder, J. Keith Ord, Anne B. Koehler, Keith R. McLaren, and Adrian Beaumont. Forecasting compositional time series: A state space approach. Working Paper 11/15, Department of Econometrics and Business Statistics, Monash University, 2015

work page 2015
[6]

Snyder, J

Ralph D. Snyder, J. Keith Ord, Anne B. Koehler, Keith R. McLaren, and Adrian Beaumont. Forecasting compositional time series: A state space approach.International Journal of Forecasting, 33(2):502–512,

work page
[7]

doi: 10.1016/j.ijforecast.2016.11.008

work page doi:10.1016/j.ijforecast.2016.11.008 2016
[8]

Compositional V ARIMA time series

Carles Barceló-Vidal, Lucía Aguilar, and Josep Antoni Martín-Fernández. Compositional V ARIMA time series. InCompositional Data Analysis: Theory and Applications, pages 87–103. John Wiley & Sons,

work page
[9]

doi: 10.1002/9781119976462.ch7

work page doi:10.1002/9781119976462.ch7
[10]

Wasserstein autoregressive models for density time series.Journal of Time Series Analysis, 43(1):30–52, 2022

Chao Zhang, Piotr Kokoszka, and Alexander Petersen. Wasserstein autoregressive models for density time series.Journal of Time Series Analysis, 43(1):30–52, 2022. doi: 10.1111/jtsa.12590

work page doi:10.1111/jtsa.12590 2022
[11]

Wasserstein multivariate auto-regressive models for modeling distributional time series, 2022

Yiye Jiang and Jérémie Bigot. Wasserstein multivariate auto-regressive models for modeling distributional time series, 2022

work page 2022
[12]

Bellemare, Will Dabney, and Rémi Munos

Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 449–458. PMLR, 2017

work page 2017
[13]

Brandeau

Xiaocheng Li, Huaiyang Zhong, and Margaret L. Brandeau. Quantile markov decision processes.Opera- tions Research, 70(3):1428–1447, 2022. doi: 10.1287/opre.2021.2123

work page doi:10.1287/opre.2021.2123 2022
[14]

Risk-sensitive and robust decision-making: A CVaR optimization approach

Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: A CVaR optimization approach. InAdvances in Neural Information Processing Systems, volume 28, pages 1522–1530, 2015

work page 2015
[15]

Distributionally robust convex optimization

Wolfram Wiesemann, Daniel Kuhn, and Melvyn Sim. Distributionally robust convex optimization. Operations Research, 62(6):1358–1376, 2014. doi: 10.1287/opre.2014.1314

work page doi:10.1287/opre.2014.1314 2014
[16]

D. V . Lindley. The theory of queues with a single server.Mathematical Proceedings of the Cambridge Philosophical Society, 48(2):277–289, 1952. doi: 10.1017/S0305004100027638

work page doi:10.1017/s0305004100027638 1952
[17]

Cambridge University Press, 2013

Mor Harchol-Balter.Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University Press, 2013

work page 2013
[18]

Learning lindley’s recursion

Sergio Palomo and Jamol Pender. Learning lindley’s recursion. InProceedings of the 2020 Winter Simulation Conference, pages 644–655, 2020. doi: 10.1109/WSC48552.2020.9384121

work page doi:10.1109/wsc48552.2020.9384121 2020
[19]

Transformer-based next-step prediction for queue length distribution

Jieqi Di, Jiecheng Lu, Runhua Wu, and Yuwei Zhou. Transformer-based next-step prediction for queue length distribution. InNeurIPS 2025 Workshop on Mathematical Foundations and Operational Integration of Machine Learning for Uncertainty-Aware Decision-Making, 2025. URL https://openreview.net/ forum?id=ErSFgi45jD. Published on OpenReview

work page 2025
[20]

DeepAR: Probabilistic forecasting with autoregressive recurrent networks.International Journal of Forecasting, 36(3):1181–1191, 2020

David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. DeepAR: Probabilistic forecasting with autoregressive recurrent networks.International Journal of Forecasting, 36(3):1181–1191, 2020. doi: 10.1016/j.ijforecast.2019.07.001. 10

work page doi:10.1016/j.ijforecast.2019.07.001 2020
[21]

Arik, Nicolas Loeff, and Tomas Pfister

Bryan Lim, Sercan O. Arik, Nicolas Loeff, and Tomas Pfister. Temporal fusion transformers for inter- pretable multi-horizon time series forecasting.International Journal of Forecasting, 37(4):1748–1764,

work page
[22]

doi: 10.1016/j.ijforecast.2021.03.012

work page doi:10.1016/j.ijforecast.2021.03.012 2021
[23]

Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio

Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. InInternational Conference on Learning Representations, 2020

work page 2020
[25]

v38i17.29868

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11106–11115, 2021. doi: 10.1609/aaai. v35i12.17325

work page doi:10.1609/aaai 2021
[26]

Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting

Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. InAdvances in Neural Information Processing Systems, volume 34, pages 22419–22430, 2021

work page 2021
[27]

FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting

Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 27268–27286. PMLR, 2022

work page 2022
[28]

Are transformers effective for time series forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11121--11128, 2023

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11121–11128, 2023. doi: 10.1609/aaai.v37i9.26317

work page doi:10.1609/aaai.v37i9.26317 2023
[29]

Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam

Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InInternational Conference on Learning Representations, 2023

work page 2023
[30]

TimesNet: Tem- poral 2d-variation modeling for general time series analysis

Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. TimesNet: Tem- poral 2d-variation modeling for general time series analysis. InInternational Conference on Learning Representations, 2023

work page 2023
[31]

Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting

Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. InInternational Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=vSVLM2j9eie

work page 2023
[32]

Long-term forecasting with TiDE: Time-series dense encoder.Transactions on Machine Learning Research, 2023

Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan Mathur, Rajat Sen, and Rose Yu. Long-term forecasting with TiDE: Time-series dense encoder.Transactions on Machine Learning Research, 2023. URLhttps://openreview.net/forum?id=pCbC3aQB5W

work page 2023
[33]

Yoder, Sercan O

Si-An Chen, Chun-Liang Li, Nathanael C. Yoder, Sercan O. Arik, and Tomas Pfister. TSMixer: An all-MLP architecture for time series forecasting.Transactions on Machine Learning Research, 2023. URL https://openreview.net/forum?id=wbpxTuXgm0

work page 2023
[34]

ARM: Refining multivariate forecasting with adaptive temporal- contextual learning

Jiecheng Lu, Xu Han, and Shihao Yang. ARM: Refining multivariate forecasting with adaptive temporal- contextual learning. InInternational Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=JWpwDdVbaM

work page 2024
[35]

iTrans- former: Inverted transformers are effective for time series forecasting

Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. iTrans- former: Inverted transformers are effective for time series forecasting. InInternational Conference on Learning Representations, 2024

work page 2024
[36]

TimeXer: Empowering transformers for time series forecasting with exogenous variables

Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Guo Qin, Haoran Zhang, Yong Liu, Yunzhong Qiu, Jianmin Wang, and Mingsheng Long. TimeXer: Empowering transformers for time series forecasting with exogenous variables. InAdvances in Neural Information Processing Systems, volume 37, 2024. URL https://openreview.net/forum?id=l3MOy7AydX

work page 2024
[37]

CATS: Enhancing multivariate time series forecasting by constructing auxiliary time series as exogenous variables

Jiecheng Lu, Xu Han, Yan Sun, and Shihao Yang. CATS: Enhancing multivariate time series forecasting by constructing auxiliary time series as exogenous variables. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR,

work page
[38]

URLhttps://openreview.net/forum?id=1lDAGDe0UR. 11

work page
[39]

TimeMixer: Decomposable multiscale mixing for time series forecasting

Shiyu Wang, Haixu Wu, Xiaoming Shi, Tengge Hu, Huakun Luo, Lintao Ma, James Zhang, and Jun Zhou. TimeMixer: Decomposable multiscale mixing for time series forecasting. InInternational Conference on Learning Representations, 2024

work page 2024
[40]

W A VE: Weighted autoregressive varying gate for time series forecasting

Jiecheng Lu, Xu Han, Yan Sun, and Shihao Yang. W A VE: Weighted autoregressive varying gate for time series forecasting. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research. PMLR, 2025. URL https://openreview.net/forum? id=Qqn5ktBUxH

work page 2025
[41]

Maddix, Hao Wang, Michael W

Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Türkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, and Yuyang Wang. Chronos: Learning the langu...

work page 2024
[42]

HyperMLP: An integrated perspective for sequence modeling, 2026

Jiecheng Lu and Shihao Yang. HyperMLP: An integrated perspective for sequence modeling, 2026

work page 2026
[43]

A decoder-only foundation model for time-series forecasting

Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR, 2024

work page 2024
[44]

StretchTime: Adaptive time series forecasting via symplectic attention, 2026

Yubin Kim, Viresh Pati, Jevon Twitty, Vinh Pham, Shihao Yang, and Jiecheng Lu. StretchTime: Adaptive time series forecasting via symplectic attention, 2026

work page 2026
[45]

Functional time series forecasting of distributions: A koopman-wasserstein approach.Behaviormetrika, 2025

Ziyue Wang and Yuko Araki. Functional time series forecasting of distributions: A koopman-wasserstein approach.Behaviormetrika, 2025. doi: 10.1007/s41237-025-00278-1

work page doi:10.1007/s41237-025-00278-1 2025
[46]

Sinkhorn distances: Lightspeed computation of optimal transport

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. InAdvances in Neural Information Processing Systems, volume 26, pages 2292–2300, 2013

work page 2013
[47]

Computational optimal transport: With applications to data science

Gabriel Peyré and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5–6):355–607, 2019. doi: 10.1561/2200000073

work page doi:10.1561/2200000073 2019
[48]

In-context time series predictor

Jiecheng Lu, Yan Sun, and Shihao Yang. In-context time series predictor. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=dCcY2pyNIO

work page 2025
[49]

Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen

Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y . Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen. Time-LLM: Time series forecasting by reprogram- ming large language models. InInternational Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Unb5CVPtae

work page 2024
[50]

Free energy mixer

Jiecheng Lu and Shihao Yang. Free energy mixer. InInternational Conference on Learning Representations,

work page
[51]

URLhttps://openreview.net/forum?id=vjQnKToCnV

work page
[52]

Unified training of universal time series forecasting transformers

Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR, 2024

work page 2024
[53]

CAPS: Unifying attention, recurrence, and alignment in transformer-based time series forecasting, 2026

Viresh Pati, Yubin Kim, Vinh Pham, Jevon Twitty, Shihao Yang, and Jiecheng Lu. CAPS: Unifying attention, recurrence, and alignment in transformer-based time series forecasting, 2026

work page 2026
[54]

Transformers are RNNs: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5156–5165. PMLR, 2020

work page 2020
[55]

Rethinking attention with performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers. InInternational Conference on Learning Representations,

work page
[56]

URLhttps://openreview.net/forum?id=Ua6zuk0WRH

work page
[57]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=tEYskw1VY2

work page 2024
[58]

Gated linear attention transformers with hardware-efficient training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR, 2024. 12

work page 2024
[59]

Linear transformers as V AR models: Aligning autoregressive attention mechanisms with autoregressive forecasting

Jiecheng Lu and Shihao Yang. Linear transformers as V AR models: Aligning autoregressive attention mechanisms with autoregressive forecasting. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research. PMLR, 2025. URL https://openreview.net/forum?id=SxJUV9mnyt

work page 2025
[60]

Retentive network: A successor to Transformer for large language models, 2023

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to Transformer for large language models, 2023

work page 2023
[61]

ZeroS: Zero-sum linear attention for efficient transformers

Jiecheng Lu, Xu Han, Yan Sun, Viresh Pati, Yubin Kim, Siddhartha Somani, and Shihao Yang. ZeroS: Zero-sum linear attention for efficient transformers. InAdvances in Neural Information Processing Systems, volume 38, 2025. URLhttps://openreview.net/forum?id=Ms6IXbfzzX. Spotlight

work page 2025
[62]

Recurrent marked temporal point processes: Embedding event history to vector

Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. Recurrent marked temporal point processes: Embedding event history to vector. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1555–1564,

work page
[63]

doi: 10.1145/2939672.2939875

work page doi:10.1145/2939672.2939875
[64]

Hongyuan Mei and Jason M. Eisner. The neural hawkes process: A neurally self-modulating multivariate point process. InAdvances in Neural Information Processing Systems, volume 30, pages 6754–6764, 2017

work page 2017
[65]

Neural temporal point processes: A review

Oleksandr Shchur, Ali Caner Türkmen, Tim Januschowski, and Stephan Günnemann. Neural temporal point processes: A review. InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pages 4585–4593, 2021. doi: 10.24963/ijcai.2021/623

work page doi:10.24963/ijcai.2021/623 2021
[66]

George Sugihara and Robert M. May. Nonlinear forecasting as a way of distinguishing chaos from measurement error in time series.Nature, 344:734–741, 1990. doi: 10.1038/344734a0

work page doi:10.1038/344734a0 1990
[67]

Antão, Amanda E

Maria Dornelas, Laura H. Antão, Amanda E. Bates, et al. BioTIME 2.0: Expanding and improving a database of biodiversity time series.Global Ecology and Biogeography, 34(5):e70003, 2025. doi: 10.1111/geb.70003

work page doi:10.1111/geb.70003 2025
[68]

Monthly Electricity Data

Ember. Monthly Electricity Data. https://ember-energy.org/data/data-product/ monthly-electricity-data, 2026. Monthly full-release long-format CSV and Ember Electricity Data Methodology; accessed 2026-04-15

work page 2026
[69]

Diet compositions.Our World in Data, 2023

Hannah Ritchie, Pablo Rosado, and Max Roser. Diet compositions.Our World in Data, 2023. https: //ourworldindata.org/diet-compositions

work page 2023
[70]

Weekly Provisional Counts of Deaths by State and Select Causes, 2020–2023

National Center for Health Statistics. Weekly Provisional Counts of Deaths by State and Select Causes, 2020–2023. https://data.cdc.gov/d/muzy-jte6, 2023. Centers for Disease Control and Prevention; accessed 2026-04-15

work page 2020
[71]

Bureau of Labor Statistics

U.S. Bureau of Labor Statistics. Quarterly Census of Employment and Wages. https://www.bls.gov/ cew/, 2026. Annual singlefile CSV data files for 2010–2024; accessed 2026-04-15

work page 2026
[72]

Environmental Protection Agency

U.S. Environmental Protection Agency. AirData Daily AQI by County.https://aqs.epa.gov/aqsweb/ airdata/download_files.html, 2026. Daily AQI by county CSV ZIP files for 2000–2024; accessed 2026-04-15

work page 2026
[73]

latent regime

NOAA National Centers for Environmental Information. Storm Events Database. https://www.ncei. noaa.gov/stormevents/, 2026. Storm Events details CSV files for 2000–2024; accessed 2026-04-15. 13 A Theory: Full Proofs and Supporting Results We give the full proofs of the three main theorems stated in Section 4, together with approximation, retrieval-consiste...

work page arXiv 2026

[1] [1]

Queueing, predictions, and large language models: Challenges and open problems.Stochastic Systems, 15(3):195–219, 2025

Michael Mitzenmacher and Rana Shahout. Queueing, predictions, and large language models: Challenges and open problems.Stochastic Systems, 15(3):195–219, 2025. doi: 10.1287/stsy.2025.0106

work page doi:10.1287/stsy.2025.0106 2025

[2] [2]

TLC Trip Record Data

New York City Taxi and Limousine Commission. TLC Trip Record Data. https://www.nyc.gov/ site/tlc/about/tlc-trip-record-data.page, 2025. Accessed 2026-05-05

work page 2025

[3] [3]

The statistical analysis of compositional data.Journal of the Royal Statistical Society: Series B (Methodological), 44(2):139–160, 1982

John Aitchison. The statistical analysis of compositional data.Journal of the Royal Statistical Society: Series B (Methodological), 44(2):139–160, 1982. doi: 10.1111/j.2517-6161.1982.tb01195.x

work page doi:10.1111/j.2517-6161.1982.tb01195.x 1982

[4] [4]

Chapman and Hall, London, 1986

John Aitchison.The Statistical Analysis of Compositional Data. Chapman and Hall, London, 1986

work page 1986

[5] [5]

Snyder, J

Ralph D. Snyder, J. Keith Ord, Anne B. Koehler, Keith R. McLaren, and Adrian Beaumont. Forecasting compositional time series: A state space approach. Working Paper 11/15, Department of Econometrics and Business Statistics, Monash University, 2015

work page 2015

[6] [6]

Snyder, J

Ralph D. Snyder, J. Keith Ord, Anne B. Koehler, Keith R. McLaren, and Adrian Beaumont. Forecasting compositional time series: A state space approach.International Journal of Forecasting, 33(2):502–512,

work page

[7] [7]

doi: 10.1016/j.ijforecast.2016.11.008

work page doi:10.1016/j.ijforecast.2016.11.008 2016

[8] [8]

Compositional V ARIMA time series

Carles Barceló-Vidal, Lucía Aguilar, and Josep Antoni Martín-Fernández. Compositional V ARIMA time series. InCompositional Data Analysis: Theory and Applications, pages 87–103. John Wiley & Sons,

work page

[9] [9]

doi: 10.1002/9781119976462.ch7

work page doi:10.1002/9781119976462.ch7

[10] [10]

Wasserstein autoregressive models for density time series.Journal of Time Series Analysis, 43(1):30–52, 2022

Chao Zhang, Piotr Kokoszka, and Alexander Petersen. Wasserstein autoregressive models for density time series.Journal of Time Series Analysis, 43(1):30–52, 2022. doi: 10.1111/jtsa.12590

work page doi:10.1111/jtsa.12590 2022

[11] [11]

Wasserstein multivariate auto-regressive models for modeling distributional time series, 2022

Yiye Jiang and Jérémie Bigot. Wasserstein multivariate auto-regressive models for modeling distributional time series, 2022

work page 2022

[12] [12]

Bellemare, Will Dabney, and Rémi Munos

Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 449–458. PMLR, 2017

work page 2017

[13] [13]

Brandeau

Xiaocheng Li, Huaiyang Zhong, and Margaret L. Brandeau. Quantile markov decision processes.Opera- tions Research, 70(3):1428–1447, 2022. doi: 10.1287/opre.2021.2123

work page doi:10.1287/opre.2021.2123 2022

[14] [14]

Risk-sensitive and robust decision-making: A CVaR optimization approach

Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: A CVaR optimization approach. InAdvances in Neural Information Processing Systems, volume 28, pages 1522–1530, 2015

work page 2015

[15] [15]

Distributionally robust convex optimization

Wolfram Wiesemann, Daniel Kuhn, and Melvyn Sim. Distributionally robust convex optimization. Operations Research, 62(6):1358–1376, 2014. doi: 10.1287/opre.2014.1314

work page doi:10.1287/opre.2014.1314 2014

[16] [16]

D. V . Lindley. The theory of queues with a single server.Mathematical Proceedings of the Cambridge Philosophical Society, 48(2):277–289, 1952. doi: 10.1017/S0305004100027638

work page doi:10.1017/s0305004100027638 1952

[17] [17]

Cambridge University Press, 2013

Mor Harchol-Balter.Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University Press, 2013

work page 2013

[18] [18]

Learning lindley’s recursion

Sergio Palomo and Jamol Pender. Learning lindley’s recursion. InProceedings of the 2020 Winter Simulation Conference, pages 644–655, 2020. doi: 10.1109/WSC48552.2020.9384121

work page doi:10.1109/wsc48552.2020.9384121 2020

[19] [19]

Transformer-based next-step prediction for queue length distribution

Jieqi Di, Jiecheng Lu, Runhua Wu, and Yuwei Zhou. Transformer-based next-step prediction for queue length distribution. InNeurIPS 2025 Workshop on Mathematical Foundations and Operational Integration of Machine Learning for Uncertainty-Aware Decision-Making, 2025. URL https://openreview.net/ forum?id=ErSFgi45jD. Published on OpenReview

work page 2025

[20] [20]

DeepAR: Probabilistic forecasting with autoregressive recurrent networks.International Journal of Forecasting, 36(3):1181–1191, 2020

David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. DeepAR: Probabilistic forecasting with autoregressive recurrent networks.International Journal of Forecasting, 36(3):1181–1191, 2020. doi: 10.1016/j.ijforecast.2019.07.001. 10

work page doi:10.1016/j.ijforecast.2019.07.001 2020

[21] [21]

Arik, Nicolas Loeff, and Tomas Pfister

Bryan Lim, Sercan O. Arik, Nicolas Loeff, and Tomas Pfister. Temporal fusion transformers for inter- pretable multi-horizon time series forecasting.International Journal of Forecasting, 37(4):1748–1764,

work page

[22] [22]

doi: 10.1016/j.ijforecast.2021.03.012

work page doi:10.1016/j.ijforecast.2021.03.012 2021

[23] [23]

Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio

Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. InInternational Conference on Learning Representations, 2020

work page 2020

[24] [25]

v38i17.29868

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11106–11115, 2021. doi: 10.1609/aaai. v35i12.17325

work page doi:10.1609/aaai 2021

[25] [26]

Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting

Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. InAdvances in Neural Information Processing Systems, volume 34, pages 22419–22430, 2021

work page 2021

[26] [27]

FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting

Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 27268–27286. PMLR, 2022

work page 2022

[27] [28]

Are transformers effective for time series forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11121--11128, 2023

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11121–11128, 2023. doi: 10.1609/aaai.v37i9.26317

work page doi:10.1609/aaai.v37i9.26317 2023

[28] [29]

Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam

Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InInternational Conference on Learning Representations, 2023

work page 2023

[29] [30]

TimesNet: Tem- poral 2d-variation modeling for general time series analysis

Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. TimesNet: Tem- poral 2d-variation modeling for general time series analysis. InInternational Conference on Learning Representations, 2023

work page 2023

[30] [31]

Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting

Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. InInternational Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=vSVLM2j9eie

work page 2023

[31] [32]

Long-term forecasting with TiDE: Time-series dense encoder.Transactions on Machine Learning Research, 2023

Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan Mathur, Rajat Sen, and Rose Yu. Long-term forecasting with TiDE: Time-series dense encoder.Transactions on Machine Learning Research, 2023. URLhttps://openreview.net/forum?id=pCbC3aQB5W

work page 2023

[32] [33]

Yoder, Sercan O

Si-An Chen, Chun-Liang Li, Nathanael C. Yoder, Sercan O. Arik, and Tomas Pfister. TSMixer: An all-MLP architecture for time series forecasting.Transactions on Machine Learning Research, 2023. URL https://openreview.net/forum?id=wbpxTuXgm0

work page 2023

[33] [34]

ARM: Refining multivariate forecasting with adaptive temporal- contextual learning

Jiecheng Lu, Xu Han, and Shihao Yang. ARM: Refining multivariate forecasting with adaptive temporal- contextual learning. InInternational Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=JWpwDdVbaM

work page 2024

[34] [35]

iTrans- former: Inverted transformers are effective for time series forecasting

Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. iTrans- former: Inverted transformers are effective for time series forecasting. InInternational Conference on Learning Representations, 2024

work page 2024

[35] [36]

TimeXer: Empowering transformers for time series forecasting with exogenous variables

Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Guo Qin, Haoran Zhang, Yong Liu, Yunzhong Qiu, Jianmin Wang, and Mingsheng Long. TimeXer: Empowering transformers for time series forecasting with exogenous variables. InAdvances in Neural Information Processing Systems, volume 37, 2024. URL https://openreview.net/forum?id=l3MOy7AydX

work page 2024

[36] [37]

CATS: Enhancing multivariate time series forecasting by constructing auxiliary time series as exogenous variables

Jiecheng Lu, Xu Han, Yan Sun, and Shihao Yang. CATS: Enhancing multivariate time series forecasting by constructing auxiliary time series as exogenous variables. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR,

work page

[37] [38]

URLhttps://openreview.net/forum?id=1lDAGDe0UR. 11

work page

[38] [39]

TimeMixer: Decomposable multiscale mixing for time series forecasting

Shiyu Wang, Haixu Wu, Xiaoming Shi, Tengge Hu, Huakun Luo, Lintao Ma, James Zhang, and Jun Zhou. TimeMixer: Decomposable multiscale mixing for time series forecasting. InInternational Conference on Learning Representations, 2024

work page 2024

[39] [40]

W A VE: Weighted autoregressive varying gate for time series forecasting

Jiecheng Lu, Xu Han, Yan Sun, and Shihao Yang. W A VE: Weighted autoregressive varying gate for time series forecasting. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research. PMLR, 2025. URL https://openreview.net/forum? id=Qqn5ktBUxH

work page 2025

[40] [41]

Maddix, Hao Wang, Michael W

Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Türkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, and Yuyang Wang. Chronos: Learning the langu...

work page 2024

[41] [42]

HyperMLP: An integrated perspective for sequence modeling, 2026

Jiecheng Lu and Shihao Yang. HyperMLP: An integrated perspective for sequence modeling, 2026

work page 2026

[42] [43]

A decoder-only foundation model for time-series forecasting

Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR, 2024

work page 2024

[43] [44]

StretchTime: Adaptive time series forecasting via symplectic attention, 2026

Yubin Kim, Viresh Pati, Jevon Twitty, Vinh Pham, Shihao Yang, and Jiecheng Lu. StretchTime: Adaptive time series forecasting via symplectic attention, 2026

work page 2026

[44] [45]

Functional time series forecasting of distributions: A koopman-wasserstein approach.Behaviormetrika, 2025

Ziyue Wang and Yuko Araki. Functional time series forecasting of distributions: A koopman-wasserstein approach.Behaviormetrika, 2025. doi: 10.1007/s41237-025-00278-1

work page doi:10.1007/s41237-025-00278-1 2025

[45] [46]

Sinkhorn distances: Lightspeed computation of optimal transport

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. InAdvances in Neural Information Processing Systems, volume 26, pages 2292–2300, 2013

work page 2013

[46] [47]

Computational optimal transport: With applications to data science

Gabriel Peyré and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5–6):355–607, 2019. doi: 10.1561/2200000073

work page doi:10.1561/2200000073 2019

[47] [48]

In-context time series predictor

Jiecheng Lu, Yan Sun, and Shihao Yang. In-context time series predictor. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=dCcY2pyNIO

work page 2025

[48] [49]

Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen

Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y . Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen. Time-LLM: Time series forecasting by reprogram- ming large language models. InInternational Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Unb5CVPtae

work page 2024

[49] [50]

Free energy mixer

Jiecheng Lu and Shihao Yang. Free energy mixer. InInternational Conference on Learning Representations,

work page

[50] [51]

URLhttps://openreview.net/forum?id=vjQnKToCnV

work page

[51] [52]

Unified training of universal time series forecasting transformers

Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR, 2024

work page 2024

[52] [53]

CAPS: Unifying attention, recurrence, and alignment in transformer-based time series forecasting, 2026

Viresh Pati, Yubin Kim, Vinh Pham, Jevon Twitty, Shihao Yang, and Jiecheng Lu. CAPS: Unifying attention, recurrence, and alignment in transformer-based time series forecasting, 2026

work page 2026

[53] [54]

Transformers are RNNs: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5156–5165. PMLR, 2020

work page 2020

[54] [55]

Rethinking attention with performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers. InInternational Conference on Learning Representations,

work page

[55] [56]

URLhttps://openreview.net/forum?id=Ua6zuk0WRH

work page

[56] [57]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=tEYskw1VY2

work page 2024

[57] [58]

Gated linear attention transformers with hardware-efficient training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR, 2024. 12

work page 2024

[58] [59]

Linear transformers as V AR models: Aligning autoregressive attention mechanisms with autoregressive forecasting

Jiecheng Lu and Shihao Yang. Linear transformers as V AR models: Aligning autoregressive attention mechanisms with autoregressive forecasting. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research. PMLR, 2025. URL https://openreview.net/forum?id=SxJUV9mnyt

work page 2025

[59] [60]

Retentive network: A successor to Transformer for large language models, 2023

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to Transformer for large language models, 2023

work page 2023

[60] [61]

ZeroS: Zero-sum linear attention for efficient transformers

Jiecheng Lu, Xu Han, Yan Sun, Viresh Pati, Yubin Kim, Siddhartha Somani, and Shihao Yang. ZeroS: Zero-sum linear attention for efficient transformers. InAdvances in Neural Information Processing Systems, volume 38, 2025. URLhttps://openreview.net/forum?id=Ms6IXbfzzX. Spotlight

work page 2025

[61] [62]

Recurrent marked temporal point processes: Embedding event history to vector

Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. Recurrent marked temporal point processes: Embedding event history to vector. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1555–1564,

work page

[62] [63]

doi: 10.1145/2939672.2939875

work page doi:10.1145/2939672.2939875

[63] [64]

Hongyuan Mei and Jason M. Eisner. The neural hawkes process: A neurally self-modulating multivariate point process. InAdvances in Neural Information Processing Systems, volume 30, pages 6754–6764, 2017

work page 2017

[64] [65]

Neural temporal point processes: A review

Oleksandr Shchur, Ali Caner Türkmen, Tim Januschowski, and Stephan Günnemann. Neural temporal point processes: A review. InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pages 4585–4593, 2021. doi: 10.24963/ijcai.2021/623

work page doi:10.24963/ijcai.2021/623 2021

[65] [66]

George Sugihara and Robert M. May. Nonlinear forecasting as a way of distinguishing chaos from measurement error in time series.Nature, 344:734–741, 1990. doi: 10.1038/344734a0

work page doi:10.1038/344734a0 1990

[66] [67]

Antão, Amanda E

Maria Dornelas, Laura H. Antão, Amanda E. Bates, et al. BioTIME 2.0: Expanding and improving a database of biodiversity time series.Global Ecology and Biogeography, 34(5):e70003, 2025. doi: 10.1111/geb.70003

work page doi:10.1111/geb.70003 2025

[67] [68]

Monthly Electricity Data

Ember. Monthly Electricity Data. https://ember-energy.org/data/data-product/ monthly-electricity-data, 2026. Monthly full-release long-format CSV and Ember Electricity Data Methodology; accessed 2026-04-15

work page 2026

[68] [69]

Diet compositions.Our World in Data, 2023

Hannah Ritchie, Pablo Rosado, and Max Roser. Diet compositions.Our World in Data, 2023. https: //ourworldindata.org/diet-compositions

work page 2023

[69] [70]

Weekly Provisional Counts of Deaths by State and Select Causes, 2020–2023

National Center for Health Statistics. Weekly Provisional Counts of Deaths by State and Select Causes, 2020–2023. https://data.cdc.gov/d/muzy-jte6, 2023. Centers for Disease Control and Prevention; accessed 2026-04-15

work page 2020

[70] [71]

Bureau of Labor Statistics

U.S. Bureau of Labor Statistics. Quarterly Census of Employment and Wages. https://www.bls.gov/ cew/, 2026. Annual singlefile CSV data files for 2010–2024; accessed 2026-04-15

work page 2026

[71] [72]

Environmental Protection Agency

U.S. Environmental Protection Agency. AirData Daily AQI by County.https://aqs.epa.gov/aqsweb/ airdata/download_files.html, 2026. Daily AQI by county CSV ZIP files for 2000–2024; accessed 2026-04-15

work page 2026

[72] [73]

latent regime

NOAA National Centers for Environmental Information. Storm Events Database. https://www.ncei. noaa.gov/stormevents/, 2026. Storm Events details CSV files for 2000–2024; accessed 2026-04-15. 13 A Theory: Full Proofs and Supporting Results We give the full proofs of the three main theorems stated in Section 4, together with approximation, retrieval-consiste...

work page arXiv 2026