Recognition: unknown
In-Context Learning Under Regime Change
Pith reviewed 2026-05-10 06:54 UTC · model grok-4.3
The pith
Transformers can be built to detect unknown regime shifts in data sequences and adapt their in-context predictions without retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formalize regime shifts as an in-context change-point detection problem and prove the existence of transformer models that solve it for piecewise stationary sequences from known parametric families. The construction shows that required model size depends on available information about the change-point location, ranging from no prior knowledge to exact timing. Experiments confirm that trained transformers match optimal baselines on synthetic linear regression and dynamical system tasks, and that injecting changepoint knowledge boosts performance of pretrained models on infectious disease and financial volatility forecasting.
What carries the argument
A layered transformer construction that encodes change-point location information at varying precision levels to enable in-context detection and adaptation for linear models.
If this is right
- Transformers achieve optimal in-context performance on linear regression and linear dynamical systems once they detect the regime shift.
- Providing more precise information about change timing reduces the number of layers and parameters needed.
- Pretrained foundation models improve on real tasks like disease and volatility forecasting when change-point signals are added without any retraining.
- The same architecture works across synthetic and real non-stationary sequences when the parametric family assumption holds.
Where Pith is reading between the lines
- Pretrained models could be extended with lightweight change detectors to handle streaming data better than current fine-tuning approaches.
- The scaling of model complexity with timing knowledge suggests a design principle for other sequence models facing abrupt shifts.
- Testing the construction on nonlinear or unknown-family regimes would reveal where the current existence proof stops applying.
Load-bearing premise
The underlying data consists of piecewise stationary segments drawn from known parametric families that admit efficient in-context solutions.
What would settle it
A demonstration that no fixed-size transformer can match the performance of an oracle that knows the change point when the data segments belong to unknown or non-parametric families.
Figures
read the original abstract
Non-stationary sequences arise naturally in control, forecasting, and decision-making. The data-generating process shifts at unknown times, and models must detect the change, discard or downweight obsolete evidence, and adapt to new dynamics on the fly. Transformer-based foundation models increasingly rely on in-context learning for time series forecasting, tabular prediction, and continuous control. As these models are deployed in non-stationary environments, understanding their ability to detect and adapt to regime shifts is important. We formalize this as an in-context change-point detection problem and formally establish the existence of transformer models that solve this problem. Our construction demonstrates that model complexity, in layers and parameters, depends on the level of information available about the change-point location, from no knowledge to knowing exact timing. We validate our results with experiments on synthetic linear regression and linear dynamical systems, where trained transformers match the performance of optimal baselines across information levels. We also show that encoding and incorporating changepoint knowledge indeed improves the real-world performance of a pretrained foundation models on infectious disease forecasting and on financial volatility forecasting around Federal Open Market Committee (FOMC) announcements without retraining, demonstrating practical applicability to real-world regime changes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes in-context learning under regime changes as an in-context change-point detection task for piecewise-stationary sequences drawn from known parametric families (linear regression and linear dynamical systems). It provides an existence construction for transformer models that solve the task, with the required depth and parameter count explicitly depending on the amount of side information supplied about the change-point location (none, partial, or exact). The construction is validated by showing that trained transformers recover the performance of optimal statistical baselines on synthetic tasks and that injecting change-point knowledge improves a pretrained foundation model on two real forecasting domains (infectious-disease and financial-volatility series) without retraining.
Significance. If the existence result and the matching to optimal baselines hold, the work supplies a concrete theoretical account of how transformer depth and attention can realize finite-state change-point detectors whose complexity scales with available information. The explicit construction and the zero-shot improvement on real data are strengths that distinguish the contribution from purely empirical studies of non-stationarity.
major comments (2)
- [§3.2, Theorem 3.1] §3.2, Theorem 3.1: the existence proof reduces the problem to realizing a finite-state machine via attention and feed-forward layers, but the explicit state-transition table and the corresponding attention-head construction for the 'no knowledge' regime are only sketched; without the full layer-by-layer mapping it is impossible to verify the claimed parameter scaling.
- [§4.2, Table 2] §4.2, Table 2: the reported equivalence between trained transformers and the optimal Bayesian filter holds only for the listed noise variances and segment lengths; the paper does not show that the match survives when the segment-length distribution is misspecified relative to the training distribution, which is load-bearing for the claim that transformers 'match optimal baselines across information levels'.
minor comments (2)
- [§2 and §5] Notation for the change-point indicator variable is introduced in §2 but reused with a different meaning in the real-data experiments (§5); a single consistent symbol would improve readability.
- [Abstract] The abstract states that 'model complexity … depends on the level of information'; the precise functional dependence (e.g., O(log T) layers for exact timing) is only stated in the theorem and should be repeated in the abstract or introduction for clarity.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The comments highlight important aspects of the existence construction and the empirical validation. We address each major comment below with clarifications and proposed revisions.
read point-by-point responses
-
Referee: [§3.2, Theorem 3.1] the existence proof reduces the problem to realizing a finite-state machine via attention and feed-forward layers, but the explicit state-transition table and the corresponding attention-head construction for the 'no knowledge' regime are only sketched; without the full layer-by-layer mapping it is impossible to verify the claimed parameter scaling.
Authors: We agree that the no-knowledge case in Theorem 3.1 is presented at a higher level of abstraction than the partial- and exact-knowledge regimes, focusing on the reduction to an FSM simulator whose depth and width scale with the number of states. The parameter bounds follow from the standard construction of attention-based FSMs (as in prior work on transformer universality for finite automata), but we acknowledge that an explicit state-transition table and per-layer attention-head wiring for the no-knowledge regime would aid verification. In the revision we will add a dedicated appendix subsection that supplies the full layer-by-layer mapping, the explicit transition table for the no-knowledge detector, and the corresponding attention-head and feed-forward configurations. This addition will make the claimed scaling fully verifiable while leaving the theorem statement unchanged. revision: partial
-
Referee: [§4.2, Table 2] the reported equivalence between trained transformers and the optimal Bayesian filter holds only for the listed noise variances and segment lengths; the paper does not show that the match survives when the segment-length distribution is misspecified relative to the training distribution, which is load-bearing for the claim that transformers 'match optimal baselines across information levels'.
Authors: The referee correctly notes that the current experiments in §4.2 use segment-length distributions that match those seen during training. While this demonstrates that transformers can recover the optimal filter when the generative assumptions are aligned, it does not yet address robustness under misspecification. To strengthen the claim, we will add a new set of experiments in the revision that evaluate both the trained transformers and the Bayesian filter on test sequences whose segment lengths are drawn from a deliberately misspecified distribution (e.g., switching from uniform to geometric or vice versa, while keeping noise variances fixed). We will report the resulting performance gap (or lack thereof) for each information level. This extension directly addresses the load-bearing aspect of the claim. revision: partial
Circularity Check
No significant circularity in existence construction
full rationale
The paper's central result is an existence proof via explicit construction of transformer architectures that realize in-context change-point detection for piecewise-stationary data drawn from known parametric families (linear regression or LDS). Model depth and parameter count are shown to scale with the amount of side information supplied about change-point location. The construction reduces the problem to encoding a finite-state detector using attention and feed-forward layers; synthetic experiments confirm trained models recover the performance of the corresponding optimal statistical procedure. Real-world transfer to disease and volatility forecasting is shown without retraining. No load-bearing step reduces by definition to its own inputs, renames a fitted quantity as a prediction, or relies on a self-citation chain whose validity is internal to the present work. The argument is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Data consists of piecewise stationary segments drawn from known model classes (linear regression or linear dynamical systems).
Reference graph
Works this paper leans on
-
[1]
Willsky and H
A. Willsky and H. Jones. A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems.IEEE Transactions on Automatic Control, 21(1):108–112, 1976
1976
-
[2]
Nikiforov.Detection of Abrupt Changes: Theory and Application
Michèle Basseville and Igor V. Nikiforov.Detection of Abrupt Changes: Theory and Application. 1993
1993
-
[3]
O. L. V. Costa, M. D. Fragoso, and R. P. Marques.Discrete-Time Markov Jump Linear Systems. Probability and Its Applications. Springer-Verlag, 2005
2005
-
[4]
Hamilton
James D. Hamilton. A new approach to the economic analysis of nonstationary time series and the business cycle.Econometrica, 57(2):357–384, 1989
1989
-
[5]
Regime changes and financial markets
Andrew Ang and Allan Timmermann. Regime changes and financial markets. Working paper, National Bureau of Economic Research, June 2011
2011
-
[6]
From sparse data to smart decisions: Region-specific policy evaluation via simulation.medRxiv, 2025
Carson Dudley et al. From sparse data to smart decisions: Region-specific policy evaluation via simulation.medRxiv, 2025
2025
-
[7]
E. S. Page. Continuous inspection schemes.Biometrika, 41, 1954
1954
-
[8]
Sequential changepoint detection in quality control and dynamical systems
Tze Leung Lai. Sequential changepoint detection in quality control and dynamical systems. Journal of the Royal Statistical Society: Series B, 57(4), 1995
1995
-
[9]
Ryan Prescott Adams and David J. C. MacKay. Bayesian online changepoint detection, 2007
2007
-
[10]
Attention is all you need, 2023
Ashish Vaswani, Noam Shazeer, et al. Attention is all you need, 2023. 13
2023
-
[11]
Brown et al
Tom B. Brown et al. Language models are few-shot learners, 2020
2020
-
[12]
What can transformers learn in-context? a case study of simple function classes, 2023
Shivam Garg et al. What can transformers learn in-context? a case study of simple function classes, 2023
2023
-
[13]
Transformers learn in-context by gradient descent, 2023
Johannes von Oswald et al. Transformers learn in-context by gradient descent, 2023
2023
-
[14]
What learning algorithm is in-context learning? investigations with linear models, 2023
Ekin Akyürek et al. What learning algorithm is in-context learning? investigations with linear models, 2023
2023
-
[15]
Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak
Yingcong Li, M. Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning, 2023
2023
-
[16]
What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization, 2023
Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization, 2023
2023
-
[17]
Bartlett
Ruiqi Zhang, Spencer Frei, and Peter L. Bartlett. Trained transformers learn linear models in-context, 2023
2023
-
[18]
A decoder-only foundation model for time-series forecasting, 2024
Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting, 2024
2024
-
[19]
Chronos: Transformer-based language models for time-series forecast- ing, 2024
Abdul Fatir Ansari et al. Chronos: Transformer-based language models for time-series forecast- ing, 2024
2024
-
[20]
Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025
Noah Hollmann, Samuel Müller, et al. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025
2025
-
[21]
Gating is weighting: Understanding gated linear attention through in-context learning, 2025
Yingcong Li et al. Gating is weighting: Understanding gated linear attention through in-context learning, 2025
2025
-
[22]
Transformers as adaptive estimators: In-context learning under regime change, 2026
Carson Dudley, Yutong Bi, Xiaofeng Liu, and Samet Oymak. Transformers as adaptive estimators: In-context learning under regime change, 2026. Technical Report
2026
-
[23]
Bayesian and non-Bayesian methods for combining models and forecasts with applications to forecasting international growth rates.Journal of Economet- rics, 56(1-2):89–118, 1993
Chung-ki Min and Arnold Zellner. Bayesian and non-Bayesian methods for combining models and forecasts with applications to forecasting international growth rates.Journal of Economet- rics, 56(1-2):89–118, 1993
1993
-
[24]
Hoeting, David Madigan, Adrian E
Jennifer A. Hoeting, David Madigan, Adrian E. Raftery, and Chris T. Volinsky. Bayesian model averaging: A tutorial.Statistical Science, 14(4):382–417, 1999
1999
-
[25]
Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery
Carson Dudley et al. Simulation as supervision: Mechanistic pretraining for scientific discovery. arXiv preprint arXiv:2507.08977, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Emrullah Ildiz, and Samet Oymak
Ege Onur Taga, M. Emrullah Ildiz, and Samet Oymak. TimePFN: Effective multivariate time series forecasting with synthetic data. InProceedings of the AAAI Conference on Artificial Intelligence, 2025
2025
-
[27]
Carson Dudley and Marisa Eisenberg. Learning from simulators: A theory of simulation- grounded learning.arXiv preprint arXiv:2509.18990, 2025
-
[28]
Mantis: A Foundation Model for Mechanistic Disease Forecasting
Carson Dudley et al. Mantis: A foundation model for mechanistic disease forecasting.arXiv preprint arXiv:2508.12260, n/a, 2025. 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Foundation time series models for forecasting and policy evaluation in infectious disease epidemics.medRxiv, February 2025
Suprabhath Kalahasti et al. Foundation time series models for forecasting and policy evaluation in infectious disease epidemics.medRxiv, February 2025
2025
-
[30]
From tables to time: Extending tabpfn-v2 to time series forecasting, 2026
Shi Bin Hoo, Samuel Müller, David Salinas, and Frank Hutter. From tables to time: Extending tabpfn-v2 to time series forecasting, 2026
2026
-
[31]
Kenneth N. Kuttner. Monetary policy surprises and interest rates: Evidence from the fed funds futures market.Journal of Monetary Economics, 47(3):523–544, 2001
2001
-
[32]
Bernanke and Kenneth N
Ben S. Bernanke and Kenneth N. Kuttner. What explains the stock market’s reaction to federal reserve policy?The Journal of Finance, 60(3):1221–1257, 2005
2005
-
[33]
Federal funds effective rate [dff]
Board of Governors of the Federal Reserve System (US). Federal funds effective rate [dff]. https://fred.stlouisfed.org/series/DFF, 2026. Retrieved from FRED, Federal Reserve Bank of St. Louis
2026
-
[34]
S&p 500 [sp500]
S&P Dow Jones Indices LLC. S&p 500 [sp500]. https://fred.stlouisfed.org/series/ SP500, 2026. Retrieved from FRED, Federal Reserve Bank of St. Louis
2026
-
[35]
Federal open market committee historical materials by year.https://www.federalreserve.gov/monetarypolicy/fomc_historical_ year.htm, 2026
Board of Governors of the Federal Reserve System. Federal open market committee historical materials by year.https://www.federalreserve.gov/monetarypolicy/fomc_historical_ year.htm, 2026. A Full Proofs This appendix provides complete proofs of the results stated in Section 2. We first discuss prelimi- naries and give the full proof of Theorem 1 (Appendix ...
2026
-
[36]
The construction from Theorem 1 simplifies as follows
The BMA predictor (2) collapses to a single term: ˆyBMA t =m n∗ 1(t) =x⊤ t µn∗ 1, sinceαn∗ 1(t) = 1. The construction from Theorem 1 simplifies as follows. Layer 1 (accumulation).Instead of computing prefix sums at every position and retrieving later, the transformer can directly accumulate only post-change statistics. A single attention head uses positio...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.