pith. machine review for the scientific record. sign in

arxiv: 2604.16988 · v1 · submitted 2026-04-18 · 💻 cs.LG · cs.AI

Recognition: unknown

In-Context Learning Under Regime Change

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords in-context learningchange-point detectiontransformersregime shiftsnon-stationary sequenceslinear regressiondynamical systemstime series adaptation
0
0 comments X

The pith

Transformers can be built to detect unknown regime shifts in data sequences and adapt their in-context predictions without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that non-stationary data with sudden shifts can be handled by transformers through in-context change-point detection. It proves that specific transformer constructions exist for this task when the data comes from piecewise stationary segments in known families like linear regression or linear dynamical systems. Complexity in layers and parameters grows as less information is given about the timing of the shift. This matters for foundation models used in forecasting and control, where real data often changes regimes, because the models can incorporate timing knowledge to improve adaptation on the fly.

Core claim

We formalize regime shifts as an in-context change-point detection problem and prove the existence of transformer models that solve it for piecewise stationary sequences from known parametric families. The construction shows that required model size depends on available information about the change-point location, ranging from no prior knowledge to exact timing. Experiments confirm that trained transformers match optimal baselines on synthetic linear regression and dynamical system tasks, and that injecting changepoint knowledge boosts performance of pretrained models on infectious disease and financial volatility forecasting.

What carries the argument

A layered transformer construction that encodes change-point location information at varying precision levels to enable in-context detection and adaptation for linear models.

If this is right

  • Transformers achieve optimal in-context performance on linear regression and linear dynamical systems once they detect the regime shift.
  • Providing more precise information about change timing reduces the number of layers and parameters needed.
  • Pretrained foundation models improve on real tasks like disease and volatility forecasting when change-point signals are added without any retraining.
  • The same architecture works across synthetic and real non-stationary sequences when the parametric family assumption holds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pretrained models could be extended with lightweight change detectors to handle streaming data better than current fine-tuning approaches.
  • The scaling of model complexity with timing knowledge suggests a design principle for other sequence models facing abrupt shifts.
  • Testing the construction on nonlinear or unknown-family regimes would reveal where the current existence proof stops applying.

Load-bearing premise

The underlying data consists of piecewise stationary segments drawn from known parametric families that admit efficient in-context solutions.

What would settle it

A demonstration that no fixed-size transformer can match the performance of an oracle that knows the change point when the data segments belong to unknown or non-parametric families.

Figures

Figures reproduced from arXiv: 2604.16988 by Carson Dudley, Samet Oymak, Xiaofeng Liu, Yutong Bi.

Figure 1
Figure 1. Figure 1: Piecewise-linear regression results. Mean squared error averaged over 5,000 test trajectories [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Piecewise-linear dynamical system results. Mean squared error (aggregated over 5,000 [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mean absolute error for infectious disease forecasting as a function of forecast origin time [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mean absolute error for federal funds rate (a) and S&P 500 index (b) forecasting and [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

Non-stationary sequences arise naturally in control, forecasting, and decision-making. The data-generating process shifts at unknown times, and models must detect the change, discard or downweight obsolete evidence, and adapt to new dynamics on the fly. Transformer-based foundation models increasingly rely on in-context learning for time series forecasting, tabular prediction, and continuous control. As these models are deployed in non-stationary environments, understanding their ability to detect and adapt to regime shifts is important. We formalize this as an in-context change-point detection problem and formally establish the existence of transformer models that solve this problem. Our construction demonstrates that model complexity, in layers and parameters, depends on the level of information available about the change-point location, from no knowledge to knowing exact timing. We validate our results with experiments on synthetic linear regression and linear dynamical systems, where trained transformers match the performance of optimal baselines across information levels. We also show that encoding and incorporating changepoint knowledge indeed improves the real-world performance of a pretrained foundation models on infectious disease forecasting and on financial volatility forecasting around Federal Open Market Committee (FOMC) announcements without retraining, demonstrating practical applicability to real-world regime changes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper formalizes in-context learning under regime changes as an in-context change-point detection task for piecewise-stationary sequences drawn from known parametric families (linear regression and linear dynamical systems). It provides an existence construction for transformer models that solve the task, with the required depth and parameter count explicitly depending on the amount of side information supplied about the change-point location (none, partial, or exact). The construction is validated by showing that trained transformers recover the performance of optimal statistical baselines on synthetic tasks and that injecting change-point knowledge improves a pretrained foundation model on two real forecasting domains (infectious-disease and financial-volatility series) without retraining.

Significance. If the existence result and the matching to optimal baselines hold, the work supplies a concrete theoretical account of how transformer depth and attention can realize finite-state change-point detectors whose complexity scales with available information. The explicit construction and the zero-shot improvement on real data are strengths that distinguish the contribution from purely empirical studies of non-stationarity.

major comments (2)
  1. [§3.2, Theorem 3.1] §3.2, Theorem 3.1: the existence proof reduces the problem to realizing a finite-state machine via attention and feed-forward layers, but the explicit state-transition table and the corresponding attention-head construction for the 'no knowledge' regime are only sketched; without the full layer-by-layer mapping it is impossible to verify the claimed parameter scaling.
  2. [§4.2, Table 2] §4.2, Table 2: the reported equivalence between trained transformers and the optimal Bayesian filter holds only for the listed noise variances and segment lengths; the paper does not show that the match survives when the segment-length distribution is misspecified relative to the training distribution, which is load-bearing for the claim that transformers 'match optimal baselines across information levels'.
minor comments (2)
  1. [§2 and §5] Notation for the change-point indicator variable is introduced in §2 but reused with a different meaning in the real-data experiments (§5); a single consistent symbol would improve readability.
  2. [Abstract] The abstract states that 'model complexity … depends on the level of information'; the precise functional dependence (e.g., O(log T) layers for exact timing) is only stated in the theorem and should be repeated in the abstract or introduction for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important aspects of the existence construction and the empirical validation. We address each major comment below with clarifications and proposed revisions.

read point-by-point responses
  1. Referee: [§3.2, Theorem 3.1] the existence proof reduces the problem to realizing a finite-state machine via attention and feed-forward layers, but the explicit state-transition table and the corresponding attention-head construction for the 'no knowledge' regime are only sketched; without the full layer-by-layer mapping it is impossible to verify the claimed parameter scaling.

    Authors: We agree that the no-knowledge case in Theorem 3.1 is presented at a higher level of abstraction than the partial- and exact-knowledge regimes, focusing on the reduction to an FSM simulator whose depth and width scale with the number of states. The parameter bounds follow from the standard construction of attention-based FSMs (as in prior work on transformer universality for finite automata), but we acknowledge that an explicit state-transition table and per-layer attention-head wiring for the no-knowledge regime would aid verification. In the revision we will add a dedicated appendix subsection that supplies the full layer-by-layer mapping, the explicit transition table for the no-knowledge detector, and the corresponding attention-head and feed-forward configurations. This addition will make the claimed scaling fully verifiable while leaving the theorem statement unchanged. revision: partial

  2. Referee: [§4.2, Table 2] the reported equivalence between trained transformers and the optimal Bayesian filter holds only for the listed noise variances and segment lengths; the paper does not show that the match survives when the segment-length distribution is misspecified relative to the training distribution, which is load-bearing for the claim that transformers 'match optimal baselines across information levels'.

    Authors: The referee correctly notes that the current experiments in §4.2 use segment-length distributions that match those seen during training. While this demonstrates that transformers can recover the optimal filter when the generative assumptions are aligned, it does not yet address robustness under misspecification. To strengthen the claim, we will add a new set of experiments in the revision that evaluate both the trained transformers and the Bayesian filter on test sequences whose segment lengths are drawn from a deliberately misspecified distribution (e.g., switching from uniform to geometric or vice versa, while keeping noise variances fixed). We will report the resulting performance gap (or lack thereof) for each information level. This extension directly addresses the load-bearing aspect of the claim. revision: partial

Circularity Check

0 steps flagged

No significant circularity in existence construction

full rationale

The paper's central result is an existence proof via explicit construction of transformer architectures that realize in-context change-point detection for piecewise-stationary data drawn from known parametric families (linear regression or LDS). Model depth and parameter count are shown to scale with the amount of side information supplied about change-point location. The construction reduces the problem to encoding a finite-state detector using attention and feed-forward layers; synthetic experiments confirm trained models recover the performance of the corresponding optimal statistical procedure. Real-world transfer to disease and volatility forecasting is shown without retraining. No load-bearing step reduces by definition to its own inputs, renames a fitted quantity as a prediction, or relies on a self-citation chain whose validity is internal to the present work. The argument is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the underlying processes are piecewise stationary and belong to tractable parametric families; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Data consists of piecewise stationary segments drawn from known model classes (linear regression or linear dynamical systems).
    Required for the change-point detection problem to be well-posed and for the transformer construction to be feasible.

pith-pipeline@v0.9.0 · 5507 in / 1398 out tokens · 57634 ms · 2026-05-10T06:54:29.538600+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    Willsky and H

    A. Willsky and H. Jones. A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems.IEEE Transactions on Automatic Control, 21(1):108–112, 1976

  2. [2]

    Nikiforov.Detection of Abrupt Changes: Theory and Application

    Michèle Basseville and Igor V. Nikiforov.Detection of Abrupt Changes: Theory and Application. 1993

  3. [3]

    O. L. V. Costa, M. D. Fragoso, and R. P. Marques.Discrete-Time Markov Jump Linear Systems. Probability and Its Applications. Springer-Verlag, 2005

  4. [4]

    Hamilton

    James D. Hamilton. A new approach to the economic analysis of nonstationary time series and the business cycle.Econometrica, 57(2):357–384, 1989

  5. [5]

    Regime changes and financial markets

    Andrew Ang and Allan Timmermann. Regime changes and financial markets. Working paper, National Bureau of Economic Research, June 2011

  6. [6]

    From sparse data to smart decisions: Region-specific policy evaluation via simulation.medRxiv, 2025

    Carson Dudley et al. From sparse data to smart decisions: Region-specific policy evaluation via simulation.medRxiv, 2025

  7. [7]

    E. S. Page. Continuous inspection schemes.Biometrika, 41, 1954

  8. [8]

    Sequential changepoint detection in quality control and dynamical systems

    Tze Leung Lai. Sequential changepoint detection in quality control and dynamical systems. Journal of the Royal Statistical Society: Series B, 57(4), 1995

  9. [9]

    Ryan Prescott Adams and David J. C. MacKay. Bayesian online changepoint detection, 2007

  10. [10]

    Attention is all you need, 2023

    Ashish Vaswani, Noam Shazeer, et al. Attention is all you need, 2023. 13

  11. [11]

    Brown et al

    Tom B. Brown et al. Language models are few-shot learners, 2020

  12. [12]

    What can transformers learn in-context? a case study of simple function classes, 2023

    Shivam Garg et al. What can transformers learn in-context? a case study of simple function classes, 2023

  13. [13]

    Transformers learn in-context by gradient descent, 2023

    Johannes von Oswald et al. Transformers learn in-context by gradient descent, 2023

  14. [14]

    What learning algorithm is in-context learning? investigations with linear models, 2023

    Ekin Akyürek et al. What learning algorithm is in-context learning? investigations with linear models, 2023

  15. [15]

    Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak

    Yingcong Li, M. Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning, 2023

  16. [16]

    What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization, 2023

    Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization, 2023

  17. [17]

    Bartlett

    Ruiqi Zhang, Spencer Frei, and Peter L. Bartlett. Trained transformers learn linear models in-context, 2023

  18. [18]

    A decoder-only foundation model for time-series forecasting, 2024

    Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting, 2024

  19. [19]

    Chronos: Transformer-based language models for time-series forecast- ing, 2024

    Abdul Fatir Ansari et al. Chronos: Transformer-based language models for time-series forecast- ing, 2024

  20. [20]

    Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

    Noah Hollmann, Samuel Müller, et al. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

  21. [21]

    Gating is weighting: Understanding gated linear attention through in-context learning, 2025

    Yingcong Li et al. Gating is weighting: Understanding gated linear attention through in-context learning, 2025

  22. [22]

    Transformers as adaptive estimators: In-context learning under regime change, 2026

    Carson Dudley, Yutong Bi, Xiaofeng Liu, and Samet Oymak. Transformers as adaptive estimators: In-context learning under regime change, 2026. Technical Report

  23. [23]

    Bayesian and non-Bayesian methods for combining models and forecasts with applications to forecasting international growth rates.Journal of Economet- rics, 56(1-2):89–118, 1993

    Chung-ki Min and Arnold Zellner. Bayesian and non-Bayesian methods for combining models and forecasts with applications to forecasting international growth rates.Journal of Economet- rics, 56(1-2):89–118, 1993

  24. [24]

    Hoeting, David Madigan, Adrian E

    Jennifer A. Hoeting, David Madigan, Adrian E. Raftery, and Chris T. Volinsky. Bayesian model averaging: A tutorial.Statistical Science, 14(4):382–417, 1999

  25. [25]

    Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery

    Carson Dudley et al. Simulation as supervision: Mechanistic pretraining for scientific discovery. arXiv preprint arXiv:2507.08977, 2025

  26. [26]

    Emrullah Ildiz, and Samet Oymak

    Ege Onur Taga, M. Emrullah Ildiz, and Samet Oymak. TimePFN: Effective multivariate time series forecasting with synthetic data. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

  27. [27]

    Learning from simulators: A theory of simulation- grounded learning.arXiv preprint arXiv:2509.18990, 2025

    Carson Dudley and Marisa Eisenberg. Learning from simulators: A theory of simulation- grounded learning.arXiv preprint arXiv:2509.18990, 2025

  28. [28]

    Mantis: A Foundation Model for Mechanistic Disease Forecasting

    Carson Dudley et al. Mantis: A foundation model for mechanistic disease forecasting.arXiv preprint arXiv:2508.12260, n/a, 2025. 14

  29. [29]

    Foundation time series models for forecasting and policy evaluation in infectious disease epidemics.medRxiv, February 2025

    Suprabhath Kalahasti et al. Foundation time series models for forecasting and policy evaluation in infectious disease epidemics.medRxiv, February 2025

  30. [30]

    From tables to time: Extending tabpfn-v2 to time series forecasting, 2026

    Shi Bin Hoo, Samuel Müller, David Salinas, and Frank Hutter. From tables to time: Extending tabpfn-v2 to time series forecasting, 2026

  31. [31]

    Kenneth N. Kuttner. Monetary policy surprises and interest rates: Evidence from the fed funds futures market.Journal of Monetary Economics, 47(3):523–544, 2001

  32. [32]

    Bernanke and Kenneth N

    Ben S. Bernanke and Kenneth N. Kuttner. What explains the stock market’s reaction to federal reserve policy?The Journal of Finance, 60(3):1221–1257, 2005

  33. [33]

    Federal funds effective rate [dff]

    Board of Governors of the Federal Reserve System (US). Federal funds effective rate [dff]. https://fred.stlouisfed.org/series/DFF, 2026. Retrieved from FRED, Federal Reserve Bank of St. Louis

  34. [34]

    S&p 500 [sp500]

    S&P Dow Jones Indices LLC. S&p 500 [sp500]. https://fred.stlouisfed.org/series/ SP500, 2026. Retrieved from FRED, Federal Reserve Bank of St. Louis

  35. [35]

    Federal open market committee historical materials by year.https://www.federalreserve.gov/monetarypolicy/fomc_historical_ year.htm, 2026

    Board of Governors of the Federal Reserve System. Federal open market committee historical materials by year.https://www.federalreserve.gov/monetarypolicy/fomc_historical_ year.htm, 2026. A Full Proofs This appendix provides complete proofs of the results stated in Section 2. We first discuss prelimi- naries and give the full proof of Theorem 1 (Appendix ...

  36. [36]

    The construction from Theorem 1 simplifies as follows

    The BMA predictor (2) collapses to a single term: ˆyBMA t =m n∗ 1(t) =x⊤ t µn∗ 1, sinceαn∗ 1(t) = 1. The construction from Theorem 1 simplifies as follows. Layer 1 (accumulation).Instead of computing prefix sums at every position and retrieving later, the transformer can directly accumulate only post-change statistics. A single attention head uses positio...