arxiv: 2604.16988 · v1 · submitted 2026-04-18 · 💻 cs.LG · cs.AI

Recognition: unknown

In-Context Learning Under Regime Change

Carson Dudley , Yutong Bi , Xiaofeng Liu , Samet Oymak

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords in-context learningchange-point detectiontransformersregime shiftsnon-stationary sequenceslinear regressiondynamical systemstime series adaptation

0 comments

The pith

Transformers can be built to detect unknown regime shifts in data sequences and adapt their in-context predictions without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that non-stationary data with sudden shifts can be handled by transformers through in-context change-point detection. It proves that specific transformer constructions exist for this task when the data comes from piecewise stationary segments in known families like linear regression or linear dynamical systems. Complexity in layers and parameters grows as less information is given about the timing of the shift. This matters for foundation models used in forecasting and control, where real data often changes regimes, because the models can incorporate timing knowledge to improve adaptation on the fly.

Core claim

We formalize regime shifts as an in-context change-point detection problem and prove the existence of transformer models that solve it for piecewise stationary sequences from known parametric families. The construction shows that required model size depends on available information about the change-point location, ranging from no prior knowledge to exact timing. Experiments confirm that trained transformers match optimal baselines on synthetic linear regression and dynamical system tasks, and that injecting changepoint knowledge boosts performance of pretrained models on infectious disease and financial volatility forecasting.

What carries the argument

A layered transformer construction that encodes change-point location information at varying precision levels to enable in-context detection and adaptation for linear models.

If this is right

Transformers achieve optimal in-context performance on linear regression and linear dynamical systems once they detect the regime shift.
Providing more precise information about change timing reduces the number of layers and parameters needed.
Pretrained foundation models improve on real tasks like disease and volatility forecasting when change-point signals are added without any retraining.
The same architecture works across synthetic and real non-stationary sequences when the parametric family assumption holds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pretrained models could be extended with lightweight change detectors to handle streaming data better than current fine-tuning approaches.
The scaling of model complexity with timing knowledge suggests a design principle for other sequence models facing abrupt shifts.
Testing the construction on nonlinear or unknown-family regimes would reveal where the current existence proof stops applying.

Load-bearing premise

The underlying data consists of piecewise stationary segments drawn from known parametric families that admit efficient in-context solutions.

What would settle it

A demonstration that no fixed-size transformer can match the performance of an oracle that knows the change point when the data segments belong to unknown or non-parametric families.

Figures

Figures reproduced from arXiv: 2604.16988 by Carson Dudley, Samet Oymak, Xiaofeng Liu, Yutong Bi.

**Figure 2.** Figure 2: Piecewise-linear dynamical system results. Mean squared error (aggregated over 5,000 [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Mean absolute error for infectious disease forecasting as a function of forecast origin time [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Mean absolute error for federal funds rate (a) and S&P 500 index (b) forecasting and [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Non-stationary sequences arise naturally in control, forecasting, and decision-making. The data-generating process shifts at unknown times, and models must detect the change, discard or downweight obsolete evidence, and adapt to new dynamics on the fly. Transformer-based foundation models increasingly rely on in-context learning for time series forecasting, tabular prediction, and continuous control. As these models are deployed in non-stationary environments, understanding their ability to detect and adapt to regime shifts is important. We formalize this as an in-context change-point detection problem and formally establish the existence of transformer models that solve this problem. Our construction demonstrates that model complexity, in layers and parameters, depends on the level of information available about the change-point location, from no knowledge to knowing exact timing. We validate our results with experiments on synthetic linear regression and linear dynamical systems, where trained transformers match the performance of optimal baselines across information levels. We also show that encoding and incorporating changepoint knowledge indeed improves the real-world performance of a pretrained foundation models on infectious disease forecasting and on financial volatility forecasting around Federal Open Market Committee (FOMC) announcements without retraining, demonstrating practical applicability to real-world regime changes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives an existence construction for transformers that do in-context change-point detection on piecewise stationary linear data, with depth and parameter count scaling to the amount of side information supplied about the change.

read the letter

The core contribution is a formalization of in-context change-point detection plus an explicit transformer construction whose size depends on how much you know about the change location. When you have no info, the model needs more layers to search; when you know the exact timing, it can be smaller and more direct. That dependence is the part that feels new rather than a routine extension of prior in-context learning bounds. The synthetic experiments on linear regression and LDS tasks are straightforward and useful: trained transformers recover the performance of the known optimal statistical procedures at each information level. The real-data part shows that feeding changepoint knowledge into a pretrained model improves forecasts on disease incidence and volatility around FOMC announcements without any retraining, which is a practical check on the idea. The main limitation is the scope. The proof and the efficient in-context solutions both rely on the segments belonging to known parametric families that already admit good in-context estimators. Outside that setting the construction does not apply, and the paper does not claim universality. The existence result is also not a statement about what standard training will find; it shows what is possible with the right architecture. This is worth a serious referee. The existence claim is self-contained once the parametric families are granted, the experiments line up with the theory, and the real-world demonstration is a low-cost way to test applicability. Readers working on non-stationary time series or on making foundation models adapt without retraining will find the scaling-with-information angle directly useful.

Referee Report

2 major / 2 minor

Summary. The paper formalizes in-context learning under regime changes as an in-context change-point detection task for piecewise-stationary sequences drawn from known parametric families (linear regression and linear dynamical systems). It provides an existence construction for transformer models that solve the task, with the required depth and parameter count explicitly depending on the amount of side information supplied about the change-point location (none, partial, or exact). The construction is validated by showing that trained transformers recover the performance of optimal statistical baselines on synthetic tasks and that injecting change-point knowledge improves a pretrained foundation model on two real forecasting domains (infectious-disease and financial-volatility series) without retraining.

Significance. If the existence result and the matching to optimal baselines hold, the work supplies a concrete theoretical account of how transformer depth and attention can realize finite-state change-point detectors whose complexity scales with available information. The explicit construction and the zero-shot improvement on real data are strengths that distinguish the contribution from purely empirical studies of non-stationarity.

major comments (2)

[§3.2, Theorem 3.1] §3.2, Theorem 3.1: the existence proof reduces the problem to realizing a finite-state machine via attention and feed-forward layers, but the explicit state-transition table and the corresponding attention-head construction for the 'no knowledge' regime are only sketched; without the full layer-by-layer mapping it is impossible to verify the claimed parameter scaling.
[§4.2, Table 2] §4.2, Table 2: the reported equivalence between trained transformers and the optimal Bayesian filter holds only for the listed noise variances and segment lengths; the paper does not show that the match survives when the segment-length distribution is misspecified relative to the training distribution, which is load-bearing for the claim that transformers 'match optimal baselines across information levels'.

minor comments (2)

[§2 and §5] Notation for the change-point indicator variable is introduced in §2 but reused with a different meaning in the real-data experiments (§5); a single consistent symbol would improve readability.
[Abstract] The abstract states that 'model complexity … depends on the level of information'; the precise functional dependence (e.g., O(log T) layers for exact timing) is only stated in the theorem and should be repeated in the abstract or introduction for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important aspects of the existence construction and the empirical validation. We address each major comment below with clarifications and proposed revisions.

read point-by-point responses

Referee: [§3.2, Theorem 3.1] the existence proof reduces the problem to realizing a finite-state machine via attention and feed-forward layers, but the explicit state-transition table and the corresponding attention-head construction for the 'no knowledge' regime are only sketched; without the full layer-by-layer mapping it is impossible to verify the claimed parameter scaling.

Authors: We agree that the no-knowledge case in Theorem 3.1 is presented at a higher level of abstraction than the partial- and exact-knowledge regimes, focusing on the reduction to an FSM simulator whose depth and width scale with the number of states. The parameter bounds follow from the standard construction of attention-based FSMs (as in prior work on transformer universality for finite automata), but we acknowledge that an explicit state-transition table and per-layer attention-head wiring for the no-knowledge regime would aid verification. In the revision we will add a dedicated appendix subsection that supplies the full layer-by-layer mapping, the explicit transition table for the no-knowledge detector, and the corresponding attention-head and feed-forward configurations. This addition will make the claimed scaling fully verifiable while leaving the theorem statement unchanged. revision: partial
Referee: [§4.2, Table 2] the reported equivalence between trained transformers and the optimal Bayesian filter holds only for the listed noise variances and segment lengths; the paper does not show that the match survives when the segment-length distribution is misspecified relative to the training distribution, which is load-bearing for the claim that transformers 'match optimal baselines across information levels'.

Authors: The referee correctly notes that the current experiments in §4.2 use segment-length distributions that match those seen during training. While this demonstrates that transformers can recover the optimal filter when the generative assumptions are aligned, it does not yet address robustness under misspecification. To strengthen the claim, we will add a new set of experiments in the revision that evaluate both the trained transformers and the Bayesian filter on test sequences whose segment lengths are drawn from a deliberately misspecified distribution (e.g., switching from uniform to geometric or vice versa, while keeping noise variances fixed). We will report the resulting performance gap (or lack thereof) for each information level. This extension directly addresses the load-bearing aspect of the claim. revision: partial

Circularity Check

0 steps flagged

No significant circularity in existence construction

full rationale

The paper's central result is an existence proof via explicit construction of transformer architectures that realize in-context change-point detection for piecewise-stationary data drawn from known parametric families (linear regression or LDS). Model depth and parameter count are shown to scale with the amount of side information supplied about change-point location. The construction reduces the problem to encoding a finite-state detector using attention and feed-forward layers; synthetic experiments confirm trained models recover the performance of the corresponding optimal statistical procedure. Real-world transfer to disease and volatility forecasting is shown without retraining. No load-bearing step reduces by definition to its own inputs, renames a fitted quantity as a prediction, or relies on a self-citation chain whose validity is internal to the present work. The argument is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the underlying processes are piecewise stationary and belong to tractable parametric families; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Data consists of piecewise stationary segments drawn from known model classes (linear regression or linear dynamical systems).
Required for the change-point detection problem to be well-posed and for the transformer construction to be feasible.

pith-pipeline@v0.9.0 · 5507 in / 1398 out tokens · 57634 ms · 2026-05-10T06:54:29.538600+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Willsky and H

A. Willsky and H. Jones. A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems.IEEE Transactions on Automatic Control, 21(1):108–112, 1976

1976
[2]

Nikiforov.Detection of Abrupt Changes: Theory and Application

Michèle Basseville and Igor V. Nikiforov.Detection of Abrupt Changes: Theory and Application. 1993

1993
[3]

O. L. V. Costa, M. D. Fragoso, and R. P. Marques.Discrete-Time Markov Jump Linear Systems. Probability and Its Applications. Springer-Verlag, 2005

2005
[4]

Hamilton

James D. Hamilton. A new approach to the economic analysis of nonstationary time series and the business cycle.Econometrica, 57(2):357–384, 1989

1989
[5]

Regime changes and financial markets

Andrew Ang and Allan Timmermann. Regime changes and financial markets. Working paper, National Bureau of Economic Research, June 2011

2011
[6]

From sparse data to smart decisions: Region-specific policy evaluation via simulation.medRxiv, 2025

Carson Dudley et al. From sparse data to smart decisions: Region-specific policy evaluation via simulation.medRxiv, 2025

2025
[7]

E. S. Page. Continuous inspection schemes.Biometrika, 41, 1954

1954
[8]

Sequential changepoint detection in quality control and dynamical systems

Tze Leung Lai. Sequential changepoint detection in quality control and dynamical systems. Journal of the Royal Statistical Society: Series B, 57(4), 1995

1995
[9]

Ryan Prescott Adams and David J. C. MacKay. Bayesian online changepoint detection, 2007

2007
[10]

Attention is all you need, 2023

Ashish Vaswani, Noam Shazeer, et al. Attention is all you need, 2023. 13

2023
[11]

Brown et al

Tom B. Brown et al. Language models are few-shot learners, 2020

2020
[12]

What can transformers learn in-context? a case study of simple function classes, 2023

Shivam Garg et al. What can transformers learn in-context? a case study of simple function classes, 2023

2023
[13]

Transformers learn in-context by gradient descent, 2023

Johannes von Oswald et al. Transformers learn in-context by gradient descent, 2023

2023
[14]

What learning algorithm is in-context learning? investigations with linear models, 2023

Ekin Akyürek et al. What learning algorithm is in-context learning? investigations with linear models, 2023

2023
[15]

Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak

Yingcong Li, M. Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning, 2023

2023
[16]

What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization, 2023

Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization, 2023

2023
[17]

Bartlett

Ruiqi Zhang, Spencer Frei, and Peter L. Bartlett. Trained transformers learn linear models in-context, 2023

2023
[18]

A decoder-only foundation model for time-series forecasting, 2024

Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting, 2024

2024
[19]

Chronos: Transformer-based language models for time-series forecast- ing, 2024

Abdul Fatir Ansari et al. Chronos: Transformer-based language models for time-series forecast- ing, 2024

2024
[20]

Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

Noah Hollmann, Samuel Müller, et al. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

2025
[21]

Gating is weighting: Understanding gated linear attention through in-context learning, 2025

Yingcong Li et al. Gating is weighting: Understanding gated linear attention through in-context learning, 2025

2025
[22]

Transformers as adaptive estimators: In-context learning under regime change, 2026

Carson Dudley, Yutong Bi, Xiaofeng Liu, and Samet Oymak. Transformers as adaptive estimators: In-context learning under regime change, 2026. Technical Report

2026
[23]

Bayesian and non-Bayesian methods for combining models and forecasts with applications to forecasting international growth rates.Journal of Economet- rics, 56(1-2):89–118, 1993

Chung-ki Min and Arnold Zellner. Bayesian and non-Bayesian methods for combining models and forecasts with applications to forecasting international growth rates.Journal of Economet- rics, 56(1-2):89–118, 1993

1993
[24]

Hoeting, David Madigan, Adrian E

Jennifer A. Hoeting, David Madigan, Adrian E. Raftery, and Chris T. Volinsky. Bayesian model averaging: A tutorial.Statistical Science, 14(4):382–417, 1999

1999
[25]

Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery

Carson Dudley et al. Simulation as supervision: Mechanistic pretraining for scientific discovery. arXiv preprint arXiv:2507.08977, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Emrullah Ildiz, and Samet Oymak

Ege Onur Taga, M. Emrullah Ildiz, and Samet Oymak. TimePFN: Effective multivariate time series forecasting with synthetic data. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

2025
[27]

Learning from simulators: A theory of simulation- grounded learning.arXiv preprint arXiv:2509.18990, 2025

Carson Dudley and Marisa Eisenberg. Learning from simulators: A theory of simulation- grounded learning.arXiv preprint arXiv:2509.18990, 2025

work page arXiv 2025
[28]

Mantis: A Foundation Model for Mechanistic Disease Forecasting

Carson Dudley et al. Mantis: A foundation model for mechanistic disease forecasting.arXiv preprint arXiv:2508.12260, n/a, 2025. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Foundation time series models for forecasting and policy evaluation in infectious disease epidemics.medRxiv, February 2025

Suprabhath Kalahasti et al. Foundation time series models for forecasting and policy evaluation in infectious disease epidemics.medRxiv, February 2025

2025
[30]

From tables to time: Extending tabpfn-v2 to time series forecasting, 2026

Shi Bin Hoo, Samuel Müller, David Salinas, and Frank Hutter. From tables to time: Extending tabpfn-v2 to time series forecasting, 2026

2026
[31]

Kenneth N. Kuttner. Monetary policy surprises and interest rates: Evidence from the fed funds futures market.Journal of Monetary Economics, 47(3):523–544, 2001

2001
[32]

Bernanke and Kenneth N

Ben S. Bernanke and Kenneth N. Kuttner. What explains the stock market’s reaction to federal reserve policy?The Journal of Finance, 60(3):1221–1257, 2005

2005
[33]

Federal funds effective rate [dff]

Board of Governors of the Federal Reserve System (US). Federal funds effective rate [dff]. https://fred.stlouisfed.org/series/DFF, 2026. Retrieved from FRED, Federal Reserve Bank of St. Louis

2026
[34]

S&p 500 [sp500]

S&P Dow Jones Indices LLC. S&p 500 [sp500]. https://fred.stlouisfed.org/series/ SP500, 2026. Retrieved from FRED, Federal Reserve Bank of St. Louis

2026
[35]

Federal open market committee historical materials by year.https://www.federalreserve.gov/monetarypolicy/fomc_historical_ year.htm, 2026

Board of Governors of the Federal Reserve System. Federal open market committee historical materials by year.https://www.federalreserve.gov/monetarypolicy/fomc_historical_ year.htm, 2026. A Full Proofs This appendix provides complete proofs of the results stated in Section 2. We first discuss prelimi- naries and give the full proof of Theorem 1 (Appendix ...

2026
[36]

The construction from Theorem 1 simplifies as follows

The BMA predictor (2) collapses to a single term: ˆyBMA t =m n∗ 1(t) =x⊤ t µn∗ 1, sinceαn∗ 1(t) = 1. The construction from Theorem 1 simplifies as follows. Layer 1 (accumulation).Instead of computing prefix sums at every position and retrieving later, the transformer can directly accumulate only post-change statistics. A single attention head uses positio...