arxiv: 2605.07483 · v2 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Does Your Neural Network Extrapolate? Feature Engineering as Identifiability Bias for OOD Generalization

Leonel Aguilar , Jan Nagler , Christoph Hoelscher , Nino Antulov-Fantulin

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords out-of-distribution generalizationidentifiabilityfeature engineeringextrapolationneural networksdata-generating processstructural commitment

0 comments

The pith

Out-of-distribution extrapolation is non-identifiable from any single training window without an explicit structural commitment to features, labels, and model class.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that infinitely many data-generating processes can match the same training observations equally well yet diverge arbitrarily far outside the observed range. No performance measure computed only on the training distribution can resolve which process is the right one. A structural commitment consisting of a chosen feature map, label map, and model class supplies the missing constraint and thereby determines out-of-distribution behavior while leaving in-distribution loss essentially unchanged. When the commitment matches the true structure, as with Fourier coordinates for periodic signals, extrapolation error drops to zero. Controlled experiments across synthetic tasks and three scientific domains confirm that swapping only the representation can alter out-of-distribution error by factors of hundreds.

Core claim

From a single training window, OOD extrapolation is non-identifiable: infinitely many DGPs are ε-observationally equivalent on the training data but diverge arbitrarily outside it, and no in-distribution criterion alone reliably breaks the tie. A structural commitment, the feature map, label map, and model class (φ, ψ, M), dictates the assumed DGP and governs OOD generalization while leaving ID performance essentially unchanged. When the commitment is correct and identifiable, OOD error vanishes, for example when Fourier coordinates turn periodic extrapolation into interpolation on the circle.

What carries the argument

The structural commitment consisting of feature map φ, label map ψ, and model class M, which selects among observationally equivalent data-generating processes and thereby fixes out-of-distribution behavior.

If this is right

When architecture, pretraining, or augmentation implicitly supplies the correct commitment, out-of-distribution success follows at unchanged in-distribution loss.
Correct features are necessary but not sufficient: the model class must be able to express the target function and the transformed training data must cover the relevant representation space.
In mass-action chemistry, Keplerian exoplanet prediction, and cross-species DNA detection, the right feature commitment enables accurate out-of-distribution predictions.
Periodic extrapolation reduces to interpolation once the input is expressed in the appropriate coordinate system such as Fourier features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Many reported out-of-distribution failures may be diagnosed and mitigated by auditing the implicit feature commitments rather than by scaling models or data alone.
Systematic testing of alternative feature maps on the same training set could serve as a practical diagnostic for whether poor extrapolation stems from identifiability rather than other causes.
The same logic may apply to time-series forecasting and sequential decision tasks in which the relevant state representation is not directly supplied by the observations.

Load-bearing premise

That a structural commitment to features, labels, and model class can be chosen independently of the training data while leaving in-distribution performance essentially unchanged, and that this choice is what actually governs out-of-distribution behavior.

What would settle it

A controlled experiment that applies two different structural commitments to identical training data at the same in-distribution loss and shows that only the commitment matching the true process produces near-zero out-of-distribution error.

Figures

Figures reproduced from arXiv: 2605.07483 by Christoph Hoelscher, Jan Nagler, Leonel Aguilar, Nino Antulov-Fantulin.

**Figure 1.** Figure 1: Left: sin(x) (green) and its degree-9 Taylor polynomial (black) can be observationally equivalent (depending on noise and amount of samples) on the training window and diverge on [π, 3π]. Models (MLP, SINDy, TabPFN, TimesFM) trained/evaluated on the Original (raw) space without noise are shown. Right: in Fourier coordinates φ(x) = (sin x, cos x), both (ID, OOD) map to the same circle S 1 , turning OOD extr… view at source ↗

**Figure 2.** Figure 2: Concept figures for the three natural-science domains. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Four faces of the DGP commitment on the periodic [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Exp. 1.3 Non linear DGP in the transformed space, mirroring [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Architecture-invariance concept ( [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

read the original abstract

Successful deep neural networks discover salient features of data. We show when and why they fail to learn out-of-distribution (OOD)-relevant representations from an in-distribution (ID) training window. This requires decoupling feature learning from data-generating-process (DGP) identifiability. From a single training window, OOD extrapolation is non-identifiable: infinitely many DGPs are $\varepsilon$-observationally equivalent on the training data but diverge arbitrarily outside it, and no in-distribution criterion alone reliably breaks the tie. A structural commitment, the feature map, label map, and model class $(\varphi, \psi, \mathcal{M})$, dictates the assumed DGP and governs OOD generalization while leaving ID performance essentially unchanged. When architecture, pretraining, augmentation, input formats, or domain knowledge implicitly inject the missing commitment, the model succeeds. When it cannot infer OOD-relevant structure from ID evidence, it fails. Changing only the representation can make the same architecture, at the same in-distribution loss, differ by ${\sim}520\times$ out of distribution. When the commitment is correct and identifiable, OOD error vanishes. For example, Fourier coordinates turn periodic extrapolation into interpolation on $\mathbb{S}^1$. The same mechanism predicts outcomes in three natural-science settings (mass-action chemistry; Kepler's-third-law exoplanet prediction, $n=2{,}362$; and cross-species coding-DNA detection) and in a 264-run positional-encoding study across Transformer, Mamba, and S4D. Finally, a controlled study shows: correct features are necessary but not sufficient. The model class must express the target, and the transformed training data must cover the relevant representation space.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core claim is that OOD extrapolation is non-identifiable from ID data alone and that feature maps supply the necessary structural commitment, backed by cross-domain examples but with an open question on whether ID optimization stays truly fixed.

read the letter

The main point is that from one training window you cannot uniquely recover the data-generating process, so infinitely many models can match inside the window yet diverge outside, and the paper treats the feature map, label map, and model class as the commitment that picks which one you get. That framing pulls together causal identifiability ideas with practical representation choices and shows it working in chemistry kinetics, Kepler-law exoplanet data, DNA sequence detection, and a large positional-encoding sweep across Transformers, Mamba, and S4D. The 520-fold OOD gap from swapping only the coordinate system while keeping ID loss matched is the strongest single result; it makes the abstract claim concrete and suggests why some scientific applications suddenly work once the right basis is injected. The controlled study at the end, showing that correct features are necessary but not sufficient without the right model class and coverage, is also useful. The soft spot is the stress-test concern: changing the feature map necessarily changes the loss landscape and the set of functions reachable at fixed optimizer budget, so the observed OOD difference could partly reflect what the optimizer actually converged to rather than pure non-identifiability. The abstract says the ID loss is the same, but without an explicit control that holds the learned ID function fixed (post-hoc projection or reparameterization), it is hard to isolate the effect cleanly. That is not fatal, but it needs a tighter experiment in revision. Overall this is the kind of paper that gives practitioners a diagnostic lens rather than another benchmark, and it deserves a serious referee who can check the empirical controls and the formal non-identifiability argument in the full text.

Referee Report

3 major / 1 minor

Summary. The manuscript argues that OOD extrapolation from a single ID training window is non-identifiable: infinitely many DGPs are ε-observationally equivalent on the training data yet diverge arbitrarily outside it, with no ID-only criterion able to select among them. A structural commitment (feature map φ, label map ψ, model class M) is claimed to dictate the assumed DGP and thereby govern OOD behavior while leaving ID performance essentially unchanged. Empirical support includes a ~520× OOD gap obtained by altering only the representation, a 264-run positional-encoding study across Transformer/Mamba/S4D, and successful predictions in mass-action chemistry, Kepler-third-law exoplanet data (n=2,362), and cross-species DNA detection; a controlled study concludes that correct features are necessary but not sufficient, requiring both model expressivity and coverage of the transformed space.

Significance. If the central non-identifiability argument and the isolation of the structural commitment hold, the work supplies a coherent account of why explicit feature engineering and inductive biases succeed at OOD tasks where pure ID optimization fails. The concrete scientific examples and the demonstration that representation change can convert extrapolation into interpolation on S¹ provide falsifiable predictions that could usefully guide architecture and preprocessing choices in extrapolation settings.

major comments (3)

[Abstract] Abstract: the claim that the same architecture at the same in-distribution loss can differ by ~520× OOD when only the representation is changed is load-bearing for the non-identifiability thesis, yet the manuscript does not report the precise ID loss values attained under each representation nor supply a control that holds the learned ID function fixed (e.g., via post-hoc projection) while swapping only the coordinate system for OOD evaluation.
[Positional-encoding study] Positional-encoding study (264 runs): altering positional encodings necessarily changes the coordinate system in which gradients are computed and the conditioning of the loss landscape; without an explicit demonstration that the optimizer reaches ID functions of equivalent quality across encodings, the observed OOD gaps cannot be attributed solely to the injected structural commitment rather than differences in reachable minima.
[Controlled study] Controlled study paragraph: the assertion that 'the transformed training data must cover the relevant representation space' is central to the necessity claim, but the manuscript provides no quantitative measure (e.g., coverage radius or density in the transformed coordinates) or ablation that isolates coverage failure from expressivity failure.

minor comments (1)

[Abstract] Notation for the structural commitment is introduced as (φ, ψ, M) but later appears as (ϕ, ψ, M); a single consistent symbol set would remove ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below with clarifications and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the same architecture at the same in-distribution loss can differ by ~520× OOD when only the representation is changed is load-bearing for the non-identifiability thesis, yet the manuscript does not report the precise ID loss values attained under each representation nor supply a control that holds the learned ID function fixed (e.g., via post-hoc projection) while swapping only the coordinate system for OOD evaluation.

Authors: We agree that explicit reporting of ID losses is needed to support the claim. The revised manuscript adds a table listing the precise ID losses (all within 0.001 of each other) for the representations achieving the 520× gap. Regarding a post-hoc projection control, we maintain that it would not isolate the effect cleanly because our argument concerns the full training dynamics under the structural commitment; projecting after training would change the effective model class. We have added a clarifying paragraph explaining why the from-scratch training on transformed inputs already holds the architecture and ID loss fixed while varying only the representation. revision: yes
Referee: [Positional-encoding study] Positional-encoding study (264 runs): altering positional encodings necessarily changes the coordinate system in which gradients are computed and the conditioning of the loss landscape; without an explicit demonstration that the optimizer reaches ID functions of equivalent quality across encodings, the observed OOD gaps cannot be attributed solely to the injected structural commitment rather than differences in reachable minima.

Authors: We appreciate this observation. Re-inspection of the experimental data shows that ID validation losses across all 264 runs converge to within 3% relative difference regardless of encoding. The revised manuscript includes this statistic together with loss-trajectory plots demonstrating comparable convergence. While we acknowledge that conditioning differs, the matched ID performance allows us to attribute the OOD gaps primarily to the structural bias injected by the encoding rather than optimization artifacts. revision: yes
Referee: [Controlled study] Controlled study paragraph: the assertion that 'the transformed training data must cover the relevant representation space' is central to the necessity claim, but the manuscript provides no quantitative measure (e.g., coverage radius or density in the transformed coordinates) or ablation that isolates coverage failure from expressivity failure.

Authors: We agree a quantitative coverage metric would improve clarity. The controlled study already varies training-data span in the transformed coordinates to induce coverage failure, but we will add an explicit coverage-density measure (fraction of the target representation space covered by the transformed training points) and a dedicated ablation that holds the model class fixed while sweeping only the coverage range. These additions will be included in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's central non-identifiability claim—that infinitely many DGPs are ε-observationally equivalent on a single training window yet diverge arbitrarily outside it—is a general theoretical observation, not derived by construction from fitted parameters or self-citations. The structural commitment (φ, ψ, M) is introduced as an independent modeling choice whose OOD effects are demonstrated empirically (e.g., Fourier coordinates converting extrapolation to interpolation, 520× OOD gaps at matched ID loss across 264 runs and natural-science examples). No step reduces a prediction to a fitted input, renames a known result, or relies on a load-bearing self-citation whose content is itself unverified. The reported outcomes are presented as direct experimental measurements rather than quantities forced by the same data used to define the commitments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that OOD behavior is governed by the chosen (φ, ψ, M) triple rather than by optimization dynamics or other unmodeled factors. No free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption OOD extrapolation from a single training window is non-identifiable without an external structural commitment.
Invoked in the opening paragraphs to establish that no ID-only criterion suffices.

pith-pipeline@v0.9.0 · 5623 in / 1271 out tokens · 27796 ms · 2026-05-14T21:30:10.436255+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

From a single training window, OOD extrapolation is non-identifiable: infinitely many DGPs are ε-observationally equivalent on the training data but diverge arbitrarily outside it... A structural commitment, the feature map, label map, and model class (φ, ψ, M), dictates the assumed DGP

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Steven L

doi: 10.1214/ss/1009213726. Steven L. Brunton, Joshua L. Proctor, and J. Nathan Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems.Proceedings of the National Academy of Sciences, 113(15):3932–3937,

work page doi:10.1214/ss/1009213726
[2]

Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou

doi: 10.1073/pnas.1517384113. Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting,

work page doi:10.1073/pnas.1517384113
[3]

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter

doi: 10.48550/arXiv.2206.11893. Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326,

work page doi:10.48550/arxiv.2206.11893
[4]

2018.10.045

doi: 10.1016/j.jcp. 2018.10.045. Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding,

work page doi:10.1016/j.jcp 2018
[5]

doi: 10.1007/b13794

ISBN 978-0-387-79051-0. doi: 10.1007/b13794. Vaiva Vasiliauskaite and Nino Antulov-Fantulin. Generalization of neural network models for complex network dynamics.Communications Physics, 7:348,

work page doi:10.1007/b13794
[6]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N

doi: 10.1038/s42005-024-01837-w. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30,

work page doi:10.1038/s42005-024-01837-w
[7]

A ERM is blind to observationally-equivalent processes Proposition 1(ERM is blind to observationally-equivalent processes, restated).Let Dn = {(xi, yi)}n i=1 with xi iid ∼ D train on W and yi =P k(xi)+η i, ηi iid ∼ N(0, σ 2), for unknown k∈ {1,2} where P1, P2 are ε-observationally-equivalent on W . For any (deterministic or randomised) test T:D n → {1,2},...

work page 2009
[8]

no sparse representation

The linear functiong(u, v) =w ⊤(u, v)withw= (1,0)satisfies g(φ(x)) = sinx=f(x)∀x∈R,(18) so Lemma 3 applies with ψ= id , M the linear functions on R2, and g⋆ = (1,0) ⊤; hence εOOD(g⋆) = 0for anyD test supported onR. Proof.Realisability is the displayed identity; Lemma 3(i) is satisfied withψ= id. Remark 1(Phase-amplitude generalisation).The same instance w...

work page 2016
[9]

1.1 (Tab

The grid crosses both candidate feature maps (log-log, log-y) and the no-transform baseline against three model classes (OLS, MLP, SINDy); each row’s predictions are evaluated againstboththe x2 targets (OOD P1 column) and the eαx targets (OOD P2 column) on [2,10] , mirroring the layout of Exp. 1.1 (Tab. 1). The diagonal of (DGP × correct feature map) is t...

work page 2024
[10]

6.3).(a)schematic ( Pvis=6): random input xt, periodic target yt =f(tmodP) ; OOD continues the same pattern past Ltrain.(b)full scale ( P=64 )

y q r x p u v h c k j x y b n v e u e n v k k j w s d d f a w s d d f a w s d d f a w s d d f a ··· ··· 0 P 2P 3P Ltr Ltr+P Ltr+2P Position t 0.0 0.2 0.4 0.6 0.8 1.0Token value (norm.) (b) train OOD Full scale: P = 64, Ltr = 256 xt Uniform (random) yt = f(t mod P) (periodic) (c) t = 276 (OOD) t = 20 (train) Ltr = 4P: training tiles S1 completely (t) = (si...

work page 2021
[11]

Mamba was crippled byd state=16

G.3 SSM state-dimension sweep (d state). Tab. 19 sweeps dstate at P=512 ; the OOD result is independent of state size for the exact-Fourier rows. 24 Table 18: Sweep C: extrapolation horizon ( P=128 , LOOD varied). Exact Fourier stays flat (Lemma 4); learned APE stays at chance. Model PEL OOD=256L OOD=1024L OOD=4096 OOD OOD OOD Transformer Exact Fourier 0....

work page 2024
[12]

raw / wrong baselines

Method:OLS / SINDy on the bilinear feature mapφ(x) = [1, x i, xixj]vs. raw / wrong baselines. Exp. 2.2: Kepler (Sec. 5.2). Dataset:NASA Exoplanet Archive snapshot mirrored as the Hugging Face dataset juliensimon/nasa-exoplanets (6,158 confirmed exoplanets); we keep rows where (a, T, Mstar, Rstar, Teff) are all positive-valued and split by semi-major axis:...

work page 2023