Recognition: 1 theorem link
· Lean TheoremDoes Your Neural Network Extrapolate? Feature Engineering as Identifiability Bias for OOD Generalization
Pith reviewed 2026-05-14 21:30 UTC · model grok-4.3
The pith
Out-of-distribution extrapolation is non-identifiable from any single training window without an explicit structural commitment to features, labels, and model class.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
From a single training window, OOD extrapolation is non-identifiable: infinitely many DGPs are ε-observationally equivalent on the training data but diverge arbitrarily outside it, and no in-distribution criterion alone reliably breaks the tie. A structural commitment, the feature map, label map, and model class (φ, ψ, M), dictates the assumed DGP and governs OOD generalization while leaving ID performance essentially unchanged. When the commitment is correct and identifiable, OOD error vanishes, for example when Fourier coordinates turn periodic extrapolation into interpolation on the circle.
What carries the argument
The structural commitment consisting of feature map φ, label map ψ, and model class M, which selects among observationally equivalent data-generating processes and thereby fixes out-of-distribution behavior.
If this is right
- When architecture, pretraining, or augmentation implicitly supplies the correct commitment, out-of-distribution success follows at unchanged in-distribution loss.
- Correct features are necessary but not sufficient: the model class must be able to express the target function and the transformed training data must cover the relevant representation space.
- In mass-action chemistry, Keplerian exoplanet prediction, and cross-species DNA detection, the right feature commitment enables accurate out-of-distribution predictions.
- Periodic extrapolation reduces to interpolation once the input is expressed in the appropriate coordinate system such as Fourier features.
Where Pith is reading between the lines
- Many reported out-of-distribution failures may be diagnosed and mitigated by auditing the implicit feature commitments rather than by scaling models or data alone.
- Systematic testing of alternative feature maps on the same training set could serve as a practical diagnostic for whether poor extrapolation stems from identifiability rather than other causes.
- The same logic may apply to time-series forecasting and sequential decision tasks in which the relevant state representation is not directly supplied by the observations.
Load-bearing premise
That a structural commitment to features, labels, and model class can be chosen independently of the training data while leaving in-distribution performance essentially unchanged, and that this choice is what actually governs out-of-distribution behavior.
What would settle it
A controlled experiment that applies two different structural commitments to identical training data at the same in-distribution loss and shows that only the commitment matching the true process produces near-zero out-of-distribution error.
Figures
read the original abstract
Successful deep neural networks discover salient features of data. We show when and why they fail to learn out-of-distribution (OOD)-relevant representations from an in-distribution (ID) training window. This requires decoupling feature learning from data-generating-process (DGP) identifiability. From a single training window, OOD extrapolation is non-identifiable: infinitely many DGPs are $\varepsilon$-observationally equivalent on the training data but diverge arbitrarily outside it, and no in-distribution criterion alone reliably breaks the tie. A structural commitment, the feature map, label map, and model class $(\varphi, \psi, \mathcal{M})$, dictates the assumed DGP and governs OOD generalization while leaving ID performance essentially unchanged. When architecture, pretraining, augmentation, input formats, or domain knowledge implicitly inject the missing commitment, the model succeeds. When it cannot infer OOD-relevant structure from ID evidence, it fails. Changing only the representation can make the same architecture, at the same in-distribution loss, differ by ${\sim}520\times$ out of distribution. When the commitment is correct and identifiable, OOD error vanishes. For example, Fourier coordinates turn periodic extrapolation into interpolation on $\mathbb{S}^1$. The same mechanism predicts outcomes in three natural-science settings (mass-action chemistry; Kepler's-third-law exoplanet prediction, $n=2{,}362$; and cross-species coding-DNA detection) and in a 264-run positional-encoding study across Transformer, Mamba, and S4D. Finally, a controlled study shows: correct features are necessary but not sufficient. The model class must express the target, and the transformed training data must cover the relevant representation space.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript argues that OOD extrapolation from a single ID training window is non-identifiable: infinitely many DGPs are ε-observationally equivalent on the training data yet diverge arbitrarily outside it, with no ID-only criterion able to select among them. A structural commitment (feature map φ, label map ψ, model class M) is claimed to dictate the assumed DGP and thereby govern OOD behavior while leaving ID performance essentially unchanged. Empirical support includes a ~520× OOD gap obtained by altering only the representation, a 264-run positional-encoding study across Transformer/Mamba/S4D, and successful predictions in mass-action chemistry, Kepler-third-law exoplanet data (n=2,362), and cross-species DNA detection; a controlled study concludes that correct features are necessary but not sufficient, requiring both model expressivity and coverage of the transformed space.
Significance. If the central non-identifiability argument and the isolation of the structural commitment hold, the work supplies a coherent account of why explicit feature engineering and inductive biases succeed at OOD tasks where pure ID optimization fails. The concrete scientific examples and the demonstration that representation change can convert extrapolation into interpolation on S¹ provide falsifiable predictions that could usefully guide architecture and preprocessing choices in extrapolation settings.
major comments (3)
- [Abstract] Abstract: the claim that the same architecture at the same in-distribution loss can differ by ~520× OOD when only the representation is changed is load-bearing for the non-identifiability thesis, yet the manuscript does not report the precise ID loss values attained under each representation nor supply a control that holds the learned ID function fixed (e.g., via post-hoc projection) while swapping only the coordinate system for OOD evaluation.
- [Positional-encoding study] Positional-encoding study (264 runs): altering positional encodings necessarily changes the coordinate system in which gradients are computed and the conditioning of the loss landscape; without an explicit demonstration that the optimizer reaches ID functions of equivalent quality across encodings, the observed OOD gaps cannot be attributed solely to the injected structural commitment rather than differences in reachable minima.
- [Controlled study] Controlled study paragraph: the assertion that 'the transformed training data must cover the relevant representation space' is central to the necessity claim, but the manuscript provides no quantitative measure (e.g., coverage radius or density in the transformed coordinates) or ablation that isolates coverage failure from expressivity failure.
minor comments (1)
- [Abstract] Notation for the structural commitment is introduced as (φ, ψ, M) but later appears as (ϕ, ψ, M); a single consistent symbol set would remove ambiguity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below with clarifications and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the same architecture at the same in-distribution loss can differ by ~520× OOD when only the representation is changed is load-bearing for the non-identifiability thesis, yet the manuscript does not report the precise ID loss values attained under each representation nor supply a control that holds the learned ID function fixed (e.g., via post-hoc projection) while swapping only the coordinate system for OOD evaluation.
Authors: We agree that explicit reporting of ID losses is needed to support the claim. The revised manuscript adds a table listing the precise ID losses (all within 0.001 of each other) for the representations achieving the 520× gap. Regarding a post-hoc projection control, we maintain that it would not isolate the effect cleanly because our argument concerns the full training dynamics under the structural commitment; projecting after training would change the effective model class. We have added a clarifying paragraph explaining why the from-scratch training on transformed inputs already holds the architecture and ID loss fixed while varying only the representation. revision: yes
-
Referee: [Positional-encoding study] Positional-encoding study (264 runs): altering positional encodings necessarily changes the coordinate system in which gradients are computed and the conditioning of the loss landscape; without an explicit demonstration that the optimizer reaches ID functions of equivalent quality across encodings, the observed OOD gaps cannot be attributed solely to the injected structural commitment rather than differences in reachable minima.
Authors: We appreciate this observation. Re-inspection of the experimental data shows that ID validation losses across all 264 runs converge to within 3% relative difference regardless of encoding. The revised manuscript includes this statistic together with loss-trajectory plots demonstrating comparable convergence. While we acknowledge that conditioning differs, the matched ID performance allows us to attribute the OOD gaps primarily to the structural bias injected by the encoding rather than optimization artifacts. revision: yes
-
Referee: [Controlled study] Controlled study paragraph: the assertion that 'the transformed training data must cover the relevant representation space' is central to the necessity claim, but the manuscript provides no quantitative measure (e.g., coverage radius or density in the transformed coordinates) or ablation that isolates coverage failure from expressivity failure.
Authors: We agree a quantitative coverage metric would improve clarity. The controlled study already varies training-data span in the transformed coordinates to induce coverage failure, but we will add an explicit coverage-density measure (fraction of the target representation space covered by the transformed training points) and a dedicated ablation that holds the model class fixed while sweeping only the coverage range. These additions will be included in the revision. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper's central non-identifiability claim—that infinitely many DGPs are ε-observationally equivalent on a single training window yet diverge arbitrarily outside it—is a general theoretical observation, not derived by construction from fitted parameters or self-citations. The structural commitment (φ, ψ, M) is introduced as an independent modeling choice whose OOD effects are demonstrated empirically (e.g., Fourier coordinates converting extrapolation to interpolation, 520× OOD gaps at matched ID loss across 264 runs and natural-science examples). No step reduces a prediction to a fitted input, renames a known result, or relies on a load-bearing self-citation whose content is itself unverified. The reported outcomes are presented as direct experimental measurements rather than quantities forced by the same data used to define the commitments.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption OOD extrapolation from a single training window is non-identifiable without an external structural commitment.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
From a single training window, OOD extrapolation is non-identifiable: infinitely many DGPs are ε-observationally equivalent on the training data but diverge arbitrarily outside it... A structural commitment, the feature map, label map, and model class (φ, ψ, M), dictates the assumed DGP
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
doi: 10.1214/ss/1009213726. Steven L. Brunton, Joshua L. Proctor, and J. Nathan Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems.Proceedings of the National Academy of Sciences, 113(15):3932–3937,
-
[2]
Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou
doi: 10.1073/pnas.1517384113. Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting,
-
[3]
doi: 10.48550/arXiv.2206.11893. Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326,
-
[4]
doi: 10.1016/j.jcp. 2018.10.045. Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding,
-
[5]
ISBN 978-0-387-79051-0. doi: 10.1007/b13794. Vaiva Vasiliauskaite and Nino Antulov-Fantulin. Generalization of neural network models for complex network dynamics.Communications Physics, 7:348,
-
[6]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
doi: 10.1038/s42005-024-01837-w. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30,
-
[7]
A ERM is blind to observationally-equivalent processes Proposition 1(ERM is blind to observationally-equivalent processes, restated).Let Dn = {(xi, yi)}n i=1 with xi iid ∼ D train on W and yi =P k(xi)+η i, ηi iid ∼ N(0, σ 2), for unknown k∈ {1,2} where P1, P2 are ε-observationally-equivalent on W . For any (deterministic or randomised) test T:D n → {1,2},...
work page 2009
-
[8]
The linear functiong(u, v) =w ⊤(u, v)withw= (1,0)satisfies g(φ(x)) = sinx=f(x)∀x∈R,(18) so Lemma 3 applies with ψ= id , M the linear functions on R2, and g⋆ = (1,0) ⊤; hence εOOD(g⋆) = 0for anyD test supported onR. Proof.Realisability is the displayed identity; Lemma 3(i) is satisfied withψ= id. Remark 1(Phase-amplitude generalisation).The same instance w...
work page 2016
-
[9]
The grid crosses both candidate feature maps (log-log, log-y) and the no-transform baseline against three model classes (OLS, MLP, SINDy); each row’s predictions are evaluated againstboththe x2 targets (OOD P1 column) and the eαx targets (OOD P2 column) on [2,10] , mirroring the layout of Exp. 1.1 (Tab. 1). The diagonal of (DGP × correct feature map) is t...
work page 2024
-
[10]
y q r x p u v h c k j x y b n v e u e n v k k j w s d d f a w s d d f a w s d d f a w s d d f a ··· ··· 0 P 2P 3P Ltr Ltr+P Ltr+2P Position t 0.0 0.2 0.4 0.6 0.8 1.0Token value (norm.) (b) train OOD Full scale: P = 64, Ltr = 256 xt Uniform (random) yt = f(t mod P) (periodic) (c) t = 276 (OOD) t = 20 (train) Ltr = 4P: training tiles S1 completely (t) = (si...
work page 2021
-
[11]
Mamba was crippled byd state=16
G.3 SSM state-dimension sweep (d state). Tab. 19 sweeps dstate at P=512 ; the OOD result is independent of state size for the exact-Fourier rows. 24 Table 18: Sweep C: extrapolation horizon ( P=128 , LOOD varied). Exact Fourier stays flat (Lemma 4); learned APE stays at chance. Model PEL OOD=256L OOD=1024L OOD=4096 OOD OOD OOD Transformer Exact Fourier 0....
work page 2024
-
[12]
Method:OLS / SINDy on the bilinear feature mapφ(x) = [1, x i, xixj]vs. raw / wrong baselines. Exp. 2.2: Kepler (Sec. 5.2). Dataset:NASA Exoplanet Archive snapshot mirrored as the Hugging Face dataset juliensimon/nasa-exoplanets (6,158 confirmed exoplanets); we keep rows where (a, T, Mstar, Rstar, Teff) are all positive-valued and split by semi-major axis:...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.