arxiv: 2605.08539 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Continuity Laws for Sequential Models

Annan Yu , Dongwei Lyu , N. Benjamin Erichson

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords sequential modelsstate space modelscontinuityinductive biasS4Mambatemporal structuresubsampling

0 comments

The pith

State-space models like S4 converge to continuous trajectories under temporal refinement while S6 does not, and this property aligns with better performance on tasks that have continuous temporal structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether sequential models inspired by continuous-time systems actually exhibit continuous behavior when time is discretized. It defines model continuity through convergence of predictions as the sampling interval shrinks, then measures task continuity directly from the temporal spacing and smoothness in datasets. Experiments reveal that S4 maintains stable continuity across input amplitudes, but S6, the backbone of Mamba, shows sensitivity tied to its selective mechanism. Across multiple benchmarks the degree of model continuity tracks both task continuity and final accuracy. The same continuity property also supports a practical temporal-subsampling trick that raises speed and performance together.

Core claim

A model is continuous if its output sequence converges to an underlying continuous function as the temporal discretization is refined. S4 satisfies this convergence reliably, whereas S6 can diverge depending on input scaling and its selective state updates even though both originate from continuous dynamical systems. A simple metric computed from the spacing and variation in a dataset's time stamps quantifies task continuity. On benchmarks that range from low to high task continuity, model continuity predicts performance, and the continuous models admit a subsampling procedure that reduces computation while preserving or improving accuracy.

What carries the argument

Convergence under temporal refinement, which formalizes model continuity, together with a metric that extracts task continuity from the spacing and smoothness of observed time points.

If this is right

Models that pass the temporal-refinement test should be preferred for any sequence task whose data exhibit smooth temporal evolution.
Temporal subsampling becomes a reliable efficiency lever once a model is known to be continuous.
Performance differences between S4 and S6 on real-world sequences may be explained in part by their differing continuity properties rather than by selectivity alone.
Continuity can serve as an additional axis for model selection or architecture search in sequential learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures beyond state-space models could be screened with the same refinement test to predict their suitability for temporally smooth data.
Enforcing the convergence property during training might improve generalization on video, audio, or sensor streams without hand-crafted regularization.
The subsampling benefit may extend to any model that can be shown to converge, offering a general way to trade resolution for speed on continuous tasks.

Load-bearing premise

That convergence of predictions under finer time steps is the right way to capture the continuity inductive bias and that the dataset metric faithfully measures how continuous a task's underlying structure is.

What would settle it

Finding a dataset scored as highly continuous by the metric on which a demonstrably non-convergent model still achieves higher accuracy than S4, or showing that the subsampling strategy fails to help even when the model satisfies the convergence test.

read the original abstract

Inductive biases influence the behavior and performance of sequential models. In this work, we study an underexplored inductive bias in sequential modeling: continuity in time. We ask a simple question: do models motivated by continuous-time formulations, such as state-space models, actually behave continuously in time, and does this translate into better performance on tasks with continuous temporal structure? To answer this, we formalize model continuity as convergence under temporal refinement, where a model is continuous if its predictions approach an underlying continuous trajectory as the temporal discretization is refined. We show that S4 exhibits stable continuous behavior, whereas S6 (the core of Mamba) can be more sensitive to input amplitude and selective dynamics, despite being derived from a continuous dynamical system. To study whether this distinction matters for learning, we also need a corresponding notion of task continuity. We therefore introduce a metric to quantify the continuity of datasets directly from their temporal structure. Across benchmarks, we find a clear empirical alignment between task continuity, model continuity, and model performance. Beyond an inductive bias, continuity also has practical consequences: we show that it enables a simple temporal subsampling strategy that improves both efficiency and performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines model continuity via convergence under temporal refinement and ties it to a new task metric and subsampling gains, but the definitions look load-bearing and under-supported so far.

read the letter

This paper's core claim is that continuity, defined as convergence of predictions under finer temporal grids, distinguishes S4 from S6 and aligns with better performance on tasks that have continuous temporal structure, plus it enables a subsampling trick for efficiency. They formalize model continuity that way and introduce a metric for how continuous a dataset is based on its time points. The work shows S4 stays stable while S6 can be sensitive to input amplitude, and across benchmarks the continuity measures line up with model success. The subsampling strategy is presented as a direct benefit. This is new in how they tie the inductive bias to measurable convergence and task properties, rather than just assuming continuous-time models are continuous. It does well in pointing out a practical difference in modern SSMs that might affect model choice. The main issues are with the foundations. Convergence under refinement sounds reasonable but may not be the right proxy for the continuity bias we care about, particularly when models like S6 adapt their discretization based on the input. The task continuity metric risks being a stand-in for signal smoothness instead of capturing dynamic continuity, which could explain the performance correlations without proving the point. The abstract mentions empirical alignment but the full details on how they tested this, including any ablations or statistical checks, will be key to judging if it holds. Readers focused on state space models, Mamba, or inductive biases in time series would get the most from this. It raises questions worth discussing in a reading group on sequence models. The paper deserves peer review to sort out whether the definitions and results are solid enough to influence how we think about these architectures. I recommend putting it through review rather than desk rejecting it.

Referee Report

2 major / 2 minor

Summary. The paper claims that sequential models exhibit an inductive bias toward continuity in time, formalized as convergence of predictions under temporal refinement of the input discretization. It shows that S4 exhibits stable continuous behavior while S6 (core of Mamba) can be sensitive to input amplitude and selective dynamics despite its continuous-time derivation. A metric quantifying task continuity from temporal structure is introduced, with empirical results across benchmarks showing alignment between task continuity, model continuity, and performance. Continuity is also shown to enable a simple temporal subsampling strategy that improves efficiency and performance.

Significance. If the central claims hold, the work provides a useful lens on inductive biases in state-space models by distinguishing S4 from S6/Mamba on continuity grounds and linking this to task structure and performance. The practical subsampling result offers a concrete efficiency benefit, and the introduced metrics could help guide model selection for temporally structured data.

major comments (2)

[§3] §3 (formalization of model continuity): The definition of continuity as convergence under temporal refinement is load-bearing for the S4 vs. S6 distinction and all downstream claims, yet the manuscript does not sufficiently address whether this metric can be satisfied by non-physical limits arising from S6's selective discretization and amplitude-dependent dynamics. A concrete counter-example or additional analysis showing that the refinement procedure interacts with selectivity in a way that violates the intended continuous-time bias would be needed.
[§4] §4 (task continuity metric): The metric for quantifying task continuity directly from temporal structure is central to the reported alignment with model performance. It is unclear whether the metric isolates continuous dynamics or simply correlates with low-frequency content or smoothness; without an ablation or alternative metric that breaks this degeneracy, the empirical alignment does not establish that continuity (rather than a proxy property) is the operative factor.

minor comments (2)

The abstract and introduction should explicitly list the benchmarks and report quantitative effect sizes for the performance alignment and subsampling gains rather than qualitative statements.
[Figures] Figures illustrating convergence under refinement should include error bars or multiple runs to show stability of the observed S4 vs. S6 differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas for improvement in the manuscript. We address each major comment below and outline the revisions we plan to make.

read point-by-point responses

Referee: [§3] §3 (formalization of model continuity): The definition of continuity as convergence under temporal refinement is load-bearing for the S4 vs. S6 distinction and all downstream claims, yet the manuscript does not sufficiently address whether this metric can be satisfied by non-physical limits arising from S6's selective discretization and amplitude-dependent dynamics. A concrete counter-example or additional analysis showing that the refinement procedure interacts with selectivity in a way that violates the intended continuous-time bias would be needed.

Authors: We appreciate this observation and agree that further analysis is warranted to distinguish physical continuity from artifacts of S6's discretization. The manuscript already shows through experiments that S6's predictions do not converge under refinement due to its selective and amplitude-dependent behavior. To provide the requested concrete counter-example, we will include in the revision a theoretical analysis of a simplified selective state-space model where the discretization step size interacts with the selection mechanism to produce non-convergent or amplitude-dependent limits, explicitly violating the continuous-time inductive bias. This will be accompanied by numerical verification. revision: yes
Referee: [§4] §4 (task continuity metric): The metric for quantifying task continuity directly from temporal structure is central to the reported alignment with model performance. It is unclear whether the metric isolates continuous dynamics or simply correlates with low-frequency content or smoothness; without an ablation or alternative metric that breaks this degeneracy, the empirical alignment does not establish that continuity (rather than a proxy property) is the operative factor.

Authors: We acknowledge the referee's concern about potential degeneracy with low-frequency content. Our task continuity metric specifically evaluates the stability of the task output under successive temporal refinements of the input, which measures the degree to which the task relies on fine-grained temporal information rather than just overall signal smoothness. To resolve this, we will add an ablation in the revised manuscript comparing our metric to a baseline based on low-frequency energy (e.g., via Fourier transform) or simple smoothness measures like total variation. We expect to show that our metric better correlates with the performance gap between S4 and S6, thereby establishing continuity as the key factor. revision: yes

Circularity Check

0 steps flagged

No significant circularity; definitions and metric are independent inputs to empirical claims

full rationale

The paper defines model continuity via convergence under temporal refinement and introduces a separate metric for task continuity based on dataset temporal structure. These serve as inputs to empirical comparisons of S4/S6 behavior, alignment with performance, and a subsampling strategy. No quoted equation or step reduces a claimed result to a fitted parameter, self-citation chain, or definitional equivalence by construction; the derivation chain remains self-contained against external benchmarks and observations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims depend on the introduced definition of continuity and the validity of the task metric; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption Model continuity is formalized as convergence of predictions under temporal refinement
This definition is introduced to study the inductive bias and is central to all subsequent claims about S4, S6, and task alignment.

pith-pipeline@v0.9.0 · 5501 in / 1279 out tokens · 65560 ms · 2026-05-12T01:27:35.619347+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize model continuity as convergence under temporal refinement... Lemma 1... |yτ_k − y(τk)| ≤ ... O(τ)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

μ_t := E[...] K(u_k, u_{k+t}) − β ... aggregate μ = Σ w_t μ_t

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 8 internal anchors

[1]

Advances in neural information processing systems , volume=

Neural ordinary differential equations , author=. Advances in neural information processing systems , volume=

work page
[2]

Chaos: An Interdisciplinary Journal of Nonlinear Science , volume=

Stiff neural ordinary differential equations , author=. Chaos: An Interdisciplinary Journal of Nonlinear Science , volume=. 2021 , publisher=

work page 2021
[3]

arXiv preprint arXiv:1911.07532 , year=

Graph neural ordinary differential equations , author=. arXiv preprint arXiv:1911.07532 , year=

work page arXiv 1911
[4]

Advances in neural information processing systems , volume=

Latent ordinary differential equations for irregularly-sampled time series , author=. Advances in neural information processing systems , volume=

work page
[5]

Advances in neural information processing systems , volume=

GRU-ODE-Bayes: Continuous modeling of sporadically-observed time series , author=. Advances in neural information processing systems , volume=

work page
[6]

Proceedings of the AAAI conference on artificial intelligence , volume=

Liquid time-constant networks , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[7]

Nature Machine Intelligence , volume=

Closed-form continuous-time neural networks , author=. Nature Machine Intelligence , volume=. 2022 , publisher=

work page 2022
[8]

arXiv preprint arXiv:2008.02389 , year=

Continuous-in-depth neural networks , author=. arXiv preprint arXiv:2008.02389 , year=

work page arXiv 2008
[9]

Recent Advances in Time Series Foundation Models Have We Reached the'BERT Moment'? , year=

FlowState: Sampling-Rate Invariant Time Series Foundation Model with Dynamic Forecasting Horizons , author=. Recent Advances in Time Series Foundation Models Have We Reached the'BERT Moment'? , year=

work page
[10]

Efficiently Modeling Long Sequences with Structured State Spaces

Efficiently modeling long sequences with structured state spaces , author=. arXiv preprint arXiv:2111.00396 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Advances in neural information processing systems , volume=

On the parameterization and initialization of diagonal state space models , author=. Advances in neural information processing systems , volume=

work page
[12]

Simpli- fied state space layers for sequence modeling,

Simplified state space layers for sequence modeling , author=. arXiv preprint arXiv:2208.04933 , year=

work page arXiv
[13]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv preprint arXiv:2312.00752 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

International conference on machine learning , pages=

Resurrecting recurrent neural networks for long sequences , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[15]

arXiv preprint arXiv:2405.13975 , year=

Hope for a robust parameterization of long-memory state space models , author=. arXiv preprint arXiv:2405.13975 , year=

work page arXiv
[16]

International Conference on Machine Learning , pages=

Simple hardware-efficient long convolutions for sequence modeling , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[17]

2005 , publisher=

Approximation of large-scale dynamical systems , author=. 2005 , publisher=

work page 2005
[18]

arXiv preprint arXiv:2206.12037 , title =

How to train your hippo: State space models with generalized orthogonal basis projections , author=. arXiv preprint arXiv:2206.12037 , year=

work page arXiv
[19]

arXiv preprint arXiv:2411.19455 , year=

Autocorrelation matters: Understanding the role of initialization schemes for state space models , author=. arXiv preprint arXiv:2411.19455 , year=

work page arXiv
[20]

arXiv preprint arXiv:2410.02035 , year=

Tuning frequency bias of state space models , author=. arXiv preprint arXiv:2410.02035 , year=

work page arXiv
[21]

Advances in Neural Information Processing Systems , volume=

Theoretical foundations of deep selective state-space models , author=. Advances in Neural Information Processing Systems , volume=

work page
[22]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

State space models for event cameras , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[23]

Water Resources Research , volume=

A deep state space model for rainfall-runoff simulations , author=. Water Resources Research , volume=. 2025 , publisher=

work page 2025
[24]

arXiv preprint arXiv:2310.01698 , year=

Robustifying state-space models for long sequences via approximate diagonalization , author=. arXiv preprint arXiv:2310.01698 , year=

work page arXiv
[25]

Advances in neural information processing systems , volume=

Combining recurrent, convolutional, and continuous-time models with linear state space layers , author=. Advances in neural information processing systems , volume=

work page
[26]

arXiv preprint arXiv:2405.06147 , year=

State-free inference of state-space models: The transfer function approach , author=. arXiv preprint arXiv:2405.06147 , year=

work page arXiv
[27]

arXiv preprint arXiv:2311.14495 , year=

Stablessm: Alleviating the curse of memory in state-space models through stable reparameterization , author=. arXiv preprint arXiv:2311.14495 , year=

work page arXiv
[28]

1980 , publisher=

The need for biases in learning generalizations , author=. 1980 , publisher=

work page 1980
[29]

International conference on machine learning , pages=

On the spectral bias of neural networks , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[30]

arXiv preprint arXiv:2205.14300 , year=

Tuning frequency bias in neural network training with nonuniform data , author=. arXiv preprint arXiv:2205.14300 , year=

work page arXiv
[31]

arXiv preprint arXiv:2510.19236 , year=

Understanding the implicit biases of design choices for time series foundation models , author=. arXiv preprint arXiv:2510.19236 , year=

work page arXiv
[32]

Induction Bias

On the" Induction Bias" in Sequence Models , author=. arXiv preprint arXiv:2602.18333 , year=

work page arXiv
[33]

The 29th International Conference on Artificial Intelligence and Statistics , year=

Recency Biased Causal Attention for Time-series Forecasting , author=. The 29th International Conference on Artificial Intelligence and Statistics , year=

work page
[34]

Longformer: The Long-Document Transformer

Longformer: The long-document transformer , author=. arXiv preprint arXiv:2004.05150 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2004
[35]

arXiv preprint arXiv:1609.01704 , year=

Hierarchical multiscale recurrent neural networks , author=. arXiv preprint arXiv:1609.01704 , year=

work page arXiv
[36]

On the emergence of position bias in transformers

On the emergence of position bias in transformers , author=. arXiv preprint arXiv:2502.01951 , year=

work page arXiv
[37]

International journal of control , volume=

All optimal Hankel-norm approximations of linear multivariable systems and their L, -error bounds , author=. International journal of control , volume=. 1984 , publisher=

work page 1984
[38]

International Journal of Control , volume=

A survey of model reduction by balanced truncation and some new results , author=. International Journal of Control , volume=. 2004 , publisher=

work page 2004
[39]

Numerical Linear Algebra with Applications , volume=

Leveraging the Hankel norm approximation and data-driven algorithms in reduced order modeling , author=. Numerical Linear Algebra with Applications , volume=. 2024 , publisher=

work page 2024
[40]

Advances in Neural Information Processing Systems , volume=

Layer-adaptive state pruning for deep state space models , author=. Advances in Neural Information Processing Systems , volume=

work page
[41]

arXiv preprint arXiv:2602.00534 , year=

AIRE-Prune: Asymptotic Impulse-Response Energy for State Pruning in State Space Models , author=. arXiv preprint arXiv:2602.00534 , year=

work page arXiv
[42]

arXiv preprint arXiv:2510.02823 , year=

The Curious Case of In-Training Compression of State Space Models , author=. arXiv preprint arXiv:2510.02823 , year=

work page arXiv
[43]

u rner, Tomke and Demarest, Damian and G \

Never train from scratch: Fair comparison of long-sequence models requires data-driven priors , author=. arXiv preprint arXiv:2310.02980 , year=

work page arXiv
[44]

arXiv preprint arXiv:2505.09022 , year=

Block-biased mamba for long-range sequence processing , author=. arXiv preprint arXiv:2505.09022 , year=

work page arXiv
[45]

2025 American Control Conference (ACC) , pages=

State space models as foundation models: A control theoretic overview , author=. 2025 American Control Conference (ACC) , pages=. 2025 , organization=

work page 2025
[46]

Advances in Neural Information Processing Systems , volume=

Understanding the differences in foundation models: Attention, state space models, and recurrent neural networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[47]

Mamba: Long Range Arena , year =

work page
[48]

International conference on machine learning , pages=

Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[49]

Advances in neural information processing systems , volume=

Parallelizing linear transformers with the delta rule over sequence length , author=. Advances in neural information processing systems , volume=

work page
[50]

arXiv preprint arXiv:2603.15569 , year=

Mamba-3: Improved sequence modeling using state space principles , author=. arXiv preprint arXiv:2603.15569 , year=

work page arXiv
[51]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[52]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

work page 2019
[53]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[54]

Chronos: Learning the Language of Time Series

Chronos: Learning the language of time series , author=. arXiv preprint arXiv:2403.07815 , year=

work page internal anchor Pith review arXiv
[55]

A decoder- only foundation model for time-series forecasting.arXiv preprint arXiv:2310.10688,

A decoder-only foundation model for time-series forecasting , author=. arXiv preprint arXiv:2310.10688 , year=

work page arXiv
[56]

Fourier Neural Operator for Parametric Partial Differential Equations

Fourier neural operator for parametric partial differential equations , author=. arXiv preprint arXiv:2010.08895 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010
[57]

DeepONet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators

Deeponet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators , author=. arXiv preprint arXiv:1910.03193 , year=

work page internal anchor Pith review arXiv 1910
[58]

Neural Operator: Graph Kernel Network for Partial Differential Equations

Neural operator: Graph kernel network for partial differential equations , author=. arXiv preprint arXiv:2003.03485 , year=

work page internal anchor Pith review arXiv 2003
[59]

Communications on pure and applied mathematics , volume=

Survey of the stability of linear finite difference equations , author=. Communications on pure and applied mathematics , volume=. 1956 , publisher=

work page 1956
[60]

2019 , publisher=

Approximation theory and approximation practice, extended edition , author=. 2019 , publisher=

work page 2019
[61]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[62]

Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=

Fredformer: Frequency debiased transformer for time series forecasting , author=. Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=

work page
[63]

2013 , isbn =

Oppenheim, Alan and Schafer, Ronald , title =. 2013 , isbn =

work page 2013
[64]

arXiv preprint arXiv:1710.10348 , year=

Multi-level residual networks from dynamical systems view , author=. arXiv preprint arXiv:1710.10348 , year=

work page arXiv
[65]

Long range arena: A benchmark for efficient transformers,

Long range arena: A benchmark for efficient transformers , author=. arXiv preprint arXiv:2011.04006 , year=

work page arXiv 2011
[66]

arXiv preprint arXiv:2510.03358 , year=

Understanding transformers for time series: Rank structure, flow-of-ranks, and compressibility , author=. arXiv preprint arXiv:2510.03358 , year=

work page arXiv