arxiv: 2605.09545 · v1 · submitted 2026-05-10 · 🧮 math.OC · cs.SY· eess.SY

Recognition: no theorem link

Diagnostic Certificates of Data Quality and Regression Identifiability for Koopman Identification

Yue Wu

Pith reviewed 2026-05-12 04:12 UTC · model grok-4.3

classification 🧮 math.OC cs.SYeess.SY

keywords Koopman identificationEDMD with controldata quality diagnosticsregression identifiabilitypersistent excitationspectral gapsystem identificationdiagnostic certificates

0 comments

The pith

Certificates isolate state coverage, lifted feature degeneracy, and regression conditioning failures in Koopman data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a layered diagnostic framework to locate why data may fail to support reliable Koopman operator identification even when inputs appear rich. It separates checks for state-space coverage and clustering, nondegeneracy of lifted features, and the conditioning of the final regression problem. The regression-spectrum certificate supplies direct theoretical control over the smallest singular value of the active design matrix together with Fisher-information and one-step stability interpretations. A finite-sample lower bound on this certificate holds when the population exhibits a spectral gap. Experiments on Duffing, Van der Pol, and Lorenz systems demonstrate that the layers separate in practice and that prediction performance is not monotone in any single certificate.

Core claim

The paper establishes that data quality for EDMD with control is governed by the joint distribution of lifted state features and inputs, not by input richness alone. It introduces certificates that separately diagnose state-space coverage, lifted-feature nondegeneracy, and regression-spectrum conditioning. The regression-spectrum certificate directly bounds the smallest singular value of the standardized design matrix, carries Fisher-information and one-step EDMDc stability meanings, and admits a finite-sample lower bound under a population spectral gap. Structural examples and a Schur-complement condition show that the four diagnostic layers cannot be substituted for one another.

What carries the argument

The regression-spectrum certificate, defined as the smallest singular value of the active standardized design matrix formed by lifted features and controls, which supplies the finite-sample lower bound under a population spectral gap.

If this is right

State, lifted-feature, input, and regression diagnostics cannot replace one another, as shown by structural counter-examples and the Schur-complement relation.
IGPE-DOPT scores candidate trajectory segments using the certificates to improve data collection.
Budget allocation and weighting in sampling shift which certificate layer becomes the active bottleneck.
Downstream prediction or control accuracy is not guaranteed to improve when only one certificate is optimized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The certificates could be monitored online to trigger adaptive sampling or input redesign during real-time identification.
Similar layered checks might diagnose identifiability problems in other embedding-based or lifted linear models beyond Koopman.
If the spectral-gap assumption can be verified from data, the bound supplies a practical stopping criterion for trajectory collection.

Load-bearing premise

The finite-sample lower bound on the regression spectrum holds only when the underlying population of lifted features and inputs has a spectral gap.

What would settle it

Measure the actual smallest singular value of the regression design matrix on trajectories from the Duffing oscillator and check whether it falls below the certificate's finite-sample lower bound whenever the population spectral gap is removed or made small.

Figures

Figures reproduced from arXiv: 2605.09545 by Yue Wu.

**Figure 2.** Figure 2: Regression certificate and quantities associated with the theory [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Budget sensitivity [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Nonmonotonicity in downstream tasks [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Sanity check for the definitional identity between Creg and the minimum singular value [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

read the original abstract

Classical persistent excitation criteria usually assess whether an input or regressor signal is sufficiently rich. In Koopman and EDMD with control (EDMDc), however, data quality is determined by the concatenation of lifted state features and control inputs. Input-rich data can still visit a narrow state region, well-spread state samples can still produce degenerate lifted features, and both can fail to condition the final regression problem. This paper develops a diagnostic certificate framework for locating these failures. The certificates separate state-space coverage and clustering, lifted-feature nondegeneracy, and the final regression spectrum. The regression-spectrum certificate is the layer with direct theoretical guarantees: it controls the active standardized design's smallest singular value, has Fisher-information and one-step EDMDc stability interpretations, and admits a finite-sample lower bound under a population spectral gap. We also give structural examples and a Schur-complement condition showing why state, lifted, input, and regression diagnostics cannot be substituted for one another. As a sampling example, IGPE-DOPT uses these certificates to score candidate trajectory segments. Experiments on Duffing, Van der Pol, and Lorenz systems compare input-, state-, lifted-, and regression-oriented baselines. The results show that certificate layers separate, budget and weights shift bottlenecks, and downstream prediction or control performance is not monotone in any single certificate. The framework is therefore intended as an interpretable diagnostic and data-collection guide, not as a universal optimality claim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical layered diagnostic for spotting distinct data failures in EDMDc, backed by structural examples, but the regression bound's finite-sample guarantee rests on an unverified population spectral gap.

read the letter

The main thing here is a framework that splits data quality checks into state coverage, lifted-feature nondegeneracy, input richness, and regression-spectrum conditioning for Koopman identification with control. The structural examples using Schur complements make a clear case that these layers are not interchangeable, which is the part that feels genuinely useful for someone trying to collect better trajectories.

Referee Report

1 major / 3 minor

Summary. The manuscript develops a layered diagnostic certificate framework for data quality assessment in Koopman identification with control inputs (EDMDc). The certificates address state-space coverage and clustering, lifted-feature nondegeneracy, and the regression spectrum of the active standardized design matrix. The regression-spectrum certificate is equipped with interpretations in terms of Fisher information and one-step EDMDc stability, along with a finite-sample lower bound conditioned on a population spectral gap. Structural examples using Schur complements demonstrate that the four certificate layers are non-substitutable. Experiments on the Duffing, Van der Pol, and Lorenz oscillators compare various baselines and show that certificate values are not monotone with downstream performance, motivating the use of IGPE-DOPT for trajectory segment scoring.

Significance. Should the finite-sample bound and non-substitutability results hold under the stated assumptions, the work provides a principled, interpretable toolkit for diagnosing data deficiencies in data-driven Koopman modeling. This is significant for practical applications in system identification where poor data conditioning can lead to unreliable models. The explicit demonstration that state, lifted, input, and regression diagnostics cannot substitute for each other is a notable contribution, as is the experimental evidence that no single layer suffices for optimal data selection. The framework's strength in offering both theoretical guarantees and practical sampling guidance enhances its potential impact in the field of nonlinear system identification.

major comments (1)

The finite-sample lower bound on the smallest singular value of the active standardized design (regression-spectrum certificate) is derived under the assumption of a positive population spectral gap in the lifted feature space. However, the manuscript does not verify the existence or magnitude of this gap for the Duffing, Van der Pol, or Lorenz systems after lifting, nor does it provide sensitivity analysis showing how the bound performs when the gap is small or absent. This assumption is load-bearing for the theoretical guarantee and requires explicit checking or relaxation to support the claims.

minor comments (3)

The notation and definitions for the four certificate layers would benefit from a consolidated summary table to improve readability.
Details on data exclusion criteria, number of trials, and error bar computation are missing from the experimental section, which would strengthen the reproducibility of the reported comparisons.
Some figure captions could more explicitly link the visualized certificate values to the theoretical interpretations in the regression-spectrum layer.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review, as well as the positive assessment of the manuscript's significance. We address the single major comment below and will revise the manuscript to incorporate the requested verification and analysis.

read point-by-point responses

Referee: The finite-sample lower bound on the smallest singular value of the active standardized design (regression-spectrum certificate) is derived under the assumption of a positive population spectral gap in the lifted feature space. However, the manuscript does not verify the existence or magnitude of this gap for the Duffing, Van der Pol, or Lorenz systems after lifting, nor does it provide sensitivity analysis showing how the bound performs when the gap is small or absent. This assumption is load-bearing for the theoretical guarantee and requires explicit checking or relaxation to support the claims.

Authors: We agree that the finite-sample lower bound is conditional on a positive population spectral gap and that the current manuscript lacks explicit verification of this gap (or sensitivity to its magnitude) for the Duffing, Van der Pol, and Lorenz examples. In the revised version we will add numerical estimates of the gap in the lifted feature spaces for these three systems, obtained from long trajectories or known dynamics where feasible. We will also include a sensitivity analysis that illustrates bound degradation as the gap approaches zero. This directly addresses the load-bearing assumption while leaving the conditional statement of the theorem unchanged. The experimental results on certificate utility and non-substitutability remain independent of the bound and continue to support the practical contribution. revision: yes

Circularity Check

0 steps flagged

No circularity: finite-sample bound conditioned on external population gap; layers separated by independent arguments

full rationale

The regression-spectrum certificate's finite-sample lower bound is explicitly conditioned on a population spectral gap (positive gap in the population Gram matrix of lifted features), which is an external modeling assumption rather than a quantity fitted or defined from the same regression data. Structural examples and the Schur-complement condition establish non-substitutability of the four certificate layers without reducing any layer to another by construction. No self-definitional steps, fitted-input-called-prediction patterns, or load-bearing self-citations appear in the derivation chain. The framework remains self-contained against external benchmarks such as the stated interpretations (Fisher information, EDMDc stability).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard Koopman lifting assumptions plus one key domain assumption for the bound; no free parameters or invented physical entities are mentioned in the abstract.

axioms (1)

domain assumption A population spectral gap exists that enables the finite-sample lower bound on the regression-spectrum certificate.
Explicitly invoked in the abstract as the condition under which the theoretical guarantee holds.

pith-pipeline@v0.9.0 · 5556 in / 1362 out tokens · 40729 ms · 2026-05-12T04:12:46.123123+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Generalized multispeed dubins motion model,

doi: 10.1109/TRO. 2019.2923880. Nikil Boddupalli, Aqib Hasnain, Sai Pushpak Nandanoori, and Enoch Yeung. Koopman operators for generalized persistence of excitation conditions for nonlinear systems. InProceedings of the IEEE 58th Conference on Decision and Control (CDC), pages 8106–8111. IEEE,

work page doi:10.1109/tro 2019
[2]

Consensus-based

doi: 10.1109/CDC40024.2019.9029365. Steven L Brunton, Marko Budišić, Eurika Kaiser, and J Nathan Kutz. Modern koopman theory for dynamical systems.SIAM Review, 64(2):229–340,

work page doi:10.1109/cdc40024.2019.9029365 2019
[3]

Salim Dahdah and James Richard Forbes

doi: 10.1137/21M1401243. Salim Dahdah and James Richard Forbes. System norm regularization methods for koopman operator approximation.Proceedings of the Royal Society A, 478(2264):20220162,

work page doi:10.1137/21m1401243
[4]

Diagnostic certificates for Koopman identification 20 Arne De Cock, Michel Gevers, and Johan Schoukens

doi: 10.1098/rspa.2022.0162. Diagnostic certificates for Koopman identification 20 Arne De Cock, Michel Gevers, and Johan Schoukens. D-optimal input design for nonlinear fir-type systems: A dispersion-based approach.Automatica, 73:88–100,

work page doi:10.1098/rspa.2022.0162 2022
[5]

2016.04.052

doi: 10.1016/j.automatica. 2016.04.052. Mario Deflorian and Sebastian Zaglauer. Design of experiments for nonlinear dynamic system identification. InIFAC Proceedings Volumes, volume 44, pages 13179–13184. Elsevier,

work page doi:10.1016/j.automatica 2016
[6]

Håkan Hjalmarsson

doi: 10.3182/20110828-6-IT-1002.01502. Håkan Hjalmarsson. From experiment design to closed-loop control.Automatica, 41(3):393–438,

work page doi:10.3182/20110828-6-it-1002.01502
[7]

Mátyás Kiss, Roland Tóth, and Maarten Schoukens

doi: 10.1016/j.automatica.2004.11.021. Mátyás Kiss, Roland Tóth, and Maarten Schoukens. Space-filling input design for nonlinear state- space identification.IFAC-PapersOnLine, 58(15):562–567,

work page doi:10.1016/j.automatica.2004.11.021 2004
[8]

Milan Korda and Igor Mezić

doi: 10.1016/j.ifacol.2024.08.589. Milan Korda and Igor Mezić. Linear predictors for nonlinear dynamical systems: Koopman operator meets model predictive control.Automatica, 93:149–160,

work page doi:10.1016/j.ifacol.2024.08.589 2024
[9]

doi: 10.1016/j.automatica.2018.03

work page doi:10.1016/j.automatica.2018.03 2018
[10]

Lauren M Miller, Yonatan Silverman, Malcolm A MacIver, and Todd D Murphey

doi: 10.1109/LCSYS.2025.3582509. Lauren M Miller, Yonatan Silverman, Malcolm A MacIver, and Todd D Murphey. Ergodic exploration of distributed information.IEEE Transactions on Robotics, 32(1):36–52,

work page doi:10.1109/lcsys.2025.3582509 2025
[11]

Kumpati S Narendra and Anuradha M Annaswamy

doi: 10.1109/TRO.2015.2500441. Kumpati S Narendra and Anuradha M Annaswamy. Persistent excitation in adaptive systems. International Journal of Control, 45(1):127–160,

work page doi:10.1109/tro.2015.2500441 2015
[12]

Anup Parikh, Rushikesh Kamalapurkar, and Warren E Dixon

doi: 10.1080/00207178708933715. Anup Parikh, Rushikesh Kamalapurkar, and Warren E Dixon. Integral concurrent learning: Adaptive control with parameter convergence using finite excitation.International Journal of Adaptive Control and Signal Processing, 33(12):1775–1787,

work page doi:10.1080/00207178708933715
[13]

Friedrich M Philipp, Manuel Schaller, Karl Worthmann, Sebastian Peitz, and Feliks Nüske

doi: 10.1002/acs.2945. Friedrich M Philipp, Manuel Schaller, Karl Worthmann, Sebastian Peitz, and Feliks Nüske. Error analysis of kernel edmd for prediction and control in the koopman framework.Journal of Nonlinear Science, 35:92,

work page doi:10.1002/acs.2945
[14]

Joshua L Proctor, Steven L Brunton, and J Nathan Kutz

doi: 10.1007/s00332-025-10182-3. Joshua L Proctor, Steven L Brunton, and J Nathan Kutz. Dynamic mode decomposition with control. SIAM Journal on Applied Dynamical Systems, 15(1):142–161,

work page doi:10.1007/s00332-025-10182-3
[15]

Proctor, Steven L

doi: 10.1137/15M1013857. Remo Rickenbach, Johannes Köhler, Anna Scampicchio, Melanie N Zeilinger, and Andrea Carron. Active learning-based model predictive coverage control.IEEE Transactions on Automatic Control, 69(9):5931–5946,

work page doi:10.1137/15m1013857
[16]

Xiaoxue Shang, Jorge Cortés, and Yang Zheng

doi: 10.1109/TAC.2024.3365569. Xiaoxue Shang, Jorge Cortés, and Yang Zheng. Willems’ fundamental lemma for nonlinear systems with koopman linear embedding.IEEE Control Systems Letters,

work page doi:10.1109/tac.2024.3365569 2024
[17]

doi: 10.1109/LCSYS.2024. 3522594. Diagnostic certificates for Koopman identification 21 V Smits and Oliver Nelles. Space-filling optimized excitation signals for nonlinear system identifica- tion of dynamic processes of a diesel engine.Control Engineering Practice, 144:105821,

work page doi:10.1109/lcsys.2024 2024
[18]

Henk J van Waarde, Jaap Eising, Harry L Trentelman, and M Kanat Camlibel

doi: 10.1016/j.conengprac.2023.105821. Henk J van Waarde, Jaap Eising, Harry L Trentelman, and M Kanat Camlibel. Data informativity: A new perspective on data-driven analysis and control.IEEE Transactions on Automatic Control, 65(11):4753–4768,

work page doi:10.1016/j.conengprac.2023.105821 2023
[19]

Jan C Willems, Paolo Rapisarda, Ivan Markovsky, and Bart LM De Moor

doi: 10.1109/TAC.2020.2966717. Jan C Willems, Paolo Rapisarda, Ivan Markovsky, and Bart LM De Moor. A note on persistency of excitation.Systems & Control Letters, 54(4):325–329,

work page doi:10.1109/tac.2020.2966717 2020
[20]

Matthew O Williams, Ioannis G Kevrekidis, and Clarence W Rowley

doi: 10.1016/j.sysconle.2004.09.003. Matthew O Williams, Ioannis G Kevrekidis, and Clarence W Rowley. A data-driven approximation of the koopman operator: Extending dynamic mode decomposition.Journal of Nonlinear Science, 25:1307–1346,

work page doi:10.1016/j.sysconle.2004.09.003 2004
[21]

Alexander D Wilson, Joshua A Schultz, and Todd D Murphey

doi: 10.1007/s00332-015-9258-5. Alexander D Wilson, Joshua A Schultz, and Todd D Murphey. Trajectory synthesis for fisher information maximization.IEEE Transactions on Robotics, 30(6):1358–1370,

work page doi:10.1007/s00332-015-9258-5
[22]

doi: 10.1109/ TRO.2014.2345918

work page arXiv 2014