arxiv: 2605.08517 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.CV· physics.med-ph

Recognition: 2 theorem links

· Lean Theorem

A Deep Risk Estimator for Known Operator Learning

Andreas Maier , Md Hasan , Paulina Conrad , Paula Andrea Perez-Toro

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:52 UTC · model grok-4.3

classification 💻 cs.LG cs.CVphysics.med-ph

keywords risk estimationknown operator learninghybrid deep networkssample complexitycomputed tomographygeneralization boundsphysics-informed neural networks

0 comments

The pith

A deep risk estimator for hybrid networks shows that replacing any learned layer with a known operator tightens the overall bound and reduces required training samples in proportion to the parameters of the replaced layer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives a statistical risk estimator for deep networks that combine learned layers with known operators. The estimator decomposes total expected error into a sum of per-layer terms, where known operators contribute nothing and each learned layer adds a Barron-style approximation error plus an estimation error that falls with more training samples. This leads to the result that inserting a known operator shrinks the bound and scales down the sample size needed for a target accuracy exactly with the number of trainable parameters removed from that layer. The approach is tested on computed tomography reconstruction, where the predicted savings match the sparsity of an analytic filter-plus-backprojection decomposition, and the calibrated bound stays within a factor of two of observed test error across training set sizes.

Core claim

We derive a deep risk estimator that connects the expected error of a layered network to the size of the training sample by decomposing the total risk into a sum over learned layers; every known operator contributes zero to this sum, while every learned layer adds an approximation term inspired by Barron's classic work and an estimation term that decreases with the number of training samples. We are able to show that the bound shrinks whenever a learned layer is replaced by a known operator and that the corresponding sample requirement scales with the number of trainable parameters of the layer that is replaced.

What carries the argument

The deep risk estimator, a sum of per-layer approximation and estimation terms in which known operators add zero while learned layers contribute positive terms that shrink with sample size.

If this is right

The risk bound decreases whenever a learned layer is replaced by a known operator.
The number of training samples required scales directly with the number of trainable parameters in the replaced layer.
In CT reconstruction the predicted parameter ratio matches the structural sparsity exposed by decomposing into a circulant filter and sparse backprojection.
After calibrating the per-layer constants, the estimator tracks empirical test MSE within a factor of two across all training-set sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of physics-informed neural networks can use the estimator to decide which physical operations to hardcode by quantifying the resulting drop in required data.
The same per-layer decomposition could be applied to hybrid architectures in fluid simulation or differential equation solving to predict minimal dataset sizes for a target accuracy.
The scaling law may let practitioners invert the estimator to choose training-set size before any learning begins.

Load-bearing premise

The derivation assumes that previously established maximal training error bounds for known operator learning and Barron-style approximation bounds apply directly to the learned layers inside the hybrid architecture.

What would settle it

If the calibrated estimator, after fitting per-layer constants on training sweeps, fails to track empirical test MSE within a factor of two at every training-set size, or if replacing a learned layer with a known operator does not reduce observed sample complexity proportionally to the number of parameters in that layer.

Figures

Figures reproduced from arXiv: 2605.08517 by Andreas Maier, Md Hasan, Paula Andrea Perez-Toro, Paulina Conrad.

**Figure 2.** Figure 2: Computed tomography networks compared in this paper. The operator-aware network [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Sample-efficiency sweeps at five operating points: CPU surrogate at [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Reconstructions at H = 128 for the unified phantom pool with identical seeds. Two test phantoms (rows) under the operator-aware and fully connected networks at six sample sizes N. The KO reconstruction is already faithful at N = 4; the FC catches up around N = 1024 – 2048 [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Reconstructions at H = 256 for the unified phantom pool with identical seeds. Two test phantoms (rows) under the operator-aware and fully connected networks at six sample sizes N. The KO reconstruction matches the phantom from N = 4 onwards; the FC remains visibly blurred even at N = 2048, consistent with the larger approximation-budget gap reported in [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

We describe an approach for estimating the statistical risk of deep networks that contain a mix of learned and known operators. Building on the maximal training error bounds previously established for known operator learning, we derive a deep risk estimator that connects the expected error of a layered network to the size of the training sample. The estimator decomposes the total risk into a sum over learned layers; every known operator contributes zero to this sum, while every learned layer adds an approximation term inspired by Barron's classic work and an estimation term that decreases with the number of training samples. We are able to show that the bound shrinks whenever a learned layer is replaced by a known operator and that the corresponding sample requirement scales with the number of trainable parameters of the layer that is replaced. As an application, we use computed tomography as an example and compare an operator-aware filtered backprojection network with a fully connected substitute that collapses the entire reconstruction pipeline into a single learned dense matrix. The predicted parameter ratio coincides with the structural sparsity that the analytic decomposition into a circulant filter and a sparse backprojection exposes. We confirm the predicted scaling on CPU at small image scale and on GPU at medium image scale, all on the same scaling law. Beyond CT reconstruction, the estimator applies to physics-informed neural networks that hardcode a known physical operation in its architecture, and we expect the result to be of interest for a broad community working on operator-aware deep learning. Calibrating the per-layer constants on each sweep yields a bound that tracks the empirical test MSE within a factor of two at every training-set size, so the estimator can be inverted to predict how many training samples are required to reach a target error.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a per-layer risk estimator for networks mixing learned layers and known operators, with empirical checks on CT showing the bound tracks MSE within a factor of two after calibration, but the transfer of Barron bounds through transformed inputs remains unaddressed.

read the letter

The paper's core contribution is a per-layer decomposition of statistical risk for deep networks that mix learned layers with known operators. It claims the total risk is the sum of approximation and estimation terms only over the learned layers, and that replacing a learned layer with a known one shrinks the bound in proportion to the number of parameters removed. They build this on prior maximal error bounds and Barron-style results, then test it in computed tomography by comparing an operator-aware network to a fully learned dense alternative. The predicted parameter ratio lines up with the sparsity from the analytic decomposition, and after calibrating the per-layer constants on each training-size sweep, the bound stays within a factor of two of the observed test MSE across scales on CPU and GPU. The new part is the explicit scaling with replaced parameters in the hybrid setting, which isn't in the cited prior work. The empirical confirmation on the same scaling law at different image sizes is a plus. That said, the derivation assumes the Barron approximation bounds and maximal training error results apply directly to each learned layer even after the inputs have been transformed by preceding known operators. Those transformations can change the effective domain, Lipschitz constants, and smoothness, so it's not obvious that the per-layer constants stay the same or that no cross terms arise from error propagation. The abstract doesn't show how they handle that. Calibration of the free parameters on the validation sweeps also introduces some circularity, since the constants are tuned to match the data being predicted. This is for researchers in physics-informed neural networks and operator learning who want a rough tool to estimate sample needs. The idea has potential, but the assumptions need clearer justification and perhaps independent validation of the constants. I would send it to peer review so the community can check the math and see if the bound generalizes beyond the CT example.

Referee Report

3 major / 2 minor

Summary. The paper derives a deep risk estimator for hybrid neural networks containing both learned layers and known operators. Building on prior maximal training-error bounds for known-operator learning, the estimator decomposes total risk additively over learned layers only (known operators contribute zero), combining a Barron-style approximation term with a sample-size-dependent estimation term. The central claims are that the bound shrinks when any learned layer is replaced by a known operator and that the required training-sample size scales with the number of trainable parameters of the replaced layer. The approach is illustrated on CT reconstruction by comparing an operator-aware filtered-backprojection network against a fully learned dense-matrix substitute; after calibrating per-layer constants on each training-size sweep, the bound tracks empirical test MSE within a factor of two at every sample size. The estimator is also positioned for physics-informed networks that hard-code known physical operations.

Significance. If the derivation and transfer of Barron-style bounds hold under the hybrid composition, the work supplies a concrete, architecture-driven tool for predicting sample complexity and for quantifying the benefit of inserting known operators. The CT example demonstrates alignment between the predicted parameter ratio and the structural sparsity of the analytic decomposition, and the scaling law is shown to be consistent across CPU and GPU regimes. These strengths would be of direct interest to the operator-aware and physics-informed deep-learning communities, provided the post-hoc calibration can be replaced or justified by a priori constants.

major comments (3)

The derivation (abstract and central claims) assumes that Barron-style per-layer approximation bounds and previously established maximal training-error bounds apply directly to each learned layer even after its input has been transformed by preceding known operators. Because known operators alter the effective domain, Lipschitz constants, and smoothness of the target function for subsequent layers, the per-layer constants and scaling may change; no explicit derivation of cross terms or error-propagation bounds is provided to justify the additive decomposition without additional remainder terms.
The reported calibration of per-layer constants on each training-size sweep (abstract) is performed on the same data used to validate that the bound tracks empirical MSE within a factor of two. This introduces a circularity: the constants are fitted post-hoc to match observed error, so the estimator is not parameter-free and its predictive use for new architectures or sample sizes rests on quantities tuned to the validation distribution.
The manuscript states that the bound shrinks and sample requirement scales with the number of trainable parameters when a learned layer is replaced by a known operator, yet the abstract provides no explicit theorem statement, proof sketch, or equation showing how the additive decomposition yields this scaling once the known-operator term is set to zero. Without the full derivation or the cited prior bounds reproduced, it is impossible to verify that the claimed scaling follows directly rather than from the calibration step.

minor comments (2)

Notation for the per-layer constants and the precise form of the Barron-inspired approximation term should be introduced with explicit definitions and referenced to the cited prior work on maximal training-error bounds.
The CT experiment description would benefit from a brief statement of the data-exclusion rules, image sizes, and exact training-set sizes used in the scaling sweeps to allow independent reproduction of the factor-of-two tracking result.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment point by point below, outlining the revisions we will make to improve the clarity and rigor of the manuscript.

read point-by-point responses

Referee: The derivation (abstract and central claims) assumes that Barron-style per-layer approximation bounds and previously established maximal training-error bounds apply directly to each learned layer even after its input has been transformed by preceding known operators. Because known operators alter the effective domain, Lipschitz constants, and smoothness of the target function for subsequent layers, the per-layer constants and scaling may change; no explicit derivation of cross terms or error-propagation bounds is provided to justify the additive decomposition without additional remainder terms.

Authors: We acknowledge that the current presentation does not explicitly derive the error-propagation bounds through known operators. In the revised manuscript we will add a dedicated subsection deriving the cross terms under the assumption that known operators are Lipschitz continuous. We show that the remainder terms remain bounded and can be absorbed into the per-layer constants, preserving the additive structure and the claimed scaling. A proof sketch will be included. revision: yes
Referee: The reported calibration of per-layer constants on each training-size sweep (abstract) is performed on the same data used to validate that the bound tracks empirical MSE within a factor of two. This introduces a circularity: the constants are fitted post-hoc to match observed error, so the estimator is not parameter-free and its predictive use for new architectures or sample sizes rests on quantities tuned to the validation distribution.

Authors: We agree that post-hoc calibration on the validation sweeps limits a priori predictive claims. In the revision we will (i) explicitly state that the constants are architecture-dependent and can be bounded from operator properties (e.g., known Lipschitz constants) without test data, (ii) add an experiment using constants calibrated on a disjoint small pilot set, and (iii) discuss the estimator’s use for relative comparisons across architectures even when absolute constants are calibrated once. revision: partial
Referee: The manuscript states that the bound shrinks and sample requirement scales with the number of trainable parameters when a learned layer is replaced by a known operator, yet the abstract provides no explicit theorem statement, proof sketch, or equation showing how the additive decomposition yields this scaling once the known-operator term is set to zero. Without the full derivation or the cited prior bounds reproduced, it is impossible to verify that the claimed scaling follows directly rather than from the calibration step.

Authors: The scaling follows directly from setting the known-operator contribution to zero in the additive risk estimator (Eq. 7 in the manuscript) and retaining only the parameter-dependent estimation term for learned layers; this is shown in Section 3 using the cited maximal training-error bounds. To improve accessibility we will insert a concise theorem statement and the key scaling equation into the abstract and introduction, and reproduce the relevant prior bounds in an appendix. revision: yes

Circularity Check

2 steps flagged

Fitted per-layer constants and prior bounds make scaling predictions dependent on inputs

specific steps

fitted input called prediction [Abstract (final paragraph)]
"Calibrating the per-layer constants on each sweep yields a bound that tracks the empirical test MSE within a factor of two at every training-set size, so the estimator can be inverted to predict how many training samples are required to reach a target error."

Constants are adjusted on the same training-size sweeps used for validation to match observed MSE; the resulting bound is then inverted to 'predict' sample requirements and scaling when a learned layer is replaced. The scaling law is therefore forced by the calibration step rather than derived independently.
self citation load bearing [Abstract (opening derivation sentence)]
"Building on the maximal training error bounds previously established for known operator learning, we derive a deep risk estimator that connects the expected error of a layered network to the size of the training sample. The estimator decomposes the total risk into a sum over learned layers; every known operator contributes zero to this sum"

The central claims—that the bound shrinks on replacement by a known operator and that sample requirement scales with trainable parameters—rest on the imported maximal bounds and the zero-contribution assumption for known operators. No independent derivation of the additive decomposition or absence of cross terms is provided for the hybrid case.

full rationale

The estimator decomposes risk using previously established maximal training error bounds and adds Barron-inspired terms, but calibrates per-layer constants directly to observed MSE on each training-size sweep. This allows the bound to track data within a factor of two, after which the estimator is inverted to predict sample requirements and the effect of replacing layers. The scaling with trainable parameters is thus a consequence of the fit rather than an independent first-principles derivation. The decomposition assumes known operators contribute exactly zero with no cross-layer error propagation, but this is imported from the cited bounds without re-derivation for the hybrid architecture.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Central claim rests on prior maximal training error bounds for known operator learning and on Barron approximation theory; free parameters are the per-layer constants calibrated to data.

free parameters (1)

per-layer constants
Constants in approximation and estimation terms are calibrated on each sweep to track empirical test MSE within factor of two.

axioms (2)

domain assumption Maximal training error bounds previously established for known operator learning
Explicitly stated as the foundation for the new estimator.
domain assumption Barron's approximation bounds apply to the learned layers in the hybrid network
Used to supply the approximation term for each learned layer.

pith-pipeline@v0.9.0 · 5608 in / 1466 out tokens · 60581 ms · 2026-05-12T01:52:46.238045+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost Jcost_unit0 echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

The estimator decomposes the total risk into a sum over learned layers; every known operator contributes zero to this sum... Replacing learned layer m by a known operator removes the term Am E∥em∥22 from (4).
IndisputableMonolith/Foundation/BranchSelection branch_selection echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the bound shrinks whenever a learned layer is replaced by a known operator and that the corresponding sample requirement scales with the number of trainable parameters of the layer that is replaced

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Andrew R

doi: 10.1137/24M1719785. Andrew R. Barron. Approximation and estimation bounds for artificial neural networks.Machine Learning, 14 (1):115–133,

work page doi:10.1137/24m1719785
[2]

Avinash C

doi: 10.1137/23M1586872. Avinash C. Kak and Malcolm Slaney.Principles of Computerized Tomographic Imaging. SIAM,

work page doi:10.1137/23m1586872
[3]

Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang

doi: 10.1038/s42254-021-00314-5. Hao Liu, Jiahui Cheng, and Wenjing Liao. Deep neural networks are adaptive to function regularity and data distribution in approximation and estimation.Journal of Machine Learning Research, 26(213):1–56,

work page doi:10.1038/s42254-021-00314-5
[4]

Andreas Maier, Harald Köstler, Marco Heisig, Patrick Krauss, and Seung Hee Yang

doi: 10.1137/20M1338460. Andreas Maier, Harald Köstler, Marco Heisig, Patrick Krauss, and Seung Hee Yang. Known operator learning and hybrid machine learning in medical imaging—a review of the past, the present, and the future.Progress in Biomedical Engineering, 4(2):022002,

work page doi:10.1137/20m1338460
[5]

Roberto Molinaro, Yunan Yang, Björn Engquist, and Siddhartha Mishra

doi: 10.1038/s42256-019-0077-5. Roberto Molinaro, Yunan Yang, Björn Engquist, and Siddhartha Mishra. Neural inverse operators for solving PDE inverse problems. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 25105–25139,

work page doi:10.1038/s42256-019-0077-5
[6]

Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations

doi: 10.1016/j.jcp.2018.10.045. Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function. The Annals of Statistics, 48(4),

work page doi:10.1016/j.jcp.2018.10.045 2018
[7]

Nonparametric regression using deep neural networks with ReLU activation function , volume=

doi: 10.1214/19-AOS1875. Christopher Syben, Markus Michen, Bernhard Stimpel, Stephan Seitz, Stefan B. Ploner, and Andreas K. Maier. Technical note: PYRO-NN: Python reconstruction operators in neural networks.Medical Physics, 46(11): 5110–5115,

work page doi:10.1214/19-aos1875
[8]

Tobias Würfl, Florin C

doi: 10.1002/mp.13753. Tobias Würfl, Florin C. Ghesu, Vincent Christlein, and Andreas Maier. Deep learning computed tomography. In Medical Image Computing and Computer-Assisted Intervention, pages 432–440. Springer,

work page doi:10.1002/mp.13753
[9]

10 A Proof of Theorem 1 We prove Eq

doi: 10.1109/TMI.2018.2833499. 10 A Proof of Theorem 1 We prove Eq. (4) by induction onL. Throughout the proof, we use the squared triangle inequality ∥a+b∥ 2 2 ≤2∥a∥ 2 2 + 2∥b∥2 2,(10) which follows from the Cauchy–Schwarz inequality, and the Lipschitz property of the true outer layer ul from Assumption 1, which gives ∥ul(a)−u l(b)∥2 ≤ℓ l ∥a−b∥ 2 for all...

work page doi:10.1109/tmi.2018.2833499 2018
[10]

(1) satisfies ℓfl ≤ lY j=1 ℓj.(20) Proof.Forx, ˜x∈ D 0, ∥fl(x)−f l(˜x)∥2 =∥u l(fl−1(x))−u l(fl−1(˜x))∥2 ≤ℓ l ∥fl−1(x)−f l−1(˜x)∥2 ≤ℓ l ·ℓ l−1 · ∥fl−2(x)−f l−2(˜x)∥2

Lemma 1.Under Assumption 1, the Lipschitz constant of the composed function fl in Eq. (1) satisfies ℓfl ≤ lY j=1 ℓj.(20) Proof.Forx, ˜x∈ D 0, ∥fl(x)−f l(˜x)∥2 =∥u l(fl−1(x))−u l(fl−1(˜x))∥2 ≤ℓ l ∥fl−1(x)−f l−1(˜x)∥2 ≤ℓ l ·ℓ l−1 · ∥fl−2(x)−f l−2(˜x)∥2 ... ≤ lY j=1 ℓj · ∥x− ˜x∥2,(21) which gives Eq. (20). C Application to the CT Reconstruction Network We ap...

work page 2019
[11]

The approximation-budget factor in Eq

that the analytic FBP decomposition exposes. The approximation-budget factor in Eq. (32) is unity exactly when the two architectures share the same approximation floor; whenever the dense matrix’s approximation budget at finite width is worse than the operator-aware diagonal weighting’s, the factor is greater than one and grows as the target errorε approa...

work page 2048
[12]

The operator-aware sweeps at H∈ {128,256} ran on a single NVIDIA Tesla V100 (16 GiB) per job

Hardware.The GPU sweeps were executed on a shared compute cluster. The operator-aware sweeps at H∈ {128,256} ran on a single NVIDIA Tesla V100 (16 GiB) per job. The fully connected baseline at H= 256 requires more weight memory than fits on a single V100 and was therefore trained on four NVIDIA RTX 6000 (24 GiB each) with PyTorch fully-sharded data-parall...

work page 2048